Last updated at Mon, 06 Nov 2017 18:52:21 GMT

Introduction

With the release of Java 8 a number of new language features were introduced [1]. These included lambda functions, streams and completable futures. Colleagues of mine have already reviewed these features in previous articles on this blog, which I recommend reading as part of this topic [2] [3]. In this article I will touch on an aspect of the Java 8 release that relates to the push towards exploiting parallelism, in the context of the existing Collections Framework – specifically the new Spliterator interface.

Due to the size and complexity of these topics, it will only be possible to provide a high-level introduction, but it is hoped that this will trigger interest in the reader, to go off and delve further into an area of growing importance and interest.

Parallelism

First thing first – what is meant by the term ‘parallelism’? For the purpose of this article, at a high level, it is a term that refers to programming in a way to exploit multi-core CPU units. That is to say, to take a piece of work (a task) and break it out into separate units (sub-tasks) that can be processed in parallel. Then aggregating the results of all the processed units to complete the original work [4]. This is sufficient for now, although the reader might want to investigate further on differences between parallelism, concurrency and sequential programming.

Figure 1

For this to be successful, the processing on a unit should not have a dependency on any other unit – this processing should be stateless and also should not depend on the result of the processing of any other unit. This is important when it comes to the data that is involved – as parallel computing lends itself to situations where vast amounts of data is to be handled.

This also raises the question of when to go for parallelism – there is an overhead in the added complexity of identifying and breaking out the units of work and then coordinating the parallel processing. Then there is also the issue of latency in reading the data – the savings found in using parallel cores for processing is to try to keep each CPU core busy all the time – which requires reading data as and when is needed without delay. Before embarking on a parallel processing architecture, some cost-benefit analysis is required to be sure that this is the right approach.

With the introduction of Java 8 Oracle was aiming to make it more attractive and simpler for developers to use Java in parallel programming. The inclusion of features such as lambda functions and the Streams API was to be seen as part of this push [5]. How successful that becomes, time will tell.

This concept of parallelism is behind the existing fork/join framework found in Java, which will be covered in a future article, but for now we will confine our overview to the new Spliterator interface, found in the java.util library.

Collections Framework

The Collections Framework is also found in the java.util library, and provides a range of classes that act as containers to hold objects (technically, references to the objects) and allow useful behaviour such as adding new objects, searching the container for a particular object, and the sorting of such objects – the behaviour available is depending on the type of container used [6].

Within the library there are two different concepts of containers, captured in the core interfaces, that are defined within the framework [7]. One concept is that of a Collection that is a sequence of individual elements, which is further refined by extending the Collection interface with interfaces for List, Set, Queue or Deque.

Listing 1 – Simple Iterator on an ArrayList

// Create a new Collection type; in this case an ArrayList Collection<Person> people = new ArrayList<Person>(); // Add some people using Person objects people.add(new Person("Jane", "Doe", "Ireland")); people.add(new Person("Joe", "Doe", "England")); people.add(new Person("John", "Doe", "Scotland")); people.add(new Person("Julie", "Doe", "Wales")); people.add(new Person("Jerry", "Doe", "France")); people.add(new Person("Jim", "Doe", "Italy")); // simple iterator example Iterator<Person> peopleIterator = people.iterator(); while(peopleIterator.hasNext()) { Person person = peopleIterator.next(); System.out.println("Hello " + person.getFirst_name() + " " + person.getLast_name() + " from " + person.getCountry()); }

The output is:

Hello Jane Doe from Ireland Hello Joe Doe from England Hello John Doe from Scotland Hello Julie Doe from Wales Hello Jerry Doe from France Hello Jim Doe from Italy

Separately, there is the concept of a container, defined in the Map interface, holding a group of key-value object pairs, which allow the use of a key to find a value – in this way an object (value) can be found using another object (key) that has been mapped to it.

Both the interfaces of Collection and Map are found in the Collections Framework, in java.util, along with a class called Collections. Note that the Collections class consists exclusively of static methods, which include polymorphic algorithms, that operate on or return collections.

Paralellism and the Collections Framework

A problem with using the containers of the Collections Framework in some form of parallel programming is that these containers are not thread-safe. Wrappers are provided for adding automatic synchronization (thread safety) but the drawback is that this introduces thread contention, where two or more threads are trying to access the same resource simultaneously and therefore cause the runtime to either suspend their execution or execute them more slowly.

The Spliterator

A new interface added to java.util is the Spliterator, which as the name implies, is a new special kind of Iterator that can traverse a Collection[8]. For Java 8, the Collection interface has been updated to include a new spliterator() method, that when called returns a Spliterator. This is not the case for the separate Map interface, although it is considered part of the Collections Framework as explained above.

The Spliterator can ‘split’ the Collection, partitioning off some of its elements as another Spliterator. This does allow parallel processing of different parts of a Collection but note that the Spliterator itself does not provide the parallel processing behaviour. Instead, the Spliterator is there to support parallel traversal of the suitably partitioned parts of a Collection. This solves the problem of dividing the data, as held in a Collection such as an ArrayList, into suitably sized sub-units that can be processed in parallel.

The fork/join framework, which is found in Java 7 libraries, can be used with the Spliterator and is designed for parallelizable work that can be broken into smaller pieces recursively to be processed independently, and then aggregating the results of the sub-units to produce a final result. Again, however note that using the Spliterator is not dependant on the fork/join framework, this framework is just one way of implementing parallel processing [9].

Listing 2 – Simple Spliterator on an ArrayList

The following code gives us an ArrayList, called people that is to hold objects of the Person type – a user defined type:

Collection<Person> people = new ArrayList<Person>();

The ‘people’ collection then has a number of objects added, and we can now call on the spliterator() of the ArrayList:

Spliterator<Person> peopleSpliterator = people.spliterator();

This returns a Spliterator we are calling ‘peopleSpliterator’.

System.out.println(" " + peopleSpliterator.characteristics()); System.out.println(" " + peopleSpliterator.estimateSize());

The output from the above code is :

16464 
6

The value 16464 represents the characteristics of the Spliterator partition. This is important as it is these predefined characteristics that result in what ends up in each new partition and how it is structured. The int value returned from the characteristics() call is the result of the OR’ed values of the individual characteristics for an ArrayList, which is ORDERED, SORTED and SUBSIZED[10].

ORDERED indicates that the elements have a defined order, which is expected from a List, when traversing and partitioning them. SORTED means that the elements follow a predefined sort order while SUBSIZED indicates that this and any further resulting Spliterator are SIZED. SIZED means the Spliterator has been created from a source with a known size. The details of the characteristics for a particular Collection type can be found in the online API documentation. Where you wish to have a collection with non-standard characteristics, that means going down the route of implementing your own version of the Spliterator interface.

Listing 3 – Split the Spliterator

Now that we have a Spliterator, lets ‘split’ it:

Spliterator<Person> newPartition = peopleSpliterator.trySplit(); System.out.println(" " + newPartition.estimateSize()); System.out.println(" " + peopleSpliterator.estimateSize());

The output from this code is:

3 
3

As you can see we have a new partition with 3 elements, leaving the existing partition with 3. The important method of trySplit() has partitioned the elements of the original Spliterator – in this case split it evenly, leaving the existing Spliterator with a reduced number of elements in its own partition. Clearly this is very different from the behaviour of the pre-existing Iterator.

If this Spliterator can be split (partitioned), the method will try to divide the elements in half for a balanced parallel computation. This will not always be the case, and where a Spliterator cannot be split, a null is returned. The splitting, or partitioning, can continue recursively until a Spliterator returns null. If a different behaviour is required in how the trySplit() method works, then again a custom implementation of the Spliterator interface is required.

When used in a parallel programming solution, bear in mind that the individual Spliterator is not thread safe; instead the parallel processing implementation must insure that each individual Spliterator is handled by one thread at a time. A thread calling the trySplit() of a Spliterator, may hand over the returned new Spliterator to a separate thread for processing. This follows the idea of decomposition of a larger task into smaller sub-tasks that can be processed in parallel individually of each other.

When working with a Spliterator, especially in a parallel programming environment, changes to the data structure that is undergoing processing can lead to arbitrary and unknown behaviour and that is not good. Ideally the original data source should not be interfered with (no elements added, replaced or removed). Collection types are not immutable and cannot manage concurrent modifications. To help with this problem, for the ArrayList, the trySplit() returns a ‘late-binding’ and ‘fast-fail’ Spliterator. This late-binding binds the elements to the Spliterator on first split, first traversal or first query for estimated size rather then at the time of the creation of the Spliterator. After this binding occurs any interference with the elements is then detected, throwing a ConcurrentModificationException. Here, it fails-fast, rather then risking arbitrary and non-deterministic behaviour at a future time.

Of course with the above trivial example of only 6 elements in the list, this is hardly a work load that would justify parallel programming – but as mentioned before, Spliterator itself does not provide the parallel programming behaviour – the reader might be interested in experimenting with a Collection type holding a vastly larger number of objects and then look to implement a parallel processing solution to exploit the use of the Spliterator. A suggestion is that such a solution may incorporate the fork/join framework and/or the Stream API.

Future articles will delve further into this interesting and increasingly important area.

References:


Disclaimer : Logentries is not responsible for the content, reliability or availability of external sites.

  1. Oracle Java Technology HomePage
  2. Logentries Blog: Exploring Lambdas and Streams in Java
  3. Logentries Blog: Playing with Java 8’s Completable Futures
  4. Java Tutorials: Parallelism
  5. Oracle Press Release: Java 8
  6. Java Collections Framework
  7. Collections Framework Interfaces
  8. Interface Spliterator
  9. fork/join Framework
  10. Constant Values

Ready to start getting insights from your applications? Sign up for a Logentries free trial today.