Mastering Java for Data Science
上QQ阅读APP看书,第一时间看更新

Google Guava

Google Guava is very similar to Apache Commons; it is a set of utilities that extend the standard Java API and make life easier. But unlike Apache Commons, Google Guava is one library that covers many areas at once, including collections and I/O.

To include it in a project, use dependency:

<dependency> 
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>

We will start with the Guava I/O module. To give an illustration, we will use some generated data. We already used the word class, which contains a token and its part-of-speech tag, and here we will generate more words. To do that, we can use a data generation tool such as http://www.generatedata.com/. Let's define the following schema as shown in the following screenshot:

After that it is possible to save the generated data to the CSV format, set the delimiter to tab (t), and save it to words.txt. We already generated a file for you; you can find it in the chapter2 repository.

Guava defines a few abstractions for working with I/O. One of them is CharSource, an abstraction for any source of character-based data, which, in some sense, is quite similar to the standard Reader class. Additionally, similarly to Commons IO, there is a utility class for working with files. It is called Files (not to be confused with java.nio.file.Files), and contains helper functions that make file I/O easier. Using this class, it is possible to read all lines of a text file as follows:

File file = new File("data/words.txt"); 
CharSource wordsSource = Files.asCharSource(file, StandardCharsets.UTF_8);
List<String> lines = wordsSource.readLines();

Google Guava Collections follows the same idea as Commons Collections; it builds on the Standard Collections API and provides new implementations and abstractions. There are a few utility classes such as Lists, for working with lists, Sets for working with sets, and so on.

One of the methods from Lists is transform, it is like map on streams and it is applied to every element from the list. The elements of the resulting list are evaluated lazily; the computation of the function is triggered only when the element is needed. Let's use it for transforming the lines from the text file to a list of Word objects:

List<Word> words = Lists.transform(lines, line -> { 
String[] split = line.split("t");
return new Word(split[0].toLowerCase(), split[1]);
});

The main difference between this and map from the Streams API is that transform immediately returns a list, so there is no need to first create a stream, call the map function, and finally collect the results to list.

Similarly to Commons Collections, there are new collections that are not available in the Java API. The most useful collections for data science are Multiset, Multimap, and Table.

Multisets are sets where the same element can be stored multiple times, and they are usually used for counting things. This class is especially useful for text processing, when we want to calculate how many times each term appears.

Let's take the words that we read and calculate how many times each pos tag appeared:

Multiset<String> pos = HashMultiset.create(); 
for (Word word : words) {
pos.add(word.getPos());
}

If we want to output the results sorted by counts, there is a special utility function for that:

Multiset<String> sortedPos = Multisets.copyHighestCountFirst(pos); 
System.out.println(sortedPos);

Multimap is a map that for each key can have multiple values. There are several types of multimaps. The two most common maps are as follows:

  • ListMultimap: This associates a key with a list of values, similar to Map<Key, List<Value>>
  • SetMultimap: This associates a key to a set of values, similar to Map<Key, Set<Value>>

This can be quite useful for implementing group by logic. Let's look at the average length per POS tag:

ArrayListMultimap<String, String> wordsByPos = ArrayListMultimap.create();
for (Word word : words) {
wordsByPos.put(word.getPos(), word.getToken());
}

It is possible to view a multimap as a map of collections:

Map<String, Collection<String>> wordsByPosMap = wordsByPos.asMap(); 
wordsByPosMap.entrySet().forEach(System.out::println);

Finally, the Table collection can be seen as a two-dimensional extension of the map interface; now, instead of one key, each entry is indexed by two keys, row keys and column keys. In addition to that, it is also possible to get the entire column using the column key or a row using the row key.

For example, we can count how many times each (word, POS) pair appeared in the dataset:

Table<String, String, Integer> table = HashBasedTable.create(); 
for (Word word : words) {
Integer cnt = table.get(word.getPos(), word.getToken());
if (cnt == null) {
cnt = 0;
}
table.put(word.getPos(), word.getToken(), cnt + 1);
}

Once the data is put to the table, we can access the rows and columns inpidually:

Map<String, Integer> nouns = table.row("NN"); 
System.out.println(nouns);

String word = "eu";
Map<String, Integer> posTags = table.column(word);
System.out.println(posTags);

Like in Commons Lang, Guava also contains utility classes for working with primitives such as Ints for int primitives, Doubles for double primitives, and so on. For example, it can be used to convert a collection of primitive wrappers to a primitive array:

Collection<Integer> values = nouns.values(); 
int[] nounCounts = Ints.toArray(values);
int totalNounCount = Arrays.stream(nounCounts).sum();
System.out.println(totalNounCount);

Finally, Guava provides a nice abstraction for sorting data--Ordering, which extends the standard Comparator interface. It provides a clean fluent interface for creating comparators:

Ordering<Word> byTokenLength =  
Ordering.natural().<Word> onResultOf(w -> w.getToken().length()).reverse();
List<Word> sortedByLength = byTokenLength.immutableSortedCopy(words);
System.out.println(sortedByLength);

Since Ordering implements the Comparator interface, it can be used wherever a comparator is expected. For example, for Collections.sort:

List<Word> sortedCopy = new ArrayList<>(words); 
Collections.sort(sortedCopy, byTokenLength);

In addition to that, it provides other methods such as extracting the top-k or bottom-k elements:

List<Word> first10 = byTokenLength.leastOf(words, 10); 
System.out.println(first10);
List<Word> last10 = byTokenLength.greatestOf(words, 10);
System.out.println(last10);

It is the same as first sorting and then taking the first or last k elements, but more efficient.

There are other useful classes:

  • Customizable hash implementations such as Murmur hash and others
  • Stopwatch for measuring time

For more insights, you can refer to https://github.com/google/guava and https://github.com/google/guava/wiki.

You may have noticed that Guava and Apache Commons have a lot in common. Selecting which one to use is a matter of taste--both libraries are very well tested and actively used in many production systems. However, Guava is more actively developed and new features appear more often, so if you want to use only one of them, then Guava may be a better choice.