Collections
Data is the most important part of data science. When dealing with data, it needs to be efficiently stored and processed, and for this we use data structures. A data structure describes a way to store data efficiently to solve a specific problem, and the Java Collection API is the standard Java API for data structures. This API offers a wide variety of implementations that are useful in practical data science applications.
We will not describe the collection API in full detail, but concentrate on the most useful and important ones--list, set, and map interfaces.
Lists are collections where each element can be accessed by its index. The g0-to implementation of the List interface is ArrayList, which should be used in 99% of cases and it can be used as follows:
List<String> list = new ArrayList<>();
list.add("alpha");
list.add("beta");
list.add("beta");
list.add("gamma");
System.out.println(list);
There are other implementations of the List interface, LinkedList or CopyOnWriteArrayList, but they are rarely needed.
Set is another interface in the Collections API, and it describes a collection which allows no duplicates. The go-to implementation is HashSet, if the order in which we insert elements does not matter, or LinkedHashSet, if the order matters. We can use it as follows:
Set<String> set = new HashSet<>();
set.add("alpha");
set.add("beta");
set.add("beta");
set.add("gamma");
System.out.println(set);
List and Set both implement the Iterable interface, which makes it possible to use the for-each loop with them:
for (String el : set) {
System.out.println(el);
}
The Map interface allows mapping keys to values, and is sometimes called as dictionary or associative array in other languages. The g0-to implementation is HashMap:
Map<String, String> map = new HashMap<>();
map.put("alpha", "α");
map.put("beta", "β");
map.put("gamma", "γ");
System.out.println(map);
If you need to keep the insertion order, you can use LinkedHashMap; if you know that the map interface will be accessed from multiple threads, use ConcurrentHashMap.
The Collections class provides several helper methods for dealing with collections such as sorting, or extracting the max or min elements:
String min = Collections.min(list);
String max = Collections.max(list);
System.out.println("min: " + min + ", max: " + max);
Collections.sort(list);
Collections.shuffle(list);
There are other collections such as Queue, Deque, Stack, thread-safe collections, and some others. They are less frequently used and not very important for data science.