All About Machine Learning: java - Producing a sorted wordcount with Spark

Wednesday, November 26, 2014

java - Producing a sorted wordcount with Spark - Code Review Stack Exchange

My method using Java 8

As addendum I'll show how I would identify your problem in question and show you how I would do it.

Input: An input file, consisting of words. Output: A list of the words sorted by frequency in which they occur.

Map<String, Long> occurenceMap = Files.readAllLines(Paths.get("myFile.txt"))          .stream()          .flatMap(line -> Arrays.stream(line.split(" ")))          .collect(Collectors.groupingBy(i -> i, Collectors.counting()));  List<String> sortedWords = occurenceMap.entrySet()          .stream()          .sorted(Comparator.comparing((Map.Entry<String, Long> entry) -> entry.getValue()).reversed())          .map(Map.Entry::getKey)          .collect(Collectors.toList());

This will do the following steps:

Read all lines into a List<String> (care with large files!)
Turn it into a Stream<String>.
Turn that into a Stream<String> by flat mapping every String to a Stream<String> splitting on the blanks.
Collect all elements into a Map<String, Long> grouping by the identity (i -> i) and using as downstream Collectors.counting() such that the map-value will be its count.
Get a Set<Map.Entry<String, Long>> from the map.
Turn it into a Stream<Map.Entry<String, Long>>.
Sort by the reverse order of the value of the entry.
Map the results to a Stream<String>, you lose the frequency information here.
Collect the stream into a List<String>.

Beware that the line .sorted(Comparator.comparing((Map.Entry<String, Long> entry) -> entry.getValue()).reversed()) should really be .sorted(Comparator.comparing(Map.Entry::getValue).reversed(), but type inference is having issues with that and for some reason it will not compile.

I hope the Java 8 way can give you interesting insights.

Read full article from java - Producing a sorted wordcount with Spark - Code Review Stack Exchange

All About Machine Learning

Wednesday, November 26, 2014

java - Producing a sorted wordcount with Spark - Code Review Stack Exchange

My method using Java 8

No comments:

Post a Comment

Popular Posts