Saturday, November 22, 2014

Suvir Jain | Pearson's Correlation Coefficient using Apache Spark and Map Reduce


Suvir Jain | Pearson’s Correlation Coefficient using Apache Spark and Map Reduce

Pearson’s Correlation Coefficient using Apache Spark and Map Reduce   Sample Input 2 variables X and Y with values in comma-seperated form like : 1,2 3,4 5,6 … The code is general enough to handle any given column. But we will keep it simple and assume that the file contains only 2 columns like above. Algorithm For each line in the file, parse the two numbers as Java Double type and emit the following (Key,Value) pairs : (0,1) – To count the number of records (1,X) – Value of X itself (2,Y) – Value of Y itself (3,X^2) – Square of X (4,Y^2) – Square of Y (5,X*Y) – Product of X and Y. This will help us compute the Dot Product of X and Y. A single pass of the Spark mapper will finish all the heavy lifting in one awesome O(n) operation! Next, our reducers will add up the values for each key and we will be almost done. Mapper and Reducer Functions The Mapper In Spark style, the Mapper is a static nested class.

Read full article from Suvir Jain | Pearson’s Correlation Coefficient using Apache Spark and Map Reduce

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.