All About Machine Learning: Suvir Jain | Pearson's Correlation Coefficient using Apache Spark and Map Reduce

Saturday, November 22, 2014

Suvir Jain | Pearson's Correlation Coefficient using Apache Spark and Map Reduce

Suvir Jain | Pearson’s Correlation Coefficient using Apache Spark and Map Reduce

Pearson’s Correlation Coefficient using Apache Spark and Map Reduce Sample Input 2 variables X and Y with values in comma-seperated form like : 1,2 3,4 5,6 … The code is general enough to handle any given column. But we will keep it simple and assume that the file contains only 2 columns like above. Algorithm For each line in the file, parse the two numbers as Java Double type and emit the following (Key,Value) pairs : (0,1) – To count the number of records (1,X) – Value of X itself (2,Y) – Value of Y itself (3,X^2) – Square of X (4,Y^2) – Square of Y (5,X*Y) – Product of X and Y. This will help us compute the Dot Product of X and Y. A single pass of the Spark mapper will finish all the heavy lifting in one awesome O(n) operation! Next, our reducers will add up the values for each key and we will be almost done. Mapper and Reducer Functions The Mapper In Spark style, the Mapper is a static nested class.

Read full article from Suvir Jain | Pearson’s Correlation Coefficient using Apache Spark and Map Reduce

All About Machine Learning

Saturday, November 22, 2014

Suvir Jain | Pearson's Correlation Coefficient using Apache Spark and Map Reduce

No comments:

Post a Comment

Popular Posts