All About Machine Learning: 2014

Thursday, December 11, 2014

Tree traversal - Rosetta Code

Implement a binary tree where each node carries an integer, and implement preoder, inorder, postorder and level-order traversal. Use those traversals to output the following tree:

Read full article from Tree traversal - Rosetta Code

Tuesday, December 9, 2014

A good POS tagger in about 200 lines of Python « Computational Linguistics

Averaged perceptron

POS tagging is a “supervised learning problem”. You’re given a table of data, and you’re told that the values in the last column will be missing during run-time. You have to find correlations from the other columns to predict that value.

So for us, the missing column will be “part of speech at word i“. The predictor columns (features) will be things like “part of speech at word i-1“, “last three letters of word at i+1“, etc

Read full article from A good POS tagger in about 200 lines of Python « Computational Linguistics

Wednesday, November 26, 2014

Apache Spark RDD API Examples

Apache Spark RDD API Examples
an RDD is actually more than that. On cluster installations, separate data partitions can be on separate nodes. Using the RDD as a handle one can access all partitions and perform computations and transformations using the contained data. Whenever a part of a RDD or an entire RDD is lost, the system is able to reconstruct the data of lost partitions by using lineage information. Lineage refers to the sequence of transformations used to produce the current RDD. As a result, Spark is able to recover automatically from most failures.

All RDDs available in Spark derive either directly or indirectly from the class RDD. This class comes with a large set of methods that perform operations on the data within the associated partitions. The class RDD is abstract. Whenever, one uses a RDD, one is actually using a concertized implementation of RDD. These implementations have to overwrite some core functions to make the RDD behave as expected.

It does not impose restrictions regarding what data can be stored within RDD partitions.

The basic RDD API considers each data item as a single value. However, users often want to work with key-value pairs. Therefore Spark extended the interface of RDD to provide additional functions (PairRDDFunctions), which explicitly work on key-value pairs. Currently, there are four extensions to the RDD API available in spark. They are as follows:

DoubleRDDFunctions

This extension contains many useful methods for aggregating numeric values. They become available if the data items of an RDD are implicitly convertible to the Scala data-type double.

PairRDDFunctions

Methods defined in this interface extension become available when the data items have a two component tuple structure. Spark will interpret the first tuple item (i.e. tuplename. 1) as the key and the second item (i.e. tuplename. 2) as the associated value.

OrderedRDDFunctions

Methods defined in this interface extension become available if the data items are two-component tuples where the key is implicitly sortable.

SequenceFileRDDFunctions

This extension contains several methods that allow users to create Hadoop sequence- les from RDDs. The data items must be two compo- nentkey-value tuples as required by the PairRDDFunctions. However, there are additional requirements considering the convertibility of the tuple components to Writable types.

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.aggregate(0)(math.max(_, _), _ + _)

sortBy

This function sorts the input RDD's data and stores it in a new RDD. The first parameter requires you to specify a function which maps the input data into the key that you want to sortBy. The second parameter (optional) specifies whether you want the data to be sorted in ascending or descending order.
val y = sc.parallelize(Array(5, 7, 1, 3, 2, 1))
y.sortBy(c => c, true).collect
val z = sc.parallelize(Array(("H", 10), ("A", 26), ("Z", 1), ("L", 5)))
z.sortBy(c => c._1, true).collect
z.sortBy(c => c._2, true).collect

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.stats

sortByKey [Ordered]

This function sorts the input RDD's data and stores it in a new RDD. The output RDD is a shuffled RDD because it stores data that is output by a reducer which has been shuffled. The implementation of this function is actually very clever. First, it uses a range partitioner to partition the data in ranges within the shuffled RDD. Then it sorts these ranges individually with mapPartitions using standard sort mechanisms.
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)
c.sortByKey(true).collect
res74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))
c.sortByKey(false).collect
res75: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))

val a = sc.parallelize(1 to 100, 5)
val b = a.cartesian(a)
val c = sc.parallelize(b.takeSample(true, 5, 13), 2)
val d = c.sortByKey(false)
res56: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4))

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.sum
takeOrdered

Orders the data items of the RDD using their inherent implicit ordering function and returns the first n items as an array.

val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b.takeOrdered(2)

takeSample

Behaves different from sample in the following respects:

It will return an exact number of samples (Hint: 2nd parameter)
It returns an Array instead of RDD.
It internally randomizes the order of the items returned.

val x = sc.parallelize(1 to 1000, 3)
x.takeSample(true, 100, 1)
toDebugString

val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.toDebugString

top

Utilizes the implicit ordering of $T$ to determine the top $k$ values and returns them as an array.
union, ++

Performs the standard set operation: A union B

val a = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(5 to 7, 1)
(a ++ b).collect

unpersist

Dematerializes the RDD (i.e. Erases all data items from hard-disk and memory). However, the RDD object remains. If it is referenced in a computation, Spark will regenerate it automatically using the stored dependency graph.

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.values.collect

zip

Joins two RDDs by combining the i-th of either partition with each other. The resulting RDD will consist of two-component tuples which are interpreted as key-value pairs by the methods provided by the PairRDDFunctions extension.
val a = sc.parallelize(1 to 100, 3)
val b = sc.parallelize(101 to 200, 3)
a.zip(b).collect

val a = sc.parallelize(1 to 100, 3)
val b = sc.parallelize(101 to 200, 3)
val c = sc.parallelize(201 to 300, 3)
a.zip(b).zip(c).map((x) => (x._1._1, x._1._2, x._2 )).collect

zipWithIndex

Zips the elements of the RDD with its element indexes. The indexes start from 0. If the RDD is spread across multiple partitions then a spark Job is started to perform this operation.
val z = sc.parallelize(Array("A", "B", "C", "D"))
val r = z.zipWithIndex
zipWithUniqueId

This is different from zipWithIndex since just gives a unique id to each data element but the ids may not match the index number of the data element. This operation does not start a spark job even if the RDD is spread across multiple partitions.
val z = sc.parallelize(100 to 120, 5)
val r = z.zipWithUniqueId
r.collect

Read full article from Apache Spark RDD API Examples

scala - Confusing "diverging implicit expansion" error when using "sortBy" - Stack Overflow

If you look at the type signature of toIndexedSeq on List you'll see it takes a type parameter B, which can be any supertype of A:

def toIndexedSeq [B >: A] : IndexedSeq[B]

If you leave out that type parameter then the compiler essentially has to guess what you meant, taking the most specific type possible. You could have meant List(3,2,1).toIndexedSeq[Any], which of course can't be sorted since there's no Ordering[Any]. It seems the compiler doesn't play "guess the type parameter" until the whole expression has been checked for correct typing (maybe someone who knows something about compiler internals can expand on this).

To make it work you can either a) provide the required type parameter yourself i.e.

List(3,2,1).toIndexedSeq[Int].sortBy(x=>x)

or b) separate the expression into two so that the type parameter has to be inferred before calling sortBy:

val lst = List(3,2,1).toIndexedSeq; lst.sortBy(x=>x)

Edit:

It's probably because sortBy takes a Function1 argument. The signature of sortBy is

def sortBy [B] (f: (A) => B)(implicit ord: Ordering[B]): IndexedSeq[A]

whereas sorted (which you should use instead!) works fine with List(3,2,1).toIndexedSeq.sorted

def sorted [B >: A] (implicit ord: Ordering[B]): IndexedSeq[A]

I'm not sure exactly why Function1 causes this problem and I'm going to bed so can't think about it further...

Read full article from scala - Confusing "diverging implicit expansion" error when using "sortBy" - Stack Overflow

Scala Tutorial - Maps, Sets, groupBy, Options, flatten, flatMap | Java Code Geeks

Scala Tutorial – Maps, Sets, groupBy, Options, flatten, flatMap | Java Code Geeks

November 26, 2014 4:11 pm Scala Tutorial – Maps, Sets, groupBy, Options, flatten, flatMap Preface This is part 7 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for. Additionally you can find this and other tutorial series on the JCG Java Tutorials page. Lists (and other sequence data structures, like Ranges and Arrays) allow you to group collections of objects in an ordered manner: you can access elements of a list by indexing their position in the list, or iterate over the list elements, one by one, using for expressions and sequence functions like map, filter, reduce and fold. Another important kind of data structure is the associative array, which you’ll come to know in Scala as a Map. (Yes, this has the unfortunate ambiguity with the map function, but their use will be quite clear from context.

Read full article from Scala Tutorial – Maps, Sets, groupBy, Options, flatten, flatMap | Java Code Geeks

Spark the fastest open source engine for sorting a petabyte - Databricks

Spark the fastest open source engine for sorting a petabyte – Databricks

October 10, 2014 | by Reynold Xin Update November 5, 2014: Our benchmark entry has been reviewed by the benchmark committee and Spark has won the Daytona GraySort contest for 2014! Please see this new blog post for update . Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes. To evaluate these improvements, we decided to participate in the Sort Benchmark .

Read full article from Spark the fastest open source engine for sorting a petabyte – Databricks

java - Producing a sorted wordcount with Spark - Code Review Stack Exchange

My method using Java 8

As addendum I'll show how I would identify your problem in question and show you how I would do it.

Input: An input file, consisting of words. Output: A list of the words sorted by frequency in which they occur.

Map<String, Long> occurenceMap = Files.readAllLines(Paths.get("myFile.txt"))          .stream()          .flatMap(line -> Arrays.stream(line.split(" ")))          .collect(Collectors.groupingBy(i -> i, Collectors.counting()));  List<String> sortedWords = occurenceMap.entrySet()          .stream()          .sorted(Comparator.comparing((Map.Entry<String, Long> entry) -> entry.getValue()).reversed())          .map(Map.Entry::getKey)          .collect(Collectors.toList());

This will do the following steps:

Read all lines into a List<String> (care with large files!)
Turn it into a Stream<String>.
Turn that into a Stream<String> by flat mapping every String to a Stream<String> splitting on the blanks.
Collect all elements into a Map<String, Long> grouping by the identity (i -> i) and using as downstream Collectors.counting() such that the map-value will be its count.
Get a Set<Map.Entry<String, Long>> from the map.
Turn it into a Stream<Map.Entry<String, Long>>.
Sort by the reverse order of the value of the entry.
Map the results to a Stream<String>, you lose the frequency information here.
Collect the stream into a List<String>.

Beware that the line .sorted(Comparator.comparing((Map.Entry<String, Long> entry) -> entry.getValue()).reversed()) should really be .sorted(Comparator.comparing(Map.Entry::getValue).reversed(), but type inference is having issues with that and for some reason it will not compile.

I hope the Java 8 way can give you interesting insights.

Read full article from java - Producing a sorted wordcount with Spark - Code Review Stack Exchange

Parallel Programming With Spark

Parallel Programming With Spark
Improves efficiency through:
» In4memory computing primitives
»General computation graphs

Spark originally written in Scala, which allows
concise function syntax and interactive use
Recently added Java API for standalone apps

Concept: resilient distributed datasets (RDDs)
» Immutable collections of objects spread across a cluster
» Built through parallel transformations (map, filter, etc)
» Automatically rebuilt on failure
» Controllable persistence (e.g. caching in RAM) for reuse

Main Primitives
Resilient distributed datasets (RDDs)
» Immutable, partitioned collections of objects
Transformations (e.g. map, filter, groupBy, join)
» Lazy operations to build RDDs from other RDDs
Actions (e.g. count, collect, save)
»Return a result or write it to storage

RDD Fault Tolerance
RDDs track the series of transformations used to
build them (their lineage) to recompute lost data

Easiest way: Spark interpreter (spark-shell)
» Modified version of Scala interpreter for cluster use
Runs in local mode on 1 thread by default, but
can control through MASTER environment var:
MASTER=local ./spark-shell # local, 1 thread
MASTER=local[2] ./spark-shell # local, 2 threads
MASTER=host:port ./spark-shell # run on Mesos

First Stop: SparkContext
Main entry point to Spark functionality
Created for you in spark-shell as variable sc

Creating RDDs
// Turn a Scala collection into an RDD
sc.parallelize(List(1, 2, 3))
// Load text file from local FS, HDFS, or S3
sc.textFile("file.txt")
sc.textFile("directory/*.txt")
sc.textFile("hdfs://namenode:9000/path/file")
// Use any existing Hadoop InputFormat
sc.hadoopFile(keyClass, valClass, inputFmt, conf)

val nums = sc.parallelize(List(1, 2, 3))
// Pass each element through a function
val squares = nums.map(x => x*x) // {1, 4, 9}
// Keep elements passing a predicate
val even = squares.filter(_ % 2 == 0) // {4}
// Map each element to zero or more others
nums.flatMap(x => 1 to x) // => {1, 1, 2, 1, 2, 3}

Working with KeyUValue Pairs
Spark’s "distributed reduce" transformations
operate on RDDs of key4value pairs
Scala pair syntax:
val pair = (a, b) // sugar for new Tuple2(a, b)
Accessing pair elements:
pair._1 // => a
pair._2 // => b

Some KeyUValue Operations
val pets = sc.parallelize( List(("cat", 1), ("dog", 1), ("cat", 2)))
pets.reduceByKey(_ + _) // => {(cat, 3), (dog, 1)}
pets.groupByKey() // => {(cat, Seq(1, 2)), (dog, Seq(1)}
pets.sortByKey() // => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also automatically implements
combiners on the map side

val lines = sc.textFile("hamlet.txt")
val counts = lines.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)

val visits = sc.parallelize(List(
("index.html"
, "1.2.3.4"),
("about.html"
, "3.4.5.6"),
("index.html"
, "1.3.3.1")))
val pageNames = sc.parallelize(List(
("index.html"
, "Home"), ("about.html"
, "About")))
visits.join(pageNames)
// ("index.html", ("1.2.3.4", "Home"))
// ("index.html", ("1.3.3.1", "Home"))
// ("about.html", ("3.4.5.6", "About"))
visits.cogroup(pageNames)
// ("index.html", (Seq("1.2.3.4", "1.3.3.1"), Seq("Home")))
// ("about.html", (Seq("3.4.5.6"), Seq("About")))

Controlling The Number of
Reduce Tasks
All the pair RDD operations take an optional
second parameter for number of tasks
words.reduceByKey(_ + _, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
Can also set spark.default.parallelism property

Using Local Variables
Any external variables you use in a closure will
automatically be shipped to the cluster:
val query = Console.readLine()
pages.filter(_.contains(query)).count()
Some caveats:
» Each task gets a new copy (updates aren’t sent back)
»Variable must be Serializable
»Don’t use fields of an outer object (ships all of it!)

sample(): deterministically sample a subset
union(): merge two RDDs
cartesian(): cross product
pipe(): pass through external program

Task Scheduler
Runs general task
graphs
Pipelines functions
where possible
Cache4aware data
reuse locality
Partitioning4aware
to avoid shuffles
Please read full article from Parallel Programming With Spark

Monday, November 24, 2014

A simple machine learning app with Spark - Chapeau

A simple machine learning app with Spark I'm currently on my way back from the first-ever Spark Summit , where I presented a talk on some of my work with the Fedora Big Data SIG to package Apache Spark and its infrastructure for Fedora. ( My slides are online, but they aren't particularly useful without the talk. I'll post a link to the video when it's available, though.) If you're interested in learning more about Spark, a great place to start is the guided exercises that the Spark team put together; simply follow their instructions to fire up an EC2 cluster with Spark installed and then work through the exercises. In one of the exercises, you'll have an opportunity to build up one of the classic Spark demos: distributed k-means clustering in about a page of code. Implementing k-means on resilient distributed datasets is an excellent introduction to key Spark concepts and idioms. With recent releases of Spark, though, machine learning can be simpler still:

Read full article from A simple machine learning app with Spark - Chapeau

Sunday, November 23, 2014

Spark SQL: Parquet | InfoObjects

Apache Parquet as a file format has garnered significant attention recently. Let's say you have a table with 100 columns, most of the time you are going to access 3-10 columns. In row oriented format all columns are scanned where you need them or not. Apache Parquet saves data in column oriented fashion so if you need 3 columns only data of those 3 columns get loaded. Another benefit is that since all data in a given column is same datatype (obviously), compression quality is far superior. In this recipe we'll learn how to save a table in Parquet format and then how to load it back. Let's use the Person table we created in other recipe. first_name last_name gender Barack Obama M Bill Clinton M Hillary Clinton F Let's load it in Spark SQL scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc) scala>import hc._ scala>case class Person(firstName: String, lastName: String, gender: String) scala>val person = sc.textFile("person").map(_.split("\t")).map(p => Person(p(0),p(1),

Read full article from Spark SQL: Parquet | InfoObjects

Pearson product-moment correlation coefficient - Wikipedia, the free encyclopedia

In statistics , the Pearson product-moment correlation coefficient ( / ˈ p ɪər s ɨ n / ) (sometimes referred to as the PPMCC or PCC or Pearson's r) is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation. It is widely used in the sciences as a measure of the degree of linear dependence between two variables. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s. [1] [2] [3] Examples of scatter diagrams with different values of correlation coefficient (ρ) Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.:

Read full article from Pearson product-moment correlation coefficient - Wikipedia, the free encyclopedia

Correlation Coefficients: Find Pearson's Correlation Coefficient

Correlation Coefficients: Find Pearson's Correlation Coefficient
Sample question: Find the value of the correlation coefficient from the following table: Subject 1 43 99 2 21 65 3 25 79 4 42 75 5 57 87 6 59 81 Step 1:Make a chart. Use the given data, and add three more columns: xy, x2, and y2. Subject xy x2 y2 1 43 99 2 21 65 3 25 79 4 42 75 5 57 87 6 59 81 Step 2::Multiply x and y together to fill the xy column. For example, row 1 would be 43 × 99 = 4,257. Subject xy x2 y2 1 43 99 4257 2 21 65 1365 3 25 79 1975 4 42 75 3150 5 57 87 4959 6 59 81 4779 Step 3: Take the square of the numbers in the x column, and put the result in the x2 column. Subject xy x2 y2 1 43 99 4257 1849 2 21 65 1365 441 3 25 79 1975 625 4 42 75 3150 1764 5 57 87 4959 3249 6 59 81 4779 3481 Step 4: Take the square of the numbers in the y column, and put the result in the y2 column. Subject xy x2 y2 1 43 99 4257 1849 9801 2 21 65 1365 441 4225 3 25 79 1975 625 6241 4 42 75 3150 1764 5625 5 57 87 4959 3249 7569 6 59 81 4779 3481 6561 Step 5:

The answer is: 2868 / 5413.27 = 0.529809

6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]

=0.5298

Read full article from Correlation Coefficients: Find Pearson's Correlation Coefficient

Pearson Correlation: Definition and Easy Steps for Use

Watch the video on how to find Pearson's Correlation Coefficient, or read below for an explanation of what it is: What is Pearson Correlation? Correlation between sets of data is a measure of how well they are related. The most common measure of correlation in stats is the Pearson Correlation. The full name is the Pearson Product Moment Correlation or PPMC. It shows the linear relationship between two sets of data. In simple terms, it answers the question, Can I draw a line graph to represent the data? Two letters are used to represent the Pearson correlation: Greek letter rho (ρ) for a population and the letter "r" for a sample. The Pearson correlation coefficient can be calculated by hand or one a graphing calculator such as the TI-89 What are the Possible Values for the Pearson Correlation? The results will be between -1 and 1. You will very rarely see 0, -1 or 1. You'll get a number somewhere in between those values. The closer the value of r gets to zero,

Read full article from Pearson Correlation: Definition and Easy Steps for Use

Scala Code Review: foldLeft and foldRight | Matt Malone's Old-Fashioned Software Development Blog

def foldLeft[B](z: B)(f: (B, A) => B): B

Firstly, foldLeft is a curried function (So is foldRight). If you don’t know about currying, that’s ok; this function just takes its two parameters (z and f) in two sets of parentheses instead of one. Currying isn’t the important part anyway.

The first parameter, z, is of type B, which is to say it can be different from the list contents type. The second parameter, f, is a function that takes a B and an A (a list item) as parameters, and it returns a value of type B. So the purpose of function f is to take a value of type B, use a list item to modify that value and return it.

The foldLeft function goes through the whole List, from head to tail, and passes each value to f. For the first list item, that first parameter, z, is used as the first parameter to f. For the second list item, the result of the first call to f is used as the B type parameter.

list.foldLeft(List[Int]())((b,a) => a :: b)

Read full article from Scala Code Review: foldLeft and foldRight | Matt Malone's Old-Fashioned Software Development Blog

Histogram inScala | Big Data Analytics with Spark

Histogram inScala | Big Data Analytics with Spark
val hist=Array("aa","bb","aa").foldLeft(Map[String,Int]()){
(m,c) => m + (c -> (m.getOrElse(c,0)+1))
}

val hist=Array("aa","bb","aa").foldLeft(Map[String,Int]()){
(m,c) => m.updated(c, m.getOrElse(c,0)+1)
}

hist.toSeq.sortBy(_._1).foreach(println)
Read full article from Histogram inScala | Big Data Analytics with Spark

Histogram in Spark (1) | Big Data Analytics with Spark

Spark’s DoubleRDDFunctions provide a histogram function for RDD[Double]. However there are no histogram function for RDD[String]. Here is a quick exercise for doing it. We will use immutable Map in this exercise.

Create a dummy RDD[String] and apply the aggregate method to calculate histogram

val d=sc.parallelize((1 to 10).map(_ % 3).map("val"+_.toString))

d.aggregate(Map[String,Int]())(
(m,c)=>m.updated(c,m.getOrElse(c,0)+1),
(m,n)=>(m /: n){case (map,(k,v))=>map.updated(k,v+map.getOrElse(k,0))}
)

def mapadd[T](m:Map[T,Int],n:Map[T,Int])={
(m /: n){case (map,(k,v))=>map.updated(k,v+map.getOrElse(k,0))}
}

mapadd(Map("a"->1,"b"->2),Map("a"->2,"c"->1))

d.aggregate(Map[String,Int]())(
(m,c)=>m.updated(c,m.getOrElse(c,0)+1),
mapadd(_,_)
)
Read full article from Histogram in Spark (1) | Big Data Analytics with Spark

Saturday, November 22, 2014

SparkNotes: Graphing Data: Histograms

Frequency Distribution Tables A frequency distribution table is a table that shows how often a data point or a group of data points appears in a given data set. To make a frequency distribution table, first divide the numbers over which the data ranges into intervals of equal length. Then count how many data points fall into each interval. If there are many values, it is sometimes useful to go through all the data points in order and make a tally mark in the interval that each point falls. Then all the tally marks can be counted to see how many data points fall into each interval. The "tally system" ensures that no points will be missed. Example: The following is a list of prices (in dollars) of birthday cards found in various drug stores: 1.45 2.20 0.75 1.23 1.25 1.25 3.09 1.99 2.00 0.78 1.32 2.25 3.15 3.85 0.52 0.99 1.38 1.75 1.22 1.75 Make a frequency distribution table for this data. We omit the units (dollars) while calculating. The values go from 0.52 to 3.85 ,

Read full article from SparkNotes: Graphing Data: Histograms

Statistics With Spark

Josh - 07 Mar 2014 Lately I've been writing a lot of Spark Jobs that perform some statistical analysis on datasets. One of the things I didn't realize right away - is that RDD's have built in support for basic statistic functions like mean, variance, sample variance, standard deviation. These operations are avaible on RDD's of Double import org.apache.spark.SparkContext._ // implicit conversions in here val myRDD = newRDD().map { _.toDouble } myRDD.mean myRDD.sampleVariance // divides by n-1 myRDD.sampleStdDev // divides by n-1 Getting It All At Once If you're interested in calling multiple stats functions at the same time, it's a better idea to get them all in a single pass. Spark provides the stats method in DoubleRDDFunctions for that; it also provides the total count of the RDD as well. Histograms Means and standard deviation are a decent starting point when you're looking at a new dataset;

Read full article from Statistics With Spark

Estimating Financial Risk with Apache Spark | Cloudera Engineering Blog

Learn how Spark facilitates the calculation of computationally-intensive statistics such as VaR via the Monte Carlo method. Under reasonable circumstances, how much money can you expect to lose? The financial statistic value at risk (VaR) seeks to answer this question. Since its development on Wall Street soon after the stock market crash of 1987, VaR has been widely adopted across the financial services industry. Some organizations report the statistic to satisfy regulations, some use it to better understand the risk characteristics of large portfolios, and others compute it before executing trades to help make informed and immediate decisions. For reasons that we will delve into later, reaching an accurate estimate of VaR can be a computationally expensive process. The most advanced approaches involve Monte Carlo simulations , a class of algorithms that seek to compute quantities through repeated random sampling.

Read full article from Estimating Financial Risk with Apache Spark | Cloudera Engineering Blog

Suvir Jain | Pearson's Correlation Coefficient using Apache Spark and Map Reduce

Suvir Jain | Pearson’s Correlation Coefficient using Apache Spark and Map Reduce

Pearson’s Correlation Coefficient using Apache Spark and Map Reduce Sample Input 2 variables X and Y with values in comma-seperated form like : 1,2 3,4 5,6 … The code is general enough to handle any given column. But we will keep it simple and assume that the file contains only 2 columns like above. Algorithm For each line in the file, parse the two numbers as Java Double type and emit the following (Key,Value) pairs : (0,1) – To count the number of records (1,X) – Value of X itself (2,Y) – Value of Y itself (3,X^2) – Square of X (4,Y^2) – Square of Y (5,X*Y) – Product of X and Y. This will help us compute the Dot Product of X and Y. A single pass of the Spark mapper will finish all the heavy lifting in one awesome O(n) operation! Next, our reducers will add up the values for each key and we will be almost done. Mapper and Reducer Functions The Mapper In Spark style, the Mapper is a static nested class.

Read full article from Suvir Jain | Pearson’s Correlation Coefficient using Apache Spark and Map Reduce

Apache Spark User List - How to sort an RDD ?

Well it turns out you can use the takeOrdered function and create your
own Compare object

object AceScoreOrdering extends Ordering[Record] {
def compare(a:Record, b:Record) = a.score.ace_score compare
b.score.ace_score
}

val collected = dataset.takeOrdered(topN)(AceScoreOrdering)

Read full article from Apache Spark User List - How to sort an RDD ?

Spark: Parse CSV file and group by column value | Java Code Geeks

November 22, 2014 7:42 pm Spark: Parse CSV file and group by column value I've found myself working with large CSV files quite frequently and realising that my existing toolset didn't let me explore them quickly I thought I'd spend a bit of time looking at Spark to see if it could help. $ ls -alh ~/Downloads/Crimes_-_2001_to_present.csv -rw-r--r--@ 1 markneedham staff 1.0G 16 Nov 12:14 /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv $ wc -l ~/Downloads/Crimes_-_2001_to_present.csv 4193441 /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv We can get a rough idea of the contents of the file by looking at the first row along with the header: $ head -n 2 ~/Downloads/Crimes_-_2001_to_present.csv ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location 9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,

Read full article from Spark: Parse CSV file and group by column value | Java Code Geeks

Spark Shell Examples - Altiscale Docs

Spark Shell Examples – Altiscale Docs

Copy Test Data to HDFS The following will upload all of our example data to HDFS under your current login username. These include GraphX PageRank's datasets, MLLib decision tree, logistic regression, Kmean, linear regression, SVM, and naive bayes. pushd `pwd` cd /opt/spark/ Second, launch the spark-shell command again with the following command: SPARK_SUBMIT_OPTS="-XX:MaxPermSize=256m" ./bin/spark-shell --master yarn --queue research --driver-class-path $(find /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-* | head -n 1) Run following Scala statements in Scala REPL Shell: SVM Logistic Regression Naive Bayes KMeans GraphX PageRank Decision Tree - Classification and Regression/Prediction // CLASSIFICATION import org.apache.spark.SparkContext import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.configuration.Algo._ import org.apache.spark.mllib.tree.

Read full article from Spark Shell Examples – Altiscale Docs

Spark 1.1.0 Basic Statistics（上） | 程序员的自我修养 | 关注Java、大数据、机器学习

Summary statistics主要提供基于列的统计信息，包括6个统计量：均值、方差、非零统计量个数、总数、最小值、最大值。

SparkConf sparkConf = new SparkConf().setAppName("Statistics").setMaster("local[2]");

    JavaSparkContext sc = new JavaSparkContext(sparkConf);

    JavaRDD<String> data = sc.textFile("/home/yurnom/data/statistics.txt");

    JavaRDD<Vector> parsedData = data.map(s -> {

        double[] values = Arrays.asList(SPACE.split(s))

              .stream()

              .mapToDouble(Double::parseDouble)

              .toArray();

        return Vectors.dense(values);

});

    MultivariateStatisticalSummary summary = Statistics.colStats(parsedData.rdd());

    System.out.println("均值:"+summary.mean());

    System.out.println("方差:"+summary.variance());

    System.out.println("非零统计量个数:"+summary.numNonzeros());

    System.out.println("总数:"+summary.count());

    System.out.println("最大值:"+summary.max());

    System.out.println("最小值:"+summary.min());

运行结果的含义对照测试数据来看是一目了然的，值得一提的是方差的计算方式为：

s2=∑ni=1(xi−x¯)2n−1

Correlations，相关度量，目前Spark支持两种相关性系数：皮尔逊相关系数（pearson）和斯皮尔曼等级相关系数（spearman）。相关系数是用以反映变量之间相关关系密切程度的统计指标。简单的来说就是相关系数绝对值越大（值越接近1或者-1）则表示数据越可进行线性拟合。如下图所示：

一个根据Key来抽样的功能，可以为每个key设置其被选中的概率。具体见代码以及注释。

Read full article from Spark 1.1.0 Basic Statistics（上） | 程序员的自我修养 | 关注Java、大数据、机器学习