Tuesday, November 18, 2014

Data Exploration Using Spark


Data Exploration Using Spark
Interactive Analysis Let's now use Spark to do some order statistics on the data set. First, launch the Spark shell: /root/spark/bin/spark-shell /root/spark/bin/pyspark The prompt should appear within a few seconds. 


val pagecounts = sc.textFile("/wiki/pagecounts")
You can use the take operation of an RDD to get the first K records. Here, K = 10.
pagecounts.take(10)
pagecounts.take(10).foreach(println)

Let’s see how many records in total are in this data set 
pagecounts.count

This should launch 177 Spark tasks on the Spark cluster. If you look closely at the terminal, the console log is pretty chatty and tells you the progress of the tasks. Because we are reading 20G of data from HDFS, this task is I/O bound and can take a while to scan through all the data (2 - 3 mins).
http://<master_node_hostname>:4040
The links in this interface allow you to track the job’s progress and various metrics about its execution, including task durations and cache statistics.
In addition, the Spark Standalone cluster status web interface displays information that pertains to the entire Spark cluster. To view this UI, browse to
http://<master_node_hostname>:8080

Let’s derive an RDD containing only English pages from pagecounts. This can be done by applying a filter function to pagecounts. For each record, we can split it by the field delimiter (i.e. a space) and get the second field-– and then compare it with the string “en”.
val enPages = pagecounts.filter(_.split(" ")(1) == "en").cache
enPages: spark.RDD[String] = FilteredRDD[2] at filter at <console>:14

  1. When you type this command into the Spark shell, Spark defines the RDD, but because of lazy evaluation, no computation is done yet. Next time any action is invoked on enPages, Spark will cache the data set in memory across the 5 slaves in your cluster.
The first time this command is run, similar to the last count we did, it will take 2 - 3 minutes while Spark scans through the entire data set on disk. But since enPages was marked as “cached” in the previous step, if you run count on the same RDD again, it should return an order of magnitude faster.

Read full article from Data Exploration Using Spark

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.