Sunday, November 23, 2014

Spark SQL: Parquet | InfoObjects


Spark SQL: Parquet | InfoObjects

Apache Parquet as a file format has garnered significant attention recently. Let's say you have a table with 100 columns, most of the time you are going to access 3-10 columns. In row oriented format all columns are scanned where you need them or not. Apache Parquet saves data in column oriented fashion so if you need 3 columns only data of those 3 columns get loaded. Another benefit is that since all data in a given column is same datatype (obviously), compression quality is far superior. In this recipe we'll learn how to save a table in Parquet format and then how to load it back. Let's use the Person table we created in other recipe. first_name last_name gender Barack Obama M Bill Clinton M Hillary Clinton F Let's load it in Spark SQL scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc) scala>import hc._ scala>case class Person(firstName: String, lastName: String, gender: String) scala>val person = sc.textFile("person").map(_.split("\t")).map(p => Person(p(0),p(1),

Read full article from Spark SQL: Parquet | InfoObjects

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.