Implement a binary tree where each node carries an integer, and implement preoder, inorder, postorder and level-order traversal. Use those traversals to output the following tree:
Read full article from Tree traversal - Rosetta Code
Implement a binary tree where each node carries an integer, and implement preoder, inorder, postorder and level-order traversal. Use those traversals to output the following tree:
Read full article from Tree traversal - Rosetta Code
A good POS tagger in about 200 lines of Python « Computational Linguistics
POS tagging is a “supervised learning problem”. You’re given a table of data, and you’re told that the values in the last column will be missing during run-time. You have to find correlations from the other columns to predict that value.
So for us, the missing column will be “part of speech at word i“. The predictor columns (features) will be things like “part of speech at word i-1“, “last three letters of word at i+1“, etc
Read full article from A good POS tagger in about 200 lines of Python « Computational Linguistics
val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2) b.takeOrdered(2) |
val a = sc.parallelize(1 to 9, 3) val b = sc.parallelize(1 to 3, 3) val c = a.subtract(b) c.toDebugString |
val a = sc.parallelize(1 to 3, 1) val b = sc.parallelize(5 to 7, 1) (a ++ b).collect |
scala - Confusing "diverging implicit expansion" error when using "sortBy" - Stack Overflow
If you look at the type signature of toIndexedSeq
on List
you'll see it takes a type parameter B
, which can be any supertype of A
:
def toIndexedSeq [B >: A] : IndexedSeq[B]
If you leave out that type parameter then the compiler essentially has to guess what you meant, taking the most specific type possible. You could have meant List(3,2,1).toIndexedSeq[Any]
, which of course can't be sorted since there's no Ordering[Any]
. It seems the compiler doesn't play "guess the type parameter" until the whole expression has been checked for correct typing (maybe someone who knows something about compiler internals can expand on this).
To make it work you can either a) provide the required type parameter yourself i.e.
List(3,2,1).toIndexedSeq[Int].sortBy(x=>x)
or b) separate the expression into two so that the type parameter has to be inferred before calling sortBy
:
val lst = List(3,2,1).toIndexedSeq; lst.sortBy(x=>x)
Edit:
It's probably because sortBy
takes a Function1
argument. The signature of sortBy
is
def sortBy [B] (f: (A) => B)(implicit ord: Ordering[B]): IndexedSeq[A]
whereas sorted
(which you should use instead!) works fine with List(3,2,1).toIndexedSeq.sorted
def sorted [B >: A] (implicit ord: Ordering[B]): IndexedSeq[A]
I'm not sure exactly why Function1
causes this problem and I'm going to bed so can't think about it further...
Read full article from scala - Confusing "diverging implicit expansion" error when using "sortBy" - Stack Overflow
Scala Tutorial – Maps, Sets, groupBy, Options, flatten, flatMap | Java Code Geeks
November 26, 2014 4:11 pm Scala Tutorial – Maps, Sets, groupBy, Options, flatten, flatMap Preface This is part 7 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for. Additionally you can find this and other tutorial series on the JCG Java Tutorials page. Lists (and other sequence data structures, like Ranges and Arrays) allow you to group collections of objects in an ordered manner: you can access elements of a list by indexing their position in the list, or iterate over the list elements, one by one, using for expressions and sequence functions like map, filter, reduce and fold. Another important kind of data structure is the associative array, which you’ll come to know in Scala as a Map. (Yes, this has the unfortunate ambiguity with the map function, but their use will be quite clear from context.Read full article from Scala Tutorial – Maps, Sets, groupBy, Options, flatten, flatMap | Java Code Geeks
Spark the fastest open source engine for sorting a petabyte – Databricks
October 10, 2014 | by Reynold Xin Update November 5, 2014: Our benchmark entry has been reviewed by the benchmark committee and Spark has won the Daytona GraySort contest for 2014! Please see this new blog post for update . Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes. To evaluate these improvements, we decided to participate in the Sort Benchmark .Read full article from Spark the fastest open source engine for sorting a petabyte – Databricks
java - Producing a sorted wordcount with Spark - Code Review Stack Exchange
As addendum I'll show how I would identify your problem in question and show you how I would do it.
Input: An input file, consisting of words. Output: A list of the words sorted by frequency in which they occur.
Map<String, Long> occurenceMap = Files.readAllLines(Paths.get("myFile.txt")) .stream() .flatMap(line -> Arrays.stream(line.split(" "))) .collect(Collectors.groupingBy(i -> i, Collectors.counting())); List<String> sortedWords = occurenceMap.entrySet() .stream() .sorted(Comparator.comparing((Map.Entry<String, Long> entry) -> entry.getValue()).reversed()) .map(Map.Entry::getKey) .collect(Collectors.toList());
This will do the following steps:
List<String>
(care with large files!)Stream<String>
.Stream<String>
by flat mapping every String
to a Stream<String>
splitting on the blanks.Map<String, Long>
grouping by the identity (i -> i
) and using as downstream Collectors.counting()
such that the map-value will be its count.Set<Map.Entry<String, Long>>
from the map.Stream<Map.Entry<String, Long>>
.Stream<String>
, you lose the frequency information here.List<String>
.Beware that the line .sorted(Comparator.comparing((Map.Entry<String, Long> entry) -> entry.getValue()).reversed())
should really be .sorted(Comparator.comparing(Map.Entry::getValue).reversed()
, but type inference is having issues with that and for some reason it will not compile.
I hope the Java 8 way can give you interesting insights.
Read full article from java - Producing a sorted wordcount with Spark - Code Review Stack Exchange
A simple machine learning app with Spark - Chapeau
A simple machine learning app with Spark I'm currently on my way back from the first-ever Spark Summit , where I presented a talk on some of my work with the Fedora Big Data SIG to package Apache Spark and its infrastructure for Fedora. ( My slides are online, but they aren't particularly useful without the talk. I'll post a link to the video when it's available, though.) If you're interested in learning more about Spark, a great place to start is the guided exercises that the Spark team put together; simply follow their instructions to fire up an EC2 cluster with Spark installed and then work through the exercises. In one of the exercises, you'll have an opportunity to build up one of the classic Spark demos: distributed k-means clustering in about a page of code. Implementing k-means on resilient distributed datasets is an excellent introduction to key Spark concepts and idioms. With recent releases of Spark, though, machine learning can be simpler still:Read full article from A simple machine learning app with Spark - Chapeau
Spark SQL: Parquet | InfoObjects
Apache Parquet as a file format has garnered significant attention recently. Let's say you have a table with 100 columns, most of the time you are going to access 3-10 columns. In row oriented format all columns are scanned where you need them or not. Apache Parquet saves data in column oriented fashion so if you need 3 columns only data of those 3 columns get loaded. Another benefit is that since all data in a given column is same datatype (obviously), compression quality is far superior. In this recipe we'll learn how to save a table in Parquet format and then how to load it back. Let's use the Person table we created in other recipe. first_name last_name gender Barack Obama M Bill Clinton M Hillary Clinton F Let's load it in Spark SQL scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc) scala>import hc._ scala>case class Person(firstName: String, lastName: String, gender: String) scala>val person = sc.textFile("person").map(_.split("\t")).map(p => Person(p(0),p(1),Read full article from Spark SQL: Parquet | InfoObjects
Pearson product-moment correlation coefficient - Wikipedia, the free encyclopedia
In statistics , the Pearson product-moment correlation coefficient ( / ˈ p ɪər s ɨ n / ) (sometimes referred to as the PPMCC or PCC or Pearson's r) is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation. It is widely used in the sciences as a measure of the degree of linear dependence between two variables. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s. [1] [2] [3] Examples of scatter diagrams with different values of correlation coefficient (ρ) Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.:Read full article from Pearson product-moment correlation coefficient - Wikipedia, the free encyclopedia
Pearson Correlation: Definition and Easy Steps for Use
Watch the video on how to find Pearson's Correlation Coefficient, or read below for an explanation of what it is: What is Pearson Correlation? Correlation between sets of data is a measure of how well they are related. The most common measure of correlation in stats is the Pearson Correlation. The full name is the Pearson Product Moment Correlation or PPMC. It shows the linear relationship between two sets of data. In simple terms, it answers the question, Can I draw a line graph to represent the data? Two letters are used to represent the Pearson correlation: Greek letter rho (ρ) for a population and the letter "r" for a sample. The Pearson correlation coefficient can be calculated by hand or one a graphing calculator such as the TI-89 What are the Possible Values for the Pearson Correlation? The results will be between -1 and 1. You will very rarely see 0, -1 or 1. You'll get a number somewhere in between those values. The closer the value of r gets to zero,Read full article from Pearson Correlation: Definition and Easy Steps for Use
def foldLeft[B](z: B)(f: (B, A) = > B): B |
list
.foldLeft(
List
[
Int
]())((b,a)
=
> a :: b)
SparkNotes: Graphing Data: Histograms
Frequency Distribution Tables A frequency distribution table is a table that shows how often a data point or a group of data points appears in a given data set. To make a frequency distribution table, first divide the numbers over which the data ranges into intervals of equal length. Then count how many data points fall into each interval. If there are many values, it is sometimes useful to go through all the data points in order and make a tally mark in the interval that each point falls. Then all the tally marks can be counted to see how many data points fall into each interval. The "tally system" ensures that no points will be missed. Example: The following is a list of prices (in dollars) of birthday cards found in various drug stores: 1.45 2.20 0.75 1.23 1.25 1.25 3.09 1.99 2.00 0.78 1.32 2.25 3.15 3.85 0.52 0.99 1.38 1.75 1.22 1.75 Make a frequency distribution table for this data. We omit the units (dollars) while calculating. The values go from 0.52 to 3.85 ,Read full article from SparkNotes: Graphing Data: Histograms
Read full article from Statistics With Spark
Estimating Financial Risk with Apache Spark | Cloudera Engineering Blog
Learn how Spark facilitates the calculation of computationally-intensive statistics such as VaR via the Monte Carlo method. Under reasonable circumstances, how much money can you expect to lose? The financial statistic value at risk (VaR) seeks to answer this question. Since its development on Wall Street soon after the stock market crash of 1987, VaR has been widely adopted across the financial services industry. Some organizations report the statistic to satisfy regulations, some use it to better understand the risk characteristics of large portfolios, and others compute it before executing trades to help make informed and immediate decisions. For reasons that we will delve into later, reaching an accurate estimate of VaR can be a computationally expensive process. The most advanced approaches involve Monte Carlo simulations , a class of algorithms that seek to compute quantities through repeated random sampling.Read full article from Estimating Financial Risk with Apache Spark | Cloudera Engineering Blog
Suvir Jain | Pearson’s Correlation Coefficient using Apache Spark and Map Reduce
Pearson’s Correlation Coefficient using Apache Spark and Map Reduce Sample Input 2 variables X and Y with values in comma-seperated form like : 1,2 3,4 5,6 … The code is general enough to handle any given column. But we will keep it simple and assume that the file contains only 2 columns like above. Algorithm For each line in the file, parse the two numbers as Java Double type and emit the following (Key,Value) pairs : (0,1) – To count the number of records (1,X) – Value of X itself (2,Y) – Value of Y itself (3,X^2) – Square of X (4,Y^2) – Square of Y (5,X*Y) – Product of X and Y. This will help us compute the Dot Product of X and Y. A single pass of the Spark mapper will finish all the heavy lifting in one awesome O(n) operation! Next, our reducers will add up the values for each key and we will be almost done. Mapper and Reducer Functions The Mapper In Spark style, the Mapper is a static nested class.Read full article from Suvir Jain | Pearson’s Correlation Coefficient using Apache Spark and Map Reduce
Apache Spark User List - How to sort an RDD ?
Well it turns out you can use the takeOrdered function and create yourRead full article from Apache Spark User List - How to sort an RDD ?
Spark: Parse CSV file and group by column value | Java Code Geeks
November 22, 2014 7:42 pm Spark: Parse CSV file and group by column value I've found myself working with large CSV files quite frequently and realising that my existing toolset didn't let me explore them quickly I thought I'd spend a bit of time looking at Spark to see if it could help. $ ls -alh ~/Downloads/Crimes_-_2001_to_present.csv -rw-r--r--@ 1 markneedham staff 1.0G 16 Nov 12:14 /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv $ wc -l ~/Downloads/Crimes_-_2001_to_present.csv 4193441 /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv We can get a rough idea of the contents of the file by looking at the first row along with the header: $ head -n 2 ~/Downloads/Crimes_-_2001_to_present.csv ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location 9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,Read full article from Spark: Parse CSV file and group by column value | Java Code Geeks
Spark Shell Examples – Altiscale Docs
Copy Test Data to HDFS The following will upload all of our example data to HDFS under your current login username. These include GraphX PageRank's datasets, MLLib decision tree, logistic regression, Kmean, linear regression, SVM, and naive bayes. pushd `pwd` cd /opt/spark/ Second, launch the spark-shell command again with the following command: SPARK_SUBMIT_OPTS="-XX:MaxPermSize=256m" ./bin/spark-shell --master yarn --queue research --driver-class-path $(find /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-* | head -n 1) Run following Scala statements in Scala REPL Shell: SVM Logistic Regression Naive Bayes KMeans GraphX PageRank Decision Tree - Classification and Regression/Prediction // CLASSIFICATION import org.apache.spark.SparkContext import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.configuration.Algo._ import org.apache.spark.mllib.tree.Read full article from Spark Shell Examples – Altiscale Docs