Data Types - MLlib - Spark 1.1.0 Documentation
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either
0
(negative) or 1
(positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ...
.// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
Sparse data
It is very common in practice to have sparse training data. MLlib supports reading training examples stored in
LIBSVM
format, which is the default format used by LIBSVM
and LIBLINEAR
. It is a text format in which each line represents a labeled sparse feature vector using the following format:label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
The basic type is called RowMatrix
. A RowMatrix
is a row-oriented distributed matrix without meaningful row indices, e.g., a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge for a RowMatrix
so that a single local vector can be reasonably communicated to the driver and can also be stored / operated on using a single node. AnIndexedRowMatrix
is similar to a RowMatrix
but with row indices, which can be used for identifying rows and executing joins. A CoordinateMatrix
is a distributed matrix stored in coordinate list (COO) format, backed by an RDD of its entries.
Read full article from Data Types - MLlib - Spark 1.1.0 Documentation
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.