Library - Pyshark

Back to Course

Lesson Description

Lession - #1496 PySpark-MLlib

Apache Spark offers a Machine Learning API called MLlib. PySpark has this machine learning API in Python too. It upholds different sort of algorithms, which are referenced beneath −

mllib.classification − The spark.mllib bundle upholds different techniques for binary classification, multiclass classification and regression analysis. The absolute most well known algorithms in characterization are Random Forest, Naive Bayes, Decision Tree,and so forth.

mllib.clustering − Clustering is an unsupervised learning issue, by which you plan to bunch subsets of entities with each other in light of some notion of comparability.

mllib.fpm − Frequent pattern matching is mining incessant things, itemsets, subsequences or different foundations that are as a rule among the initial steps to analyze a huge scope dataset. This has been a functioning exploration theme in information mining for years.

mllib.linalg − MLlib utilities for linear algebra.

mllib.recommendation − Collaborative filtering is generally utilized for recommender systems. These strategies mean to fill in the missing passages of a client thing association matrix.

spark.mllib − It ¬currently upholds model-based collaborative filtering, in which users and products are depicted by a little arrangement of dormant variables that can be utilized to predict missing entries. spark.mllib utilizes the Alternating Least Squares (ALS>
algorithm to gain proficiency with these inactive factors.

There are different algorithms, classes and functions likewise as a piece of the mllib package. At this point, let us comprehend an demonstration on pyspark.mllib.

The accompanying example is of collaborative filtering utilizing ALS algorithm to construct the proposal model and assess it on preparing information.

Dataset used − test.data

from __future__ import print_function
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
if __name__ == "__main__":
sc = SparkContext(appName="Pspark mllib Example">
data = sc.textFile("test.data">
ratings = data.map(lambda l: l.split(','>
\ .map(lambda l: Rating(int(l[0]>
, int(l[1]>
, float(l[2]>
# Build the recommendation model using Alternating Least Squares rank = 10 numIterations = 10 model = ALS.train(ratings, rank, numIterations>
# Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1]>
predictions = model.predictAll(testdata>
.map(lambda r: ((r[0], r[1]>
, r[2]>
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]>
, r[2]>
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1]>
print("Mean Squared Error = " + str(MSE>
# Save and load model model.save(sc, "target/tmp/myCollaborativeFilter">
sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter">

Command - The command will be as follows −
$SPARK_HOME/bin/spark-submit recommend.py

Output - The output of the above command will be −
Mean Squared Error = 1.20536041839e-05