Library - Pyshark

Back to Course

Lesson Description

Lession - #1488 PySpark-Introduction


Apache Spark is a lightning quick real-time processing framework. It does in-memory calculations to examine information progressively. It came into picture as Apache Hadoop MapReduce was performing bunch handling just and coming up short on constant handling highlight. Consequently, Apache Spark was presented as it can perform stream handling continuously and can likewise deal with batch processing.

Aside from real-time and group handling, Apache Spark upholds intelligent inquiries and iterative algorithm too. Apache Spark has its own cluster manager, where it can have its application. It use Apache Hadoop for both storage and processing. It utilizes HDFS (Hadoop Distributed File framework>
for capacity and it can run Spark applications on YARN also.


Apache Spark is written in Scala programming language. To help Python with Spark, Apache Spark Community delivered a tool, PySpark. Utilizing PySpark, you can work with RDDs in Python programming language too. It is a direct result of a library called Py4j that they can accomplish this.

PySpark offers PySpark Shell which connects the Python API deeply and introduces the Spark setting. Larger part of data scientists and analytics specialists today use Python on account of its rich library set. Coordinating Python with Spark is a boon to them.