...

Library - Pyshark

Back to Course

Lesson Description


Lession - #1492 PySpark-Broadcost & Accumulator


For parallel processing, Apache Spark utilizes shared variables. A copy of shared variable continues every node of the cluster when the driver sends an undertaking to the executor on the cluster, so it tends to be utilized for performing tasks. There are two kinds of shared variables supported by Apache Spark −   * Broadcast * Accumulator   ## Broadcast * Broadcast factors are utilized to save the duplicate of information across all hubs. This variable is stored on every one of the machines and not sent on machines with tasks. The accompanying code block has the details of a Broadcast class for PySpark. The accompanying model tells the best way to utilize a Broadcast variable. A Broadcast variable has a property called esteem, which stores the information and is utilized to return a broadcasted value. **Command** below is the following command for broadcast variable -  **Output** - below the result for above command -    ## Accumulator * Aggregator variables are utilized for aggregating the data through affiliated and commutative operations. For instance, you can involve a accumulator for a sum operation or counters (in MapReduce>
. The accompanying code block has the details of an Accumulator class for PySpark. The accompanying example tells the best way to utilize an Accumulator variable. An Accumulator variable has a attribute called value that is like what a broadcast variable has. It stores the information and is utilized to return the accumulator's value, however usable just in a driver program. In this example, a accumulator variable is utilized by numerous workers and returns a accumulated value. **Command** - Below is the command for accumulator variable -  $SPARK\_HOME/bin/spark-submit accumulator.py **Output** - Below is the result for above command -      ```plaintext ``` ```plaintext Accumulated value is -> 150 ``` ```plaintext ----------------------------------------accumulator.py------------------------------------ from pyspark import SparkContext sc = SparkContext("local", "Accumulator app">
num = sc.accumulator(10>
def f(x>
: global num num+=x rdd = sc.parallelize([20,30,40,50]>
rdd.foreach(f>
final = num.value print "Accumulated value is -> %i" % (final>
``` ```plaintext class pyspark.Accumulator(aid, value, accum_param>
``` ```plaintext Stored data -> [ 'scala', 'java', 'hadoop', 'spark', 'akka' ] Printing a particular element in RDD -> hadoop ``` ```plaintext $SPARK_HOME/bin/spark-submit broadcast.py ``` ```plaintext ----------------------------------------broadcast.py-------------------------------------- from pyspark import SparkContext sc = SparkContext("local", "Broadcast app">
words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]>
data = words_new.value print "Stored data -> %s" % (data>
elem = words_new.value[2] print "Printing a particular element in RDD -> %s" % (elem>
``` ```plaintext class pyspark.Broadcast ( sc = None, value = None, pickle_registry = None, path = None >
```