Cloud Computing - spark

Back to Course

Lesson Description

Lession - #767 RDD Shared Variable

RDD Shared Variables

In Spark, when any capacity passed to a change activity, then it is executed on a remote group hub. It deals with various duplicates of the multitude of factors utilized in the capacity. These factors are duplicated to each machine, and no updates to the factors on the remote machine are return to the driver program.

Apache Spark provides two types of shared variable namely broadcast variable and accumulator.

Broadcast variable

The Broadcast factors support a read-just factor reserved on each machine as opposed to furnishing a duplicate of it with errands. Flash purposes broadcast calculations to disseminate broadcast factors for diminishing correspondence cost.

The execution of flash activities goes through a few phases, isolated by disseminated "mix" tasks. Flash consequently communicates the normal information expected by errands inside each stage. The information communicated this way is reserved in serialized structure and deserialized prior to running each assignment.

To make broadcast variable (let say, v>
, call SparkContext.broadcast(v>
. How about we comprehend with a model.
scala> val v = sc.broadcast(Array(1, 2, 3>
scala> v.value


The Accumulator are factors that are utilized to perform affiliated and commutative activities like counters or aggregates. The Spark offers help for collectors of numeric kinds. Be that as it may, we can add support for new kinds.

To make a numeric aggregator, call SparkContext.longAccumulator(>
or SparkContext.doubleAccumulator(>
to amass the upsides of Long or Double sort.

scala> val a=sc.longAccumulator("Accumulator">
scala> sc.parallelize(Array(2,5>
scala> a.value