Apache Spark

This is a new framework that is faster than MapReduce. It is written in Scala and has a more functional approach to programming. Spark extends the previous MapReduce framework to a generic distributed dataflow, properly modeled as a DAG.

Resilient distributed datasets’ Lifecycle

Resilient distributed datasets (RDD) are the unit data blocks of Apache Spark. These blocks are created, transformed and written back into the disk.

Resilient means that they remain in memory or on disk on a “best effort” basis, and can be recomputed if need be. Distributed means that, just like the collections of key-value pairs in MapReduce, they are partitioned and spread over multiple machines.

RDDs can be anything, not just Key-value pairs as in MapReduce.

Creation

The RDD is created by fetching the used data from the disk or somewhere else. These RDDs can also be so large that cannot fit on a single machine. Spark automatically infers the schema from discovering the JSON Lines file. This adds a static performance cost, but then allows for sql queries inside spark!

Transformation

Instead of just MapReduce, apache spark introduces the notion of transformations. There are hundreds of possible transformations that can be composed like a blocks of LEGO.

Example of some possible transformations

  • Filter (basically a selection on possible values)
  • Map (this can implement a projection for example, it takes a row (single item) and returns some function of it).
  • flatMap (this is a map, but flattens the output of the map, which could be a list).
  • distinct (maps some items evaluated as the same to a single value).
  • sample (take some random items from the start values).
  • join: put a tuple grouping by key value.
  • subtract by key: binary operation that returns the keys in the first operand that do not appear for the second.

Some can also be binary operations, for example:

  • union
  • Cartesian product There are also some possible transformations that are specifically tailored for key-value pairs.
  • reduceByKey: this implements the reduce phase of the map reduce framework.
  • We could also just drop the keys or values as a transformation.
  • groupByKey: all key-values with the same key are grouped together.
  • sortByKey: we can sort by key.
  • We can keep the key and also map the corresponding values.

Action

Actions make the transformation permanent. Usually Apache emplyes lazy evaluation: the transformations are not executed until the action is called. Sometimes we say data is materialized when it is actually computed and stored on the machines (which happens only when the action is triggered).

Example of actions are:

  • Collect (just collects all the RDDs and flushes them on the client machine).
  • Count (example count by key, or count how many times a value occurs).
    • It is possible to count also by value, somehow this action uses a lot of memory, and if we have too many distinct values, it is possible that the task fails because of memory overflow.
  • Take (return first $n$ values)
  • Top (return last $n$ values).
  • takeSample, similar as before, but samples randomly soem actions.
  • saveAsTextFile, saves the whole RDD in a Cloud Storage service, e.g. S3.
  • lookup (returns one key-value that matches the key).

Because there is some time for setup, the first execution time is usually larger.

Physical Architecture

Narrow dependency transformations

In this case, the computation depends only on a single value. Which means, this could be easily parallelized. So, if the data is spread over different machines, all narrow dependencies could be executed in parallel, following this diagram: Massive Parallel Processing-20241109161812672 This is the equivalent of a memory and CPU slot when we were talking about MapReduce. If we have more Tasks than slots, then the faster slots usually get assigned more tasks (this usually improves the latency), so that we can have as few as idle computers as possible. Within a slot, tasks are executed sequentially. We could also split the slots inside a single node, by assigning them different number of cores.

Chains of narrow dependencies

the physical calls of the underlying map/filter/etc functions are directly chained on each input value to directly produce the corresponding final, output value, meaning that the intermediate RDDs are not even materialized anywhere and exist purely logically. From ({fourny} 2024).

Communication over the network is usually not efficient. If we have a sequence of narrow dependency transformations, usually every transformation is done on a chain on the same machine, meaning the intermediate steps of the RDD usually are not even materialized, but we just do subsequent function calls! This chain is called a stage, which is a computation step that can occur sequentially on a single machine without materializing the intermediate RDDs.

Submitting a spark task

We can set some hyperparameters for spark to know how many resources it needs (for example executors, or memory), and the code for the task that needs to be executed.

This is called spark-submit.

Wide dependency transformations

These types of tasks need some shuffling, corresponding to the communication step of the MapReduce framework. In these cases, we can have a DAG describing all the transformations of the tasks, and one can have a topological order on the execution.

For clusters, the stages are usually in sequential.

Optimization methods

Pinning a RDD đźź©

If a RDD is used in many successive RDDs, we can pin this in memory, and forgetting all the old RDDs so that we can be efficient in applying these new operations. Massive Parallel Processing-20241109170821523 Example of pinning from the course slides.

Prepartitioning

This is an optimization method that enables us to prevent shuffling.

If, however, Spark knows that the data is already located where it should be, then shuffling is not needed. A simple example is when data is sorted before being grouped with the same keys as for sorting: then, Spark has the knowledge that the groups are already consistent with their physical location, and that no additional shuffling is needed between sorting and grouping.

Spark SQL

For normal SQL, you should take a look at Structured Query Language. Spark usually converts the JSON or CSV into a dataframe, which can also be converted back to RDD if needed.

Explode

Some operations of SQL only exist within spark. For example explode, which is also called lateral view. Allows to have first normal form by expanding the nested data (e.g. arrays) to have more values.

Apache Spark-20241109174056181

Distributing and Clustering

Spark SQL also allows for different grouping information. Sort operation in Spark happens only locally.

The DISTRIBUTE BY clause forces a repartition by putting all rows with the same value (for the specified field(s)) into the same new partition.

Cluster is just mapping by key, and having all the same keys in the same machine.

Shuffling operations

Remember that shuffles are usually the bottleneck of our transformations. We need to remember this! Some examples are group by or order by operations.

References

[1] {fourny} “The Big Data Textbook” 2024