This is a new framework that is faster than MapReduce (See Massive Parallel Processing). It is written in Scala and has a more functional approach to programming. Spark extends the previous MapReduce framework to a generic distributed dataflow, properly modeled as a DAG. There are other benefits of using Spark instead of the Map reduce Framework:

  • Spark processes data in memory, avoiding the disk I/O overhead of MapReduce, making it significantly faster.
  • Spark uses a DAG to optimize the entire workflow, reducing data shuffling and stage count.

But MapReduce sometimes has its advantages:

  • Ultra-large datasets that do not fit in memory: For massive datasets that don’t fit even across a large Spark cluster’s memory, MapReduce’s disk-based processing can be more manageable.
  • Strict fault tolerance requirements: MapReduce’s fault tolerance mechanism is very robust, as it writes intermediate results to HDFS, which may be preferred in environments where data recovery is critical.
  • Low hardware resources: Spark requires more memory and computational resources than MapReduce, so in resource-constrained environments, MapReduce may be a more practical choice.

Resilient Distributed Datasets

RDDs are the main innovation in this section. The idea, presented (Zaharia et al. 2012), is that keeping the data in memory, when possible, gives an order of magnitude performance increase. See Performance at Large Scales for comparison of the speed of access in various memories.

Usually, they are seen as Dataframes with a specific schema.

RDDs fault tolerance 🟩

Previous methods implemented fault tolerance by saving and persisting the intermediate data that could be produced during the computation. RDDs save the lineage (transformations graph), somewhat akin to a DAG that we often see for machine learning methods like Backpropagation.

RDD memorization 🟩

We can keep RDD mainly in three forms:

  • Deserialized in memory data: fastest access
  • Serialized in memory: slower access, but less memory
  • Disk: slowest access, but more space (this is often used when RDDs cannot fit into the main memory).

Resilient distributed datasets’ Lifecycle 🟩

Resilient distributed datasets (RDD) are the unit data blocks of Apache Spark. These blocks are created, transformed and written back into the disk.

Resilient means that they remain in memory or on disk on a “best effort” basis, and can be recomputed if need be. Distributed means that, just like the collections of key-value pairs in MapReduce, they are partitioned and spread over multiple machines.

RDDs can be anything, not just Key-value pairs as in MapReduce.

Creation 🟩

The RDD is created by fetching the used data from the disk or somewhere else. These RDDs can also be so large that cannot fit on a single machine. Spark automatically infers the schema from discovering the JSON Lines file. This adds a static performance cost, but then allows for sql queries inside spark!

Transformation 🟩

Instead of just MapReduce, apache spark introduces the notion of transformations. There are hundreds of possible transformations that can be composed like a blocks of LEGO.

Example of some possible transformations

  • Filter (basically a selection on possible values)
  • Map (this can implement a projection for example, it takes a row (single item) and returns some function of it).
  • flatMap (this is a map, but flattens the output of the map, which could be a list).
  • distinct (maps some items evaluated as the same to a single value).
  • sample (take some random items from the start values).
  • join: put a tuple grouping by key value.
  • subtractByKey: binary operation that returns the keys in the first operand that do not appear for the second.

Some can also be binary operations, for example:

  • union
  • Cartesian product There are also some possible transformations that are specifically tailored for key-value pairs.
  • reduceByKey: this implements the reduce phase of the map reduce framework.
  • We could also just drop the keys or values as a transformation.
  • groupByKey: all key-values with the same key are grouped together.
  • sortByKey: we can sort by key.
  • We can keep the key and also map the corresponding values.

Action 🟩

Actions make the transformation permanent. Usually Apache emplyes lazy evaluation: the transformations are not executed until the action is called. Sometimes we say data is materialized when it is actually computed and stored on the machines (which happens only when the action is triggered).

Example of actions are:

  • Collect (just collects all the RDDs and flushes them on the client machine).
  • Count (example count by key, or count how many times a value occurs).
    • It is possible to count also by value, somehow this action uses a lot of memory, and if we have too many distinct values, it is possible that the task fails because of memory overflow.
  • Take (return first $n$ values)
  • Top (return last $n$ values).
  • takeSample, similar as before, but samples randomly soem actions.
  • saveAsTextFile, saves the whole RDD in a Cloud Storage service, e.g. S3.
  • lookup (returns one key-value that matches the key).

Because there is some time for setup, the first execution time is usually larger.

Physical Architecture

Narrow dependency transformations 🟩

In this case, the computation depends only on a single value. Which means, this could be easily parallelized. So, if the data is spread over different machines, all narrow dependencies could be executed in parallel, following this diagram: Massive Parallel Processing-20241109161812672 This is the equivalent of a memory and CPU slot when we were talking about MapReduce. If we have more Tasks than slots, then the faster slots usually get assigned more tasks (this usually improves the latency), so that we can have as few as idle computers as possible. Within a slot, tasks are executed sequentially. We could also split the slots inside a single node, by assigning them different number of cores.

The opposite of narrow dependency is called a wide dependency. A stage is a sequential lineage of narrow dependency transformations. This is to clearly distinguish them with the shuffling stage which is often a bottleneck of the system.

Chains of narrow dependencies 🟩

the physical calls of the underlying map/filter/etc functions are directly chained on each input value to directly produce the corresponding final, output value, meaning that the intermediate RDDs are not even materialized anywhere and exist purely logically. From ({fourny} 2024).

Communication over the network is usually not efficient. If we have a sequence of narrow dependency transformations, usually every transformation is done on a chain on the same machine, meaning the intermediate steps of the RDD usually are not even materialized, but we just do subsequent function calls! This chain is called a stage, which is a computation step that can occur sequentially on a single machine without materializing the intermediate RDDs.

Submitting a spark task

We can set some hyperparameters for spark to know how many resources it needs (for example executors, or memory), and the code for the task that needs to be executed.

This is called spark-submit.

Wide dependency transformations 🟩

These types of tasks need some shuffling, corresponding to the communication step of the MapReduce framework. In these cases, we can have a DAG describing all the transformations of the tasks, and one can have a topological order on the execution.

For clusters, the stages are usually in sequential.

Optimization methods

Pinning a RDD 🟩

If a RDD is used in many successive RDDs, we can pin this in memory, and forgetting all the old RDDs so that we can be efficient in applying these new operations. Massive Parallel Processing-20241109170821523 Example of pinning from the course slides.

This is often also called checkpointing. This is often not worthwhile when we have just Narrow dependency transformations, as usually the bottleneck is the shuffling. We would spend a lot more memory to safe a little bit of time.

Prepartitioning 🟩–

This is an optimization method that enables us to prevent shuffling.

If, however, Spark knows that the data is already located where it should be, then shuffling is not needed. A simple example is when data is sorted before being grouped with the same keys as for sorting: then, Spark has the knowledge that the groups are already consistent with their physical location, and that no additional shuffling is needed between sorting and grouping.

For example, if we have blocks that are sorted across blocks, but not inside, even if I would normally require shuffling, in this case we do not need that. There are some ways to explicitly defined the partitions in Spark.

Spark SQL

For normal SQL, you should take a look at Structured Query Language. Spark usually converts the JSON or CSV into a dataframe, which can also be converted back to RDD if needed.

Explode🟩

Some operations of SQL only exist within spark. For example explode, which is also called lateral view. Allows to have first normal form by expanding the nested data (e.g. arrays) to have more values.

Its effect is that the rows are duplicated into as many rows as items in the array, and one particular item of the array is placed on each duplicated row in the corresponding column.

Apache Spark-20241109174056181

Distributing and Clustering 🟥++

Spark SQL also allows for different grouping information. Sort operation in Spark happens only locally.

The DISTRIBUTE BY clause forces a repartition by putting all rows with the same value (for the specified field(s)) into the same new partition.

Cluster is just mapping by key, and having all the same keys in the same machine. Note that these operations are against the principle of data independence proposed by Codd.

In Spark SQL, the DISTRIBUTE BY clause is used to control the physical distribution of the data across different nodes of the cluster. When you use this clause, it forces a repartition of the data such that all rows that have the same values for the specified columns are grouped into the same partition. This is especially useful when you anticipate subsequent operations that benefit from this specific distribution, like certain JOIN operations or SORT BY operations within partitions.

For example, consider the following query:

SELECT first_name, last_name FROM persons
WHERE age >= 65
GROUP BY country
HAVING COUNT(*) >= 1000
DISTRIBUTE BY country;

In this scenario, the DISTRIBUTE BY country clause ensures that all rows with the same country are brought together into the same partition. This can be beneficial for performance, especially in large-scale data processing, as it minimizes the amount of data shuffled across the network during subsequent operations that might need to group or join data by the country field.

Shuffling operations🟨–

Remember that shuffles are usually the bottleneck of our transformations. We need to remember this! Some examples are group by or order by operations. Examples of operations that need shuffling are:

  • reduceByKey or Group by key and similars
  • Joins
  • distinct operations

References

[1] {fourny} “The Big Data Textbook” 2024

[2] Zaharia et al. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing” USENIX Association 2012