Inner workings of Apache Spark

Here’s a breakdown of key internal aspects of the inner workings of Apache Spark. :

1. Architecture:

Master/Worker: Spark follows a master/worker architecture.
- Driver Program (Master): The heart of the Spark application. It:
  - Converts user code into Jobs.
  - Divides Jobs into Stages.
  - Divides Stages into Tasks.
  - Schedules tasks to run on Executors.
  - Manages the overall execution of the application.
- Cluster Manager: Responsible for allocating resources (worker nodes) to Spark applications. Examples include Spark’s Standalone Manager, YARN (Hadoop), Mesos, and Kubernetes.
- Worker Nodes: Machines in the cluster that run Executors.
- Executors (Workers): JVM processes running on worker nodes that:
  - Execute the Tasks assigned by the Driver.
  - Store data in memory or on disk for the application.
  - Report the status of tasks back to the Driver.
Key Abstractions:
- RDD (Resilient Distributed Dataset): The fundamental, immutable, distributed collection of data elements in Spark. RDDs are fault-tolerant.
- DAG (Directed Acyclic Graph): Represents the logical execution plan of the transformations applied to RDDs. The Driver’s DAG Scheduler creates this.

2. Execution Model:

Lazy Evaluation: Transformations on RDDs are not executed immediately. Spark builds up the DAG of transformations.
Actions Trigger Execution: Computation begins only when an action (e.g., collect(), count(), save()) is called on an RDD.
Job, Stage, Task Hierarchy:
- An Application is your Spark program.
- Actions within an application create Jobs.
- Jobs are broken down into Stages. Stage boundaries are often determined by shuffle operations.
- Stages are further divided into parallel Tasks that run on partitions of the data.

3. Data Partitioning:

Spark distributes data across multiple partitions, which reside on different nodes in the cluster.
The level of parallelism in Spark computations is largely determined by the number of partitions.
Partitioning is influenced by the input data source (e.g., HDFS blocks) or can be controlled through operations like repartition() and coalesce().

4. Shuffle Operations:

Shuffle: A costly operation that redistributes data across partitions, often across the network between executors.
Shuffles occur during “wide” transformations that require data with the same key to be together (e.g., groupByKey(), reduceByKey(), join()).
Shuffle Write: Map tasks write intermediate data to disk.
Shuffle Read: Reduce tasks fetch the necessary data from the shuffle write outputs of the map tasks.
Optimizing shuffles is crucial for Spark performance. Strategies include reducing the amount of data shuffled, using broadcast joins for smaller datasets, and tuning shuffle-related configurations.

5. Memory Management:

Spark manages memory on both the Driver and the Executors.
Executor Memory: Divided into regions:
- Reserved Memory: Small amount reserved by the system.
- Execution Memory: Used for computation during tasks (shuffles, joins, sorts).
- Storage Memory: Used for caching data (RDDs, DataFrames).
- User Memory: For user-defined objects.
Unified Memory Management (UMM): Spark dynamically manages the sizes of the Execution and Storage memory regions within a defined fraction of the JVM heap. This allows for flexibility based on workload.
Off-Heap Memory: Can be enabled to provide memory outside the JVM heap, potentially reducing garbage collection overhead for very large datasets.
Efficient memory management is critical to prevent disk spills and out-of-memory errors, and to maximize in-memory processing for performance.

In essence, Spark’s internal workings involve:

Understanding these internal mechanisms is key to writing efficient and scalable Spark applications.

Latest Posts