Spark - Propagating Data to the Worker Nodes

A brief tutorial on propagating files and folders from the driver node the worker nodes on Spark.

Spark ships with a shell script, copy-dir.sh, to make copying data from the driver to workers very easy. This tutorial is a hello world for using that script. Let's assume a basic setup, 1 driver and 3 workers and that your cluster is running on Amazon EC2 servers.

Connect to the Driver

First, navigate to to the EC2 folder where Spark is installed on your local machine. Then, establish a terminal session with your driver using the spark-ec2 script. The k option corresponds to the key-pair name, and the i option corresponds to the identity-file, which is a path and name for the .pem file. Replace <*> with your AWS key information.

Create a Sample File

This should put you in the /root directory of the driver. Let's create a folder and file to send, with a single line that reads "Hello Worker."

Script Usage

Now we are ready to use the script. From /root/spark-ec2, run the following command to copy our folder out to the workers.

In this case, I had three workers, so it printed the hostname for each one. Let's verify that the file was moved. SSH into one of the workers and print the file contents.

Use Cases

Less trivial use cases might be:

  • Copying updated config files from the driver to workers, such as /root/spark/conf/spark-env.sh. (See the Configuration section at Running Spark on EC2)
  • If you have a need for distributing data to disk on the nodes, outside of Spark's broadcast function that distributes data to memory.
Posted in Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *