SparkR on Ec2 - Up and Running in 30 Minutes


The purpose of this post is to walk through spinning up a Spark cluster using Amazon Web Services EC2 servers and use R to interface with that cluster. The Apache Spark distribution comes with an EC2 script to do this, which was extremely helpful, but I had a hard time getting the newly released SparkR to work. The first part of this post follows closely with the Running Spark on EC2 link, but completes the tutorial for those wanting to use SparkR API and RStudio. In particular, this post extends that with the following:

  • Installing RStudio on the Driver
  • Dealing with the ulimit error with SparkR.init()
  • Configuring R to connect to the Spark Cluster
  • Loading data into HDFS to get the DataFrame.R tutorial to work

AWS Access Credentials

Set the secret key and access ID as environment variables. If you don't have your secret key and access ID, then start here. Once you have them, create the environment variables:

Download Spark

The script for launching an EC2 cluster comes with the distribution of Spark, so let's start with pulling down Spark. Download Spark here. As of the writing of this post, the latest release was 1.4.0, and I choose the Hadoop 2.6 and later option, but I don't think that matters for our purposes. We won't actually be using a local install of Spark in this tutorial.

Once the file is downloaded, place it in your desired location and unpack the file.

Running the EC2 Script

The script sits in spark-1.4.0-bin-hadoop2.6/ec2.  Here is an example of how to use the script. It launches 4 servers (1 driver and 3 workers), all on m3.xlarge instances, which are 4 cores and 15 GB of memory on each machine. For other options, see the EC2 pricing page. My initial temptation was to go for the memory optimized r3 machines, but I don't think that always the best choice. Each application is different, and may need a different core/memory ratio. Tune your cluster accordingly. After shaving off some memory for the OS in this example, we're left with about 40GB for Spark to consume.

The script takes about 5-10 mins to run and returns a verbose status update each step of the way. One of the last pieces of output you'll see is the url for the monitoring page hosted from the driver, which will look something like this:

Spark standalone cluster started at

The script just created the necessary AWS security groups for the driver and workers, but we need to open up another port for RStudio, 8787.  You have two options for doing this.  Edit the EC2 script to add it, or manually do it through the AWS Console. If you're tempted to edit the script, it looks like you need to insert a line somewhere below line 480 that looks like this for port 8787:

If you do it through the AWS console, only the spark-cluster-master security group needs 8787 opened up, not both.

Install RStudio

Now we'll SSH into the box so we can complete the setup. Connect by running this on your local machine from the same folder that the initial EC2 script was run from. In fact, we're using the same script, but with different arguments to SSH into the box.

Now that you are connected to the driver, download and install RStudio from that server.  They used a Redhat flavor of linux with the Spark AMI, so no apt-get's here. R is already installed, so no need to install it.

Now create a user to log into RStudio with. The first line creates the user called rstudiouser, and the second line prompts you to set a password.

Dealing with the ulimit issue

We're done with the installation, but if you try running SparkR.init() within RStudio at this point, then you'll get an error message that looks like this. If you're using Spark 1.4.1, then they resolved this issue by bypassing this command if you're not root. If you're using Spark 1.4.0, then this still applies.

Line 21 of the config file is trying to increase the default limit on the number of files the rstudiouser can have open at one time on the OS. Apparently during shuffles of large datasets, you can exceed this maximum with Spark. For the sake of getting up and running quickly, we'll just comment out the offending line from our config file and accept the 1024 limit for now.

I didn't find a simple, straightforward way to adjust the ulimit for a user. If you run across a a good way to do it, then please comment below. I couldn't do a simple ulimit -n 4096, even as root.

SparkR within RStudio

Leaving the terminal, we're ready to configure R code. Access RStudio via port 8787 on your driver machine, http://<driver>:8787. The url should look something like this:

First we'll need to load the SparkR package, but it's not on CRAN, which is ok. It's already on the local machine, we just need to tell R where to look. Add an argument to the library() function for an alternate library location. Second, we also need to tell R where Spark is installed, which is /root/spark on this machine.

Now we are ready to initialize a spark context with the cluster. Go to your driver url, port 8080 to capture the info needed. Again, it'll look something like this:  spark:// The first thing to note for the master argument is the Spark Master url, which begins with "spark://" and ends with ":7077".  Next, take note of how much remaining memory is available on each worker node.  We spun up 15GB machines, but only 13.4 GB remains available for processing. For my app, I gave it 12 MB on each driver to use. Note, the spark.executor.memory option is not how much each core gets, but how much memory the entire instance receives.

Spark Monitoring Page

That should spew about 30 lines of output. Get used to it. Spark is chatty.

SparkR - The built-in DataFrame.R Tutorial

You should be able to pick up the dataframe.R tutorial from here. The code is sitting in $SPARK_HOME/examples/src/main/r, but copied below for convenience. You may encounter an error on line 17. If so, the data isn't loaded into HDFS. See the next section on how to do that.

Loading Data into HDFS

If you encountered the following error on the people.json file, then you need to load the data into hdfs:

Leaving R for a moment, we need to return to the terminal session on the driver. The following code will create the folder that the tutorial is looking for, and load the people.json file to the temporary HDFS store. It is temporary in that if that if you reboot the instance, your data goes away.

Shutting Down the Cluster

Once you're done running the Spark application, you can either stop it or destroy it. Stopping shuts down the servers. These are run from the terminal on your local machine.

Final Thoughts

Sometimes my Spark Cluster just jams up and spews errors when I trying to overwrite a spark context object. I found that clicking Session->Clear Workspace and Session->Restart R tends to clean things up.

Having experience with dplyr may be a bit of a disadvantage when using pipes to chain together Spark functions.

  • You can't put more than one statement in a filter() function.
  • Column names need the "dataframe$" prefix rather than just declaring the column name.
  • If you create a column in summarize, then you can't reference it again in the same chain.

That last one is ok because the catalyst optimizer will examine all of your code before running, but it just makes your code a little more difficult to read.  All of that aside, it seems like a great piece of software and I'm looking forward to exploring it that I can get R working with it. 🙂

Update: someone in Japan felt the same way I did about about how the syntax wasn't quite as powerful as dplyr and did something about it. Here is a blog post describing the SparkRext package that addresses some of these issues.

Posted in R and tagged , , , , .

Leave a Reply

Your email address will not be published. Required fields are marked *