OpenCPU Server on AWS: Accessing R via an API

Motivation and Use Case

Developer - Suppose you're an app developer and want to take advantage of code someone wrote in R. You have no interest in learning R, nor whatever that fancy statistical algorithm does. You just want to call and API and get the answer.  Black boxes are fine with you, provided that black box is maintained by experts and well vetted.

Analyst - Suppose you're an analyst who has some amazing predictive model working on your local machine in R, but now you want to scale it. You've solved a problem with one set of data, and written it in a robust way to where it is ready to predict against new data. You're tired of having to pull data off of a server, run code locally, and email the results out.

Enter the OpenCPU Server...the R programmer can have their code deployed on a server, and the developer can call an API that runs that R code.  The R programmer is freed up to tackle the next problem and no longer email results, while the developer can call the code on demand without needing to know to code in R.

As an aside, Revolution Analytics has a similar solution, DeployR, that comes in both a free Open version and a license Enterprise version.  I am not yet qualified to speak about how it compares to OpenCPU, but exploring its functionality is on my To Do list.  If you're seriously considering R behind an API for an enterprise solution, then I would start there.

Getting Started with the OpenCPU Server on AWS

Ubuntu 14.04 LTS is the recommended server OS, so start here for the AMI on Amazon Web Services. The OpenCPU documentation recommends using a compute optimized server, C3.size or C4.size, but I used it an m3.medium ($0.07) over the c3.large ($0.105), saving three and half cents per hour and it worked just fine in my test environment.

You'll need at least 3 ports open in your Security Group for this server:  22 for SSH, 80 (and/or 443) for API requests  OpenCPU Server, and 8787 for RStudio Server.

Once the server is up and running, ssh into the box:

As per the OpenCPU download instructions, install OpenCPU Server and RStudio Server:

I accepted the defaults during the installation and it works great on AWS.

You can get a UI for testing the API by going to the server's fqdn/ocpu.  For example:

http://ec2-###-###-###-###.compute-1.amazonaws.com/ocpu/

Test the API - Example 1

Now we need some Hello World code to run.  I created a short gist to print "Hello Ben Porter."  It is publicly posted, so feel free to use it in your test.

From the command line on the client, let's use cURL to execute the gist. Replace the ec2 server name, http://ec2-123-456-789-123.compute-1.amazonaws.com, with yours, but leave everything else the same.

If it returns something like this, then you're in good shape.  Those five lines of output are the results of what was just executed on the server.  That strange number in the middle is a session ID the code you just executed.

Let's explore that output. First, let's look at the standard out by appending the first line our host url, and run that as an argument with the curl command, like so:

So all that code did was tell us that we should add "/text" to the end.  Let's try that.

There it is in all of it's magical glory, the result of the R code that is sitting on gist.github.com, but executed from the OpenCPU AWS server, and viewable from your local machine by calling an API. I think that's amazing.

Now, we took the long route to go from executing the script to viewing.  That pattern will hold for future executions of those scripts, so you can just straight to the "/stdout/text" version once you know the session ID.

We only looked at the first of five lines of ouput that the initial call reported.  For the sake of completeness, here is the output of the remaining four.

Example 2 - Overview

The last example was just a warm up to get familiar with calling the API to execute R code with no data.  This example goes further and will meander a bit in order to demonstrate a few more pieces of functionality:

  • Downloading data
  • Uploading data
  • Referencing data from previous steps in a new session

Example 2 - Downloading Data

By default, R comes with the MASS package, which contains the anorexia dataset. Let's download that dataset from the OpenCPU server.

A couple noteworthy things happen in this step.  First, we're pulling from a system library to get the MASS package, so our url follows the /ocpu/library/<library name>/ convention. If we were accessing a package installed by user, then we would insert /user/<user name> between /ocpu/ and /library/.

To reference data within the MASS package, or any package for that matter, the next piece of the url is /data/. If you want to see all of the available datasets, then just run it up to that part, and it'll return a list of all 87 datasets available in the MASS package. In this case, I picked the anorexia dataset, and chose to output as a csv.  We could have picked text, json, tab, or many others. The last piece writes our result to a folder in my user home directory, saving it as anorexia.csv.

Example 2 - Uploading Data

To demonstrate uploading data, let's upload that same csv file.

For this step we needed the read.csv() function from the utils package, so our url reflects that package change. Everything in quotes at the end are the parameters sent to the read.csv() function. Rather than commas to separate function arguments, use an ampersand. Also notice that we need to put an "@" symbol before the file name.

Example 2 - Referencing Data from Previous Sessions

Now that your data is uploaded, we can reference that data by the session id and run any other R functions against it.  That's powerful.  Reread that sentence.  The next example will compute summary statistics against that dataset, using several functions from the dplyr library, demonstrating three times in a row how to reference that session id.  Rather than printing the entire output of each line, I just reference the session id returned from each.  Notice how the reference id is used as argument in subsequent functions.

Learning More

Developers, I recommend going straight to the package vignette first. To see this in action with javascript rather than curl, check out these examples: OpenCPU Example Apps.

Analysts, take a look at the package vignette document as well. It starts getting relevant for you on page 9, part 4.

Other Considerations

Skill Continuity - if this is implemented into production and needs ongoing maintenance, consider your staffing levels, both now and in the future.  If the organization taking this on is primarily an analytics team, they may not understand what it takes to hire and retain the application developers needed to maintain the server side.  Similarly, a tech organization may have difficulty attracting and retaining R developers.

Security - I don't have much expertise here, but it looks like any yahoo that can get through your firewall can start executing R code.  Now they can't access any arbitrary file on the server, only R packages available to the system or users, and GitHub/gists hosted code. If you're serious about security, then the DeployR may be more appropriate.

Model Management - If you're in an environment where statistical models are highly regulated, like in banking, insurance or pharmaceuticals, then putting your models behind an API centralizes the deployment of them, simplifying oversight. Having each application handle their own modeling probably won't yield a high level of rigor and or validation.

Standards and Consistency - The R programmer should write their code in a robust way that accepts a variety of data, regardless of column names, number of records, etc. Once the "must-have" rules have been established, make them clear to the developers who will consume your API. If someone needs to provide data in a certain format, then make that clear and provide examples of what that data should look like.

Execution Time - The built-in system.time() function and microbenchmark packages are your friend. If your code will be called often, then do your customers a favor by optimizing each piece of the code. Also consider executing your model ahead of time, and posting the results, so all the API needs to do is pull a number rather than wait for R to crunch the numbers.

Posted in R and tagged , , , , , .

2 Comments

  1. You should totally test out DeployR and let us know.

    Also maybe launch a custom ec2 AMI instance with this stuff pre-loaded?

  2. Ali - DeployR is definitely on my list of things to play around with.

    As for custom or saved EC2 images with this stuff pre-installed, I do that on my personal account for things like a ready-to-go Cassandra server, a Shiny Server and OpenCPU server. I hesitate to make those publicly available because I would have to scrub personal things out, like my users and keys. Plus, this software is continually updating, and I generally want to experiment with the latest and greatest.

Leave a Reply

Your email address will not be published. Required fields are marked *