dplyr with PostgreSQL

A general complaint with R is that the size of your data is limited to the amount of memory available on your machine. One solution is to spin up a cloud server with 224 GB of RAM and install R if that is large enough for your data. Another solution is to load your data into a local database and access it using dplyr.

The dplyr vignette on databases recommends using PostgreSQL.

"If you don’t already have a database, here’s some advice from my experiences setting up and running all of them. SQLite is by far the easiest to get started with, but the lack of window functions makes it limited for data analysis. PostgreSQL is not too much harder to use and has a wide range of built in functions. Don’t bother with MySQL/MariaDB: it’s a pain to set up and the documentation is subpar. "

Running under the assumption that your data is too large to fit in R, which is why you need a database in the first place, this post will focus on getting data into Postgres from psql. This is in contrast to populating the database from R, which would be an iterative and presumable tedious job from R if the data was too big.

This post is broken into the following sections:

  1. Installing PostgreSQL on Ubuntu
  2. Creating and populating a database and table from the command line
  3. Connecting to PostgreSQL from R with dplyr
  4. Sample dplyr commands against PostgreSQL

Part 1 - Installing PostgreSQL

Assuming you don't already have Postgres installed, run the following from the terminal.

Before leaving bash, either download the titanic training dataset from Kaggle.com manually, or pull it down from my github page. If you don't already have git installed, then you'll need to run the first line too. This saves test.csv and train.csv to the /tmp/titanic-dataset folder.

Next, enter the PostgreSQL prompt, psql, as the postgres user.

Now that you're in in the psql prompt, create a database called testdb, then connect into that database. Finally, create a schema called testschema. Notice the terminating semi-colons, except with the database connect statement.

Create the table based on titanic dataset we downloaded earlier, calling the table titanic in the testschema schema. Having to define the fields and data types is probably pretty annoying for the average R programmer. The workflow is typically to just read in the data, let R guess what the datatypes are, and just change the data type for the fields it got wrong later. Here we need to define the column name, followed by the data type. This link was a good intro reference for defining data types.

Read in the data from the titanic dataset.  Recall that I saved it to to the tmp/titantic-dataset directory, so I'm reading in the data from there. It is a comma separated file and does have a header.

If it worked, it will say something like COPY 891, telling you that it read in 891 records. If you get an error about no being able to read in the data, it may be that the postgres user does not have permissions to read the file. If that occurs, open up a separate terminal window and navigate to the directory where the files is located, and add read permissions for all users.

Let's get back to the psql shell, and test out the data by looking at the first 10 records.

That's it on the database set up and populating side. We're done with bash and psql, and can now jump into accessing the data with R. I don't know about you, but I find it amazing that we can go from nothing to a populated database in less than 10 lines of code.

Part 2 - Connecting to PostgreSQL from R

First load the necessary libraries, and connect to the Postgres database.  All of these arguments correspond choices you made earlier for the database, schema and table name. The user name and password are default values, so that should work for you too on a fresh install.

This is link to the table and not a local copy of the data.  Let's do some basics to, like view the column names and look at the first 6 records.

Now for some light dplyr. Let's count the number of survivors and non-survivors, then print the result. Since this table only has two records, it was able to print the whole thing. At this step, the syntax for regular dplyr is identical.

To physically execute the query and pull the table into R, use collect(). Aside from the upfront connection, that's the only extra step you need to do with dplyr and Postgres to pull down the data locally.

 

Posted in R.

Leave a Reply

Your email address will not be published. Required fields are marked *