A Gentle Intro to R for SQL Programmers

The target audience of this tutorial is someone who knows SQL, but wants to start using R. I originally created this code as an intro to R for people on my team who were great with SQL, but started to deal with data outside of a database and still wanted to use SQL.

Here is what is covered in this tutorial:

Getting Started

  • Installing Packages
  • Loading Packages
  • Setting the working directory
  • Printing directory contents

Import:

  • Two examples for reading in .xlsx files
  • One for reading in .csv files
  • One for tab delimited .txt files

Inspect and Clean:

  • Count rows and columns
  • Print top and bottom 6 rows
  • Search for missing values, NA values, and empty strings
  • Summary stats on each column
  • Change column names with periods to underbars

SQL Examples:

  • Count rows without creating a table
  • Creating a table from a sql statement
  • Where clause
  • Left join
  • Case statement

Export Examples:

  • .xlsx
  • .csv
  • .txt (tab delimited)

Getting Started

If you're familiar with git, pull down the sample data and code from my github repository.

If you're not familiar with git, then just download the zip file from this page. Look for the Download Zip icon on the lower right.

Install Packages

We'll be using 3 packages to extend the functionality of base R. The first two are different methods to read in data from Excel, and the third provides SQL functionality against data frames in R.  Next we'll install those packages, which only needs to be done once. First we'll cover the code based method from the R console, then the point and click method.

If you're using RStudio, an alternate method to install these packages is to click Tools, then Install Packages.

Load Packages

Now that the packages or libraries are installed on your machine, they are ready to be loaded. This is something that you'll need to do every time you open R if you would like to use the functionality in these packages.

Set the Working Directory

Next we'll cover some basic file/folder operations in R to make sure we're reading data from the right location. To view the directory that your session is set to, use getwd().  To change the default directory, use setwd().  To print the files in that folder, use list.files(). Here is what the interactive session looks like using those commands.

The final list.files() command may seem trivial when you can just view the directory contents outside of R, but proves that we can see the three different datasets that we'll be reading in the next step.  You can also use that command to store the names of the files for later reference dynamically in the code.

Reading in Data

The following code demonstrates four ways to read in data. The first two read in XLSX files, and I tend to have the best experience using the first one, XLConnect. Notice the sheet or tab name arguments. The third one does not require an additional packages and reads in the CSV file. Finally, the forth method reads in a tab delimited file. All of these create a data frame and store the result in the df object.  That object is in memory and like all objects in memory, it goes away when you close R.

Inspect the Data - Printing to the Console

This section provides you with ways to inspect the data. First we'll count the rows, with nrow(), then count the columns with ncol(). The head() function is equivalent to "select top 6 from df" and tail() is just the last 6 rows.  To print the entire object, simply put that object name.  If it contains more than 1000 lines, then R will only print the first 1000.

Inspect the Data - Look for Missing Values

The following code is a little more advanced, but you can easily reuse it just by replacing the df data frame with the name of your data frame. Each of these three sapply() functions apply a function to each column in the df data frame. It is a function, that applies a function to each column in your data. In the first case we count the number of TRUE's when the row is NA by applying the is.na() function to each column.  Then we do the same for NULL values, and again for empty strings. This example is fairly uninteresting because no columns have missing values, but it's always good to check.

Inspect the Data - Summary Statistics

The summary() command is very powerful. Based upon the data type, it counts frequencies of values, or computes quantile-type statistics.

Inspect the Data - Column Names 

In SQL, a period denotes moving from a database to a table, or a table to a column, so periods in the names of fields is not allowed. By default, R replaces spaces with periods when it reads in column names, so this needs to be changed before we can start writing SQL against it. The following code prints the column names, then replaces the period with an underbar, and finally prints the column names again to prove that it worked.

SQL Examples

The following examples all use the sqldf() function, which comes from the sqldf package that we loaded earlier. The sqldf package uses sqlite syntax, so search for that if you need sql syntax questions.

Starting off with a simple example, let's do the equivalent of nrow() with sqldf().

Notice that example did not store the result, it just printed it to the console. To store the result and effectively do a "create table" statement, simply send the results to an object. Let's tall that object rec_count. It will just quietly store the result, so I printed the contents by just putting object name.

Here is an example of using a where clause and store the result in a new table. Notice that we didn't define a table first, or go through the shenanigans of defining field types and lengths. That's huge if you're in the habit of predefining these things in a database.  The df_nc object inherits the field types from the data it was created from.

Next here is an example of a left join, a case statement, and printing the first 6 rows of the resulting dataframe. Notice what happens in the second sql statement. We are overwriting a dataframe that we are using in the from statement.

Exporting Data

Finally, here are three ways to get data out of R. Just like how we read in Excel, CSV and tab delimited data, these examples show you how to write the dataframe to disk in one of those formats.

It wrote the data to the same place that we read it from.  Run getwd() to see what directory R wrote these files to.

Conclusion

Now you're equipped with the basics of reading in data from different files formats, inspecting the data, writing SQL against it, and exporting the results.

Posted in R and tagged , , .

Leave a Reply

Your email address will not be published. Required fields are marked *