Average Lifespan with Right Censored Data

Business Question Say your boss poses the following question: what is the average lifespan of an account? It seems like an easy question at first. Just take the difference between the end date and start date for each account, and compute the average from those deltas. Here is the rub...what happens when accounts haven't been closed […]

SparkR on Ec2 - Up and Running in 30 Minutes

Motivation The purpose of this post is to walk through spinning up a Spark cluster using Amazon Web Services EC2 servers and use R to interface with that cluster. The Apache Spark distribution comes with an EC2 script to do this, which was extremely helpful, but I had a hard time getting the newly released SparkR to […]

Spark - Propagating Data to the Worker Nodes

A brief tutorial on propagating files and folders from the driver node the worker nodes on Spark. Spark ships with a shell script, copy-dir.sh, to make copying data from the driver to workers very easy. This tutorial is a hello world for using that script. Let's assume a basic setup, 1 driver and 3 workers and that […]

dplyr with PostgreSQL

A general complaint with R is that the size of your data is limited to the amount of memory available on your machine. One solution is to spin up a cloud server with 224 GB of RAM and install R if that is large enough for your data. Another solution is to load your data […]

HackathonCLT 2015

My wife and came in 1st place in the HACK category at HackathonCLT 2015. It was an overnight, 24-hour competition and the challenge was to find an innovative way to deliver groceries to customers, without them coming into the store. We competed against 88 other people, analyzing 140 million transactions from the grocery store Harris Teeter. […]