Setting up a Cassandra with IPython on AWS

How to get up and running with a 3-node, single-server instance of Cassandra with IPython on a cloud server.

This article focuses on installing and configuring the following:

  • Cassandra on AWS EC2 via the DataStax ami
  • Python and IPython Notebook web IDE

Step 1:  Cassandra

You have two options for installing Cassandra.  Either install the open source version from the Apache website which seems long and involved, or go with the free DataStax Community version.  The folks at DataStax make an image available on Amazon EC2, with a great step by step walk through of getting your server running.  The walkthrough that DataStax provides is great, but I struggled with a couple parts, which will be the focus of this first section.  Read through their tutorial first, then see my clarifying comments.

Firewall Options:

  • When setting up the firewall, you'll see that they ask you to open up port 8888, which is for the OpsCenter server monitor that comes along with the DataStax install.  However, IPython Notebook wants the same port.  I choose port 9999 for my IPython Notebook port, with just my IP address allowed.
  • If you plan on adding RStudio Server at some point, then add 8787 while you're at it, I recommend just allowing your IP for this port.
  • They leave port 22 open to all IP's, which seems unnecessary.  Since that is just your SSH connection into the box, only you need access, you might as well choose just your IP rather than all.

Launch Instances:

  • Instance Type:  You have the option to use HVM or PV instances.  After reading the documentation, I choose PV.  If this is just for development and learning the tools, then either should be fine.
  • Server Size:  The smallest instance they recommend, m3.large, works just fine for my purposes.  As of the writing of this blog, that runs $0.14 per hour, or $3.36 per day.  You can manage that cost even further by shutting down (not terminating) your EC2 instance from the AWS console when you're not using the server, but beware config issues.  (more on that below)
  • Number of Nodes:  Under the User data section, I choose 3 nodes rather than 6 since this is a single server instance.
  • Storage:  The instructions are vague on what to do for storage options.  Below the default entry that reads "Root, /dev/sda1...", add another drive below it.  If you choose Instance Store for your secondary drive means that everything stored on it will be lost if you shutdown and restart.  EBS (Elastic Block Storage) is permanent and will survive a shutdown and restart.  I choose the "Delete on Termination" option for my EBS storage, so it would go away when I decided to finally terminate the EC2 instance.  If you miss this step and do not add another drive, you'll get an error in that reads something like this in ~/datastax_ami/ami.log:

[INFO] Unformatted devices: []
[INFO] Clear "invalid flag 0x0000 of partition table 4" by issuing a write, then running fdisk on the device...
[ERROR] Exception seen in ds1_launcher.py:
Traceback (most recent call last):
File "/home/ubuntu/datastax_ami/ds1_launcher.py", line 22, in initial_configurations
ds2_configure.run()
File "/home/ubuntu/datastax_ami/ds2_configure.py", line 1078, in run
File "/home/ubuntu/datastax_ami/ds2_configure.py", line 933, in prepare_for_raid
File "/home/ubuntu/datastax_ami/ds2_configure.py", line 879, in format_xfs
IndexError: list index out of range

You can verify that this was successful by connecting to the box and listing out the keyspaces.  Navigate to where you saved your keypair.pem file and SSH into your box.  The default username is ubuntu and your code should look something like this:

Keep in mind, if you shutdown your instance and start it back up again to save money, you will have a new Public DNS.  That means your ssh connection script needs to be updated with the new name.  Here is your clue that you used the old Public DNS rather than the new one:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.

Now that you have connected to the box, enter the Cassandra Query Language shell (cqlsh) and list the keyspaces (desc keyspaces;).  It should look something like this:

While you are in the Cassandra shell, create a test keyspace, which is analogous to a database or collection of tables. You can't create a keyspace via the Python Cassandra driver, but other functionality is accessible.

Issues with Cassandra After Shutting Down the Server

I found that Cassandra will work just fine until a shutdown occurs.  Reboots are safe, but when the server is shutdown and the ip address is renewed, several config files retained the old ip address info.  You will need to edit three config files to get cassandra up and running again after a shutdown.  I am assuming that you're just interested in using a single server instance of Cassandra.

Step 1) Edit /etc/cassandra/cassandra.yaml.  Update the following three variables.  If you're new to linux, then here is a quick way to edit the file from the command line:

  • Update seeds (approximately line 227) to:  seeds: "127.0.0.1"
  • Update listen_address (approximately line 297) to: listen_address: localhost
  • Update broadcast_address (approximately line 301) to: broadcast_address: localhost

Step 2) Edit /etc/cassandra/cassandra-env.sh

  • Near the bottom, update JVM_OPTS to: JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=127.0.0.1"

Step 3) Edit /etc/opscenter/clusters/cassandrasandbox.conf

  • Update seed_hosts to:  seed_hosts = 127.0.0.1

Step 3) Reboot the server, either from the EC2 console, or from the command line with sudo reboot

When you ssh back in, you should find that after 3-4 minutes that the command prompt appears and the cqlsh command should work.  If it continues to hang  Hit "ctrl c" to break the loop, and examine ~/datastax_ami/ami.log.

Step 2: Install and Configure IPython

This section will walk through installing IPython Notebook before being ready for your hello world with Cassandra.  It is organized as follows:

  • Installing IPython Notebook
  • Creating a password and IPython Profile
  • Configuring the IPython Profile
  • Connecting to the IPython Notebook server

Installing IPython Notebook

Just like before, you will SSH into the box.  From the command line run the following:

A few notes:  When I forgot to do the apt-get update part on my first attempt, I got error messages on the second line.  This just installed the drivers for Python to interface with Cassandra, and the "[all]" installs the Notebook feature too.

Creating an IPython Notebook Password and Profile

This next section closely mirrors the official IPython site, with a few shortcuts.  Next you'll create a password for your Notebook account that as yet to be created.  Now enter into an interactive python session by typing ipython from the command line.  Run the following from the session, and enter a password.  This will create a string for you to save for a later step.

Now exit IPython by typing exit create user name.   I choose something Cassandra-esque.

This last step should have told you created three config files. We will edit one in particular, which is completely commented out except for the first part.  You could be a purist and find where each line item is located, uncomment it out and make the proper edit, or just paste in something like this at top and go about your day.

From the command line:

Paste make these additions, anywhere below where it says "c = get_config()."  This is where you use the string that you created from the password creation step.

Now your Profile is created and configured and you're ready to start the IPython Notebook server. Start the server from the command line with the following.

Hello World

Now you are ready to connect to the IPython Notebook from your local machine.  Using your server name, add in the notebook port that you choose in the config file.  I stuck with 9999 that the IPython site recommends. Enter something like this into your browser and enter the password you used when creating the IPython Profile.

ec2-###-###-###-###.compute-1.amazonaws.com:9999

Once you're logged in, click New Notebook, and start enter python code.  First we'll import from the Cassandra driver that we installed earlier and connect to the cluster and keyspace using the following. Recall that we created the keyspace in an earlier step.

Now create a table, defining the column names and datatypes.  The first field is the primary key called "id" and stores integers, and the second field is called "textvalue" storing variable character data.

Add a record of data to the table.  Just like in the previous step, we're wrapping standard CQL with session.execute().

So far all we have done is send commands to Cassandra.  This next line pulls data back, and stores it in the results variable using a cql select statement.

Two ways to view the result.

For more examples on how you can interact with Cassandra, check out the Datastax cql documentation.

 

Posted in Python and tagged , , , , , .

Leave a Reply

Your email address will not be published. Required fields are marked *