Amazon EMR and mrjob

Easy Hadoop with Python streaming

Michael Selik
mike@selik.org

Abstract

This is a quick guide to setting up the mrjob Python library for easy use of Hadoop on Amazon's Elastic MapReduce.

Python and Virtual Environments

If you don’t have Python, grab the latest 2.x installer — download and double-click to get yourself some Python. On a Mac, a better, but more involved option is to use Homebrew and follow its instructions for installing Python.

If you just installed Python, then you’ll need to also install pip.

$ easy_install pip

After you have pip, you can install virtualenv.

$ pip install virtualenv

Installing mrjob

Let’s get started with a new virtual environment. The distribute flag installs distribute and pip immediately after creating the new virtual environment.

$ virtualenv --distribute ~/venv/emr

That was quick. Let’s activate our new virtual environment.

$ source ~/venv/emr/bin/activate

Next we will install Yelp’s mrjob. We will use pip to make sure we get all the dependencies.

(emr)$ pip install mrjob
...
Successfully installed mrjob boto PyYAML simplejson

Let’s see if that worked.

(emr)$ cd ~/venv/emr/lib/python2.7/site-packages/mrjob/examples
(emr)$ python mr_word_freq_count.py mr_word_freq_count.py
no configs found; falling back on auto-configuration
...
Streaming final output ...
"'__main__'"    1
"0" 2
"1" 1
"2" 2
"2009"  1
...
removing tmp directory...

Looks like we’ve got basic functionality.

Elastic MapReduce

Now let’s get set up with Amazon Elastic MapReduce. Create an Amazon Web Services account and sign up for Elastic MapReduce.

Next we will create a configuration file to store your access credentials. Make a file in your home directory ~/.mrjob.conf. Get your access key and secret key. From the My Account/Console dropdown, select Security Credentials. In the Access Credentials section of the page, under the Access Keys tab, you should see your Access Key ID and a link to show your Secret Access Key. Write these in your ~/.mrjob.conf configuration file.

runners:
    emr:
        aws_access_key_id: 8jkYp8aXFDMmDsQn
        aws_secret_access_key: 8jkYp8aXFDMmDsQn8jkYp8aXFDMmDsQn

Okay, ready to test.

(emr)$ cd ~/venv/emr/lib/python2.7/site-packages/mrjob/examples
(emr)$ python mr_word_freq_count.py mr_word_freq_count.py -r emr
using configs in /Users/mike/.mrjob.conf
creating new scratch bucket mrjob-91ec4f2554293c21
using s3://mrjob-91ec4f2554293c21/tmp/ as our scratch dir on S3
creating S3 bucket 'mrjob-91ec4f2554293c21' to use as scratch space
Uploading input to s3://...
creating tmp directory /var/folders/0y/...
writing master bootstrap script to /var/folders/0y/...
Copying non-input files into s3://...
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-1D9WVRKQNM1EZ
Job launched 30.5s ago, status STARTING: Starting instances
Job launched 60.9s ago, status STARTING: Starting instances
Job launched 91.4s ago, status STARTING: Starting instances
Job launched 122.0s ago, status STARTING: Starting instances
Job launched 152.4s ago, status STARTING: Starting instances
Job launched 183.0s ago, status STARTING: Configuring cluster...
Job launched 213.3s ago, status BOOTSTRAPPING: ...
Job launched 243.9s ago, status BOOTSTRAPPING: ...
Job launched 274.5s ago, status RUNNING: ... Step 1 of 1)
Job completed.
Running time was 49.0s...
ec2_key_pair_file not specified, going to S3
Fetching counters from S3...
Waiting 5.0s for S3 eventual consistency
Counters from step 1:
  FileSystemCounters:
    FILE_BYTES_READ: 834
    FILE_BYTES_WRITTEN: 1831
    S3_BYTES_READ: 1572
    S3_BYTES_WRITTEN: 987
  Job Counters :
    Launched map tasks: 2
    Launched reduce tasks: 1
    Rack-local map tasks: 2
  Map-Reduce Framework:
    Combine input records: 152
    Combine output records: 103
    Map input bytes: 1047
    Map input records: 36
    Map output bytes: 1501
    Map output records: 152
    Reduce input groups: 97
    Reduce input records: 103
    Reduce output records: 97
    Reduce shuffle bytes: 933
    Spilled Records: 206
Streaming final output from s3://...
"'__main__'"    1
"0" 2
"1" 1
"2" 2
"2009"  1
...
"www"   1
"yelp"  1
"yield" 3
"you"   2
removing tmp directory /var/folders/0y/...
Removing all files in s3://...
Removing all files in s3://.../tmp/logs/j-1D9WVRKQNM1EZ/
Terminating job flow: j-1D9WVRKQNM1EZ

And that’s Hadoop. It probably cost you about $0.40 for the one hour (rounded up) of Elastic MapReduce “m1.small” instance.

More Configuration

To debug your map-reduce jobs, you’ll want to have access to the job tracker. I’ll leave this as an exercise for the reader. The mrjob documentation on how to enable the Hadoop Job Tracker describes the process well.