If you don’t have Python, grab the latest 2.x installer — download and double-click to get yourself some Python. On a Mac, a better, but more involved option is to use Homebrew and follow its instructions for installing Python.
If you just installed Python, then you’ll need to also install
$ easy_install pip
After you have
pip, you can install
$ pip install virtualenv
Let’s get started with a new virtual environment. The distribute flag installs distribute and pip immediately after creating the new virtual environment.
$ virtualenv --distribute ~/venv/emr
That was quick. Let’s activate our new virtual environment.
$ source ~/venv/emr/bin/activate
Next we will install Yelp’s mrjob. We will use
pip to make sure we get all the dependencies.
(emr)$ pip install mrjob ... Successfully installed mrjob boto PyYAML simplejson
Let’s see if that worked.
(emr)$ cd ~/venv/emr/lib/python2.7/site-packages/mrjob/examples (emr)$ python mr_word_freq_count.py mr_word_freq_count.py no configs found; falling back on auto-configuration ... Streaming final output ... "'__main__'" 1 "0" 2 "1" 1 "2" 2 "2009" 1 ... removing tmp directory...
Looks like we’ve got basic functionality.
Next we will create a configuration file to store your access credentials. Make a file in your home directory
~/.mrjob.conf. Get your access key and secret key. From the My Account/Console dropdown, select Security Credentials. In the Access Credentials section of the page, under the Access Keys tab, you should see your Access Key ID and a link to show your Secret Access Key. Write these in your
~/.mrjob.conf configuration file.
runners: emr: aws_access_key_id: 8jkYp8aXFDMmDsQn aws_secret_access_key: 8jkYp8aXFDMmDsQn8jkYp8aXFDMmDsQn
Okay, ready to test.
(emr)$ cd ~/venv/emr/lib/python2.7/site-packages/mrjob/examples (emr)$ python mr_word_freq_count.py mr_word_freq_count.py -r emr using configs in /Users/mike/.mrjob.conf creating new scratch bucket mrjob-91ec4f2554293c21 using s3://mrjob-91ec4f2554293c21/tmp/ as our scratch dir on S3 creating S3 bucket 'mrjob-91ec4f2554293c21' to use as scratch space Uploading input to s3://... creating tmp directory /var/folders/0y/... writing master bootstrap script to /var/folders/0y/... Copying non-input files into s3://... Waiting 5.0s for S3 eventual consistency Creating Elastic MapReduce job flow Job flow created with ID: j-1D9WVRKQNM1EZ Job launched 30.5s ago, status STARTING: Starting instances Job launched 60.9s ago, status STARTING: Starting instances Job launched 91.4s ago, status STARTING: Starting instances Job launched 122.0s ago, status STARTING: Starting instances Job launched 152.4s ago, status STARTING: Starting instances Job launched 183.0s ago, status STARTING: Configuring cluster... Job launched 213.3s ago, status BOOTSTRAPPING: ... Job launched 243.9s ago, status BOOTSTRAPPING: ... Job launched 274.5s ago, status RUNNING: ... Step 1 of 1) Job completed. Running time was 49.0s... ec2_key_pair_file not specified, going to S3 Fetching counters from S3... Waiting 5.0s for S3 eventual consistency Counters from step 1: FileSystemCounters: FILE_BYTES_READ: 834 FILE_BYTES_WRITTEN: 1831 S3_BYTES_READ: 1572 S3_BYTES_WRITTEN: 987 Job Counters : Launched map tasks: 2 Launched reduce tasks: 1 Rack-local map tasks: 2 Map-Reduce Framework: Combine input records: 152 Combine output records: 103 Map input bytes: 1047 Map input records: 36 Map output bytes: 1501 Map output records: 152 Reduce input groups: 97 Reduce input records: 103 Reduce output records: 97 Reduce shuffle bytes: 933 Spilled Records: 206 Streaming final output from s3://... "'__main__'" 1 "0" 2 "1" 1 "2" 2 "2009" 1 ... "www" 1 "yelp" 1 "yield" 3 "you" 2 removing tmp directory /var/folders/0y/... Removing all files in s3://... Removing all files in s3://.../tmp/logs/j-1D9WVRKQNM1EZ/ Terminating job flow: j-1D9WVRKQNM1EZ
And that’s Hadoop. It probably cost you about $0.40 for the one hour (rounded up) of Elastic MapReduce “m1.small” instance.
To debug your map-reduce jobs, you’ll want to have access to the job tracker. I’ll leave this as an exercise for the reader. The mrjob documentation on how to enable the Hadoop Job Tracker describes the process well.