Scientific Computing

Building an environment for data analysis with Python on Mac OS X

Michael Selik
mike@selik.org

Abstract

For whatever reason, it can be a big pain to get python set up correctly on Mac OS X. This process worked for me. Hopefully it will work for you too.

An alternative to all this is using the Enthought Python Distribution or Continuum’s Anaconda. But why make it easy?

Setting up Python 2.x on Mac OS X

To avoid system administration difficulties, we will use the latest Python 2.x version. I downloaded the .dmg installer, unpacked it and ran the .mkpg file. You may prefer to install Python with Homebrew, but I chose not to and I’ve forgotten why.

Virtual Environments

Next I installed virtualenv to make sure this setup will not interfere with other projects.

$ cd ~/.local/lib
$ git clone git://github.com/pypa/virtualenv.git
$ cd virtualenv
$ python setup.py install

I set up an alias in my ~/.profile file and set an environment variable to avoid some typing. This isn’t very useful, because you’re unlikely to be using virtualenv frequently.

$ export VIRTUALENV_DISTRIBUTE=true
$ alias venv='virtualenv --distribute'

Now my activation of virtualenv can be more terse.

$ venv ~/venv/data-science
New python executable in /Users/mike/venv/data-science/bin/python
Installing distribute...........................................
................................................................
...........................................................done.
Installing pip................done.

I activate the new data-science virtual environment.

$ source ~/venv/data-science/bin/activate

Next is the tedious process of downloading and installing packages. Unfortunately some of them do not play well with pip and need to be built from the latest source code explicitly. Others simply take a long time to build and you may want to avoid repeating the build when creating a new python virtual environment. Still others don’t build well and need to be installed using pip. Good times.

Nose

To test that I’ve installed these packages correctly, I first install nose, a testing framework.

(data-science) $ pip install nose
Downloading/unpacking nose
  Downloading nose-1.2.1.tar.gz (400Kb): 400Kb downloaded
  Running setup.py egg_info for package nose
    
    no previously-included directories found matching 'doc/.build'
Installing collected packages: nose
  Running setup.py install for nose
    
    no previously-included directories found matching 'doc/.build'
    Installing nosetests script to /Users/mike/venv/data-science/bin
    Installing nosetests-2.7 script to /Users/mike/venv/data-science/bin
Successfully installed nose
Cleaning up...

I won’t bother to type all that command line output from pip anymore.

Command Line Tools

Whoops, hold on. Before we build NumPy from source, we’ll need to get up-to-date C and FORTRAN compilers. On OS X that means we need the Xcode command line tools. You can either install Xcode or install just the command line tools. Downloading the command line tools will require a (free) developer account. If you have an older Mac, I suggest searching or the appropriate command line tools for your OS version. If you want all Xcode, then after you install Xcode you will need to download and install the command line tools via the Xcode Preferences dialog. Go to Xcode → Preferences, click on the Downloads tab, select Command Line Tools, and click Install.

Apparently Xcode forgot FORTRAN, so we need to use brew for that.

(data-science)$ brew install gfortran

NumPy and SciPy

We’ll install NumPy and SciPy from the bleeding edge source.

(data-science)$ cd ~/.local/lib
(data-science)$ git clone git://github.com/numpy/numpy.git
(data-science)$ python ~/.local/lib/numpy/setup.py install

To check if numpy is working correctly, we can get python running and quickly test it.

(data-science)$ python
>>>Python 2.7.3 (v2.7.3:70274d53c1dd, Apr  9 2012, 20:52:43) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.test()
Running unit tests for numpy
...
Ran 4779 tests in 25.513s
OK (KNOWNFAIL=5, SKIP=6)
<nose.result.TextTestResult run=4779 errors=0 failures=0>
>>> quit()

There were a few skips and known failures, but that’s OK. Let’s move on to SciPy.

(data-science)$ git clone git://github.com/scipy/scipy.git
(data-science)$ python ~/.local/lib/scipy/setup.py install
(data-science)$ python
>>> import scipy
>>> scipy.test()
Running unit tests for scipy
...
Ran 3896 tests in 62.633s
FAILED (KNOWNFAIL=6, SKIP=34, errors=14, failures=72)
<nose.result.TextTestResult run=3896 errors=14 failures=72>
>>> quit()

This one was a little more disturbing, but I’ll ignore these failures for now. Let’s hope for the best.

MatPlotLib

For some reason, I have to build MatPlotLib before I can install it.

(data-science)$ git clone https://github.com/matplotlib/matplotlib.git
(data-science)$ python ~/.local/lib/matplotlib/setup.py build
(data-science)$ python ~/.local/lib/matplotlib/setup.py install
(data-science)$ python
>>> import matplotlib
>>> matplotlib.test()
....K...K..K......
...Taking.a.nap...
..................
Ran 1211 tests in 385.316s
FAILED (KNOWNFAIL=300, failures=2)
>>> quit()

Alright, more disturbing failures, but mostly working.

IPython

IPython has a whole bunch of dependencies. To install ZeroMQ, I use Homebrew

(data-science)$ ruby -e "$(curl -fsSkL raw.github.com/mxcl/homebrew/go)"
(data-science)$ brew install zeromq

With so many dependencies, I started relying on pip.

(data-science)$ pip install pyzmq
(data-science)$ pip install tornado
(data-science)$ pip install sympy
(data-science)$ pip install pygments

Not sure why, but pip doesn’t like readline, or vice versa.

(data-science)$ easy_install readline
(data-science)$ pip install ipython

Finally we can test IPython.

(data-sceicne)$ iptest

Mostly passing, good enough.

(data-science)$ pip install pandas
(data-science)$ pip install statsmodels

Had fun? We’re just about ready to get some work done. Start up an ipython notebook server from your General Assembly Data Science source folder.

$ cd ~/src/data-science/
$ ipython notebook --profile=sympy --pylab inline

Click the ‘new notebook’ button. The first time you use IPython Notebook it will create your config files. Then you can install MathJax for IPython so that you won’t be hitting the MathJax server to render your equations.

In [1]: from IPython.external.mathjax import install_mathjax
In [2]: install_mathjax()

Done. For now

If you find any errors with this guide, please let me know.