Scientific Python Ecosystem

Throughout my education and post-academic career before joining Focused Support I had used MATLAB almost exclusively for my data analysis and algorithmic prototyping tasks. I was never very happy with MATLAB as a scripting language (and even less so as a programming language). I was comfortable enough with it and all of my colleagues were using it which meant there was a lot of prior art available to me if I choose to use MATLAB. This lack of interoperability never made the effort to move to something that might be better seem worth it.

I had dabbled in Python on personal side-projects and for the occasional script at work when MATLAB was not a good fit (i.e., anything non-mathematical). I had heard great things about the Python scientific stack (e.g., numpy, matplotlib, pandas, etc.) and wanted to give it a real try. So when I started at Focused Support, where there was no preexisting corpus of MATLAB code or colleagues using it, I decided to go for it. I have been using it now for a little over six months and I am happy with the decision even if it came with some growing pains while getting up to speed. I would recommend anyone else thinking about taking this leap, especially in a "greenfield" situation like I was facing, to give it serious consideration.

With all that said I went back over everything I did to get setup with python at Focused Support to share anything I would have liked to have known starting out and tips I have picked up along the way.

You Probably Should be Python 3.x

This is advice I did not take myself in part because the last time I had tried Python 3 half the packages that I needed to use had not been ported over to it yet. That was basically a non-starter. This was some time ago and from everything I have heard and read the situation should be much better now; however, I did not want to get going only to find out that some slightly obscure package I needed was incompatible with Python 3. It was risk mitigation and at some point soon I will probably give Python 3 a try on my existing code which should just work as long as all the libraries I am using support Python 3.

If you choose or have to use Python 2.7.x (you should not be using any minor version less than 7 if you can help it) you can take advantage of some Python 3.x improvements via the __future__ module. I personally use

I consider the division import essential (1/2 == 0 makes much less sense than 1/2 = 0.5 to me) and the print_function one nice. As an additional benefit using these imports in all your 2.7.x Python code will make the transition to Python 3.x easier since the print statement has been removed and the division behavior changed to match this starting with Python 3.0.

Just Use pip to Install Packages

I use a Macbook Pro at work and on OS X I did not find any compelling reason to use anything heavier-weight (e.g., Anaconda). The pip package installer works well and I have not had any major issues installing Python packages on OS X except the following gotcha. Do not use the default Python installation: OS X does some nonstandard things with file placement that made it impossible to install certain packages with pip. Instead use the fantastic homebrew package manager to install Python to /usr/local (and make sure to adjust your PATH environment variable so that /usr/local/bin precedes /usr/bin). With the homebrew installation pip has been able to install anything I have thrown at it without a hiccup.

As a side-note, at my previous job I used a Windows laptop where I found it easier to go with the excellent prepackaged scientific Python distribution Python(x,y). This was especially true when it came to installing any packages that required compiling C extensions (e.g., numpy). It contains most of the packages I mentioned above and as well as many more.

Grab Some Packages

One of the bigger draws to the Python world from MATLAB for me was all of the great free and open-source packages available for just about anything you can imagine. Software for scientific applications is particularly well served in this regard. Vanilla Python is not quite up to the task: to do any manner of scientific computing some packages will need to be installed (with the aforementioned pip).

I find that I use the following five packages practically every time I am doing any data science work in Python

  • numpy: base n-dimensional arrays
  • matplotlib: 2D plotting
  • IPython shell: an enhanced python REPL
  • pandas: higher-level data structures and routines for data analysis and manipulation
  • seaborn: higher-level plotting routines and prettier plot styling

That's a good start: I cannot imagine getting started without one of those. I find I use the following packages less often but are crucial when I need them

Depending on what you need to do that might be most of what you need to get started picking up whatever other packages you need to as you go specific to your task. For instance, I also do a good deal of geospatial analysis and visualization for which the following packages are invaluable

  • basemap: map plotting support for matplotlib
  • Shapely: Geometric objects, predicates and operations
  • descrates: use Shapely geometries as matplotlib paths and patches
  • Fiona: vector map data file I/O
  • rasterio: raster map data file I/O
Don't Worry About Virtual Environments

Future-me may be cursing past-me at some point for this. Since I am mostly writing data analysis, exploration and visualization tools that only I am using vice production applications I have been running along the latest stable (and occasionally repository HEADs if there is an improvement/bug-fix that I need). Virtual environments make it easy to lockdown dependencies on a per-project basis which is not important for my current use-case. These are built into Python 3 now so it may be something I incorporate when I make the switch but not before then.

Use the IPython Shell

I have finally gotten to the point now where I am going to try PyCharm as I have heard great things about it and I have been surprised how much I have enjoyed using Eclipse for Java development (I was a diehard C++ in vim guy previously). Starting out I think it would have been lumping one more thing I did not already know onto an already intimidating pile. Diving directly into using the IPython shell REPL requires very little orientation if you are already familiar with REPLs in other languages and your time spent in the IPython shell will not be lost (e.g., it looks like PyCharm has out-of-the-box support for IPython magic commands). Here are a few of the IPython enhancements I have found to be most useful

  • %run -i script.py will run the script.py in the IPython interactive session dumping all the variables defined in script.py into the IPython namespace: great for iterating changes on a script in the REPL
  • _ references the last result which is super useful when you execute something in the REPL you liked but didn't capture in a variable
  • %hist -n prints the shell sessions history with numbers that can be used to reexecute it directly with %exec _i# or edit before re-executing with %rep # (e.g., %exec _i5 or %rep 7)
  • %debug starts a ipdb session for the last command that failed which is super useful when something in a script unexpectedly fails
  • %who and %whos should be familiar to people that have used MATLAB before: much like their MATLAB counterparts these magic commands print out a list of all variables defined in the interactive shell session with %whos adding some additional information for each variable (e.g., variable type)
  • %timeit provides convenient access to the Python timeit module for quick command profiling
  • %save can be a life-saver for the times when you have been trying things out in the REPL and found something that ends up being long or complicated that you want to turn into a script. This magic command takes a filename to save to and a space separated sequence of history ranges and lines to save (e.g., %save myScript 1-10 77 20-30)

It should be noted that all of the IPython magics can be used without the magical % character if there is not anything in the IPython namespace that hides the magic name (e.g., creating a Python variable run means the IPython magic %run command cannot be executed without the %).

Have a PYTHONSTARTUP file

I would also recommend setting at least a barebones python configuration file which will ease some of the more repetitive aspects of starting up a new python shell (e.g., commonly used imports and library configuration options). To do this define PYTHONSTARTUP environment variable, for instance like

# in your .bashrc 
export PYTHONSTARTUP="${HOME}/.pythonrc"

and in the file that environment variable points write out all the commands you would like executed every time you open a new python shell. Here is my anemic configuration file with additional explanatory comments

# ~/.pythonrc

# Grab good features from Python 3
from __future__ import print_function, division

# fairly standard import aliases
import numpy as np
from numpy import linalg as la
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd

# workaround for a matplotlib/seaborn bug viz.,
# https://github.com/matplotlib/matplotlib/issues/3711
# https://github.com/mwaskom/seaborn/issues/344
sns.set_context( rc = { 'lines.markeredgewidth': 0.1 } )

# pandas DataFrame display settings
pd.options.display.max_rows = 35
pd.options.display.width = 200

Really this is the file for those commands where on opening a new shell you grumble to yourself something isn't quite right. Having something in place, even if minimal, will reduce the friction to adding additional commands as they become expected. Trust me, I would issue the Pandas display settings from above over and over again after trying to view a DataFrame for the first time in a new shell (with reverse history search via CTRL-R so not as bad as typing from scratch).

Make Sure Your Laptop Does Not Sleep When It Should Be Working

Laptop settings which aim to conserve power are great and that is generally how I'd like my laptop to behave. However, when I kickoff a script that will take a while and plan on leaving it unattended I am certainly not happy when I get back to find out only 15 minutes of work was completed before the laptop went to sleep.

Instead of changing the power/sleep settings entirely I use the incredibly handy Caffeine OS X menu bar application. It is a simple as just clicking the coffee cup in the menu bar: presto your laptop is wide awake and will not go into sleep mode while you are off doing something else. You still have to remember to turn this on (or turn it off depending on which behavior you'd prefer to have by default) but you only have to make this mistake a couple of times before it becomes pretty well engrained.

Epilogue

I am sure I do not remember every tribulation I went through during my transition from doing scientific computing in MATLAB to Python through the setup phase but I think I hit on the major points. I hope anyone that has taken the time to read this was able to take away something they can apply to their situation. I would love any feedback (especially anything I missed or even got wrong as I love tinkering with my own setup and learning) through the social links below. Thanks so much for reading.



Dr. Bryan Patrick Wood