How spreadsheet enabled notebooks are the ultimate incremental programming tool.

Incremental programming is the process of writing code one step at a time and checking the output after each one. It's common practice for data exploration workflows, and it goes a little something like this:

You're using pandas to do some Python data analysis. The first thing you do with your new data set is print df.head() to get a sense of what the data looks like. You notice there are a few NaN values in the first five rows, so you print df.info() to see how prevalent they are throughout the data set. Depending on their distribution, you'll either leave them be or figure out a way to remove them.

Incremental programming is one of those things where using the right tool can make you 10 times faster. Trying to explore and clean a dataset by writing a script and executing the entire thing at once (aka the main.py style of programming) is equivalent to proofreading this blog with only a paper dictionary.

REPL programming, notebooks, and spreadsheets each have massive improvements over the main.py approach, but what if we combined all three?

REPL Programming

REPL (Read-Eval-Print Loop) is a programming paradigm that reads the user's program, evaluates it, and prints the result back to the user. In fact, if you're first experience programming was printing "hello world" to the screen in a Python shell, then you're first experience programming used the REPL paradigm.

REPL Command Line Video.mov.gif

The benefits of REPL programming are quite obvious. For iterative tasks like data exploration, it's more natural and efficient to write and execute your program in a piece-wise fashion rather than executing the entire script top to bottom after each iteration. REPL programming lets you apply a single transformation to your data, look at the output, and then apply a new transformation to that already altered state.

Executing a single transformation and immediately seeing the result lets you focus on the specific piece of your program that you're interested in. Like all feedback, feedback when you're programming is best received immediately and in context. In contrast, in the traditional main.py style of programming, you have to wait dozens of seconds for your program to finish executing before you get any feedback on your code. A workflow that can only get choppier if it requires opening Excel or Notepad at the end of the execution to see your results. Trust me, I've been there - it's not how you want to set up your data exploration process.

Using a REPL programming environment stops you from context switching, and at least gives you a chance of remaining in flow.