Jupyter Notebook Templates for Data Science: Plotting Time Series

March 14, 2021 #data-science #data-visualization #jupyter #my-projects

The planet Jupiter as seen by the departing Juno spacecraft.

I often have data where each row describes an event. The data might describe a word that my son spoke for the first time, or a collision that happened in California, or the finishing place of a rider in the Tour de France. A question I always want to answer with the data is: What does the distribution of these events look like in time?

Plotting the data as a time series is the best way to answer this question, but I never remember how to pivot the table, aggregate the events by type, and resample to the right frequency. So I made the Time Series Plotting Notebook to remember for me.

The Time Series Plotting Notebook

Suppose we are looking at the number of automobile collisions by make using my curated SWITRS dataset. We could extract one row for each collision and the associated vehicle, which would look like this:

ID	datetime	vehicle_make
0	2020-01-01	Honda
1	2020-02-01	Toyota
2	2020-01-01	Other
…	…	…

The time series plotting notebook has two helpful functions to visualize this data: plot_time_series() and draw_left_legend().

Plot Time Series

The first function, plot_time_series() is simple. It takes a dataframe formatted like the above data and returns a plot showing the number of events for each value in the categorical column. For example, to plot the number of accidents per week by vehicle make, we would call:

plot_time_series(
  df, 
  ax,
  date_col="datetime",
  category_col="vehicle_make",
  resample_frequency="W",  # Resample to 'W'eeks
)

Which would produce this plot:

The function accepts a few optional parameters:

resample_frequency: controls the timescale over which the data is aggregated.
aggfunc which controls how the data is aggregated.
linewidth which can be used to make the lines larger if there are only a few of them, or thinner if there is lots of data.

Simple Legend

Simple legends are great. They convey their information effectively because the superfluous noise has been removed. My basic plotting notebook has a function to remove all the extra information from the legend box leaving only the color and the label. This time I have taken it a step further: I wrote a function to get rid of the box and label each line.

The function draw_left_legend() will draw labels on the end of each line, like so:

I’ve used this legend when Plotting the winners of the 2019 Tour de France as well as the 2020 Tour de France.

Putting It Together

The time series plotting notebook enables you to quickly plot your data in time with only a few lines of code. Here is the final version of the plot:

Which was produced by this short code snippet:

import seaborne as sns

fig, ax = setup_plot(title="Collisions by Make")

pivot = plot_time_series(df, ax, date_col=DATE_COL, category_col="vehicle_make", resample_frequency="W")

# Move labels slightly to avoid overlap
nudges = {"Toyota": 15, "Honda": -8}
draw_left_legend(ax, nudges=nudges, fontsize=25)

sns.despine(trim=True)

save_plot(fig, "/tmp/make_collision_in_time.svg")

I hope the notebook template library is useful to you! Let me know on Twitter or Github if it is. Your feedback helps make the project better for everyone!