Data Science Interview Practice: Data Manipulation
I often get asked by newly-minted PhDs trying to get their first data job:
How can I prepare for dataset-based interviews? Do you have any examples of datasets to practice with?
I never had a good answer. I would tell them about how the interviews worked, but I wished I had something to share that they could get their hands on.
As of today, that’s changing. In this post I put together a series of practice questions like the kind you might see (or be expected to come up with) in a hands-on data interview using the curated and hosted dataset of California Traffic accidents. The dataset is available for download from both Kaggle and Zenodo, and I even have an example notebook demonstrating how to work with the data entirely online within Kaggle.
As I mentioned in my post about my most recent interview experience, data science and machine learning interviews have become more practical, covering tasks that show up in the day-to-day work of a data scientist instead of hard but irrelevant problems. One common interview type involves working with a dataset, answering some simple questions about it, and then building some simple features.
Generally these interviews use Python and Pandas or pure SQL. Sometimes the interviewer has a set of questions for you to answer and sometimes they want you to come up with your own.
To help people prepare, I have created a set of questions similar to what you would get in a real interview. For the exercise you will be using the SWITRS dataset. I have included a notebook to get you started in SQL (rendered on Github) or Pandas (rendered on Github). The solution notebooks can be found at the very end and I have included answers in this post, click the arrow to show them.
Good luck, and if you have any questions or suggestions please reach out to me on Twitter: @alex_gude
How many collisions are there in the dataset?
A good first thing to check is “How much data am I dealing with?”
Each row in the collisions database represents one collision, so the solution is nice and short:
SELECT COUNT(1) AS collision_count FROM collisions
What percent of collisions involve males aged 16–25?
Young men are famously unsafe drivers so let’s look at how many collisions they’re involved in.
The age and gender of the drivers are in the parties table so the query does a simple filter on those entries. The tricky part comes from needing to calculate the ratio as this requires us to get the total number of collisions. We could hard-code the number, but I prefer calculating it as part of the query. There isn’t a super elegant way to do it in SQLite, but a sub-query works fine. We also have to cast to a float to avoid integer division.
There are a lot of
NULL values for age and sex. I assume they are uncorrelated to age and sex which allows me to remove them. If we were worried about this assumption, we could leave them in and treat the answer as a lower bound.
SELECT COUNT(DISTINCT case_id) / ( SELECT CAST(COUNT(DISTINCT case_id) AS FLOAT) FROM parties WHERE party_age IS NOT NULL AND party_sex IS NOT NULL ) AS percentage FROM parties WHERE party_sex = 'male' AND party_age BETWEEN 16 AND 25
The result is:
How many solo motorcycle crashes are there per year?
A “solo” crash is one where the driver runs off the road or hits a stationary object. How many solo motorcycle crashes were there each year? Why does 2020 seem to (relatively) have so few?
To select the right rows we filter with
WHERE and to get the count per year we need to use a
GROUP BY. SQLite does not have a
YEAR() function, so we have to use
strftime instead. In a real interview, you can normally just assume that the function you need will exist without getting into the specifics of the SQL dialect.
SELECT STRFTIME('%Y', collision_date) AS collision_year, COUNT(1) AS collision_count FROM collisions WHERE motorcycle_collision = True AND party_count = 1 GROUP BY collision_year ORDER BY collision_year
This gives us:
The count is low in 2020 primarily because the data doesn’t cover the whole year. It is also low due to the COVID pandemic keeping people off the streets, at least initially. To differentiate these two causes we could compare month by month to last year.
What make of vehicle has the largest fraction of accidents on the weekend? During the work week?
Weekdays are generally commute and work-related traffic, while weekends involves recreational travel. Do we see different vehicles involved in collisions on these days?
Only consider vehicle makes with at least 10,000 collisions, in order to focus only on common vehicles where the difference between weekend and weekday usage will be significant.
This query is tricky. We need to aggregate collisions by vehicle make, which means we need the parties table. We also care about when the crash happened, which means we need the collisions table. So we need to join these two tables together.
In an interview setting, I would write two simpler queries: one that gets the highest weekend fraction and one that gets the highest weekday fraction with a lot of copy and pasted code. This is a lot easier to work out. Here is an example of one of those queries:
SELECT p.vehicle_make AS make, AVG( CASE WHEN STRFTIME('%w', c.collision_date) IN ('0', '6') THEN 1 ELSE 0 END ) AS weekend_ratio, AVG( CASE WHEN STRFTIME('%w', c.collision_date) IN ('0', '6') THEN 0 ELSE 1 END ) AS weekday_ratio, count(1) AS total FROM collisions AS c LEFT JOIN parties AS p ON c.case_id = p.case_id GROUP BY make HAVING total >= 10000 ORDER BY weekday_ratio DESC LIMIT 1
Then I would copy and paste this but replace the
weekend_ratio with a
weekday_ratio. It isn’t as “elegant” because we have to duplicate code, but it is easy to write.
Combining the queries is possible. To do so I first use a sub-query to do the aggregation. A
WTIH clause keeps it tidy so we don’t have to copy/paste the sub-query twice. I use
HAVING to filter out makes with too few collisions; it has to be
HAVING and not
WHERE because it filters after the aggregation.
I then construct two queries that read from the sub-query to select the highest row for the weekend and weekdays. I
UNION the two queries together so we end up with a single table containing our results. The double select is to allow the
ORDER BY before the
A note: for complicated queries like this one there are always many ways to do it. I’d love to hear how you got it to work!
WITH counter AS ( SELECT p.vehicle_make AS make, AVG( CASE WHEN STRFTIME('%w', c.collision_date) IN ('0', '6') THEN 1 ELSE 0 END ) AS weekend_fraction, AVG( CASE WHEN STRFTIME('%w', c.collision_date) IN ('0', '6') THEN 0 ELSE 1 END ) AS weekday_fraction, count(1) AS total FROM collisions AS c LEFT JOIN parties AS p ON c.case_id = p.case_id GROUP BY make HAVING total >= 10000 ) SELECT * FROM ( SELECT * FROM counter ORDER BY weekend_fraction DESC LIMIT 1 ) UNION SELECT * FROM ( SELECT * FROM counter ORDER BY weekday_fraction DESC LIMIT 1 )
These results makes sense, Peterbilt is a commercial truck manufacturer which you expect to be driven for work. Harley-Davidson makes iconic motorcycles that people ride for fun on the weekend with their friends.
How many different values represent “Toyota” in the Parties database? How would you go about correcting for this?
Data is never as clean as you would hope, and this applies even to the curated SWITRS dataset. How many different ways does “Toyota” show up?
What steps would you take to fix this problem?
This is a case where there is no right answer. You can get a more and more correct answer as you spend more time, but at some point you have to decide it is good enough.
The first step is to figure out what values might represent Toyota. I do that with a few simple
SELECT vehicle_make, COUNT(1) AS number_seen FROM parties WHERE LOWER(vehicle_make) = 'toyota' OR LOWER(vehicle_make) LIKE 'toy%' OR LOWER(vehicle_make) LIKE 'ty%' GROUP BY vehicle_make ORDER BY number_seen DESC
Which gives us this table (truncated):
Most of those look like they mean Toyota, although Tymco is a different company that makes street sweepers.
Here is how I would handle this issue: the top 5 make up the vast majority of entries. I would fix those by hand and move on. More generally it seems that makes are represented mostly by their name or a four-letter abbreviation. It wouldn’t be too hard to detect and fix these for the most common makes.
So that’s it! I hope it was useful and you learned something!
Here are my notebooks with the solutions:
- The SQL solution notebook (Rendered on Github)
- The Python/Pandas solution notebook (Rendered on Github)
A special thanks to Quynh M. Nguyen who came up with some simplifications for my queries!
Let me know if you find any more elegant solutions!