{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import sqlite3"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_rows', None)\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.width', None)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"con = sqlite3.Connection(\"./switrs.sqlite\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Science Interview Practice: Data Manipulation\n",
"\n",
"I often get asked by newly-minted PhDs trying to get their first data job:\n",
"\n",
"> How can I prepare for dataset-based interviews? Do you have any examples of\n",
"> datasets to practice with?\n",
"\n",
"I never had a good answer. I would tell them about how the interviews worked,\n",
"but I wished I had something to share that they could get their hands on.\n",
"\n",
"As of today, that's changing. In this post I put together a series of practice\n",
"questions like the kind you might see (or be expected to come up with) in a\n",
"hands-on data interview using the [curated and hosted dataset of California\n",
"Traffic accidents][switrs_dataset]. The dataset is available for download from\n",
"both [Kaggle][kaggle] and [Zenodo][zenodo], and I even have an [example\n",
"notebook][example_notebook] demonstrating how to work with the data entirely\n",
"online within Kaggle.\n",
"\n",
"[switrs_dataset]: https://alexgude.com/blog/switrs-sqlite-hosted-dataset/\n",
"\n",
"[kaggle]: https://www.kaggle.com/alexgude/california-traffic-collision-data-from-switrs\n",
"\n",
"[zenodo]: https://zenodo.org/record/4284843\n",
"\n",
"[example_notebook]: https://www.kaggle.com/alexgude/starter-california-traffic-collisions-from-switrs\n",
"\n",
"## Interview Format\n",
"\n",
"As I mentioned in [my post about my most recent interview\n",
"experience][last_post], data science and machine learning interviews have\n",
"become more practical, covering tasks that show up in the day-to-day work of a\n",
"data scientist instead of hard but irrelevant problems. One common interview\n",
"type involves working with a dataset, answering some simple questions about\n",
"it, and then building some simple features.\n",
"\n",
"[last_post]: https://alexgude.com/blog/interviewing-for-data-science-positions-in-2020/\n",
"\n",
"Generally these interviews use Python and [Pandas][pandas] or pure SQL.\n",
"Sometimes the interviewer has a set of questions for you to answer and\n",
"sometimes they want you to come up with your own.\n",
"\n",
"[pandas]: https://en.wikipedia.org/wiki/Pandas_(software)\n",
"\n",
"To help people prepare, I have created a set of questions similar to what you\n",
"would get in a real interview. For the exercise you will be using the SWITRS\n",
"dataset. I have included a notebook to get you started in Pandas or SQL. The\n",
"solution notebooks can be found at the very end.\n",
"\n",
"Good luck, and if you have any questions or suggestions please reach out to me\n",
"on Twitter: [@alex_gude][twitter]\n",
"\n",
"[twitter]: https://twitter.com/alex_gude\n",
"\n",
"## Questions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How many collisions are there in the dataset?\n",
"\n",
"A good first thing to check is \"How much data am I dealing with?\"\n",
"\n",
"Each row in the collisions database represents one collision, so the solution\n",
"is nice and short:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"QUERY1 = \"\"\"\n",
"SELECT COUNT(1) AS collision_count\n",
"FROM collisions\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" collision_count | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 9172565 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" collision_count\n",
"0 9172565"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(QUERY1, con).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What percent of collisions involve males aged 16--25?\n",
"\n",
"Young men are famously unsafe drivers so let's look at how many collisions\n",
"they're involved in.\n",
"\n",
"The age and gender of the drivers are in the parties table so the query does a\n",
"simple filter on those entries. The tricky part comes from needing to\n",
"calculate the ratio as this requires us to get the total number of collisions.\n",
"We could hard-code the number, but I prefer calculating it as part of the\n",
"query. There isn't a super elegant way to do it in SQLite, but a sub-query\n",
"works fine. We also have to cast to a float to avoid integer division.\n",
"\n",
"There are a lot of `NULL` values for age and sex. I assume they are\n",
"uncorrelated to age and sex which allows me to remove them. If we were worried\n",
"about this assumption, we could leave them in and treat the answer as a lower\n",
"bound."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"QUERY2 = \"\"\"\n",
"SELECT \n",
" COUNT(DISTINCT case_id) \n",
" / (\n",
" SELECT CAST(COUNT(DISTINCT case_id) AS FLOAT) \n",
" FROM parties \n",
" WHERE party_age IS NOT NULL\n",
" AND party_sex IS NOT NULL\n",
" )\n",
" AS percentage\n",
"FROM parties\n",
"WHERE party_sex = 'male'\n",
"AND party_age BETWEEN 16 AND 25\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" percentage | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.25769 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" percentage\n",
"0 0.25769"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(QUERY2, con)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How many solo motorcycle crashes are there per year?\n",
"\n",
"A \"_solo_\" crash is one where the driver runs off the road or hits a\n",
"stationary object. How many solo motorcycle crashes were there each year? Why\n",
"does 2020 seem to (relatively) have so few?\n",
"\n",
"To select the right rows we filter with `WHERE` and to get the count per year\n",
"we need to use a `GROUP BY`. SQLite does not have a `YEAR()` function, so we\n",
"have to use `strftime` instead. In a real interview, you can normally just\n",
"assume that the function you need will exist without getting into the\n",
"specifics of the SQL dialect."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"QUERY3 = \"\"\"\n",
"SELECT\n",
" STRFTIME('%Y', collision_date) AS collision_year,\n",
" COUNT(1) AS collision_count\n",
"FROM collisions\n",
"WHERE motorcycle_collision = True\n",
" AND party_count = 1\n",
"GROUP BY collision_year\n",
"ORDER BY collision_year\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" collision_year | \n",
" collision_count | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2001 | \n",
" 3258 | \n",
"
\n",
" \n",
" 1 | \n",
" 2002 | \n",
" 3393 | \n",
"
\n",
" \n",
" 2 | \n",
" 2003 | \n",
" 3822 | \n",
"
\n",
" \n",
" 3 | \n",
" 2004 | \n",
" 3955 | \n",
"
\n",
" \n",
" 4 | \n",
" 2005 | \n",
" 3755 | \n",
"
\n",
" \n",
" 5 | \n",
" 2006 | \n",
" 3967 | \n",
"
\n",
" \n",
" 6 | \n",
" 2007 | \n",
" 4513 | \n",
"
\n",
" \n",
" 7 | \n",
" 2008 | \n",
" 4948 | \n",
"
\n",
" \n",
" 8 | \n",
" 2009 | \n",
" 4266 | \n",
"
\n",
" \n",
" 9 | \n",
" 2010 | \n",
" 3902 | \n",
"
\n",
" \n",
" 10 | \n",
" 2011 | \n",
" 4054 | \n",
"
\n",
" \n",
" 11 | \n",
" 2012 | \n",
" 4143 | \n",
"
\n",
" \n",
" 12 | \n",
" 2013 | \n",
" 4209 | \n",
"
\n",
" \n",
" 13 | \n",
" 2014 | \n",
" 4267 | \n",
"
\n",
" \n",
" 14 | \n",
" 2015 | \n",
" 4415 | \n",
"
\n",
" \n",
" 15 | \n",
" 2016 | \n",
" 4471 | \n",
"
\n",
" \n",
" 16 | \n",
" 2017 | \n",
" 4373 | \n",
"
\n",
" \n",
" 17 | \n",
" 2018 | \n",
" 4240 | \n",
"
\n",
" \n",
" 18 | \n",
" 2019 | \n",
" 3772 | \n",
"
\n",
" \n",
" 19 | \n",
" 2020 | \n",
" 2984 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" collision_year collision_count\n",
"0 2001 3258\n",
"1 2002 3393\n",
"2 2003 3822\n",
"3 2004 3955\n",
"4 2005 3755\n",
"5 2006 3967\n",
"6 2007 4513\n",
"7 2008 4948\n",
"8 2009 4266\n",
"9 2010 3902\n",
"10 2011 4054\n",
"11 2012 4143\n",
"12 2013 4209\n",
"13 2014 4267\n",
"14 2015 4415\n",
"15 2016 4471\n",
"16 2017 4373\n",
"17 2018 4240\n",
"18 2019 3772\n",
"19 2020 2984"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(QUERY3, con)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The count is low in 2020 primarily because the data doesn't cover the whole\n",
"year. It is also low due to the COVID pandemic keeping people off the streets,\n",
"at least initially. To differentiate these two causes we could compare month\n",
"by month to last year."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What make of vehicle has the largest fraction of accidents on the weekend? During the work week?\n",
"\n",
"Weekdays are generally commute and work-related traffic, while weekends\n",
"involves recreational travel. Do we see different vehicles involved in\n",
"collisions on these days?\n",
"\n",
"Only consider vehicle makes with at least 10,000 collisions, in order to focus\n",
"only on common vehicles where the difference between weekend and weekday usage\n",
"will be significant.\n",
"\n",
"This query is tricky. We need to aggregate collisions by vehicle make, which\n",
"means we need the parties table. We also care about when the crash happened,\n",
"which means we need the collisions table. So we need to join these two tables\n",
"together.\n",
"\n",
"In an interview setting, I would write two simpler queries: one\n",
"that gets the highest weekend fraction and one that gets the highest weekday\n",
"fraction with a lot of copy and pasted code. This is a lot easier to work out.\n",
"Here is an example of one of those queries:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"QUERY41 = \"\"\"\n",
"SELECT\n",
" p.vehicle_make AS make,\n",
" AVG(\n",
" CASE WHEN STRFTIME('%w', c.collision_date) IN ('0', '6') THEN 1 ELSE 0 END\n",
" ) AS weekend_ratio,\n",
" AVG(\n",
" CASE WHEN STRFTIME('%w', c.collision_date) IN ('0', '6') THEN 0 ELSE 1 END\n",
" ) AS weekday_ratio,\n",
" count(1) AS total\n",
"FROM collisions AS c\n",
"LEFT JOIN parties AS p\n",
" ON c.case_id = p.case_id\n",
"GROUP BY make\n",
"HAVING total >= 10000\n",
"ORDER BY weekday_ratio DESC\n",
"LIMIT 1\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" make | \n",
" weekend_ratio | \n",
" weekday_ratio | \n",
" total | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" PETERBILT | \n",
" 0.09177 | \n",
" 0.90823 | \n",
" 70579 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" make weekend_ratio weekday_ratio total\n",
"0 PETERBILT 0.09177 0.90823 70579"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(QUERY41, con).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then I would copy and paste this but replace the `weekend_ratio` with a\n",
"`weekday_ratio`. It isn't as \"elegant\" because we have to duplicate code, but\n",
"it is easy to write.\n",
"\n",
"Combining the queries is possible. To do so I first use a sub-query to do the\n",
"aggregation. A `WTIH` clause keeps it tidy so we don't have to copy/paste the\n",
"sub-query twice. I use `HAVING` to filter out makes with too few collisions;\n",
"it has to be `HAVING` and not `WHERE` because it filters **after** the\n",
"aggregation.\n",
"\n",
"I then construct two queries that read from the sub-query to select the\n",
"highest row for the weekend and weekdays. I `UNION` the two queries together\n",
"so we end up with a single table containing our results. The double select is\n",
"to allow the `ORDER BY` before the `UNION`.\n",
"\n",
"A note: for complicated queries like this one there are always many ways to do\n",
"it. I'd love to hear how you got it to work!"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"QUERY42 = \"\"\"\n",
"WITH counter AS (\n",
" SELECT\n",
" p.vehicle_make AS make, \n",
" AVG(\n",
" CASE WHEN STRFTIME('%w', c.collision_date) IN ('0', '6') THEN 1 ELSE 0 END\n",
" ) AS weekend_fraction,\n",
" AVG(\n",
" CASE WHEN STRFTIME('%w', c.collision_date) IN ('0', '6') THEN 0 ELSE 1 END\n",
" ) AS weekday_fraction,\n",
" count(1) AS total\n",
" FROM collisions AS c\n",
" LEFT JOIN parties AS p\n",
" ON c.case_id = p.case_id\n",
" GROUP BY make\n",
" HAVING total >= 10000\n",
")\n",
"\n",
"SELECT * FROM (\n",
" SELECT \n",
" *\n",
" FROM counter\n",
" ORDER BY weekend_fraction DESC\n",
" LIMIT 1\n",
")\n",
"\n",
"UNION\n",
"\n",
"SELECT * FROM (\n",
" SELECT \n",
" *\n",
" FROM counter\n",
" ORDER BY weekday_fraction DESC\n",
" LIMIT 1\n",
")\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" make | \n",
" weekend_fraction | \n",
" weekday_fraction | \n",
" total | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" HARLEY-DAVIDSON | \n",
" 0.385569 | \n",
" 0.614431 | \n",
" 49602 | \n",
"
\n",
" \n",
" 1 | \n",
" PETERBILT | \n",
" 0.091770 | \n",
" 0.908230 | \n",
" 70579 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" make weekend_fraction weekday_fraction total\n",
"0 HARLEY-DAVIDSON 0.385569 0.614431 49602\n",
"1 PETERBILT 0.091770 0.908230 70579"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(QUERY42, con).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These results makes sense, Peterbilt is a commercial truck manufacturer which\n",
"you expect to be driven for work. Harley-Davidson makes iconic motorcycles\n",
"that people ride for fun on the weekend with their friends."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How many different values represent \"Toyota\" in the Parties database? How would you go about correcting for this?\n",
"\n",
"Data is **_never_** as clean as you would hope, and this applies even to the\n",
"[curated SWITRS dataset][switrs_dataset]. How many different ways does\n",
"\"Toyota\" show up?\n",
"\n",
"[switrs_dataset]: https://alexgude.com/blog/switrs-sqlite-hosted-dataset/\n",
"\n",
"What steps would you take to fix this problem?\n",
"\n",
"\n",
"This is a case where there is no _right_ answer. You can get a more and more\n",
"correct answer as you spend more time, but at some point you have to decide it\n",
"is good enough.\n",
"\n",
"The first step is to figure out what values might represent Toyota. I do that\n",
"with a few simple `LIKE` filters:\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"QUERY5 = \"\"\"\n",
"SELECT \n",
" vehicle_make,\n",
" COUNT(1) AS number_seen\n",
"FROM parties\n",
"WHERE LOWER(vehicle_make) = 'toyota'\n",
" OR LOWER(vehicle_make) LIKE 'toy%'\n",
" OR LOWER(vehicle_make) LIKE 'ty%'\n",
"GROUP BY vehicle_make\n",
"ORDER BY number_seen DESC\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" vehicle_make | \n",
" number_seen | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TOYOTA | \n",
" 2374621 | \n",
"
\n",
" \n",
" 1 | \n",
" TOYO | \n",
" 166209 | \n",
"
\n",
" \n",
" 2 | \n",
" TOYT | \n",
" 146746 | \n",
"
\n",
" \n",
" 3 | \n",
" TOYOT | \n",
" 2823 | \n",
"
\n",
" \n",
" 4 | \n",
" TOY | \n",
" 2262 | \n",
"
\n",
" \n",
" 5 | \n",
" TOYTA | \n",
" 246 | \n",
"
\n",
" \n",
" 6 | \n",
" TOYOTA/ | \n",
" 181 | \n",
"
\n",
" \n",
" 7 | \n",
" TOYTO | \n",
" 84 | \n",
"
\n",
" \n",
" 8 | \n",
" TOYTOA | \n",
" 71 | \n",
"
\n",
" \n",
" 9 | \n",
" TOYOYA | \n",
" 66 | \n",
"
\n",
" \n",
" 10 | \n",
" TOYT. | \n",
" 65 | \n",
"
\n",
" \n",
" 11 | \n",
" TOYA | \n",
" 51 | \n",
"
\n",
" \n",
" 12 | \n",
" TOYTOTA | \n",
" 45 | \n",
"
\n",
" \n",
" 13 | \n",
" TOYOA | \n",
" 43 | \n",
"
\n",
" \n",
" 14 | \n",
" TOYO / | \n",
" 39 | \n",
"
\n",
" \n",
" 15 | \n",
" TOYT / | \n",
" 17 | \n",
"
\n",
" \n",
" 16 | \n",
" TOYT/ | \n",
" 14 | \n",
"
\n",
" \n",
" 17 | \n",
" TYMCO | \n",
" 13 | \n",
"
\n",
" \n",
" 18 | \n",
" TOYOTO | \n",
" 10 | \n",
"
\n",
" \n",
" 19 | \n",
" TOY0 | \n",
" 10 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" vehicle_make number_seen\n",
"0 TOYOTA 2374621\n",
"1 TOYO 166209\n",
"2 TOYT 146746\n",
"3 TOYOT 2823\n",
"4 TOY 2262\n",
"5 TOYTA 246\n",
"6 TOYOTA/ 181\n",
"7 TOYTO 84\n",
"8 TOYTOA 71\n",
"9 TOYOYA 66\n",
"10 TOYT. 65\n",
"11 TOYA 51\n",
"12 TOYTOTA 45\n",
"13 TOYOA 43\n",
"14 TOYO / 39\n",
"15 TOYT / 17\n",
"16 TOYT/ 14\n",
"17 TYMCO 13\n",
"18 TOYOTO 10\n",
"19 TOY0 10"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_sql(QUERY5, con).head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of those look like they mean Toyota, although Tymco is a different\n",
"company that makes street sweepers.\n",
"\n",
"Here is how I would handle this issue: the top 5 make up the vast majority of\n",
"entries. I would fix those by hand and move on. More generally it seems that\n",
"makes are represented mostly by their name or a four-letter abbreviation. It\n",
"wouldn't be too hard to detect and fix these for the most common makes."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}