Data Science Interview Practice: Machine Learning Case Study

A black and white photo of Henry J.E. Reid, Directory of the Langley Aeronautics Laborator, in a suit writing while sitting at a desk.

A common interview type for data scientists and machine learning engineers is the machine learning case study. In it, the interviewer will ask a question about how the candidate would build a certain model. These questions can be challenging for new data scientists because the interview is open-ended and new data scientists often lack practical experience building and shipping product-quality models.

I have a lot of practice with these types of interviews as a result of my time at Insight, my many experiences interviewing for jobs, and my role in designing and implementing Intuit’s data science interview. Similar to my last article where I put together an example data manipulation interview practice problem, this time I will walk through a practice case study and how I would work through it.

My Approach

Case study interviews are just conversations. This can make them tougher than they need to be for junior data scientists because they lack the obvious structure of a coding interview or data manipulation interview. I find it’s helpful to impose my own structure on the conversation by approaching it in this order:

  1. Problem: Dive in with the interviewer and explore what the problem is. Look for edge cases or simple and high-impact parts of the problem that you might be able to close out quickly.
  2. Metrics: Once you have determined the scope and parameters of the problem you’re trying to solve, figure out how you will measure success. Focus on what is important to the business and not just what is easy to measure.
  3. Data: Figure out what data is available to solve the problem. The interviewer might give you a couple of examples, but ask about additional information sources. If you know of some public data that might be useful, bring it up here too.
  4. Labels and Features: Using the data sources you discussed, what features would you build? If you are attacking a supervised classification problem, how would you generate labels? How would you see if they were useful?
  5. Model: Now that you have a metric, data, features, and labels, what model is a good fit? Why? How would you train it? What do you need to watch out for?
  6. Validation: How would you make sure your model works offline? What data would you hold out to test your model works as expected? What metrics would you measure?
  7. Deployment and Monitoring: Having developed a model you are comfortable with, how would you deploy it? Does it need to be real-time or is it sufficient to batch inputs and periodically run the model? How would you check performance in production? How would you monitor for model drift where its performance changes over time?

Case Study

Here is the prompt:

At Twitter, bad actors occasionally use automated accounts, known as “bots”, to abuse our platform. How would you build a system to help detect bot accounts?


At the start of the interview I try to fully explore the bounds of the problem, which is often open ended. My goal with this part of the interview is to:

Our Twitter bot prompt has a lot of angles from which we could attack. I know Twitter has dozens of types of bots, ranging from my harmless Raspberry Pi bots, to “Russian Bots” trying to influence elections, to bots spreading spam. I would pick one problem to focus on using my best guess as to business impact. In this case spam bots are likely a problem that causes measurable harm (drives users away, drives advertisers away). Russian bots are probably a bigger issue in terms of public perception, but that’s much harder to measure.

After deciding on the scope, I would ask more about the systems they currently have to deal with it. Likely Twitter has an ops team to help identify spam and block accounts and they may even have a rules based system. Those systems will be a good source of data about the bad actors and they likely also have metrics they track for this problem.


Having agreed on what part of the problem to focus on, we now turn to how we are going to measure our impact. There is no point shipping a model if you can’t measure how it’s affecting the business.

Metrics and model use go hand-in-hand, so first we have to agree on what the model will be used for. For spam we could use the model to just mark suspected accounts for human review and tracking, or we could outright block accounts based on the model result. If we pick the human review option, it’s probably more important to get all the bots even if some good customers are affected. If we go with immediate action, it is likely more important to only ban truly bad accounts. I covered thinking about metrics like this in detail in another post, What Machine Learning Metric to Use. Take a look!

I would argue the automatic blocking model will have higher impact because it frees our ops people to focus on other bad behavior. We want two sets of metrics: offline for when we are training and online for when the model is deployed.

Our offline metric will be precision because, based on the argument above, we want to be really sure we’re only banning bad accounts.

Our online metrics are more business focused:

It is often useful to normalize metrics, like the spam fraction metric, so they don’t go up or down just because we have more customers!


Now that we know what we’re doing and how to measure its success, it’s time to figure out what data we can use. Just based on how a company operates, you can make a really good guess as to the data they have. For Twitter we know they have to track Tweets, accounts, and logins, so they must have databases with that information. Here are what I think they contain:

And a lot more. From these we can find out a lot about an account and the Tweets they send, who they send to, who those people react to, and possibly how login events tie different accounts together.

Labels and Features

Having figured out what data is available, it’s time to process it. Because I’m treating this as a classification problem, I’ll need labels to tell me the ground truth for accounts, and I’ll need features which describe the behavior of the accounts.


Since there is an ops team handling spam, I have historical examples of bad behavior which I can use as positive labels.1 If there aren’t enough I can use tricks to try to expand my labels, for example looking at IP address or devices that are associated with spammers and labeling other accounts with the same login characteristics.

Negative labels are harder to come by. I know Twitter has verified users who are unlikely to be spam bots, so I can use them. But verified users are certainly very different from “normal” good users because they have far more followers.

It is a safe bet that there are far more good users than spam bots, so randomly selecting accounts can be used to build a negative label set.


To build features, it helps to think about what sort of behavior a spam bot might exhibit, and then try to codify that behavior into features. For example:

Model Selection

I try to start with the simplest model that will work when starting a new project. Since this is a supervised classification problem and I have written some simple features, logistic regression or a forest are good candidates. I would likely go with a forest because they tend to “just work” and are a little less sensitive to feature processing.2

Deep learning is not something I would use here. It’s great for image, video, audio, or NLP, but for a problem where you have a set of labels and a set of features that you believe to be predictive it is generally overkill.

One thing to consider when training is that the dataset is probably going to be wildly imbalanced. I would start by down-sampling (since we likely have millions of events), but would be ready to discuss other methods and trade offs.


Validation is not too difficult at this point. We focus on the offline metric we decided on above: precision. We don’t have to worry much about leaking data between our holdout sets if we split at the account level, although if we include bots from the same botnet into our different sets there will be a little data leakage. I would start with a simple validation/training/test split with fixed fractions of the dataset.


Since we want to classify an entire account and not a specific tweet, we don’t need to run the model in real-time when Tweets are posted. Instead we can run batches and can decide on the time between runs by looking at something like the characteristic time a spam bot takes to send out Tweets. We can add rate limiting to Tweet sending as well to slow the spam bots and give us more time to decide without impacting normal users.

For deployment, I would start in shadow mode, which I discussed in detail in another post. This would allow us to see how the model performs on real data without the risk of blocking good accounts. I would track its performance using our online metrics: spam fraction and ops time saved. I would compute these metrics twice, once using the assumption that the model blocks flagged accounts, and once assuming that it does not block flagged accounts, and then compare the two outcomes. If the comparison is favorable, the model should be promoted to action mode.

Let Me Know!

I hope this exercise has been helpful! Please reach out and let me know at @alex_gude if you have any comments or improvements!

  1. In this case a positive label means the account is a spam bot, and a negative label means they are not. 

  2. If you use regularization with logistic regression (and you should) you need to scale your features. Random forests do not require this.