What Should a PM Know About Data Science? – The Discourse #19
And what questions should you ask your ML team
So it’s your first time working on an ML product. What are the concepts you should know?
I don't claim to be an expert on the topic, but having worked on an ML product for 1.5 years, I’ve learned a few things. And to further back up my points, I spoke with Hassan Kane, who is the Lead Data Scientist on my team and an MIT grad.
This is just a starting point of your journey into understanding the basics of Machine Learning and what intelligent questions you should ask before you start. I’ve linked in-depth resources at the end of this article.
If you’re new here, please subscribe and get insights about product, design, no-code delivered to your inbox every week.
So let’s begin with some of the basics of ML:
What is Machine Learning?
In simple words, machine learning gives computers the ability to learn without being explicitly programmed through rules. The ML models learn from sample data to make predictions on new data.
What are the use cases?
Object recognition (For example, it can identify a cat in an image)
Speech recognition (This is how it can identify words from voice data)
Natural language processing (To understand semantic language)
Prediction (It can make predictions based on historical or existing data)
This is the data set of values. It includes both the input and the output values that are used to train the model.
The dataset of values is without the output parameter. It is used to test the model and provide accuracy measures.
There are a few different ways in which models learn:
In this, the training data is labeled, meaning, it has an output assigned to it. And we use this to train the models to predict the values where the output is not present.
The learning happens through the clustering of input values and categorizing the data points. This training data is said to be unlabelled.
What are some good questions to ask?
Okay, now that you know the basics of ML, let us get into some of the questions you should ask:
What models are suited for our use case and what are their tradeoffs?
Certain models are better suited for specific use cases. For e.g., GPT2 is better suited for NLP and text-based category predictions.
Hassan: There are usually many types of models for solving a given problem. Each model comes with its complexity in terms of size and data requirements. A usual proxy is that more powerful models tend to be data-heavy and have a bigger surface area for going wrong. It can be helpful to start with simpler models to get quick wins and then go up the complexity chain.
What are the risks inherent to the models we work with? How do we manage them?
This question does not get discussed enough. Machine learning models are far from perfect and present inherent risk. Every team should manage them.
Hassan: Each step of the model development pipeline starting from data acquisition to training and deployment involves risks because machine learning is inherently probabilistic. The worst-case scenario is having a model silently fail in production. There should be ways to know the weaknesses of the models, where they come from, and whether they are solvable and reflect that in your product (i.e. enable users to overwrite predictions, humans in the loop, confidence score).
What data are the models trained on?
In the case of GPT2, it is trained on 8 million web pages, while some models are trained on Wikipedia text.
Hassan: You need to know what data powers your models as there is an extensive literature on how assumptions baked into the datasets used to pre-train the models can be different from the intended use cases. It can go from a benign drop in accuracy to straight out bias in the models against certain demographic groups. It is very important as a product manager to be aware of this.
What's the processing power required? (CPU/GPU)
Processing power will define the cost estimates.
Hassan: GPUs in the cloud are not cheap and cost up to $3 per hour, while it is 15 cents for machines using CPU. A lot of engineering work though can be done to use the computational resources cleverly.
What is the latency?
This is a function of the complexity of the model along with the CPU/GPU resources that you can throw at it.
Hassan: Latency will help determine whether batch processing or live processing is intended for your use case.
What is the minimum amount of training data required and what can be done to augment our in-house data?
The training data has to also be near equally distributed across features and attributes. You will face a situation where there is a lot of data for one attribute but not enough data for the other. But do not limit yourself to just the data that you have in-house.
Hassan: There are a lot of techniques to create synthetic training datasets along with publicly available datasets. Those can be leveraged to jumpstart your machine learning efforts and not have to start from scratch. Your time will be well spent exploring online competitions and available datasets before labeling your data.
What should be the split between Training data, Validation, and Testing data?
Usually, it is something in the range of 70-20-10%.
Hassan: Make sure that you keep a chunk of your data to train your model, another chunk to validate, and a final chunk to test it.
How should we track the performance of our model over time?
A common misconception is that machine learning models are developed just once. They are constantly updated and improved.
Hassan: This is one of the most important questions to get right. After deploying your machine learning models, you need to keep track of their live performance and identify whether you want to retrain them or whether your performance starts to drop. You should also know what your expected failure cases are. Not all models are perfect and from the training performance, your team should not be surprised in many cases and communicate this to your users.
What are the performance metrics that need to be tracked?
Precision - What proportion of positive identifications was actually correct?
Recall - What proportion of actual positives was identified correctly?
f1 score - The f1 score is the harmonic mean of precision and recall.
Area under the curve (AUC) - The AUROC is calculated as the area under the ROC curve. A ROC curve shows the trade-off between true positive rate (TPR) and false-positive rate (FPR) across different decision thresholds. An AUC of 0.5 (area under the red dashed line in the figure above) corresponds to a coin flip, i.e. a useless model. An AUROC of 1.0 (area under the purple line in the figure above) corresponds to a perfect classifier
Hassan: Depending on the use cases, error cases such as false positive or false negatives are more or less important. This choice can be reflected in your metric.
ML 101 - Jason Mayes, Google
Machine Learning Crash Course - Google
That's it for today. Hope this take into ML was helpful for you. Comment on this and I'll be happy to discuss it with you.
Talk to you soon!
P.S. Hit the subscribe button if you liked it! You’ll get insightful posts like this directly in your email inbox every week.