Being a successful data scientist requires a mix of technical skills, higher order thinking and down-and-dirty problem solving. Given that this mix of talent isn’t necessarily part of standard college curriculum, you’ll find many data scientists without the necessary real world experience to fully understand the potential pitfalls you can encounter when working with data. In fact, there are three mistakes that I think a data scientist can easily make if not looking out for them.
Keep Metrics and KPIs Consistent
The most prevalent mistake that I’ve seen a data scientist make is manipulating the data so that it tells the story that his or her constituents or superiors want it to tell. At the end of the day, the numbers themselves don’t lie. But the context that you apply to those numbers can certainly distort their meaning. For example, if you have a metric that you are tracking on a monthly basis and have an agreed scale for interpreting that metric, it needs to stay constant. In my prior roles managing dashboards and metrics upon which performance was measured, those being measured would often advocate for changing the scale so that it would shed a more positive light on their performance. As the data scientist, it’s critical that you remain steadfast in the face of this kind of pressure. The purity of the data that you are extracting and reporting on is yours to protect.
Appreciate Data Source Nuances
The next mistake of using incomparable metrics is almost as commonplace as the first. Especially when working with or for a larger company, you’ll find that you have a number of data sources relevant to your goal. In this case, it can be tempting to use them all. But again, it’s important that you do so very carefully, because these data sources may be relevant but not comparable. And if you are including these metrics in an analytics dashboard, it’s important to assess their comparability. For example, one metric may be reported on a weekly basis, while another is reported on a monthly basis. Or you may have two different sources for the same metric that don’t offer the same result. This can be due to using a different calculation, different time periods, etc. Since the very intention of a dashboard is to provide a snapshot of performance against a specific goal or purpose, your data sources must be comparable in order to paint an accurate picture. Thus, the data scientist needs to plan for this and not only understand the origin of the data but also clearly label the metrics to ensure accuracy.
Perform Sanity Checks
The last mistake is one that can be made by any data scientist from an amateur to a well-seasoned professional. When getting to the end of a data science project and working through the final execution and delivery of a report, I’ve seen many data scientists be overconfident and not follow through with a final proofing or ‘sanity check.’ At this point, they’ve failed to realize that they can be too close to the work and may have missed the obvious. It’s always prudent to do a ‘sanity check,’ either yourself or to have one done by a teammate. While you don’t have to recalculate everything (which on a large project with a lot of data would be impossible), it’s good to employ techniques that allow you to ensure that your numbers look directionally right. For example, if the metrics you’re reviewing are continuous variables (e.g. realistic age range for humans), I like to ‘take it to the extremes’ to uncover any potential issues. You can do this by checking the maximums and minimums for each to make sure that the values you’ve reported are falling within the expected range. I’ve found that conducting this kind of ‘sanity check’ can be very helpful in identifying any extra zeros, missing digits, incorrect formulas and the like.
The best data scientists are aware of these common mistakes and take the necessary steps to control for them. Ultimately, they have a foundation of both traditional book smarts and real world application. Because it’s one thing to understand and apply concepts in an academic environment, but another thing entirely to do so in the real world with all its pressures. Those who work hard to protect the integrity of their data and take the right steps to ensure its accuracy will find their work to be valuable to both themselves and to those who rely on it as well.