I remember the first time I heard the term “Data Scientist.” It was 2012, and I was at a going-away party with other physicists at CERN to celebrate a friend who was headed back to the US to become one. At the time I remember thinking that the title was kind of pointless; a scientist is nothing without data, so adding “data” to a career title that I already held seemed tautological.
Fast forward to almost a decade later, and there’s a healthy debate1 in industry over what Data Science actually means, and if “Data Scientist” is a pernicious title. I’d like to argue that the role of a “scientist” is something that industry has yet to capitalize on, and that if we imprudently get rid of it, we’ll be missing an opportunity to foster a valuable mindset within the product development process.
So before we completely dump “Data Scientist” as a role in industry, I thought it might be worth breaking down where the confusion stems from in the first place.
Here are my top four (hot take) reasons:
We clearly have a serious semantics issue
Maybe this goes without saying, but I don’t just mean the issue with the overloaded term “Data Science.” Compared to other disciplines in the tech industry, our field is still relatively nascent, and it seems like almost every aspect of our roles is hard to pin down with consistent vocabulary. For example, the textbook definition of the word “experiment” is about validating a hypothesis with evidence in a controlled way, but it’s also often used to describe specifically an A/B test, which is but a single hammer in a data scientist’s evidence-gaining toolbox.
We can solve the semantics issue by either getting rid of all the overloaded terms we currently use (good luck) or being a bit more explicit about the language we already have and the value certain words bring. For what it’s worth, I’m going to completely avoid the discussion about where machine learning engineers should sit, and speak specifically to the subset of our field often called “product analytics” or “decision science”. Here, I define science as “the pursuit of understanding the universe following a systematic methodology based on evidence.”2
Companies try to hire magicians
I’ve seen (and experienced first-hand) many companies that have tried to hire a Data Science team without also budgeting for the engineering resources they need in order to be successful. In these situations, it’s often the case that a Data Science team is expected to churn out oodles of money-making insights before the company has a mature enough data infrastructure. Expecting a bunch of data scientists to have eureka moments that 10x the business every month without first laying the groundwork through thoughtful data engineering should be regarded the same way as any other “get rich quick” scheme. Just like setting any other investment up for success, it takes work, planning, and a solid foundation.
Different companies will be at different stages of becoming data-driven, and early-stage companies might actually find that data engineering is the investment they need in order to get the data ball rolling.3 Maybe one of the reasons why so many data scientists are not performing the cultural and textbook definition of science in their roles is because when they are hired, the data infrastructure is insufficient, so they roll up their sleeves and build it themselves. Something causation vs correlation is happening here.
That being said, the whole point of collecting data is ultimately to inform business decisions, and with the amount of emphasis and upfront investment on the infrastructure, that original reason for building it in the first place can sometimes get lost. Because of this, it’s even more important to have a strategic roadmap for the data team that looks at least 3+ years out. A Data Engineering-heavy team won’t just magically slide into an effective Data Science- (and impactful insights generating) team without intentional forethought of the different needs that the business will have of the data as it evolves.
Real scientists don’t just do statistics
Who declared that ETL isn’t a “scientist’s” job? I’m going to speak from my personal experience here, having come up through the academic track. As grad students, we all started as what industry calls “analytics engineers”, but got to have the scientist title while we did it. For eight years I worked for ATLAS, a general-purpose particle detector located at CERN whose claim-to-fame was the co-discovery of the Higgs Boson in 2012. However, years before the big discovery, we needed to build the infrastructure that would be capable of collecting petabytes of raw data. When the first data really started rolling in during the Spring of 2010, we (the scientists) basically just counted events to make sure they made sense and to validate the data pipeline. It may have been tedious work, but it was how we learned to appreciate the data and gain a healthy respect for it.
Regardless of your career path, whether you grow into a Data Scientist role straight from undergrad, or hop over from academia after getting a PhD, you can’t escape putting in the time to learn that data appreciation, which is to say that collecting data is hard and not to be taken for granted. Pulling data, cleaning it, looking at it, scratching your head, it’s all part of the process. No matter how senior you get, science is 5% inspiration and 95% sweaty data gathering.
Finally, the real problem: Our industry hasn’t learned how to take advantage of the special powers scientists bring to the table
Ok, by now you’re probably screaming “if you love science so much, why don’t you go back to academia where you belong??!”
Fair. It’s easy to argue for the value of artifacts that are tangible, such as reports, dashboards, data pipelines, etc, and thus it’s easy to argue that data analysts, data engineers, analytics engineers, and machine learning engineers as roles provide tangible value. Science, as I’ve previously defined it as a journey of learning, is slow and boring and why on earth would we want it in a business setting?
Perhaps without realizing it, product teams have been using something akin to the scientific process through the practice of product discovery for decades now. Start with an idea (the theory), go learn about your users, try stuff out by testing features, get user feedback (all of this is evidence gathering), and then rinse and repeat.
The real problem I see is that we as an industry haven’t figured out how to optimally integrate science-practitioners with product developers, at least not consistently. Product leaders were doing a great job creating new products that users loved, and then suddenly around the mid 2010’s a deluge of scientists entered the industry with this new title of Data Scientist. Since these battered-and-bruised ex-academics didn’t particularly know what they were doing in a business setting either, this created a hotbed of tension and misunderstandings. Evidence of the rocky relationship manifests with pessimistic terms like “data gate-keepers” and “ivory towers” on one side and “we’re not a service org” on the other.4
It’s time we acknowledge that this tension exists, and turn it on its head. Product Managers are experts in product development, and Data Scientists have a deep understanding at a more meta-level of how to learn. A product manager who cracks how to leverage the learning expertise of their Data Science team would suddenly gain those super powers, like putting on a really cool powered exoskeleton for supercharged product discovery. What if we worked on that relationship, as partners, and combined forces?
After all that, I have to admit that that the term Data Scientist is a generic one that means different things to different people, and the field probably does need to evolve with more specific titles. Analogously, titles like “Dev ops Engineer,” “Android Engineer,” “Front-end Developer,” “Q/A Engineer,” etc., all fall under the more broad scope of “Software Engineer” (fwiw: a quick search on LinkedIn returned 11,600,000 results for people with Software Engineer as their title). For now though, having scientist available as a title for data practitioners working in industry still matters to me, and I vote to keep it.
Thanks to Hamilton Ulmer for providing helpful feedback for this post!
A real thing someone high up in my management chain once said: “We’re not the data bitches.”