I have been going on an on about my contention that data science is not science, and practitioners of the profession have no business calling themselves scientists, for what feels like forever now. Since I post on the topic so frequently and usually in responses to articles in data science focused publications like Towards Data Science I had anticipated a strong defense to be mounted by the data science community. In this I have been sadly mistaken as the few comments I have received from actual, working data scientists have either been supportive or not substantial.
Recently I received my first “real” defense of the status quo in a comment to a response I had posted to an article in Towards Data Science bemoaning the lack of a process for discovering and recruiting new talented data scientists to the profession. The author provided the following example problem to illustrate his point.
“…..In a typical example, the sales team might notice that deals close more frequently at a certain time of day, and ask the data science team to look into this [observable phenomenon], to explain it — or at the very least predict its occurrence — with a model. Now granted, this may not be fundamental science, but it certainly is the application of the scientific method, at least by most people’s definitions (including apparently, your own)…”
Thank you so much again Jeremie Harris for taking the time to read and consider my position, and then to take the further step of writing an intelligent defense. As I mentioned few have done so to date. I responded to the original Harris reply and have been thinking a lot about this example ever since. I feel that the original “defense of my attack” response was lacking and will expand upon it here.
First, you can ignore the last bit about “(including apparently, your own).” That was simply the result of a misunderstanding of an earlier infographic I had published in which I provided a brief statement concerning why science is needed. He took that to be my “definition of science” which it most certainly is not. I have posted the infographic below if interested.
That out of the way we can get to the example problem at hand. Contrary to Mr. Harris’s contention that the example problem proves data science is science I believe it illustrates the exact opposite. Let’s go through it piece by piece.
“…..In a typical example, the sales team might notice that deals close more frequently at a certain time of day” We already encounter problem number one. The sales team are clearly not scientists and yet it is them, not the data scientists who will later analyze the data, who develop the hypothesis which is to be tested. Presumably it is something like “There is something unusual about the fact that deals close more frequently at a certain time of day. I wonder what it could be?” Now I recognize that it is a very weak hypothesis and does not provide any conjecture/guess as to what that strange thing might be, but it is still a hypothesis. They hypothesize that it is strange. Even if you do not agree that it is a hypothesis it is still the sales team who pose the question to be answered, not the data scientists. Hypothesis generation is a part of the scientific method and a requirement to do science or to be a scientist.
“…ask the data science team to look into this [observable phenomenon], to explain it — or at the very least predict its occurrence — with a model. My problem with this part is twofold. One is that the data already exists that the sales team asks the data science team to look into it. In other words the “experiment” has already been conducted and the data scientists had nothing to do with it. They did not design the experiment, or run it, or have any input at all into it. They are simply asked to take the data and analyze it and then perhaps build a model to predict it. Hypothesis testing through experimentation is one of the requirements to do science or to be a scientist. Obviously not everyone agrees with this position thus the reason there exist entirely “theoretical” sciences such as theoretical physics. However, even the theoretical sciences require hypothesis generation and I would contend there is an experiment only it is conducted “in the mind” or “on paper/in a computer.” My second problem which is related to the first is the fact that one could ask a group of statisticians to do the exact same task and get the exact same or very similar answer(s). Statisticians do not call themselves statistical scientists so why should data analytics professionals. At base this is an exercise in data analysis. Taking an existing data set that the “scientists” had no input into and analyzing it for patterns or unusual/unexpected outputs. Finally building a model to predict something based on the findings of an analysis is also not science. It is an exercise in statistics and mathematics to be sure but not science.
“…Now granted, this may not be fundamental science, but it certainly is the application of the scientific method, at least by most people’s definitions…
I am not sure what is meant by “fundamental science” in this quote but I presume it is intended to mean basic as opposed to applied research or something like that. In this we agree, it is not fundamental science. Is it “an application of the scientific method?” I do not see how it can be when their is no hypothesis to be tested and no test conducted. There may be conclusions drawn based on analysis of the data and all scientists do this in their work. This is how they generate new hypotheses to test. But part of a thing is not identical to the thing itself. This is akin to saying, as I often do, data science is a tool of science, not science. I also am always quick to add that this is no value judgement as to the relative merits of either profession. It is simply a statement of fact.