A

Gatekeeping and Elitism in Data Science

image tooltip here

If I’m being honest, I often feel like I need to show that I’m better than other people. I can be especially egotistical when it comes to things that I have built my identity around, things that I take a lot of pride in, things that provide me with a sense of self-worth, however shallow it may be. So when I encounter elitism and snobbish, intellectual gatekeeping in the field of data science, I totally get it.

When I hear people who spend most of their time working in Microsoft Excel, who have very little knowledge of math or statistics, refer to their work as data science or analytics, I often feel a sense of anger. When I see someone asking an online forum “Is understanding statistics really necessary for data science?” or “I’m trying to make the transition from a social media manager to data scientist, where’s the best place to start learning linear algebra?”, I sometimes feel personally offended. Hell, even when people are comfortable with naively running svm.fit() and svm.predict() on real datasets without attempting to understand the theoretical nuances of convex optimization, my insecurities can force me to scoff as if I was a French aristocratic woman being asked to the ball by a peasant farmer.

Of course, reacting in this way is childish and narcissistic. Attempting to discourage aspiring learners and to assume some false sense of superiority or moral high ground is not only immature and selfish. By not actively encouraging others to engage with the data science field and by not praising the expanding access to online educational resources, one becomes a detriment to the progression of the data science field. The most important contribution many people can make to human knowledge is to inspire and encourage others. Acting as the gatekeeper of data science is a betrayal to intellectual curiosity and an affront to the new paradigms of accessible knowledge, technological disruption, and the democratic nature of data and empirical evidence.

On the other hand, I also believe that there is a way in which this gatekeeping instinct is at least a partially valid reaction. Although ego and famine thinking - the perception that there is only a fixed amount of opportunities in data science and so the success of one is a harm to another - can account for a great deal of these reactions, I do believe there’s something else at play here. Before attempting to provide a charitable understanding of this elitist, exclusionary mindset, I’d like to clarify what I mean by gatekeeping in data science. The gatekeeping that I’m referring to often takes the form of established data scientists and other “insiders” attempting to dissuade or discourage people (often those from unconventional or non-technical academic and occupational backgrounds) from pursuing their sacred field. A similar elitist vein runs through many attempts to sharply distinguish “real data scientists” from the fakes, to clarify that “business/reporting analysts” aren’t really doing analytics, or to claim that one “must have a graduate degree in a STEM field” to be a worthwhile data scientist.

I’d now like to turn to a few reasons that we should empathize with the gatekeepers. To understand their point of view, we need to take a critical look at some of these issues. In the end, this leads to a greater understanding of the need to encourage a diverse group of people from many different backgrounds to collaborate, form communities, and pursue a variety of opportunities in all things data science.


In an article titled “Software Developer’s Growing Elitism Problem” for techcrunch.com , Cahlan Sharp points out that one of the factors driving elitism among programmers is the fact that the younger generation of emerging programmers has access to opportunities and shortcuts that the older generation didn’t have. The old guard feels cheated by this perceived unfairness and threatened by the immense potential that the newer programmers possess. I sense that this same dynamic is prevalent in data science.

Imagine that you are one of the career statisticians that has been using statistical programming and building models for decades, or that you are a part of the first wave of industry data scientists. Before the hype around data science exploded in the last decade or so, it was likely much more common for people with graduate degrees and formal educational experience in quantitative fields to become data scientists. Ph.D.’s in physics, computer science, biology, and other STEM fields likely learned programming and data analysis in the context of applied science research and then transitioned to work as data scientists in industry.

It’s not too hard to see the huge discrepancy here between one of these early data scientists with a graduate STEM degree and an emerging data scientist today. The latter would have access to carefully curated, online data science programs that can be accessed for free from anywhere in the world, as well as specific career guidance and new analytical software. The veteran data scientist from the mid-2000s might feel cheated by missing out on all the online courses in “Intro to Data Science in Python” or “Introduction to R Programming”, online data science communities like Kaggle, in addition to the universities that are offering undergraduate and graduate programs specifically in data science and machine learning.

Let’s give our gatekeeper the benefit of the doubt. Let’s assume that their reaction is not simply feeling cheated by the additional advantages that the newer generation has access to. Let’s assume their concern is a more genuine one: they perceive a decline in the quality of this generation’s educational standard. Furthermore, assume they are genuinely concerned about the increasing number of people pursuing data science solely for the attractive salaries and the social prestige. Finally, let’s add on the fact that it is understandably annoying to have to deal with the broad and often unhelpful label of “data scientist”. In some ways, perhaps the elitism is merely an effort to reject the media buzz around data science.

This argument, even in its most charitable form, comes far from justifying elitism. Lamenting the decline of educational quality is not grounds for barring others from access to knowledge and claiming to be the sole source of truth. If anything, the pervasion of poor educational content should inspire one to spread higher quality educational content to take its place. An influx of rent-seeking might be a stronger justification for attempting to safeguard the field from its corrupting influence. However, gatekeeping in response seems to actually rely on the mentality that the newcomers will encroach upon the profits that were to be enjoyed by the older generation. It is not a zero-sum game, and a field as fundamental and as widespread as data science is not corruptible on its own any more than mathematics is corruptible.

There is definitely a problem with loosely defined terms in data science, and in tech in general, but it is quite petty to put so much weight into a title and let that distract from the very real substance that underlies the hype.

As societal transformations come to affect various industries, academic disciplines, artists, genres of music, etc., attempting to safeguard one’s area from outside influence is a very natural human response. Examples are seen throughout history in the resistance to textile machinery by the infamous Luddites, the cries that rock music would tear the moral fabric of society, and the violent opposition to Galileo’s proposed heliocentrism. In the case of data science, however, it is much more appropriate that the field comes to avoid these emotional impulses in favor of a more rational, calculated assessment, one which naturally disavows elitism and embraces disruption.

updated_at 02-02-2020