Data Science in 2020: Technology

This article takes a look at what the online Data Science community has been writing about over the last couple of years. To do this, I've sampled roughly 30,000 unique Data Science stories from across Medium between the January 2019 and mid-December 2020.

This article is broken into two parts:

  1. Technology - This section takes a deep-dive into the technologies that the Data Science world has been writing about and responding to this year. It includes rankings for popular software tools, programming languages and platforms, plus some commentary too.
  2. Community - The Data Science community is historically vibrant and ambitious, but this year has been tough for many of us. This section looks at how the writing part of the community has responded to the events this year in terms of publications, quality and activity.

This post (Part 1) deals with Technology. Part 2 of this article will deal with Community.

Technology

Technologies come and go in any field, but within a field such as Data Science that is still in the process of establishing itself and its 'standard' tools and methodologies, this is particularly true. New technologies sweep in on a wave of hype, and a similar number fall silently back into obscurity. Given how fast the field is moving, understanding what technologies the community is discussing may give us a leading indicator about where the field is going to end up in medium-term.

That's what this section is about: what technologies has the Data Science community (on Medium) been discussing over the last year? And which of technologies regularly garner the most enthusiasm?

To keep things simple, I've split the technologies I'm considering into the following areas:

  • Software tools - Tooling and libraries that constitute the modern Data Science software stack. These tools and libraries are taken from across multiple languages.
  • Programming languages - Programming languages in common use with Data Science professionals.
  • Cloud Platforms - Platforms designed to support the development of cloud infrastructure.
  • Data Science Platforms - Platforms designed for the express purpose of facilitating modern Data Science workflows.

I've also adopted two ranking approaches when evaluating these areas. The first is simply ranking by the absolute number of mentions (i.e. the number of articles a technology is referenced in). This is fine for seeing how often a given tool or language surfaces in the sampled articles, and goes some way to indicating the 'penetration' of a technology, but it doesn't necessarily highlight what the community is interested in.

To try and get a better grasp on this latter point, I've adopted a second ranking scheme that ranks a given set of technologies by their median popularity (in Medium speak: claps). This is still an imperfect measure, but tends to give a better indication of the relative popularity of a specific technology in terms of the enthusiasm shown towards it.

Let's dig in.

Software tools

First up, software tools! For many Data Scientists, applying these tools to business problems makes up a good proportion of their working lives. I've drawn the software tools considered in this list by manually curating a list of common search terms and regularly discussed tools in the ever-popular top 10 list blog format. This leaves me open to potential biases in my data collection, but I've tried to make an effort to give equal attention to tools from across the Data Science software ecosystem.

Most mentioned tools

Let's start by looking at the most commonly mentioned tools. Here's the top 10 most mentioned technologies for 2020:

Tool Rank Articles (%)
pandas 1 15.04
scikit-learn 2 9.60
plotly 3 9.28
tensorflow 4 8.37
numpy 5 7.83
keras 6 5.53
pytorch 7 5.39
matplotlib 8 5.39
spark 9 3.93
docker 10 3.22

There was no change in the top 10 between 2019 and 2020. This is unsurprising from the point of view that these tools are as close to perennially popular technologies in this field. It's also common for new Data Science bloggers to begin by writing introductory articles on precisely these technologies too, which may explain the regularity of their mentions. Still, it is pretty striking that by my count, pandasa tabular data manipulation library for Python – appears in fully 15% of all published articles in the last year.

While the top tools by mention is pretty static over time, their popularity has been much more variable. Here's the top 10 technologies in this case:

Technology Articles 2019 Rank 2020 Rank Change
streamlit 166 1 1 =
pycaret 72 32 2 +30
dask 99 5 3 +2
airflow 159 16 4 +12
kubeflow 60 7 6 +1
kubernetes 242 6 7 -1
flask 384 21 8 +13
bokeh 94 11 9 +2
docker 580 17 12 +5
pytorch 970 14 14 =

There's no huge surprises in this ranking, but there are some interesting trends. The ranking is skewed towards 'infrastructure technologies': technologies used to build and deploy productionised ML systems. I consider fully six tools in this list as falling into this category (Docker, Flask, Kubernetes, Kubeflow, Airflow, Dask). This is particularly noticeable as there's a relative lack of 'actual' ML technologies in this list, with only PyTorch and PyCaret (an interesting new 'low-code' ML library) making the cut. The list is rounded out by two visualisation tools in the form of bokeh and Streamlit.

It isn't surprising to see Streamlit sitting at the top of these rankings. In terms of buzz, Streamlit is one of the bigger Data Science technology stories of the last couple of years. Personally, I think it's a fantastic bit of kit. It has made its way into many of my personal and professional projects over the last year, and it's a joy to use. If you'd like to find out more about Streamlit and how to get started, here's an article introducing it with a short tutorial on deploying Streamlit apps to Google Cloud:

Deploying Streamlit Apps to GCP
Streamlit is a minimal, modern data visualization framework that’s rapidly becoming the go-to dataapp framework in the Python ecosystem. This post introduces Streamlit, and shows you how to securely and scalably deploy your Streamlit apps with Google App Engine.

For me, the presence of the infrastructure tools on this list is particularly interesting. Anecdotally, there seems to have been a consistent rise in the interest in and awareness of ML Ops/ML Engineering-type concepts and tools within the broader Data Science community over the last couple of years, and this may be reflected here: the presence of modern software engineering tools such as Docker and Kubernetes, plus data/ML workflow management tools like Airflow and Kubeflow certainly point in that direction. Is the Data Science community at the start of a speciation event where the analysts split off from the engineers?

It's also interesting to see Dask creep into this list too. If you're unaware, Dask provides tools to perform distributed analytics capabilities for users of the Python numerical and scientific ecosystem. In other words: it is somewhat similar to the older (and heavier-duty) Spark in it's ability to handle very large volumes of data, but unlike Spark, it provides great native interoperability with the well-established Python scientific computing tools. From this perspective, it's an attractive proposition for practitioners unfamiliar with the Spark and Java ecosystem, but familiar with the 'core' Python scientific computing toolkit. I suspect Dask's popularity within the community will continue to grow in coming years.

Programming languages

Programming languages are one of the more contentious topics within the Data Science community. Everyone seems to have their own favourite, and we're collectively quite quick to let others know. I suspect the 'language wars' typified by certain tired language feature comparisons are overwrought: the world has space for them all! For this section, I've drawn the languages considered in the analysis from the top 50 TIOBE language language rankings.

Most discussed languages

Let's start with the simplest of rankings: how often were different languages discussed over the last year? Here's the top 5 languages from 2020 by mention:

Language Rank Articles (%) Change (%) Rank Change
python 1 40.78 +4.73 =
java 2 8.46 +0.14 =
r 3 7.58 -0.68 =
javascript 4 3.15 +0.36 =
c++ 5 1.76 +0.13 =

As you may expect, articles referencing Python tend to dominate the conversation. This is unsurprising: Python is a language that is both an ideal starter-language for new Data Science fans and for professional Data Science and engineering teams alike. It sits at something of a sweet-spot between productivity and capability: it has a rich and relatively mature ecosystem of libraries and tools that cover most modern software use-cases, from cloud computing, through to Internet of Things applications, and – most relevant here – scientific and numerical computing too. In other words: it has a huge, active community of varying levels of skill and experience, which goes some way to explaining the preponderance of mentions in Data Science blogs.

Perhaps more surprisingly on this list, the Java programming language pips R to the post for second place. In the 'modern' Data Science toolkit, Java (and its unranked sister language Scala) is often regarded as the languages of choice when tackling 'Big Data' problems. In a similar way that Python can act as a 'common language' for software engineers and Data Scientists, Java can play a similar role for 'Big Data' and data engineering applications. It's a performant language with a strong pedigree and a very mature ecosystem. So long as the likes of Spark (and Databricks) continue to be used and to grow in usage, its popularity is likely to continuing to grow within the community, too.

Of the remaining three languages in this top 5, R is the least surprising entry. As a language built from inception for statistical computing applications, it's a language with some excellent features that tends to excel at exploring and modelling tabular data. For good or ill, it doesn't have the a generality (and corresponding ecosystem), or mass-following of Python, but it's invaluable for use-cases in the niche it fills. By my count, it's seen a slight decrease in mentions this year with respect to the other languages in the top 5. It's also grown the slowest of these too.

It's hard to tell if this is part of a broader slow-down in the R community, though a cursory glance at CRAN (R's open-source package registry) suggests that the number of new contributions (i.e. new packages) on CRAN have been trending down over the last couple of years. To crystalise this a bit, here's a chart showing the number of new packages released from a sample ~3500 packages on CRAN (~20% of the total number of published packages):

Figure 1: A plot of the number of new packages published on CRAN by date.

It's worth noting that the decline could be a sign of changes in the R ecosystem which may well be positive (consolidation towards standard tools, for example), though such a precipitous decline in new packages is unlikely to augur rapid growth in its user-base/community either way.

One of the more interesting entries here is Javascript. Javascript has traditionally been regarded as a 'frontend' language: great for building nice web interfaces, but less good for everything else. With the advent of NodeJS and the subsequent surge in adoption of Javascript across the entire software stack, Javascript is often the language of choice for many engineering teams, and the 'only for frontend' mindset has long-since become obsolete.

Despite this, it has typically had poor support for scientific computing applications, and consequently hasn't seen much adoption within the Data Science and Machine Learning communities. That said, with the rise of TensorFlow JS and other similar tools, plus the need to use Javascript (and possibly the closely-related Typescript language) to extend and modify popular tools such as JupyterLab and Streamlit, Javascript is emerging as useful language to know in the Data Science world. If you haven't checked out TensorFlow JS yet, it's worth a look:

TensorFlow.js | Machine Learning for Javascript Developers
Train and deploy models in the browser, Node.js, or Google Cloud Platform. TensorFlow.js is an open source ML platform for Javascript and web development.

Finally, C++ creeps in here for a good reason: C++ is the language that underpins the implementation/runtimes of three of the other four languages on this list (Python, R and Javascript), and plays a crucial role in the optimised implementations of many algorithms in popular Data Science packages such as TensorFlow, LightGBM and Scipy to name a few. However, despite its utility, it's not widely known within the Data Science community at large, so it fills something of a niche of being both highly useful and novel. This may go some way towards explaining its placing on this list.

While the landscape of the top 5 languages by mentions alone has remained unchanged over the last year, the most popular languages (by median popularity of articles referencing the language) has changed a good deal more. Here's the top 5 most popular languages from 2020:

Language No. 2020 Articles 2019 Rank 2020 Rank Rank Change
c++ 316 1 1 =
javascript 567 4 2 +2
swift 153 7 3 +4
java 1523 3 4 -1
julia 246 6 5 +1

I've taken the decision to omit the 'true' top ranked language here: Go. A total of only ~30 articles were written on Go in the sampled article set from this year, and these near-uniformly performed above average in terms of reader engagement. However, with so few articles written, I took the decision to remove it from the rankings. One important observation to note is that only three of the top 5 languages by mention appear in this list: Java, Javascript, and C++. Where did R and Python go?

To dig into this, I've done some topic and style extraction on the corpus and assigned each article a primary topic and a style. Without going into too much detail here, the 'optimal' number of topics and styles was found through Topic Stability Analysis (using a TF-IDF and NMF extraction pipeline), and Silhouette Analysis (using hand-crafted features and K-Means clustering) respectively. This process identified five consistent topics and four predominant styles, shown in the tables below.

If we take a look at the topics I've associated with articles mentioning each of the languages, the proportion of articles associated with each topic looks like this:

Primary Topic Python (%) R (%) Popularity
Business & Products 10.10 14.96 107
Deep Learning 11.84 6.96 83
Opinion & Experience 12.52 13.92 110
Programming & Tools 42.36 22.66 92
Statistics & Probability 23.18 41.50 63

As you might expect, these language rankings are (partially) a function of topic, style and novelty. On these measures, both Python and R lose out, in part to their ubiquity, but also the styles of writing common to each camp. With statistics & probability being the least popular of the identified topic groups, it's unsurprising – if a bit sad – that R doesn't get the visibility it may well deserve. This correlation can also be seen by comparing the predominant R and Python styles against the corpus as a whole:

Style R (%) Python (%) Overall (%)
Typical 34.67 32.12 32.38
Filler 29.74 41.12 37.08
Technical 25.43 20.51 22.62
Informative 10.16 6.24 7.92

From this we can see that Python has an excess of 'Filler'-style articles (short, low-content articles - the second least popular variety), and a below-average number of 'Informative' articles (longer, content-rich articles - the most popular style of article). This makes sense: writing about some well-trodden quirk of Python is a common first-blog topic for beginners, and there are huge number of these posts in the corpus.

In contrast, R has an above average number of 'Technical'-style articles (technically dense and typically inaccessible articles - the least popular article style). These tend to be less accessible and, given that a high proportion of these tend to also be relatively technical statistics articles, result in R articles tending to get less traction with the broader Data Science community.

The most interesting entry on this list for me is Swift. This is Apple's open-sourced 'replacement' for the Objective C language that now underpins most Apple software products. It has recently been adopted as the language of choice for the next generation of lower-level TensorFlow tools. Swift is a good fit for this application: it is a statically typed language with some impressive language features that make it fast and (generally) suitable for massive high-performance applications. If you haven't heard about it yet, it's worth reading up on:

Swift for TensorFlow
Swift for TensorFlow is a next generation system for deep learning and differentiable computing.

As a more general note, it's interesting to see the emergence of languages like Swift, Javascript and Go appearing within the Data Science community. There's yet to be a 'true' modern statically typed language that's easy to pick up and use (perhaps with the exception of Java/Scala, if you think the previous statement holds!) with the capabilities the average Data Scientist may expect. As well-designed, user-friendly languages, I'm hopeful Swift or Go may yet claim that crown. It'll be exciting to see how things evolve over the next few years.

Platforms

It seems like everyone is building platforms these days – and in many cases for good reason. For the purposes of analysis, this section breaks the world of platforms into two halves: cloud platforms and Data Science platforms. Here, I'm considering the former to be a more general cloud platform, capable of building and deploying almost any form of modern software, while in the latter I'm considering platforms specifically tailored to facilitate Data Science work, and therefore aimed at Data Scientists as their primary persona. I've taken the platforms considered here directly from the Gartner 2020 Data Science Platform Magic Quadrant.

Cloud platforms

In less than 15 years, 'The Big Three' cloud giants have come to power much of the modern web. They also play an important role in the current wave of AI and ML technologies trying to find their way in the market (and in the research community too). By making computing cheap and transient, they've allowed many of these technologies to advance more rapidly than they may otherwise have done. It's now possible to spin up a supercomputer-like machine in seconds and use it for a few dollars an hour. That level of cost and flexibility is pretty much magical.

Beyond the infrastructure itself, The Big Three are also playing a key role in the development of core technologies too, with the likes of MxNet (AWS), and TensorFlow (Google) being pillars of their respective cloud's AI strategy. Each also has a growing range of technical, low-code and no-code ML products in their portfolio, and it seems clear that AI & ML are regarded as central aspects of these platform's longer-term strategy. Given the amount of data these applications need to store, the amount of compute resources often needed to process this data and the rate at which both of these things are growing, it's definitely a smart business move – provided there's no new AI Winter.

By extension then, the attitudes of the Data Science community are likely to become increasingly important to these platforms, and are already illuminating. Let's look at popularity here first:

| Platform | 2020 Rank | Popularity | Change (%) |
|:-----------|-----------:|------------:|------------:|---------:|-------------:|---------:|
| gcp | 1 | 87 | 177 |
| aws | 2 | 70 | 168 |
| azure | 3 | 62 | 191 |

Note: 'Change (%)' here (and below) refers to the year-over-year change in the total number of articles published that reference the given platform.

By this ranking, GCP come out with a commanding lead in terms of popularity, and the number of articles it is mentioned has grown quickly too. If you take a look at the popularity of the respective clouds over the last two years or so in figure 2, you can see that GCP has developed a relatively consistent lead in overall popularity over the last quarter.

Figure 2: A plot showing the ratio of median claps for each of The Big Three platforms over the last two years.

However, if we flip this and take a look at absolute mentions, the story is a little different. figure 3 shows the the number of mentions of each platform as a proportion of the total number of mentions of cloud platforms. In this case AWS clearly dominate, though there's signs that this lead is slowly being eroded by growth in both GCP and Azure.

Figure 3: A plot showing the proportion of all mentions of The Big Three for each platform.

I suspect the competition for mindshare in the Data Science community will heat up in the coming years, particularly as the field continues to mature and 'the ML stack' begins to firm up. The majority of prime real-estate there is still up for grabs.

On a personal note, it's interesting to see the relative popularity of GCP in the community. In my opinion, the onboarding process to GCP is cleaner and more straightforward than both Azure and AWS, the documentation is generally excellent (unlike), and the pricing transparent. With the established offerings of the likes of Cloud Run, DataFlow, Firebase (including Firebase ML), and tight integration with the TensorFlow ecosystem, plus their pedigree in the world of ML & AI, my instinct is that GCP is rapidly becoming the most compelling platform for AI infrastructure. I've certainly taken to using it in my personal projects too. If you're interested in reading more in that vein, here's one of my articles on the subject:

Serverless ML: Deploying Lightweight Models at Scale
Deploying ML models ‘into production’ as scalable APIs can be tricky. This post looks at how Serverless Functions can make deployment easier for some applications, and gives an example project to get you started deploying your own models as Google Cloud Functions.

Data Science platforms

There have been many analytics 'platforms' over the last few decades. Indeed, there's still a number of incumbent analytics companies playing for market share in the shiny 'new' Data Science world. However, there's a growing number of 'new kids' too. Perhaps the most established global players of this new breed are Databricks and Domino Data Labs. There's others too, like H2O.ai, Dataiku and DataRobot, all of whom are making a strong play for the 'Enterprise AI' crown, each with their own distinct flavour and perspective.

I've taken a look at several of these platforms and found Databricks and Domino to offer some impressive capabilities, and H2O.ai and Datarobot offer some interesting low-code features too. Does the community agree? Here's what they've been talking about:

Platform Rank 2020 Popularity Change (%)
databricks 1 78 133
domino 2 77 170
h2o.ai 3 58 105
dataiku 4 55 240
rapidminer 5 52 275

Databricks leads the way here, though it's also seen a smaller relative increase in articles referencing it. In contrast, the number of articles published referencing Domino has grown relatively rapidly, and it's pretty much neck-and-neck with Databricks in terms of popularity. It'll be interesting to see what the effect of the rapid growth in the number of publications referencing Dataiku and RapidMiner will be, and whether that'll increase their relative popularity and awareness over the next few years.

Figure 4: A plot showing the proportion of all mentions of Data Science platforms for each of the considered platforms.

Things look a little different if we look at absolute mentions over time, though, as shown in figure 4. In this case, Domino clearly has the lead in overall mentions. However, it's also clear that for the last couple of years it has been pretty much a two-horse race: 70% or more of all platform mentions commonly fall to just Domino and Databricks. This could signal a growing duopoly on the Data Science community mindshare. That said, there have been a steadily growing number of articles posted about both Alteryx and DataRobot this year, so it's far from a done deal.

See you next time

That's it for this post. Make sure to check out Part 2, which will be available very soon. If you have any questions or feedback, or perhaps some ideas for what else I should have a look in the dataset for, feel free to drop me a message on Twitter or add me on LinkedIn.