Welcome to the first edition of my new-look newsletter! This newsletter is a little different to the ones that came before it in structure and content. Here's how it's structured:
- Blog highlights - A selection of highlights from this blog over the last quarter, including technical tutorials, ML theory and general thoughts and opinion on the world of tech and AI.
- Machine Learning trends - A breakdown of trends emerging in the world of Data Science (DS) and Machine Learning (ML) as derived from analysis of recent news and articles.
- Tech highlights - The best bits of tech (particularly open source packages) I discovered and/or started using in earnest this quarter.
- Research highlights - The most exciting AI/ML/software research I came across in the last quarter.
- Other odds and ends - A selection of miscellaneous goodies that may appear from time-to-time.
Let's dive in!
As a theme for this quarter and next, I'm focussing on the basics of deploying ML models. I've got a couple more posts in the works, including deploying event-driven models and how to manage deployed models over time. To kick things off this quarter, though, I've put together a couple of posts on how to get started with deploying a model, and how to load test a model 'in production', too. You can check these out here:
Outside of these technical posts, I found some time to put together my thoughts on working at a startup and why it might be something you should consider. I was inspired by Chip Huyen's recent post adopting the 'opposite' stance, and liked the structure of that so much I've borrowed it for my own:
Machine Learning trends
If you've been following this blog for a few months, you might have read my post on tech trends in Data Science in 2020. In that post, I mentioned I'd scraped some 30,000 blog posts from Medium and elsewhere to pull together those insights. Since then, I've been updating my tool and gathering still more data from even more sources. I'll write about this tool in more detail in future. For now though, here's a few more insights I've pulled out from this quarter.
Data Science specialisations
Over the last couple of years, I've noticed a significant increase in interest in technical specialisms within the broader Machine Learning field. However, I may be biased: I have a strong personal interest in some of these specialisations. To try and get a more 'data-driven' view, I've pulled together some high-level statistics on job titles referred to in the 60,000 or so articles I've analysed. Here's what I found:
|Title||Articles (%)||YoY Growth (%)|
In this table, you can see the number of articles published mentioning a given title as a percentage of all indexed articles (as determined by a few simple regular expressions, if you're interested). For clarity, too, year-over-year growth in this table is a comparison of Q1 2020 with Q1 2021.
An example of a relatively new and highly specialised role that has grown rapidly in the last year that isn't explicitly listed in the above table is ML Ops. I've included it under the umbrella term 'ML Engineer' here as the total number of articles published discussing ML Ops is still less than 1% of the articles I've indexed. However, within that small pool of a few hundred articles, the year-over-year growth was ~110%.
I have a personal view that this growing interest in specialised ML functions is the sign of something of a 'speciation event' in the field of commercial ML more widely: the title 'Data Scientist' has been a very broad church for much of its existence. Much like the Software Engineering domain has adopted relatively standard sub-disciplines (e.g. DevOps, frontend, backend etc.), I think the world of Data Science may be learning to find a better set of terms to describe the skills needed to satisfy business requirements, and this growing interest in ML disciplines outside the catch-all term of 'Data Scientist' may be a sign of that. It'll be interesting to see what a team of ML specialists looks like in 2025.
The shifting landscape of Deep Learning frameworks
Since its release by Google in 2015, Tensorflow has arguably been the de-facto standard for DL tools. Helped along by the user-friendly Keras API, Tensorflow has found its way into a huge number of research publications and products, and has long held an impressive lead in terms of mind-share within the field.
However, since its launch in 2016, PyTorch has been hot on its heels. In particular, use of PyTorch has grown quickly within the research community. Backed by Facebook's AI Research (FAIR) lab, PyTorch has brought a slick user experience and is developing its own rich ecosystem of extensions and tools. More recently, Microsoft has leant its weight to PyTorch by becoming the official maintainer of PyTorch on Windows.
But it isn't all about competing feature sets and all-too-common tribal loyalties that crop up in the tech world. There's a broader strategy at play from the tech giants here. It's no surprise that Tensorflow integrates beautifully with many Google Cloud features. In contrast, PyTorch arguably integrates better with Microsoft's Azure. A quick search for the frameworks in the Azure docs returns a nice feature page for PyTorch and some relatively dry documentation for Tensorflow.
In other words: it seems Google is pushing Tensorflow as the way of doing ML, while many of the other players are pushing PyTorch (or MXNet, to a degree). If you're to believe the hype, ML will be a standard feature of pretty much all software in the future. It follows then that there is significant strategic value in being the owner/controller/evangelist for standard frameworks and tools that can be used to facilitate this future. And it won't hurt if said frameworks and tools integrate most seamlessly with certain products and services, either.
That makes the question of trends in this area far more interesting than 'simply' one of personal preferences of individuals in the field. It might point to what the broader market might look like in the future, too. So what sort of mindshare of the ML/DS blogging world do these tools take up? By my count, it looks a bit like this:
|Framework||Articles (%)||YoY Growth (%)|
Clearly, PyTorch has grown aggressively according to these numbers, and is on-track to become the most-mentioned DL framework sometime later this year. If this reflects broader usage trends, it'd be an interesting change of pace in the field. I've also included MXNet in this comparison for completeness. Sadly, it appears to still languish in relative obscurity. This is a shame as there's some excellent features under the hood.
Synthetic Data Vault
If you develop ML systems, it can be extremely useful to have multiple datasets available for testing and validation purposes, in particular testing and validation purposes outside 'traditional' model evaluation activities. For example, these can be datasets with deliberately erroneous observations that you use for ensuring your system handles errors gracefully, or very large or very small datasets for checking your system scales gracefully and so on.
Synthetic datasets can come in very handy for this sort of thing. These are datasets that are generated via simulation or through some other generative means to look like an 'authentic' dataset. As you can exert fine-grained control over them, they're excellent for all manner of non-functional testing activities - as I've alluded to. They're also handy for demoing new features without risk of violating privacy/confidentiality, too. This might be helpful in domains where you're handling sensitive personal data, for example. They can come in very handy if you're a Product Manager, too.
I've recently come across the Synthetic Data Vault (SDV) that provides a set of tools for generating synthetic datasets using generative DL models (using PyTorch). It allows users to fit a generative model to their 'real' dataset and generate new data that has the same distributional properties as the original dataset. It works nicely for multi-table use cases and time series, too. It's a great package and well worth checking out:
As projects grow in complexity and age, the advantages of features offered by statically typed languages become increasingly apparent. One of the biggest advantages (for me) is being able to ensure structures and interfaces are adhered to precisely when code is extended or otherwise interacted with, and that this information is embodied in the code. This might sound dry, but it can make a big difference as code becomes used more widely and is worked on by more individuals: folks know exactly the specification they're working to, and the program will complain if they mess up. This can help avoid some of the sloppiness, or lack of clarity that can sometimes creep in to dynamically typed code.
In recent years, Python has begun to develop some level of support for more sophisticated typing. These developments go some way to mitigating some of the aforementioned issues. A great example of this is
pydantic, an open source package that allows you to define data models with Python type hints, and to then validate instantiations of these models using said type hints at runtime. This can help ensure you keep the interfaces of your system clean and well-defined. I highly recommend checking it out:
A few years back, I read the paper Generative Adversarial Text-to-Image Synthesis by Scott Reed et al, and I recall being extremely excited about the implications of advancements of that kind of technology. The possibilities for technology that can create highly specific images from natural language are pretty vast. Fast-forward to January of this year, and OpenAI announced DALL-E, a network that builds on their previous work with GPT-3 to produce some pretty spectacular text-to-image outputs. I was blown away. Here's the Wikipedia page:
But do make sure to check out their original announcement.
Other odds and ends
Finally, some news for those of you that might run your own blogs (or be thinking about starting one!). I've recently upgraded my site from Ghost 3.0 to Ghost 4.0. I've loved using Ghost, and I'd recommend it to anyone looking to get a feature-rich platform for publishing content for a reasonable price. To celebrate my upgrade, I decided to try out a 'new coat of paint' for the blog. You can find out more about Ghost 4.0 on their release page:
I'm always keen to get feedback, so let me know what you think of the new-look site!
That's all, folks!
See you next quarter!