And with that, November is over. What a strange year it has been. I took some leave over the last couple of weeks, hence the delay on the release of this newsletter. However, here's your belated dose of some of the most interesting Machine Learning (ML) news from around the web.
First up, news from my end of the world. This month was a busy one, but most of my progress was behind the scenes. I did manage to get an odd little post about HDF5 out though! If you aren't aware, HDF5 is a venerable and extremely useful data format (and library) for managing large, complex datasets. It was originally developed for the likes of NASA for massive data-gathering efforts and super-computing applications. It's also got an interesting connection to the likes of Netscape too. Here it is:
While I didn't manage to get many other articles written this month, I did get a fair amount of code written – much of which will feature in future posts. To give you a flavour, here's the repository for
lingo. This is a repository that illustrates how you can go about deploying common ML models developed and trained in Python as optimised Go services. Adopting this approach can allow you to gain impressive boosts to performance, allows you to scale your ML services much more gracefully, and ultimately might save you a chunk of cash too. Can you guess what one of my upcoming posts is about? Anyway, if you're interested, here's the repo:
Last month, I started writing articles for Towards Data Science (TDS) over on Medium, most of these will be time-lagged articles from this blog, but if you're interested in following me over on Medium instead, here's my writer's page:
And here's a reposted article (updated slightly) for TDS. Enjoy!
News and articles from around the web
Here's a pick of a few of the most interesting ML/AI/software news, articles and papers I've read this month. Here they are, in no particular order...
1. Novel vulnerabilities for Machine Learning systems
During my PhD, I worked on developing techniques for the assurance of AI-based systems. This was/is an interesting research area: how can you be confident that your cutting-edge AI system is going to behave as expected? In cases where this expected behaviour includes 'being safe and secure' – as in many autonomous vehicle use-cases – the problem becomes particularly interesting.
One increasingly studied sub-problem is the effect of inaccurate, incomplete or manipulated data on the behaviour of these systems. This blog post (and accompanying paper) looks into a particularly interest approach to 'attacking' language models via 'data poisoning'. This type of vulnerability exists across most modern ML techniques too. Interesting stuff.
2. The limits of ML
In a similar vein to the above article on ML security vulnerabilities, ML failure modes are also something of an interest of mine. In this paper (mostly) from a team at Google, the authors lay out one failure mode related to 'underspecification' in many modern ML techniques (predominantly within the Deep Learning toolbox). The problem here is distinct from the classic problem of having unrepresentative training data that is in some way misaligned with the data the system would receive in a deployed context. Importantly, the team claim it goes some way to explaining why the 'generalisation performance' of these systems can be lacking in practice, despite promising offline evaluations. An interesting, if long, read.
3. The dawn of regulation?
With the events of the last few years and a sense of techno-skepticism slowly creeping into the population, it seems increasingly likely that some enhanced form of regulation of the social media and internet monopoly companies will sweep in over the coming years. However, the likes of Facebook and Twitter have been investing heavily in AI/ML capabilities to help them better curate their platforms as part of their broader efforts to outmanoeuvre governments and regulators (for good or ill). This article gives some interesting insight into the global considerations that may end up driving the decisions that will need to made in the not-too-distant future.
4. Lessons in recommendation systems
This list wouldn't be complete without a recommendation system entry, would it? In this post, the team over at Twitter compiled their learnings from their recent RecSys 2020 Challenge. Much like the earlier Netflix Challenge, this challenge aims to advance recommender system technology through friendly competition – in this case as part of a broader academic conference. Among the learnings of the Twitter team are a number of points related to feature extraction and engineering for recommendation systems, particularly how the ability to (extremely) quickly extract features, train models and iterate was key to the winning team's success. Well worth a read if you're into recommendation systems.
5. Tools for the toolbox
Generative Adversarial Networks (GANs) are an area of modern ML that I feel I've personally spent too little time being hands-on with. They're the technological basis for a lot of the AI-driven upscaling, style-transfer and other wonderful yet bizarre applications that have emerged in recent years. As it stands, I've read a few papers, played with a few pre-trained models, but not much more. This framework (provided by the
torchgan team) provides a simple approach to quickly building and training GANs in PyTorch. I'm hoping it could be the catalyst I need to get 'properly' hands-on with GANs. Open source software is a wonderful thing.
Other odds and ends
Working in a high-growth startup/scale-up is a strange place to find yourself. It's sometimes hard to know which way is 'up'. With teams growing and shifting on a near-daily basis, new customers and their associated technical issues coming in thick and fast, and all the corresponding changes to strategy and culture this involves, it's an exciting – if bewildering – place to be. I've drawn quite a lot of reassurance and insight from reading about the experiences of other folks in (roughly) analogous environments. Here's one from Patrick McKenzie, a well-known software personality. He works at Stripe, and recently wrote an account of his time there thus far. It's an interesting read. Here it is: