Review: November 2020

And with that, November is over. What a strange year it has been. I took some leave over the last couple of weeks, hence the delay on the release of this newsletter. However, here's your belated dose of some of the most interesting Machine Learning (ML) news from around the web.

First up, news from my end of the world. This month was a busy one, but most of my progress was behind the scenes. I did manage to get an odd little post about HDF5 out though! If you aren't aware, HDF5 is a venerable and extremely useful data format (and library) for managing large, complex datasets. It was originally developed for the likes of NASA for massive data-gathering efforts and super-computing applications. It's also got an interesting connection to the likes of Netscape too. Here it is:

A Brief Introduction to HDF5
Data models and data formats are an easily overlooked but critical aspect of modern data infrastructure and development work. This post gives an introduction to HDF5 and how to get started using it in Go.

While I didn't manage to get many other articles written this month, I did get a fair amount of code written – much of which will feature in future posts. To give you a flavour, here's the repository for lingo. This is a repository that illustrates how you can go about deploying common ML models developed and trained in Python as optimised Go services. Adopting this approach can allow you to gain impressive boosts to performance, allows you to scale your ML services much more gracefully, and ultimately might save you a chunk of cash too. Can you guess what one of my upcoming posts is about? Anyway, if you're interested, here's the repo:

A package for quickly deploying Scikit-Learn Linear Models in Go. - markdouthwaite/lingo

Last month, I started writing articles for Towards Data Science (TDS) over on Medium, most of these will be time-lagged articles from this blog, but if you're interested in following me over on Medium instead, here's my writer's page:

Mark Douthwaite – Medium
Read writing from Mark Douthwaite on Medium. Applied AI specialist, computer scientist, software engineer. Read more at Every day, Mark Douthwaite and thousands of other voices read, write, and share important stories on Medium.

And here's a reposted article (updated slightly) for TDS. Enjoy!

Object-Oriented Programming: A Practical Introduction (Part 1)
If you’ve been programming for at least a little while, you’ll likely have come across (and perhaps used) Object Oriented Programming (OOP) concepts and language-features. This programming paradigm…

News and articles from around the web

Here's a pick of a few of the most interesting ML/AI/software news, articles and papers I've read this month. Here they are, in no particular order...

1. Novel vulnerabilities for Machine Learning systems

During my PhD, I worked on developing techniques for the assurance of AI-based systems. This was/is an interesting research area: how can you be confident that your cutting-edge AI system is going to behave as expected? In cases where this expected behaviour includes 'being safe and secure' – as in many autonomous vehicle use-cases – the problem becomes particularly interesting.

One increasingly studied sub-problem is the effect of inaccurate, incomplete or manipulated data on the behaviour of these systems. This blog post (and accompanying paper) looks into a particularly interest approach to 'attacking' language models via 'data poisoning'. This type of vulnerability exists across most modern ML techniques too. Interesting stuff.

Data Poisoning
Customizing Triggers with Concealed Data Poisoning

2. The limits of ML

In a similar vein to the above article on ML security vulnerabilities, ML failure modes are also something of an interest of mine. In this paper (mostly) from a team at Google, the authors lay out one failure mode related to 'underspecification' in many modern ML techniques (predominantly within the Deep Learning toolbox). The problem here is distinct from the classic problem of having unrepresentative training data that is in some way misaligned with the data the system would receive in a deployed context. Importantly, the team claim it goes some way to explaining why the 'generalisation performance' of these systems can be lacking in practice, despite promising offline evaluations. An interesting, if long, read.

Underspecification Presents Challenges for Credibility in Modern Machine Learning
ML models often exhibit unexpectedly poor behavior when they are deployed inreal-world domains. We identify underspecification as a key reason for thesefailures. An ML pipeline is underspecified when it can return many predictorswith equivalently strong held-out performance in the training domain…

3. The dawn of regulation?

With the events of the last few years and a sense of techno-skepticism slowly creeping into the population, it seems increasingly likely that some enhanced form of regulation of the social media and internet monopoly companies will sweep in over the coming years. However, the likes of Facebook and Twitter have been investing heavily in AI/ML capabilities to help them better curate their platforms as part of their broader efforts to outmanoeuvre governments and regulators (for good or ill). This article gives some interesting insight into the global considerations that may end up driving the decisions that will need to made in the not-too-distant future.

The National-Security Case for Fixing Social Media
Glenn S. Gerstell, ex general counsel of the NSA, discusses the numerous threats posed by online disinformation and the ways—via regulation, technological advances, and international treaties—in which we can combat the dissemination of false statements by foreign or domestic cybercriminals on platfo…

4. Lessons in recommendation systems

This list wouldn't be complete without a recommendation system entry, would it? In this post, the team over at Twitter compiled their learnings from their recent RecSys 2020 Challenge. Much like the earlier Netflix Challenge, this challenge aims to advance recommender system technology through friendly competition – in this case as part of a broader academic conference. Among the learnings of the Twitter team are a number of points related to feature extraction and engineering for recommendation systems, particularly how the ability to (extremely) quickly extract features, train models and iterate was key to the winning team's success. Well worth a read if you're into recommendation systems.

In this blog post we describe the dataset that Twitter released for the RecSys 2020 Challenge and the insights we had from the winning teams.

5. Tools for the toolbox

Generative Adversarial Networks (GANs) are an area of modern ML that I feel I've personally spent too little time being hands-on with. They're the technological basis for a lot of the AI-driven upscaling, style-transfer and other wonderful yet bizarre applications that have emerged in recent years. As it stands, I've read a few papers, played with a few pre-trained models, but not much more. This framework (provided by the torchgan team) provides a simple approach to quickly building and training GANs in PyTorch. I'm hoping it could be the catalyst I need to get 'properly' hands-on with GANs. Open source software is a wonderful thing.

Research Framework for easy and efficient training of GANs based on Pytorch - torchgan/torchgan

Other odds and ends

Working in a high-growth startup/scale-up is a strange place to find yourself. It's sometimes hard to know which way is 'up'. With teams growing and shifting on a near-daily basis, new customers and their associated technical issues coming in thick and fast, and all the corresponding changes to strategy and culture this involves, it's an exciting – if bewildering – place to be. I've drawn quite a lot of reassurance and insight from reading about the experiences of other folks in (roughly) analogous environments. Here's one from Patrick McKenzie, a well-known software personality. He works at Stripe, and recently wrote an account of his time there thus far. It's an interesting read. Here it is:

What Working At Stripe Has Been Like | Kalzumeus Software
Mark Douthwaite

Mark Douthwaite