Manchester, UK
mark@douthwaite.io

Non-Data Science Packages for Data Scientists – Part 1: tqdm

Non-Data Science Packages for Data Scientists – Part 1: tqdm

This series of posts is going to give an overview of some interesting, useful and/or time-saving general-purpose packages that can make the lives of Data Scientists a little easier. I thought I’d put this together after finding a few-to-many DS repositories, scripts and notebooks that made life far too difficult for themselves.

TQDM

To kick things off, let’s look at tqdm, a fantastic, easy-to-use, extensible progress bar package. It makes adding fast, informative progress bars to Python processes extremely easy. If you’re a Data Scientist or Machine Learning (ML) Engineer with any degree of experience, you’ll no doubt have used or developed algorithms or data transformations that can take many hours to complete.

Invariably, many Data Scientists opt to simply print status messages to console, or in some slightly more sophisticated cases use the (excellent and recommended) built-in logging module. In a lot of cases this is fine. However, if you’re running a task with many hundreds of steps, or over a data structure with many millions of elements, these approaches are sometimes a little unclear and gratuitous, and frankly kind of ugly.

Example output from tqdm

That’s where tqdm can come in. It has a nice clean API that lets us quickly add progress bars to our code. Plus it has a lightweight ‘time-remaining’ estimation algorithm built in to the progress bar too. When I first came across tqdm (while exploring the implicit library), my mind immediately went to how I could use it to make my own ML algorithm outputs a lot tidier. Take a look at the example below:

from tqdm import tqdm
k = 100
with tqdm(desc="Training Epoch", total=k) as progress:
    for epoch in range(k):
    progress.update(1)

In this simple example, we set up a tqdm progress bar that expects a process of 100 steps. We then run our mock training loop, each time updating the progress bar when the step is completed. We can also update the progress bar by arbitrary amounts if we break out of the loop too. That’s two lines of code (plus the import statement) to get a rich progress bar in your code.

Pandas Integration

Beyond cool little additions to your program’s outputs,tqdmalso integrates nicely with other widely used packages. Probably the most interesting integration for Data Scientists is with Pandas, the ubiquitous Python data analysis library. Take a look at the example below:

df = pd.read_csv("weather.csv")
tqdm.pandas(desc="Applying Transformation")
df.progress_apply(lambda x: x)

This would give us:

Example output from tqdm

Technically, the tqdm.pandas method monkey patches the progress_apply method onto Pandas data structures, giving them a modified version of the commonly used apply method. Practically, when we call the progress_apply method, the package wraps the standard Pandas ‘apply’ method with a tqdm progress bar. This can come in really handy when you’re processing large data frames!

Leave a Reply

Your email address will not be published. Required fields are marked *