Visualizing High-dimensional Data with Altair (in Python) is a Super Power

5 min readJan 21, 2021

Are you a data scientist using Python?
Do you know about the “Grammar of Graphics” concept? If not, then this article is for you!
This article will not teach you how to use Altair but at least I hope it will spark your interest in this great approach to visualization by showing you a simple use case.

One of the most valuable things that I have learned over last several years working on data science is that one of the key ingredients for building better models is really understanding the dataset and their patterns. Nevertheless, one of the mistakes that I see beginners in data science make is concentrating most of their time learning math, theory and algorithms but dedicating little or no time developing data visualization skills.

Let’s make one thing clear, data visualization is hard! One of the main reasons that makes it difficult is because data usually comes from high-dimensional datasets. Said in another way, it is not always easy to understand how different variables affect the target variable we are trying to predict.

It turns out there is an approach that makes it easy to visualize high-dimensional data called “Grammar of Graphics” developed by Leland Wilkenson. Nevertheless, according to around 20k responses in the 2020 Kaggle survey, only 233 users reported using a “grammar of graphics” library in Python (Altair) on a regular basis, compared to 12.3k using Matplotlib and 8.8k using Seaborn which are not based on this concept. (Here is the link to the code and data of the following plots)

Data from 2020 Kaggle Survey (20k answers)

More interesting, there is a big difference in the pattern of usage of visualization libraries between Python and R users. While 4120 out of 4277 of R users (96.33%) reported using Ggplot2, which is based on the “grammar of graphics” concept, only 233 out of 15536 Python users (1.5%) use Altair.

Don’t get me wrong, Matplotlib and Seaborn are great libraries and you should know how to use them. However, I also think that learning to use a visualization library with the “grammar of graphics” approach in Python, such as Altair, makes you more productive in the long run and that is ultimately a Super Power.

Let us now turn to a simple but common use case in data science. When we are tuning neural networks, we have many hyperparameters available for improving our model such as learning rate, epochs, dropout, batch size and others. All of these hyperparameters might affect both the accuracy and time to fit our neural network. There are many ways to select which values of hyperparameters to try such as a grid search or a Bayesian Optimization algorithm. But once we have run many trials, it is often difficult to get a sense of what parts of the hyperparameter space contribute to good results of the target metric. The following plot, built with Altair, allows you to visually explore 5 dimensions (accuracy, time to fit, epochs, learning rate and dropout regularization). On top of these 5 dimensions, this plot is interactive. This means that you can select a set of points by dragging a box around any of the two plots and you will select the same points on the other. Also, if you move the mouse over any point, it shows a tooltip with additional information. Why don’t you try playing with this plot and see how easy it is to find the best hyperparameters in this graph.

Spoiler: The best performance was achieved for low values of dropout (0.5), learning rate in the interval (2E-5,3E-3) and a number of epochs large than 20.

I want to highlight that the main advantages of this approach is that it allows you to easily encode how some variables (possibly 4 to 6 variables) should be represented in the final graph (similar to Seaborn) but at the same time it allows you to construct interactive plots in order to facilitate data exploration (In the same way that Plotly does). For R users, this set of advantages is similar to one provided by the combination of Ggplot2 and Plotly libraries.

Now let’s take a look at the code that produced this plot:

The first thing you’ll notice is that the code is very different to the one you would use for a Matplotlib plot. In fact, learning the first “grammar of graphs” library has a slow curve since it is a paradigm shift in plotting, but when you get used to it, it is difficult to go back. The second thing you’ll notice is that the main part of the code is inside of the function “encode()”. This function allows you to map the variables to their aesthetics, title and scale. Lines (2, 6–9 and 12) are responsible for adding the drag and drop capability and line 10 inserts tooltip information to show to the user.

This graph is part of a Jupyter notebook in which I explore different Natural Language Processing models on the HuffPost dataset. The original plot includes the batch size variable as well. The one presented in this post did not include it due to width dimensions constraints. If you are interested in the original plot and code, here is the link to the Github repository.

The “grammar of graphics” is based on the idea that a plot can be seen as a stack of layers (there are a total of 7 different layers such as data, aesthetics, geometries, facets, statistics, coordinates and themes). Two layers that are worth mentioning are facets and statistics. Facets are convenient for “creating multiples views of a dataset” as you can see in the following example. Let’s begin with a plot without facets:

This is a plot of the well-known cars dataset with variables Miles_per_Gallon vs. Horsepower and Origin as color.

Now, we can facet the same plot by origin adding just one line of code in order to get the following plot (check this article by Jim Vallandingham):

Statistics is another layer that is very useful for visualizing trends in scatter plots. This layer let us stack trend lines with algorithms such as LOESS (LOcally Estimated Scatterplot Smoothing) and regression:

If a was successful in getting you excited about Altair, here are some links that may be useful for you:

Visualizing High-dimensional Data with Altair (in Python) is a Super Power

Written by Mauricio Soto Alvarez