Filling in the blanks: how machines learn when data is incomplete

Outstanding paper award at ICML 2025 (episode 3)

Sep 11, 2025

Hello fellow researchers and AI enthusiasts!

Welcome back to The Future of AI! In this third episode of my ICML review series, we’ll step away from large language models and look at something more fundamental: the role of training data in machine learning. For context, ICML — the International Conference on Machine Learning — is one of the most prestigious gatherings in AI/ML, and this year’s edition took place in Vancouver this July.

What happens when AI faces the messy reality of missing data? A new approach to score matching could change the way we generate insights from incomplete datasets — from finance to biology. Hold on! What is score matching? Let’s find out.

Full reference : J. Givens, S. Liu, and H. W. Reeve, “Score matching with missing data,” arXiv preprint arXiv:2506.00557, 2025

Context

Ever heard about Stable Diffusion, and how it generates images from noise? Let’s quickly review this process.

A diffusion process is a random process that gradually adds random pixels (noise) to an image (or other types of data) until it becomes a collection of totally random pixels, like white noise on an old TV screen.

The reverse process is denoising or generation: if we can reverse this noising process, we can start from noise and generate realistic images. To reverse diffusion, we need to know how to move particles from noise back toward data. This requires some knowledge of the relationships between pixels in realistic images. Since this information is virtually impossible to get, one trains a neural network to approximate it using score matching.

To summarise, the score gives some knowledge about a random process. Score matching requires training a network to predict the score. Diffusion models use score matching to learn how to generate data from noise (by denoising noise into data). Stable Diffusion is just an example among others.

Key results

Most real-world data is incomplete. This creates a problem for score matching, because it largely assumes that all data can be fully observed. The Authors propose two new methods that make score matching practical when parts of the data are missing:

Importance Weighting (Marg-IW) estimates the score using weighted averages over the missing dimensions. This method performs best for small datasets with few dimensions. But it comes with some theoretical guarantees.
Variational Approach (Marg-Var) uses approximation techniques to estimate the unobserved parts of data. This method scales well to high-dimensional and complex problems, such as learning relationships among many variables

The Authors studied these methods in a series of experiments and discovered that:

Marg-IW is strong for simpler tasks with fewer variables.
Marg-Var shines in large-scale problems, especially for graphical model estimation and real-world datasets like S&P 100 stock prices.
Both methods outperform basic approaches that ignore missing values or simply zero them out.

These methods allow for more advanced modeling in scenarios where missing data is common:

In finance, where incomplete price data is common.
In biology, where observed gene data is often only partial.
In Machine Learning, with numerous gaps in widely-used image, video or text dataset.

In summary, these techniques allow researchers and practitioners to use incomplete data without giving up the advantages of score matching, thus improving applications from finance to computational biology.

My take

A perfect dataset simply does not exist. When gathering or generating data through a physical experiment, something always goes wrong. And this results in missing recordings, missing samples, missing dimensions, and so on. And this creates a big problem for the training phase, because most machine learning algorithms are allergic to missing data. Certainly, data scientists can fill in the missing data (e.g. by copy-pasting), but this only holds only for very small datasets. The possibility of filling in the blanks automatically, with a method that scales well for big datasets, is a very welcome tool in any researcher’s toolbox. And while this paper will most likely go unnoticed by the end-users of various AI tools, it could very much become a game-changer in the AI developer community.

Looking ahead

This paper shows how score matching can be extended to messy, incomplete datasets, opening the door to more reliable AI models in real-world scenarios.

In the next episode, we’ll pick up the thread from “Why AI struggles to think outside the box” and explore how a new approach helps AI break free from rigid patterns and discover smarter paths forward. If you’d like to follow along and never miss an update, make sure to subscribe to The Future of AI!

The Future of AI