publications | Igor Shilov

2025

NeurIPS 2025

Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models

Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Katherine Lee, and 10 more authors

arXiv preprint 2505.18773, 2025

Abs Paper

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today’s LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.
NeurIPS Workshop

RippleBench: Capturing Ripple Effects by Leveraging Existing Knowledge Repositories

Roy Rinberg, Usha Bhalla, Igor Shilov, and Rohit Gandikota

Mechanistic Interpretability Workshop at NeurIPS, 2025

Abs Paper

The ability to make targeted updates to models, whether for unlearning, debiasing, model editing, or safety alignment, is central to AI safety. While these interventions aim to modify specific knowledge (e.g., removing virology content), their effects often propagate to related but unintended areas (e.g., allergies). Due to lack of standardized tools, existing evaluations typically compare performance on targeted versus unrelated general tasks, overlooking this broader collateral impact called the "ripple effect". We introduce RippleBench, a benchmark for systematically measuring how interventions affect semantically related knowledge. Using RippleBench, built on top of a Wikipedia-RAG pipeline for generating multiple-choice questions, we evaluate eight state-of-the-art unlearning methods. We find that all methods exhibit non-trivial accuracy drops on topics increasingly distant from the unlearned knowledge, each with distinct propagation profiles. We release our codebase for on-the-fly ripple evaluation as well as RippleBench-WMDP-Bio, a dataset derived from WMDP biology, containing 9,888 unique topics and 49,247 questions.
USENIX 2025

Free Record-Level Privacy Risk Evaluation Through Artifact-Based Methods

Joseph Pollock^*, Igor Shilov^*, Euodia Dodd, and Yves-Alexandre Montjoye

In 34th USENIX Security Symposium (USENIX Security 25), 2025

Abs Paper Code Website

Membership inference attacks (MIAs) are widely used to empirically assess the privacy risks of samples used to train a target machine learning model. State-of-the-art methods however require training hundreds of shadow models, with the same size and architecture of the target model, solely to evaluate the privacy risk. While one might be able to afford this for small models, the cost often becomes prohibitive for medium and large models. We here instead propose a novel approach to identify the at-risk samples using only artifacts available during training, with little to no additional computational overhead. Our method analyzes individual per-sample loss traces and uses them to identify the vulnerable data samples. We demonstrate the effectiveness of our artifact-based approach through experiments on the CIFAR10 dataset, showing high precision in identifying vulnerable samples as determined by a SOTA shadow model-based MIA (LiRA). Impressively, our method reaches the same precision as another SOTA MIA when measured against LiRA, despite it being orders of magnitude cheaper. We then show LT-IQR to outperform alternative loss aggregation methods, perform ablation studies on hyperparameters, and validate the robustness of our method to the target metric. Finally, we study the evolution of the vulnerability score distribution throughout training as a metric for model-level risk assessment.
ICML 2025

Certification for Differentially Private Prediction in Gradient-Based Training

Matthew Wicker, Philip Sosnin, Igor Shilov, Adrianna Janik, Mark N. Müller, Yves-Alexandre Montjoye, and 2 more authors

In Forty-second International Conference on Machine Learning, 2025

Abs Paper Code

Differential privacy upper-bounds the information leakage of machine learning models, yet providing meaningful privacy guarantees has proven to be challenging in practice. The private prediction setting where model outputs are privatized is being investigated as an alternate way to provide formal guarantees at prediction time. Most current private prediction algorithms, however, rely on global sensitivity for noise calibration, which often results in large amounts of noise being added to the predictions. Data-specific noise calibration, such as smooth sensitivity, could significantly reduce the amount of noise added, but were so far infeasible to compute exactly for modern machine learning models. In this work we provide a novel and practical approach based on convex relaxation and bound propagation to compute a provable upper-bound for the local and smooth sensitivity of a prediction. This bound allows us to reduce the magnitude of noise added or improve privacy accounting in the private prediction setting. We validate our framework on datasets from financial services, medical image classification, and natural language processing and across models and find our approach to reduce the noise added by up to order of magnitude.
ICML Workshop

Counterfactual Influence as a Distributional Quantity

Matthieu Meeus, Igor Shilov, Georgios Kaissis, and Yves-Alexandre Montjoye

The Impact of Memorization on Trustworthy Foundation Models: ICML 2025 Workshop, 2025

Abs Paper

Machine learning models are known to memorize samples from their training data, raising concerns around privacy and generalization. Counterfactual self-influence is a popular metric to study memorization, quantifying how the model’s prediction for a sample changes depending on the sample’s inclusion in the training dataset. However, recent work has shown memorization to be affected by factors beyond self-influence, with other training samples, in particular (near-)duplicates, having a large impact. We here study memorization treating counterfactual influence as a distributional quantity, taking into account how all training samples influence how a sample is memorized. For a small language model, we compute the full influence distribution of training samples on each other and analyze its properties. We find that solely looking at self-influence can severely underestimate tangible risks associated with memorization: the presence of (near-)duplicates seriously reduces self-influence, while we find these samples to be (near-)extractable. We observe similar patterns for image classification, where simply looking at the influence distributions reveals the presence of near-duplicates in CIFAR-10. Our findings highlight that memorization stems from complex interactions across training data and is better captured by the full influence distribution than by self-influence alone.
SaTML 2025

SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)

Matthieu Meeus, Igor Shilov, Shubham Jain, Manuel Faysse, Marek Rei, and Yves-Alexandre Montjoye

In 3rd IEEE Conference on Secure and Trustworthy Machine Learning, 2025

Abs Paper Code

Whether LLMs memorize their training data and what this means, from privacy leakage to detecting copyright violations – has become a rapidly growing area of research over the last two years. In recent months, more than 10 new methods have been proposed to perform Membership Inference Attacks (MIAs) against LLMs. Contrary to traditional MIAs which rely on fixed – but randomized – records or models, these methods are mostly evaluated on datasets collected post-hoc. Sets of members and non-members, used to evaluate the MIA, are constructed using informed guesses after the release of a model. This lack of randomization raises concerns of a distribution shift between members and non-members. In the first part, we review the literature on MIAs against LLMs. While most work focuses on sequence-level MIAs evaluated in post-hoc setups, we show that a range of target models, motivations and units of interest have been considered in the literature. We then quantify distribution shifts present in the 6 datasets used in the literature, ranging from books to papers, using a bag of word classifier. Our analysis reveals that all of them suffer from severe distribution shifts. This challenges the validity of using such setups to measure LLM memorization and may undermine the benchmarking of recently proposed methods. Yet, all hope might not be lost. In the second part, we introduce important considerations to properly evaluate MIAs against LLMs and discuss potential ways forward: randomized test splits, injections of randomized (unique) sequences, randomized finetuning, and post-hoc control methods. While each option comes with its advantages and limitations, we believe they collectively provide solid grounds to guide the development of MIA methods and study LLM memorization. We conclude by proposing comprehensive, easy-to-use benchmarks for sequence- and document-level MIAs against LLMs.

2024

arXiv

Sub-optimal Learning in Meta-Classifier Attacks: A Study of Membership Inference on Differentially Private Location Aggregates

Yuhan Liu, Florent Guepin, Igor Shilov, and Yves-Alexandre De Montjoye

arXiv preprint 2412.20456, 2024

Abs Paper

The widespread collection and sharing of location data, even in aggregated form, raises major privacy concerns. Previous studies used meta-classifier-based membership inference attacks (MIAs) with multi-layer perceptrons (MLPs) to estimate privacy risks in location data, including when protected by differential privacy (DP). In this work, however, we show that a significant gap exists between the expected attack accuracy given by DP and the empirical attack accuracy even with informed attackers (also known as DP attackers), indicating a potential underestimation of the privacy risk. To explore the potential causes for the observed gap, we first propose two new metric-based MIAs: the one-threshold attack and the two-threshold attack. We evaluate their performances on real-world location data and find that different data distributions require different attack strategies for optimal performance: the one-threshold attack is more effective with Gaussian DP noise, while the two-threshold attack performs better with Laplace DP noise. Comparing their performance with one of the MLP-based attack models in previous works shows that the MLP only learns the one-threshold rule, leading to a suboptimal performance under the Laplace DP noise and an underestimation of the privacy risk. Second, we theoretically prove that MLPs can encode complex rules (\eg, the two-threshold attack rule), which can be learned when given a substantial amount of training data. We conclude by discussing the implications of our findings in practice, including broader applications extending beyond location aggregates to any differentially private datasets containing multiple observations per individual and how techniques such as synthetic data generation and pre-training might enable MLP to learn more complex optimal rules.
arXiv

Watermarking Training Data of Music Generation Models

Pascal Epple, Igor Shilov, Bozhidar Stevanoski, and Yves-Alexandre Montjoye

arXiv preprint 2412.08549, 2024

Abs Paper

Generative Artificial Intelligence (Gen-AI) models are increasingly used to produce content across domains, including text, images, and audio. While these models represent a major technical breakthrough, they gain their generative capabilities from being trained on enormous amounts of human-generated content, which often includes copyrighted material. In this work, we investigate whether audio watermarking techniques can be used to detect an unauthorized usage of content to train a music generation model. We compare outputs generated by a model trained on watermarked data to a model trained on non-watermarked data. We study factors that impact the model’s generation behaviour: the watermarking technique, the proportion of watermarked samples in the training set, and the robustness of the watermarking technique against the model’s tokenizer. Our results show that audio watermarking techniques, including some that are imperceptible to humans, can lead to noticeable shifts in the model’s outputs. We also study the robustness of a state-of-the-art watermarking technique to removal techniques.
ICML 2024

Copyright Traps for Large Language Models

Matthieu Meeus^*, Igor Shilov^*, Manuel Faysse, and Yves-Alexandre Montjoye

In Forty-first International Conference on Machine Learning, 2024

Press coverage in MIT Technology Review and Nature News.

Abs Paper Code

Questions of fair use of copyright-protected content to train Large Language Models (LLMs) are being very actively debated. Document-level inference has been proposed as a new task: inferring from black-box access to the trained model whether a piece of content has been seen during training. SOTA methods however rely on naturally occurring memorization of (part of) the content. While very effective against models that memorize a lot, we hypothesize–and later confirm–that they will not work against models that do not naturally memorize, e.g. medium-size 1B models. We here propose to use copyright traps, the inclusion of fictitious entries in original content, to detect the use of copyrighted materials in LLMs with a focus on models where memorization does not naturally occur. We carefully design an experimental setup, randomly inserting traps into original content (books) and train a 1.3B LLM. We first validate that the use of content in our target model would be undetectable using existing methods. We then show, contrary to intuition, that even medium-length trap sentences repeated a significant number of times (100) are not detectable using existing methods. However, we show that longer sequences repeated a large number of times can be reliably detected (AUC=0.75) and used as copyright traps. We further improve these results by studying how the number of times a sequence is seen improves detectability, how sequences with higher perplexity tend to be memorized more, and how taking context into account further improves detectability.
arXiv

The Mosaic Memory of Large Language Models

Igor Shilov^*, Matthieu Meeus^*, and Yves-Alexandre Montjoye

arXiv preprint 2405.15523, 2024

Abs Paper

As Large Language Models (LLMs) become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomena we call mosaic memory. We show major LLMs to exhibit mosaic memory, with fuzzy duplicates contributing to memorization as much as 0.8 of an exact duplicate and even heavily modified sequences contributing substantially to memorization. Despite models display reasoning capabilities, we somewhat surprisingly show memorization to be predominantly syntactic rather than semantic. We finally show fuzzy duplicates to be ubiquitous in real-world data, untouched by deduplication techniques. Taken together, our results challenge widely held beliefs and show memorization to be a more complex, mosaic process, with real-world implications for privacy, confidentiality, model utility and evaluation.

2022

arXiv

Defending against Reconstruction Attacks with Rényi Differential Privacy

Pierre Stock, Igor Shilov, Ilya Mironov, and Alexandre Sablayrolles

arXiv preprint 2202.07623, 2022

Abs Paper

Reconstruction attacks allow an adversary to regenerate data samples of the training set using access to only a trained model. It has been recently shown that simple heuristics can reconstruct data samples from language models, making this threat scenario an important aspect of model release. Differential privacy is a known solution to such attacks, but is often used with a relatively large privacy budget (epsilon > 8) which does not translate to meaningful guarantees. In this paper we show that, for a same mechanism, we can derive privacy guarantees for reconstruction attacks that are better than the traditional ones from the literature. In particular, we show that larger privacy budgets do not protect against membership inference, but can still protect extraction of rare secrets. We show experimentally that our guarantees hold against various language models, including GPT-2 finetuned on Wikitext-103.

2021

NeurIPS 2021

Antipodes of label differential privacy: PATE and ALIBI

Mani Malek Esmaeili, Ilya Mironov, Karthik Prasad, Igor Shilov, and Florian Tramer

In Advances in Neural Information Processing Systems, 2021

Abs Paper Code

We consider the privacy-preserving machine learning (ML) setting where the trained model must satisfy differential privacy (DP) with respect to the labels of the training examples. We propose two novel approaches based on, respectively, the Laplace mechanism and the PATE framework, and demonstrate their effectiveness on standard benchmarks. While recent work by Ghazi et al. proposed Label DP schemes based on a randomized response mechanism, we argue that additive Laplace noise coupled with Bayesian inference (ALIBI) is a better fit for typical ML tasks. Moreover, we show how to achieve very strong privacy levels in some regimes, with our adaptation of the PATE framework that builds on recent advances in semi-supervised learning. We complement theoretical analysis of our algorithms’ privacy guarantees with empirical evaluation of their memorization properties. Our evaluation suggests that comparing different algorithms according to their provable DP guarantees can be misleading and favor a less private algorithm with a tighter analysis.
NeurIPS Workshop

Opacus: User-Friendly Differential Privacy Library in PyTorch

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, and 6 more authors

In NeurIPS Workshop on Privacy in Machine Learning, 2021

Abs Paper Code

We introduce Opacus, a free, open-source PyTorch library for training deep learning models with differential privacy (hosted at opacus.ai). Opacus is designed for simplicity, flexibility, and speed. It provides a simple and user-friendly API, and enables machine learning practitioners to make a training pipeline private by adding as little as two lines to their code. It supports a wide variety of layers, including multi-head attention, convolution, LSTM, GRU (and generic RNN), and embedding, right out of the box and provides the means for supporting other user-defined layers. Opacus computes batched per-sample gradients, providing higher efficiency compared to the traditional "micro batch" approach. In this paper we present Opacus, detail the principles that drove its implementation and unique features, and benchmark it against other frameworks for training models with differential privacy as well as standard PyTorch.