CVPR 2026

EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

Wenqi Cai1, Yawen Zou1, Guang Li2, Chunzhi Gu3, Chao Zhang1

1University of Toyama 2Hokkaido University 3University of Fukui

Introduction

Why early vision-language fusion?

Dataset distillation (DD) seeks to synthesize compact training sets that retain strong downstream utility with far fewer samples. In recent diffusion-based DD methods, textual semantics are often injected late in the generation process, which can cause prompt guidance to dominate visual evidence and weaken the contribution of visual latents.

In EVLF, we retain the conventional late-fusion pathway and additionally introduce an early vision-language fusion module at the encoder-to-backbone transition. This design strengthens the interaction between visual structure and semantic guidance throughout the denoising process, rather than relying only on late-stage conditioning.

EVLF is a lightweight, plug-and-play component for diffusion-based DD pipelines with an encoder. By complementing late fusion with early cross-modal interaction, we improve the balance between semantic alignment and visual fidelity, leading to more coherent synthetic samples and stronger downstream classification performance.

Highlights

  • We retain the conventional late-fusion design and add an early vision-language fusion module.
  • Our early fusion interface strengthens the interaction between visual evidence and language guidance.
  • EVLF is lightweight and can be inserted at the encoder–backbone interface.
  • Across settings, EVLF improves sample coherence and downstream utility.
EVLF overview illustration
Overview of EVLF. Open PDF

Input

Visual latents together with class-level text guidance.

Fusion

We add early cross-modal fusion while preserving the standard late-fusion pathway.

Outcome

More coherent synthetic samples and better downstream performance.

Method

Framework overview

EVLF introduces two key modifications to the diffusion-based dataset distillation pipeline: we revise the fusion interface by performing vision-language interaction at the encoder-to-backbone transition, and we optionally fine-tune the denoiser when adaptation to the fused latent space is needed.

EVLF pipeline figure

Framework overview of EVLF. We modify the original diffusion-based dataset distillation pipeline in two ways. First, we introduce an additional early vision-language fusion module at the encoder-to-backbone interface, allowing visual latents and text embeddings to interact before denoising while preserving the standard late-fusion pathway. Second, for pipelines whose pretrained denoiser is not adapted to the target dataset or the fused latent distribution, we optionally fine-tune the denoiser to better match the fused representations. Open PDF

Results

Qualitative comparisons across datasets

Qualitative comparisons across ImageIDC, ImageNette, and ImageWoof show that EVLF produces synthetic samples with stronger label fidelity and more coherent visual structure than conventional late-fusion baselines. In particular, EVLF better preserves object shape, texture, and class-consistent details while reducing the over-corrected artifacts that often appear when semantic guidance dominates the denoising process.

Dataset

ImageIDC

PDF
ImageIDC qualitative comparison

Qualitative comparison on ImageIDC.

Dataset

ImageNette

PDF
ImageNette qualitative comparison

Qualitative comparison on ImageNette.

Dataset

ImageWoof

PDF
ImageWoof qualitative comparison

Qualitative comparison on ImageWoof.

Quantitative Results

Performance under different IPC settings

We report the main quantitative comparisons from the paper for ImageWoof, followed by the ImageNette and ImageIDC results. Across datasets and IPC settings, EVLF consistently improves over diffusion-based baselines, with especially clear gains in low-IPC regimes where preserving both semantic fidelity and visual structure is most challenging.

ImageWoof

IPC Test Model Random DiT Minimax D⁴M D⁴M+EVLF MGD³ MGD³+EVLF
10 ConvNet-6 24.3 ± 1.134.2 ± 1.133.3 ± 1.729.4 ± 0.934.3 ± 2.433.5 ± 1.934.9 ± 1.0
ResNetAP-10 29.4 ± 0.834.7 ± 0.536.2 ± 3.233.2 ± 2.137.3 ± 0.736.6 ± 0.939.3 ± 0.3
ResNet-18 27.7 ± 0.934.7 ± 0.435.7 ± 1.632.3 ± 1.235.9 ± 2.135.1 ± 1.838.5 ± 0.3
20 ConvNet-6 29.1 ± 0.736.1 ± 0.837.3 ± 0.134.0 ± 2.340.1 ± 2.636.2 ± 1.640.2 ± 0.5
ResNetAP-10 32.7 ± 0.441.1 ± 0.843.3 ± 2.740.1 ± 1.642.8 ± 0.244.5 ± 2.845.1 ± 0.9
ResNet-18 29.7 ± 0.540.5 ± 0.541.8 ± 1.938.4 ± 1.140.7 ± 1.340.3 ± 2.542.1 ± 0.3
50 ConvNet-6 41.3 ± 0.646.5 ± 0.850.9 ± 0.847.4 ± 0.952.5 ± 0.951.9 ± 0.453.5 ± 0.4
ResNetAP-10 47.2 ± 1.349.3 ± 0.253.9 ± 0.751.7 ± 3.255.8 ± 0.255.6 ± 1.059.0 ± 1.1
ResNet-18 47.9 ± 1.850.1 ± 0.553.7 ± 0.653.7 ± 2.258.1 ± 0.956.3 ± 0.558.7 ± 1.5

Classification accuracy (%) on ImageWoof across IPC settings and test architectures. Values are reported as mean ± standard deviation. The best result in each row is highlighted.

ImageNette and ImageIDC

Dataset IPC Random DiT DM Minimax D⁴M D⁴M+EVLF MGD³ MGD³+EVLF
ImageNette 10 54.2 ± 1.6 59.1 ± 0.7 60.8 ± 0.6 57.7 ± 1.2 60.9 ± 1.7 65.8 ± 1.2 64.3 ± 1.0 66.0 ± 1.6
20 63.5 ± 0.5 64.8 ± 1.2 66.5 ± 1.1 64.7 ± 0.8 66.3 ± 1.3 71.7 ± 0.5 69.2 ± 1.9 72.5 ± 0.8
50 76.1 ± 1.1 73.3 ± 0.9 76.2 ± 0.4 73.9 ± 0.3 77.7 ± 1.1 79.7 ± 0.5 79.2 ± 1.9 79.5 ± 0.4
ImageIDC 10 48.1 ± 0.8 54.1 ± 0.4 52.8 ± 0.5 51.9 ± 1.4 47.7 ± 0.5 57.3 ± 1.5 55.0 ± 2.3 56.3 ± 1.5
20 52.5 ± 0.9 58.9 ± 0.2 58.5 ± 0.4 59.1 ± 3.7 56.3 ± 0.7 62.0 ± 0.7 61.7 ± 1.0 64.1 ± 0.3
50 68.1 ± 0.7 64.3 ± 0.6 69.1 ± 0.8 69.4 ± 1.4 67.8 ± 1.0 72.1 ± 0.3 71.0 ± 0.9 72.7 ± 1.1

Classification accuracy (%) on ImageNette and ImageIDC with ResNetAP-10. Values are reported as mean ± standard deviation. The best result in each row is highlighted.

Analysis

Representation and attribution analysis

We report representation, attribution, and hyperparameter analyses using t-SNE, Grad-CAM, and the line-chart summary from the paper.

Embedding space

t-SNE comparison

PDF
t-SNE comparison figure

t-SNE comparison on ImageIDC. Both D4M and MGD3 tend to produce tightly clustered synthetic samples, indicating limited diversity and restricted coverage of the original feature space. After incorporating EVLF, the generated samples become more dispersed and better separated, suggesting improved distributional coverage and richer class-wise variation.

Hyperparameter

Hyperparameter analysis

PDF
Hyperparameter analysis figure

Hyperparameter analysis of λ1 on ImageIDC. Enabling text injection (λ1 > 0) noticeably improves both validation accuracy and distributional coverage, while λ1 = 0 leads to over-corrected generations dominated by late-stage prompt conditioning. Once EVLF is introduced, both metrics become more stable, indicating that the method is robust to moderate changes in the text-alignment weight.

Attribution

Grad-CAM comparison

PDF
Grad-CAM comparison figure

Grad-CAM visualizations for the Shih-Tzu class from the ImageWoof validation set. Compared with models trained on datasets distilled by the original baselines, models trained on EVLF-enhanced data exhibit more discriminative and semantically relevant attention. In particular, EVLF suppresses redundant background activation for D4M and encourages MGD3 to focus on a broader and more complete region of the target object.

Citation

BibTeX

@inproceedings{cai2026evlf,
  title={{EVLF}: Early Vision-Language Fusion for Generative Dataset Distillation},
  author={Cai, Wenqi and Zou, Yawen and Li, Guang and Gu, Chunzhi and Zhang, Chao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Project links

Project materials and figure assets are collected here for convenient reference.