Masked Diffusion Captioning for Visual Feature Learning

Chao Feng ^{1, 2}, Zihao Wei ^{1, 3}, Andrew Owens ^{1, 2}

¹ University of Michigan
² Cornell University
³ University of Maryland

TL;DR: We use training objective of masked diffusion language modeling for image captioning as the proxy task to learn visual representations.

Paper Code

EMNLP 2025 (Findings)

Abstract

We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

(a) Masked diffusion captioning. (b) Image-conditioned language sampling.

Learning visual features using masked diffusion captioning. (a) We train an image-conditioned masked diffusion language model to learn visual features. Given an image and its corresponding text caption, we randomly mask text tokens in the caption. We then reconstruct the caption, using a decoder that is conditioned on visual features (obtained from a separate encoder network) and the text tokens. In each training iteration, we sample a time step t that determines a masking ratio and a cross-entropy weight. T=0 means no masked token while T=1 means sequence is fully masked. (b) During sampling, we start with a fully masked sequence containing N' mask tokens. We then iteratively denoise N' steps to obtain a full caption.

Masked Diffusion Captioning

We train an image-conditioned masked diffusion language model to learn visual features. Each training pair consists of image $I$ and its corresponding caption $C$. We use a standard transformer encoder-decoder architecture following Tschannen et al. (2023) as the captioner $h$. Encoder $f_{\phi}$ takes image $I$ and produces a sequence of visual features $\mathbf{V}= f_{\phi}(I)$. These are (late) fused with the decoder $g_{\psi}$ by cross attention to predict caption $C$.
Building on the training objective of the MDLM (Sahoo et al., 2024), we define the loss for our masked diffusion captioning (MDC). Given the caption $C$, MDC chooses a factorized forward process $q\left(C_{t}|C_{0},\mathbf{V}\right)=\prod_{i=0}^{N-1}q\left(c^{i}_{t}|c^{i}_{0},\mathbf{V}\right)$, the learned reverse process is also factorized $p_{\psi}\left(C_{r}|C_{t},\mathbf{V}\right) := \prod_{i=0}^{N-1}q\left(c^{i}_{r}|c^{i}_{t},g_\psi^{i}\left(C_{t},t,\mathbf{V}\right)\right)$. Thus, the training objective is: $$ \mathcal{L}_{\mathrm{MDC}} = \mathbb{E}_{t}\left [\frac{\alpha_t'}{1 - \alpha_t} \mathbb{E}_{q} \Big[\sum_{i=0}^{N-1} \delta_{c_t^{i},\texttt{[MASK]}}\mathbf{c}_{0}^{i \top}\log \left( g_{\psi}^{i} \left( C_{t}, t,\mathbf{V}\right)\right) \Big] \right] $$

Linear Probing Results

We compare masked diffusion captioning with CLIP and autoregressive captioning with vision backbones of ViT-B/32, ViT-B/16, and ViT-L/14. We choose CC12M and a 10M random subset of Recap-DataComp (Recap-DataComp-10M) as the pretraining datasets. We evaluate the learned visual features by linear probing on several datasets: ImageNet-1K, Foods, CIFAR-10, CIFAR-100, and Pets.

ViT-L/14

Comparison with Image-conditioned BERT

We compare our masked diffusion captioning with image-conditioned BERT with different masking ratios.

Comparison to image-conditioned BERT with different masking ratios. We compare our method against BERT with varying masking ratios, including 100% (parallel decoding). While BERT with certain masking ratios achieves performance close to ours, our method adopts a unified schedule, avoiding the need to tune the masking ratio on each dataset.

Dataset Size Scaling

We randomly sample 5M, 10M, and 20M image-caption pairs from Recap-DataComp as the pretraining datasets to pretrain our masked diffusion captioning.

Linear probing performance with varying numbers of image-text pairs. We randomly sample 5M, 10M, 20M, and 30M pairs from Recap-DataComp-1B for pretraining our method. As the number of image-text pairs increases, the linear probing performance on IN-1K improves.

Vision Language Compositionality

We evaluate the vision language compositionality of models by asking them to choose the correct caption for an image from a pair consisting of a true caption and a manipulated one.

Vision language compositionality evaluation. We evaluate compositionality of models on two benchmarks: ARO (Yuksekgonul et al., 2022) and SugarCrepe (Hsieh et al., 2023).

Analysis of Design Choices

Necessity of $t$

To assess the necessity of $t$ in the training objective, we perform an ablation study by removing $t$ from the weighted cross-entropy loss during pretraining. The model then essentially becomes CMLM (Ghazvininejad et al., 2019). The results, presented in the table below, show linear probing performance drops for models trained on both CC12M and Recap-DataComp-10M. This suggests that the loss scaling factor $t$ plays a critical role in learning effective visual representations.

Ablation on $t$. We compare masked diffusion captioning (MDC) with its loss variant pretrained on CC12M and Recap-DataComp-10M. We evaluate them by linear probing on ImageNet-1K.

Analysis of noise schedule

During the training of masked diffusion models, the noise level (masking ratio) of each step is determined by $t$ sampled from the interval $[\omega_{l}, \omega_{u}]$. The vanilla masked diffusion model with linear schedule uses $\omega_{l}=0, \omega_{u}=1$. However, we find that loss is very unstable when pretrained on CC3M, where many captions are very short. This resonates with findings from prior work (Arriola et al., 2025). Thus, to analyze the effect of the sampling interval of $t$, we experiment with varying noise schedules on CC3M, and the results are shown in table below. We find that $\omega_{l}=0.5, \omega_{u}=1$ achieves the best performance and use this noise schedule as the default for masked diffusion captioning.

Analysis of noise schedule. We test masked diffusion captioning pretrained on CC3M with different noise schedules by linear probing on ImageNet-1K.

Image Captioning

We use our masked diffusion captioning to generate captions for images. We need to specify the sequence length for sampling since we only train model to reconstruct non-padding tokens. We use three varaints of sequence length: 10, 15, and 20 for caption generation.

Examples of caption sampling. We show three examples sampled from MSCOCO Karpathy-test split. MDC-10/15/20 means the length of the output sequence is 10/15/20 for masked diffusion captioning.

Limitations

Both the pretraining dataset scale (on the order of 10M image-caption pairs) and the model size are at the academic scale. Training masked diffusion captioning on datasets that contain undesirable contents may result in the learning of biased or harmful visual representations and the generation of malicious captions.

Conclusion

In this work, we introduce masked diffusion captioning (MDC), an image-conditioned masked diffusion language model designed to learn visual representations. Our results demonstrate that masked diffusion captioning effectively learns visual features, outperforming previous masked language modeling variants by using a unified noise schedule. In addition, masked diffusion captioning can generate reasonable captions and exhibits strong vision-language compositionality. We conduct evaluations to establish an effective training recipe for masked diffusion captioning. Overall, our study suggests that masked diffusion language models offer a compelling alternative to autoregressive approaches for learning visual representations from image caption pairs.

Acknowledgments

This work was supported by Advanced Research Projects Agency for Health (ARPA-H) under award #1AY2AX000062. This research was funded, in part, by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. We thank Subham Sahoo and Zixuan Pan for helpful discussions.

Citation

@article{feng2025masked,
    title={Masked Diffusion Captioning for Visual Feature Learning},
    author={Feng, Chao and Wei, Zihao and Owens, Andrew},
    journal={Findings of the Association for Computational Linguistics: EMNLP 2025},
    year={2025}
}