How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.
L2V with TRANSPORTER: Embeddings \(\mathbf{z}_\Xi \in \mathbb{R}^\Xi\) are coupled with network \(\Phi\) and concept bank \(\mathbf{Q}\). Coupling network \(\Phi\) projects \(\mathbf{z}_\Xi\) with condition \(\pi_\Xi\) to \(\widehat{\mathbf{z}}_{\Omega_1}=\Phi_{\Omega_1}(\mathbf{z}_\Xi,\pi_\Xi)\). Latents \(\widehat{\mathbf{z}}_{\Omega_2} \in \mathbb{R}^\Omega\) are obtained via \(\Phi_{\Omega_2}\) over decoder \(\mathcal{D}_\Xi\) and encoder \(\mathcal{E}_\Omega\). The Learnable Optimal Transport (\(\rho\)-OT) uses projection vectors \(\mathbf{p}_{\Omega_1},\mathbf{p}_{\Omega_2}\) to transport embeddings to \(\tilde{\mathbf{z}}_\Omega\). Concept bank \(\mathbf{Q}=\{\mathbf{q}_o:o\in \mathcal{O}\}\) is trained using probability path difference \(\Delta v\) weighted by logit distribution change \(\Delta \omega\). Inference: Latents \(\mathbf{q}_o\) are added to conditions to transport noise \(\mathbf{\epsilon}\sim \mathcal{N}(0,\mathbf{I})\) and generate videos.
TRANSPORTER can generate videos to visualize VLM logit transitions between different object attributes, such as color or material types. The sharpness of the transition also varies across modulations.
In actions where their performance is proximal, TRANSPORTER modulations show that small and distinct changes are learned by VLMs. The difference in speed can be seen in cases where action verbs differ significantly in their executions. Transitions between actions with even larger differences in their executions are simultaneously.
The resulting videos show TRANSPORTER's ability to generate videos with in-context modulations. Aspects such as the performance of the action or the appearances of objects/actors remain constant throughout the transition between source and target logits.
In settings with combined modulation, transitions of individual logit pairs are visualized over different \( \Delta \omega\).
Embedding similarity between TRANSPORTER and real videos. Cosine similarity \( cos \), \(l1/l2\) distance, and Kullback–Leibler divergence (KL) metrics are compared between mean encodings from VidChapters7M and generated videos.
@journal{stergiou2025transporter,
title={TRANSPORTER: Transferring Visual Semantics from VLM Manifolds},
author={Stergiou, Alexandros},
journal={arxiv},
year={2025}
}
For questions you can drop an email to a dot g dot stergiou at utwente dot nl.
a.g.stergiou@utwente.nl