We gratefully acknowledge support from
the Simons Foundation and member institutions.

Shraman Pramanick and Sayan Nag are qualified to endorse.

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Shraman Pramanick: Is registered as an author of this paper.
Can endorse for cs.AI, cs.CL, cs.CV, cs.LG, cs.MM. (why?)
Sayan Nag: Is registered as an author of this paper.
Can endorse for cs.CV, cs.LG, cs.SD, physics.data-an. (why?)

Li **g, Jiachen Zhu, Hardik Shah, Yann LeCun and Rama Chellappa are not registered as owners of this paper. (why?)