-
Towards Scene-Text to Scene-Text Translation
Authors:
Onkar Susladkar,
Prajwal Gatti,
Anand Mishra
Abstract:
In this work, we study the task of ``visually" translating scene text from a source language (e.g., English) to a target language (e.g., Chinese). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the text, such as font, size, and background. There are several challenges associated wi…
▽ More
In this work, we study the task of ``visually" translating scene text from a source language (e.g., English) to a target language (e.g., Chinese). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the text, such as font, size, and background. There are several challenges associated with this task, such as interpolating font to unseen characters and preserving text size and the background. To address these, we introduce VTNet, a novel conditional diffusion-based method. To train the VTNet, we create a synthetic cross-lingual dataset of 600K samples of scene text images in six popular languages, including English, Hindi, Tamil, Chinese, Bengali, and German. We evaluate the performance of VTnet through extensive experiments and comparisons to related methods. Our model also surpasses the previous state-of-the-art results on the conventional scene-text editing benchmarks. Further, we present rigorous qualitative studies to understand the strengths and shortcomings of our model. Results show that our approach generalizes well to unseen words and fonts. We firmly believe our work can benefit real-world applications, such as text translation using a phone camera and translating educational materials. Code and data will be made publicly available.
△ Less
Submitted 6 August, 2023;
originally announced August 2023.
-
COFAR: Commonsense and Factual Reasoning in Image Search
Authors:
Prajwal Gatti,
Abhirama Subramanyam Penamakuri,
Revant Teotia,
Anand Mishra,
Shubhashis Sengupta,
Roshni Ramnani
Abstract:
One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent. Consider the following two natural language search queries - (i) "a queue of customers patiently waiting to buy ice cream" and (ii) "a queue of tourists going to see a famous Mughal architecture in India." Interpreting these queries requires o…
▽ More
One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent. Consider the following two natural language search queries - (i) "a queue of customers patiently waiting to buy ice cream" and (ii) "a queue of tourists going to see a famous Mughal architecture in India." Interpreting these queries requires one to reason with (i) Commonsense such as interpreting people as customers or tourists, actions as waiting to buy or going to see; and (ii) Fact or world knowledge associated with named visual entities, for example, whether the store in the image sells ice cream or whether the landmark in the image is a Mughal architecture located in India. Such reasoning goes beyond just visual recognition. To enable both commonsense and factual reasoning in the image search, we present a unified framework, namely Knowledge Retrieval-Augmented Multimodal Transformer (KRAMT), that treats the named visual entities in an image as a gateway to encyclopedic knowledge and leverages them along with natural language query to ground relevant knowledge. Further, KRAMT seamlessly integrates visual content and grounded knowledge to learn alignment between images and search queries. This unified framework is then used to perform image search requiring commonsense and factual reasoning. The retrieval performance of KRAMT is evaluated and compared with related approaches on a new dataset we introduce - namely COFAR. We make our code and dataset available at https://vl2g.github.io/projects/cofar
△ Less
Submitted 16 October, 2022;
originally announced October 2022.
-
An Algebraic Description of the Monodromy of Log Curves
Authors:
Pietro Gatti
Abstract:
Let $k$ be an algebraically closed field of characteristic $0$. For a log curve $X/k^{\times}$ over the standard log point, we define (algebraically) a combinatorial monodromy operator on its log-de Rham cohomology group. The invariant part of this action has a cohomological description, it is the Du Bois cohomology of $X$. This can be seen as an analogue of the invariant cycles exact sequence for…
▽ More
Let $k$ be an algebraically closed field of characteristic $0$. For a log curve $X/k^{\times}$ over the standard log point, we define (algebraically) a combinatorial monodromy operator on its log-de Rham cohomology group. The invariant part of this action has a cohomological description, it is the Du Bois cohomology of $X$. This can be seen as an analogue of the invariant cycles exact sequence for a semistable family (as in the complex, étale and $p$-adic settings). In the specific case in which $k=\mathbb C$ and $X$ is the central fiber of a semistable degeneration over the complex disc, our construction recovers the topological monodromy and the classical local invariant cycles theorem. In particular, our description allows an explicit computation of the monodromy operator in this setting.
△ Less
Submitted 27 October, 2018;
originally announced October 2018.
-
On the length of perverse sheaves and D-modules
Authors:
Nero Budur,
Pietro Gatti,
Yongqiang Liu,
Botong Wang
Abstract:
We prove that the length function for perverse sheaves and algebraic regular holonomic D-modules on a smooth complex algebraic variety Y is an absolute Q-constructible function. One consequence is: for "any" fixed natural (derived) functor F between constructible complexes or perverse sheaves on two smooth varieties X and Y, the loci of rank one local systems L on X whose image F(L) has prescribed…
▽ More
We prove that the length function for perverse sheaves and algebraic regular holonomic D-modules on a smooth complex algebraic variety Y is an absolute Q-constructible function. One consequence is: for "any" fixed natural (derived) functor F between constructible complexes or perverse sheaves on two smooth varieties X and Y, the loci of rank one local systems L on X whose image F(L) has prescribed length are Zariski constructible subsets defined over Q, obtained from finitely many torsion-translated complex affine algebraic subtori of the moduli of rank one local systems via a finite sequence of taking union, intersection, and complement.
△ Less
Submitted 13 March, 2019; v1 submitted 4 September, 2017;
originally announced September 2017.