Search | arXiv e-print repository

Towards Scene-Text to Scene-Text Translation

Authors: Onkar Susladkar, Prajwal Gatti, Anand Mishra

Abstract: In this work, we study the task of ``visually" translating scene text from a source language (e.g., English) to a target language (e.g., Chinese). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the text, such as font, size, and background. There are several challenges associated wi… ▽ More In this work, we study the task of ``visually" translating scene text from a source language (e.g., English) to a target language (e.g., Chinese). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the text, such as font, size, and background. There are several challenges associated with this task, such as interpolating font to unseen characters and preserving text size and the background. To address these, we introduce VTNet, a novel conditional diffusion-based method. To train the VTNet, we create a synthetic cross-lingual dataset of 600K samples of scene text images in six popular languages, including English, Hindi, Tamil, Chinese, Bengali, and German. We evaluate the performance of VTnet through extensive experiments and comparisons to related methods. Our model also surpasses the previous state-of-the-art results on the conventional scene-text editing benchmarks. Further, we present rigorous qualitative studies to understand the strengths and shortcomings of our model. Results show that our approach generalizes well to unseen words and fonts. We firmly believe our work can benefit real-world applications, such as text translation using a phone camera and translating educational materials. Code and data will be made publicly available. △ Less

Submitted 6 August, 2023; originally announced August 2023.

arXiv:2210.08554 [pdf, other]

COFAR: Commonsense and Factual Reasoning in Image Search

Authors: Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand Mishra, Shubhashis Sengupta, Roshni Ramnani

Abstract: One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent. Consider the following two natural language search queries - (i) "a queue of customers patiently waiting to buy ice cream" and (ii) "a queue of tourists going to see a famous Mughal architecture in India." Interpreting these queries requires o… ▽ More One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent. Consider the following two natural language search queries - (i) "a queue of customers patiently waiting to buy ice cream" and (ii) "a queue of tourists going to see a famous Mughal architecture in India." Interpreting these queries requires one to reason with (i) Commonsense such as interpreting people as customers or tourists, actions as waiting to buy or going to see; and (ii) Fact or world knowledge associated with named visual entities, for example, whether the store in the image sells ice cream or whether the landmark in the image is a Mughal architecture located in India. Such reasoning goes beyond just visual recognition. To enable both commonsense and factual reasoning in the image search, we present a unified framework, namely Knowledge Retrieval-Augmented Multimodal Transformer (KRAMT), that treats the named visual entities in an image as a gateway to encyclopedic knowledge and leverages them along with natural language query to ground relevant knowledge. Further, KRAMT seamlessly integrates visual content and grounded knowledge to learn alignment between images and search queries. This unified framework is then used to perform image search requiring commonsense and factual reasoning. The retrieval performance of KRAMT is evaluated and compared with related approaches on a new dataset we introduce - namely COFAR. We make our code and dataset available at https://vl2g.github.io/projects/cofar △ Less

Submitted 16 October, 2022; originally announced October 2022.

Comments: Accepted in AACL-IJCNLP 2022

arXiv:1810.11627 [pdf, ps, other]

An Algebraic Description of the Monodromy of Log Curves

Authors: Pietro Gatti

Abstract: Let $k$ be an algebraically closed field of characteristic $0$. For a log curve $X/k^{\times}$ over the standard log point, we define (algebraically) a combinatorial monodromy operator on its log-de Rham cohomology group. The invariant part of this action has a cohomological description, it is the Du Bois cohomology of $X$. This can be seen as an analogue of the invariant cycles exact sequence for… ▽ More Let $k$ be an algebraically closed field of characteristic $0$. For a log curve $X/k^{\times}$ over the standard log point, we define (algebraically) a combinatorial monodromy operator on its log-de Rham cohomology group. The invariant part of this action has a cohomological description, it is the Du Bois cohomology of $X$. This can be seen as an analogue of the invariant cycles exact sequence for a semistable family (as in the complex, étale and $p$-adic settings). In the specific case in which $k=\mathbb C$ and $X$ is the central fiber of a semistable degeneration over the complex disc, our construction recovers the topological monodromy and the classical local invariant cycles theorem. In particular, our description allows an explicit computation of the monodromy operator in this setting. △ Less

Submitted 27 October, 2018; originally announced October 2018.

MSC Class: 16F40 (Primary); 16D05 (Secondary)

arXiv:1709.00876 [pdf, ps, other]

On the length of perverse sheaves and D-modules

Authors: Nero Budur, Pietro Gatti, Yongqiang Liu, Botong Wang

Abstract: We prove that the length function for perverse sheaves and algebraic regular holonomic D-modules on a smooth complex algebraic variety Y is an absolute Q-constructible function. One consequence is: for "any" fixed natural (derived) functor F between constructible complexes or perverse sheaves on two smooth varieties X and Y, the loci of rank one local systems L on X whose image F(L) has prescribed… ▽ More We prove that the length function for perverse sheaves and algebraic regular holonomic D-modules on a smooth complex algebraic variety Y is an absolute Q-constructible function. One consequence is: for "any" fixed natural (derived) functor F between constructible complexes or perverse sheaves on two smooth varieties X and Y, the loci of rank one local systems L on X whose image F(L) has prescribed length are Zariski constructible subsets defined over Q, obtained from finitely many torsion-translated complex affine algebraic subtori of the moduli of rank one local systems via a finite sequence of taking union, intersection, and complement. △ Less

Submitted 13 March, 2019; v1 submitted 4 September, 2017; originally announced September 2017.

Comments: 14 pages. v2: minor changes, final version

Showing 1–4 of 4 results for author: Gatti, P