Search | arXiv e-print repository

PALP: Prompt Aligned Personalization of Text-to-Image Models

Authors: Moab Arar, Andrey Voynov, Amir Hertz, Omri Avrahami, Shlomi Fruchter, Yael Pritch, Daniel Cohen-Or, Ariel Shamir

Abstract: Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impe… ▽ More Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfillment of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a \emph{single} prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: Project page available at https://prompt-aligned.github.io/

arXiv:2311.17609 [pdf, other]

AnyLens: A Generative Diffusion Model with Any Rendering Lens

Authors: Andrey Voynov, Amir Hertz, Moab Arar, Shlomi Fruchter, Daniel Cohen-Or

Abstract: State-of-the-art diffusion models can generate highly realistic images based on various conditioning like text, segmentation, and depth. However, an essential aspect often overlooked is the specific camera geometry used during image capture. The influence of different optical systems on the final scene appearance is frequently overlooked. This study introduces a framework that intimately integrate… ▽ More State-of-the-art diffusion models can generate highly realistic images based on various conditioning like text, segmentation, and depth. However, an essential aspect often overlooked is the specific camera geometry used during image capture. The influence of different optical systems on the final scene appearance is frequently overlooked. This study introduces a framework that intimately integrates a text-to-image diffusion model with the particular lens geometry used in image rendering. Our method is based on a per-pixel coordinate conditioning method, enabling the control over the rendering geometry. Notably, we demonstrate the manipulation of curvature properties, achieving diverse visual effects, such as fish-eye, panoramic views, and spherical texturing using a single diffusion model. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.10093 [pdf, other]

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

Authors: Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski

Abstract: Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images… ▽ More Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach. △ Less

Submitted 5 June, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

Comments: Accepted to SIGGRAPH 2024. Project page is available at https://omriavrahami.com/the-chosen-one/

arXiv:2307.06925 [pdf, other]

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

Authors: Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, Amit H. Bermano

Abstract: Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, wh… ▽ More Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while kee** the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: Project page at https://datencoder.github.io

arXiv:2304.05177 [pdf, other]

Bounds on non-linear errors for variance computation with stochastic rounding *

Authors: E M El Arar, D Sohier, P de Oliveira Castro, E Petit

Abstract: The main objective of this work is to investigate non-linear errors and pairwise summation using stochastic rounding (SR) in variance computation algorithms. We estimate the forward error of computations under SR through two methods: the first is based on a bound of the variance and Bienaym{é}-Chebyshev inequality, while the second is based on martingales and Azuma-Hoeffding inequality. The study… ▽ More The main objective of this work is to investigate non-linear errors and pairwise summation using stochastic rounding (SR) in variance computation algorithms. We estimate the forward error of computations under SR through two methods: the first is based on a bound of the variance and Bienaym{é}-Chebyshev inequality, while the second is based on martingales and Azuma-Hoeffding inequality. The study shows that for pairwise summation, using SR results in a probabilistic bound of the forward error proportional to log(n)u rather than the deterministic bound in O(log(n)u) when using the default rounding mode. We examine two algorithms that compute the variance, called ''textbook'' and ''two-pass'', which both exhibit non-linear errors. Using the two methods mentioned above, we show that these algorithms' forward errors have probabilistic bounds under SR in O($\sqrt$ nu) instead of nu for the deterministic bounds. We show that this advantage holds using pairwise summation for both textbook and two-pass, with probabilistic bounds of the forward error proportional to log(n)u. △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2302.12228 [pdf, other]

Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Authors: Rinon Gal, Moab Arar, Yuval Atzmon, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or

Abstract: Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts. However, current personalization approaches struggle with lengthy training times, high storage requirements or loss of identity. To overcome these limitations, we propose an encoder-based domain-tuning approach.… ▽ More Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts. However, current personalization approaches struggle with lengthy training times, high storage requirements or loss of identity. To overcome these limitations, we propose an encoder-based domain-tuning approach. Our key insight is that by underfitting on a large set of concepts from a given domain, we can improve generalization and create a model that is more amenable to quickly adding novel concepts from the same domain. Specifically, we employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain, e.g. a specific face, and learns to map it into a word-embedding representing the concept. Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts. Together, these components are used to guide the learning of unseen concepts, allowing us to personalize a model using only a single image and as few as 5 training steps - accelerating personalization from dozens of minutes to seconds, while preserving quality. △ Less

Submitted 5 March, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: Project page at https://tuning-encoder.github.io/

arXiv:2302.05905 [pdf, other]

Single Motion Diffusion

Authors: Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Amit H. Bermano, Daniel Cohen-Or

Abstract: Synthesizing realistic animations of humans, animals, and even imaginary creatures, has long been a goal for artists and computer graphics professionals. Compared to the imaging domain, which is rich with large available datasets, the number of data instances for the motion domain is limited, particularly for the animation of animals and exotic creatures (e.g., dragons), which have unique skeleton… ▽ More Synthesizing realistic animations of humans, animals, and even imaginary creatures, has long been a goal for artists and computer graphics professionals. Compared to the imaging domain, which is rich with large available datasets, the number of data instances for the motion domain is limited, particularly for the animation of animals and exotic creatures (e.g., dragons), which have unique skeletons and motion patterns. In this work, we present a Single Motion Diffusion Model, dubbed SinMDM, a model designed to learn the internal motifs of a single motion sequence with arbitrary topology and synthesize motions of arbitrary length that are faithful to them. We harness the power of diffusion models and present a denoising network explicitly designed for the task of learning from a single input motion. SinMDM is designed to be a lightweight architecture, which avoids overfitting by using a shallow network with local attention layers that narrow the receptive field and encourage motion diversity. SinMDM can be applied in various contexts, including spatial and temporal in-betweening, motion expansion, style transfer, and crowd animation. Our results show that SinMDM outperforms existing methods both in quality and time-space efficiency. Moreover, while current approaches require additional training for different applications, our work facilitates these applications at inference time. Our code and trained models are available at https://sinmdm.github.io/SinMDM-page. △ Less

Submitted 13 June, 2023; v1 submitted 12 February, 2023; originally announced February 2023.

Comments: Video: https://www.youtube.com/watch?v=zuWpVTgb_0U, Project page: https://sinmdm.github.io/SinMDM-page, Code: https://github.com/SinMDM/SinMDM

arXiv:2112.11435 [pdf, other]

Learned Queries for Efficient Local Attention

Authors: Moab Arar, Ariel Shamir, Amit H. Bermano

Abstract: Vision Transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range dependencies in the data. Nonetheless, an integral part of any transformer architecture, the self-attention mechanism, suffers from high latency and inefficient memory utilization, making it l… ▽ More Vision Transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range dependencies in the data. Nonetheless, an integral part of any transformer architecture, the self-attention mechanism, suffers from high latency and inefficient memory utilization, making it less suitable for high-resolution input images. To alleviate these shortcomings, hierarchical vision models locally employ self-attention on non-interleaving windows. This relaxation reduces the complexity to be linear in the input size; however, it limits the cross-window interaction, hurting the model performance. In this paper, we propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlap** manner, much like convolutions. The key idea behind QnA is to introduce learned queries, which allow fast and efficient implementation. We verify the effectiveness of our layer by incorporating it into a hierarchical vision transformer model. We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models. Finally, our layer scales especially well with window size, requiring up-to x10 less memory while being up-to x5 faster than existing methods. The code is publicly available at \url{https://github.com/moabarar/qna}. △ Less

Submitted 19 April, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

Comments: CVPR 2022 - Oral

arXiv:2104.03843 [pdf, other]

InAugment: Improving Classifiers via Internal Augmentation

Authors: Moab Arar, Ariel Shamir, Amit Bermano

Abstract: Image augmentation techniques apply transformation functions such as rotation, shearing, or color distortion on an input image. These augmentations were proven useful in improving neural networks' generalization ability. In this paper, we present a novel augmentation operation, InAugment, that exploits image internal statistics. The key idea is to copy patches from the image itself, apply augmenta… ▽ More Image augmentation techniques apply transformation functions such as rotation, shearing, or color distortion on an input image. These augmentations were proven useful in improving neural networks' generalization ability. In this paper, we present a novel augmentation operation, InAugment, that exploits image internal statistics. The key idea is to copy patches from the image itself, apply augmentation operations on them, and paste them back at random positions on the same image. This method is simple and easy to implement and can be incorporated with existing augmentation techniques. We test InAugment on two popular datasets -- CIFAR and ImageNet. We show improvement over state-of-the-art augmentation techniques. Incorporating InAugment with Auto Augment yields a significant improvement over other augmentation techniques (e.g., +1% improvement over multiple architectures trained on the CIFAR dataset). We also demonstrate an increase for ResNet50 and EfficientNet-B3 top-1's accuracy on the ImageNet dataset compared to prior augmentation methods. Finally, our experiments suggest that training convolutional neural network using InAugment not only improves the model's accuracy and confidence but its performance on out-of-distribution images. △ Less

Submitted 8 April, 2021; originally announced April 2021.

arXiv:2007.07723 [pdf, other]

Focus-and-Expand: Training Guidance Through Gradual Manipulation of Input Features

Authors: Moab Arar, Noa Fish, Dani Daniel, Evgeny Tenetov, Ariel Shamir, Amit Bermano

Abstract: We present a simple and intuitive Focus-and-eXpand (\fax) method to guide the training process of a neural network towards a specific solution. Optimizing a neural network is a highly non-convex problem. Typically, the space of solutions is large, with numerous possible local minima, where reaching a specific minimum depends on many factors. In many cases, however, a solution which considers speci… ▽ More We present a simple and intuitive Focus-and-eXpand (\fax) method to guide the training process of a neural network towards a specific solution. Optimizing a neural network is a highly non-convex problem. Typically, the space of solutions is large, with numerous possible local minima, where reaching a specific minimum depends on many factors. In many cases, however, a solution which considers specific aspects, or features, of the input is desired. For example, in the presence of bias, a solution that disregards the biased feature is a more robust and accurate one. Drawing inspiration from Parameter Continuation methods, we propose steering the training process to consider specific features in the input more than others, through gradual shifts in the input domain. \fax extracts a subset of features from each input data-point, and exposes the learner to these features first, Focusing the solution on them. Then, by using a blending/mixing parameter $α$ it gradually eXpands the learning process to include all features of the input. This process encourages the consideration of the desired features more than others. Though not restricted to this field, we quantitatively evaluate the effectiveness of our approach on various Computer Vision tasks, and achieve state-of-the-art bias removal, improvements to an established augmentation method, and two examples of improvements to image classification tasks. Through these few examples we demonstrate the impact this approach potentially carries for a wide variety of problems, which stand to gain from understanding the solution landscape. △ Less

Submitted 15 July, 2020; originally announced July 2020.

arXiv:2003.08073 [pdf, other]

Unsupervised Multi-Modal Image Registration via Geometry Preserving Image-to-Image Translation

Authors: Moab Arar, Yiftach Ginger, Dov Danon, Ilya Leizerson, Amit Bermano, Daniel Cohen-Or

Abstract: Many applications, such as autonomous driving, heavily rely on multi-modal data where spatial alignment between the modalities is required. Most multi-modal registration methods struggle computing the spatial correspondence between the images using prevalent cross-modality similarity measures. In this work, we bypass the difficulties of develo** cross-modality similarity measures, by training an… ▽ More Many applications, such as autonomous driving, heavily rely on multi-modal data where spatial alignment between the modalities is required. Most multi-modal registration methods struggle computing the spatial correspondence between the images using prevalent cross-modality similarity measures. In this work, we bypass the difficulties of develo** cross-modality similarity measures, by training an image-to-image translation network on the two input modalities. This learned translation allows training the registration network using simple and reliable mono-modality metrics. We perform multi-modal registration using two networks - a spatial transformation network and a translation network. We show that by encouraging our translation network to be geometry preserving, we manage to train an accurate spatial transformation network. Compared to state-of-the-art multi-modal methods our presented method is unsupervised, requiring no pairs of aligned modalities for training, and can be adapted to any pair of modalities. We evaluate our method quantitatively and qualitatively on commercial datasets, showing that it performs well on several modalities and achieves accurate alignment. △ Less

Submitted 18 March, 2020; originally announced March 2020.

arXiv:1904.08475 [pdf, other]

Image Resizing by Reconstruction from Deep Features

Authors: Moab Arar, Dov Danon, Daniel Cohen-Or, Ariel Shamir

Abstract: Traditional image resizing methods usually work in pixel space and use various saliency measures. The challenge is to adjust the image shape while trying to preserve important content. In this paper we perform image resizing in feature space where the deep layers of a neural network contain rich important semantic information. We directly adjust the image feature maps, extracted from a pre-trained… ▽ More Traditional image resizing methods usually work in pixel space and use various saliency measures. The challenge is to adjust the image shape while trying to preserve important content. In this paper we perform image resizing in feature space where the deep layers of a neural network contain rich important semantic information. We directly adjust the image feature maps, extracted from a pre-trained classification network, and reconstruct the resized image using a neural-network based optimization. This novel approach leverages the hierarchical encoding of the network, and in particular, the high-level discriminative power of its deeper layers, that recognizes semantic objects and regions and allows maintaining their aspect ratio. Our use of reconstruction from deep features diminishes the artifacts introduced by image-space resizing operators. We evaluate our method on benchmarks, compare to alternative approaches, and demonstrate its strength on challenging images. △ Less

Submitted 22 June, 2021; v1 submitted 17 April, 2019; originally announced April 2019.

Comments: 13 pages, 21 figures

arXiv:1711.06625 [pdf, ps, other]

Dynamic Matching: Reducing Integral Algorithms to Approximately-Maximal Fractional Algorithms

Authors: Moab Arar, Shiri Chechik, Sarel Cohen, Cliff Stein, David Wajc

Abstract: We present a simple randomized reduction from fully-dynamic integral matching algorithms to fully-dynamic "approximately-maximal" fractional matching algorithms. Applying this reduction to the recent fractional matching algorithm of Bhattacharya, Henzinger, and Nanongkai (SODA 2017), we obtain a novel result for the integral problem. Specifically, our main result is a randomized fully-dynamic… ▽ More We present a simple randomized reduction from fully-dynamic integral matching algorithms to fully-dynamic "approximately-maximal" fractional matching algorithms. Applying this reduction to the recent fractional matching algorithm of Bhattacharya, Henzinger, and Nanongkai (SODA 2017), we obtain a novel result for the integral problem. Specifically, our main result is a randomized fully-dynamic $(2+ε)$-approximate integral matching algorithm with small polylog worst-case update time. For the $(2+ε)$-approximation regime only a \emph{fractional} fully-dynamic $(2+ε)$-matching algorithm with worst-case polylog update time was previously known, due to Bhattacharya et al.~(SODA 2017). Our algorithm is the first algorithm that maintains approximate matchings with worst-case update time better than polynomial, for any constant approximation ratio. As a consequence, we also obtain the first constant-approximate worst-case polylogarithmic update time maximum weight matching algorithm. △ Less

Submitted 27 February, 2018; v1 submitted 17 November, 2017; originally announced November 2017.

ACM Class: F.2.2

arXiv:1711.05444 [pdf, other]

Robust Real-Time Multi-View Eye Tracking

Authors: Nuri Murat Arar, Jean-Philippe Thiran

Abstract: Despite significant advances in improving the gaze tracking accuracy under controlled conditions, the tracking robustness under real-world conditions, such as large head pose and movements, use of eyeglasses, illumination and eye type variations, remains a major challenge in eye tracking. In this paper, we revisit this challenge and introduce a real-time multi-camera eye tracking framework to impr… ▽ More Despite significant advances in improving the gaze tracking accuracy under controlled conditions, the tracking robustness under real-world conditions, such as large head pose and movements, use of eyeglasses, illumination and eye type variations, remains a major challenge in eye tracking. In this paper, we revisit this challenge and introduce a real-time multi-camera eye tracking framework to improve the tracking robustness. First, differently from previous work, we design a multi-view tracking setup that allows for acquiring multiple eye appearances simultaneously. Leveraging multi-view appearances enables to more reliably detect gaze features under challenging conditions, particularly when they are obstructed in conventional single-view appearance due to large head movements or eyewear effects. The features extracted on various appearances are then used for estimating multiple gaze outputs. Second, we propose to combine estimated gaze outputs through an adaptive fusion mechanism to compute user's overall point of regard. The proposed mechanism firstly determines the estimation reliability of each gaze output according to user's momentary head pose and predicted gazing behavior, and then performs a reliability-based weighted fusion. We demonstrate the efficacy of our framework with extensive simulations and user experiments on a collected dataset featuring 20 subjects. Our results show that in comparison with state-of-the-art eye trackers, the proposed framework provides not only a significant enhancement in accuracy but also a notable robustness. Our prototype system runs at 30 frames-per-second (fps) and achieves 1 degree accuracy under challenging experimental scenarios, which makes it suitable for applications demanding high accuracy and robustness. △ Less

Submitted 3 January, 2018; v1 submitted 15 November, 2017; originally announced November 2017.

Comments: Organisational changes in the main msp and supplementary info. Results unchanged. Main msp: 14 pages, 15 figures. Supplementary: 2 tables, 1 figure. Under review for an IEEE transactions publication

arXiv:1401.6759 [pdf]

doi 10.7321/jscse.v3.n3.91

Modeling the behavior of reinforced concrete walls under fire, considering the impact of the span on firewalls

Authors: Nadia Otmani Benmehidi, Meriem Arar, Imene Chine

Abstract: Numerical modeling using computers is known to present several advantages compared to experimental testing. The high cost and the amount of time required to prepare and to perform a test were among the main problems on the table when the first tools for modeling structures in fire were developed. The discipline structures-in-fire modeling is still currently the subject of important research effort… ▽ More Numerical modeling using computers is known to present several advantages compared to experimental testing. The high cost and the amount of time required to prepare and to perform a test were among the main problems on the table when the first tools for modeling structures in fire were developed. The discipline structures-in-fire modeling is still currently the subject of important research efforts around the word, those research efforts led to develop many software. In this paper, our task is oriented to the study of fire behavior and the impact of the span reinforced concrete walls with different sections belonging to a residential building braced by a system composed of porticoes and sails. Regarding the design and mechanical loading (compression forces and moments) exerted on the walls in question, we are based on the results of a study conducted at cold. We use on this subject the software Safir witch obeys to the Eurocode laws, to realize this study. It was found that loading, heating, and sizing play a capital role in the state of failed walls. Our results justify well the use of reinforced concrete walls, acting as a firewall. Their role is to limit the spread of fire from one structure to another structure nearby, since we get fire resistance reaching more than 10 hours depending on the loading considered. △ Less

Submitted 27 January, 2014; originally announced January 2014.

Comments: 8 pages,12 figures, 4 tables

Journal ref: International Journal of Soft Computing And Software Engineering (JSCSE), Vol.3,No.3, pp. 600-607, 2013

Showing 1–15 of 15 results for author: Arar, M