Controlling Face’s Frame generation in StyleGAN’s latent space operations

Modifying faces to deceive our memory

Agustín Roca
Nicolás Britos
Supervisor: Rodrigo Ramele

Final Project
Computer Engineering Degree

Abstract

Innocence Project is a non-profitable organization that works in reducing wrongful convictions. In collaboration with Laboratorio de Sueño y Memoria from Instituto Tecnológico de Buenos Aires (ITBA), they are studying human memory in the context of face identification. They have a strong hypothesis stating that human memory heavily relies in face’s frame to recognize faces. If this is proved, it could mean that face recognition in police lineups couldn’t be trusted, as they may lead to wrongful convictions. This study uses experiments in order to try to prove this using faces with different properties, such as eyes size, but maintaining its frame as much as possible.
In this project, we continue the work from a previous project [1] that provided the basic tool to generate realistic faces using StyleGAN2. We take a deep dive into the internals of this tool to make full use of StyleGAN2 functionalities, while also adding more features needed by the Lab, such as projecting a face from a target face image or modifying certain of its attributes, including mouth-opening or eye-opening.
As the usage of this tool heavily relies on maintaining the face-frame, we develop a way to identify the face-frame of each image and a function to compare it to the output of the neural network after applying some operations, such as the modification of the eye-opening. The objective is to have a numeric value that measures how much does a certain operation change the face-frame and to know how much of it is maintained, having a clearer perspective of which face image may be a better candidate for generating false memories.
We conclude that the face-frame is maintained when modifying eye-opening or mouth opening. When modifying vertical face orientation, gender, age and smile, have a considerable impact on its frame variation. And finally, the horizontal face orientation shows a major impact on the face-frame. This way, the Lab may apply some operations being confident that the face-frame won’t significantly change, making them viable to be used to deceive subjects’ memories.

Keywords – StyleGAN, GAN, Generative Image Modeling, Innocence Project, False Memories, Laboratorio de Sueño y Memoria, ITBA, Face-Frame, Face Landmarks, Image Segmentation, Image Processing

Introduction

Understanding how our brains store and recover memories can be really powerful. Marketing, psychology, design, biology, medicine or even computer science can have great breakthroughs by knowing how our brains remember things. Better advertisements [2], having a more solid base of knowledge to fight Alzheimer [3], dealing with traumas in an efficient way [4] or creating new models of information based on it [5]; all these can be potential consequences of understanding memories in our brains. However, we also know that our memory is malleable [6]. It can be manipulated and affected by external factors and deceive ourselves [7].
Having this information, it is natural to doubt the accuracy of human memory. This is the main reason why Innocence Project is doing its research. They are investigating cases where the eyewitnesses were the main evidence for a conviction. They theorize that many of these cases may be result of corruption, manipulation, or honest mistakes when identifying the culprit of the crime. In 1983, around 52% of wrongful convictions were results of eyewitnesses mistakes [8].
Innocence Project Argentina, partnered with Laboratorio de Sueño y Memoria from Instituto Tecnológico de Buenos Aires (ITBA), to research how humans remember faces, which are the key factors for our brains to identify faces and how accurate is human memory for face identification. The experiments the lab is conducting consists on simulating a crime that a subject witnesses. Later, the subject is given the task to try to recognize the perpetrator of the crime in a lineup where the criminal may be present or not. The objective of the experiment is to know which are the attributes that are common between the real criminal and the identified person.
During this research, they theorized that the contour of the face, the hair and the ears, are the things that our memory first notice in a face. This set of properties is called the "face-frame".
Currently, the Lab is using a tool to generate faces similar to the criminal in the experiment [1]. This tool is based on NVidia’s StyleGAN2 [9], a Generative Adversarial Network (GAN). A GAN is a deep-learning-based model that allows to create synthetic data which is indistinguishable from real data. Nvidia used this model and changed its architecture to create StyleGAN2, capable of generating realistic images of faces, which is why is used by the Lab.
Albeit the tool being used by the Lab is capable of generating images of faces, it’s still missing some key functionalities to fully take advantage of StyleGAN2, like being able to change one face’s attributes using the very same neural network, or map** an existing face to a one generated by the network.
This way, the project can be divided in two main parts. First, we implement the missing functionalities that StyleGAN2 offers: style mixing and projecting an image into the latent space. These functionalities provide additional tools for the Lab to generate faces, especially when they have specific requirements about how the face they want to generate should look like. In addition, we use some latent directions known by the community [10] to be able to modify some attributes of the face. This attributes include eye-opening, mouth-opening, smile, face orientation, presumable age and gender, among others.
The second part consists in measuring how much of the face differs from the original one after modifying it with these functionalities. These also allows us to find a way to project an image maintaining its frame as much as possible. Style mixing is not part of the experiments as it only changes color scheme and microstructure of the image [11].
In this work we will first present the main stakeholders of this investigation, a brief introduction to human’s memory, Generative Adversarial Networks and how faces generated from them can be modified within the same network (specifically in StyleGAN2) and this project’s initial status in Section 2. In Section 3 and 4 readers will find all the definitions and functions we’ve used to make our measurements, the faces we’ve used and the setup we have required. Furthermore, in Section 5 we explain the experiments we have done and in Section 6 we present the results we’ve obtained making a focus on face-frame deviation when modifying faces when moving across some of the neural network directions and in Section 7 we discuss the importance and significance of these results, and how they may be used. Finally, in Section 8 and Section 9 we conclude this work with future works and acknowledgements.

Review of similar works

Innocence Project

Innocence Project is one of the 69 organizations that are part of the Innocence Network, that work to free innocent people from jail and to prevent wrongful convictions. This work can be with legal services or investigation of the case presented.  [12]
In the United States, the Innocence Project reported over 375 DNA exonerations [13]. The causes of wrongful conviction that the organization are aware of are: eyewitness identification, false confessions, forensic science errors, poor defense, among others. [12]
Innocence Project Argentina is part of the Innocence Network, and it works in Argentinian cases. This organization is the one that got in touch with Laboratorio de Sueño y Memoria from Instituto Tecnológico de Buenos Aires (ITBA).
The Lab does research in neuroscience. Some of the areas they are working on are: activation of memories while slee**, sleep paralysis, wave processing for the brain and false memories formation [14].

Memories

In 2019, Zlotnik and Vansintjan defined memory as "the capacity to store and retrieve information" [15]. The process of formation of memories have different phases.
The first of these phases is the acquisition. In this phase, the sensory stimuli are encoded into neurochemical representations. The second phase is the consolidation, that is a period in which the memory is stabilized in order to subsist through time. The retrieval phase is where the information in a memory can be recovered. A consolidated memory can go through a re-consolidation phase, in which its information can be modified and it is re-stabilized. [16]
This last phase is the one that has a key role for the experiments of the Lab. There are some theories about how false memories are created. Between them, there is the activation-monitoring theory, which explains the creation of false memories using two principles. The first came as a conclusion of an experiment that consists in giving the subject a list of words and later ask if a certain word was in the list or not. It was discovered that if the words in the list were related somehow, and the word asked later followed the same topic, the subject would think the word was in the list, even if it that was not the case. The second principle states that if the person cannot remember the source of the information of interest, that person may create a false memory. [17]

Generative Adversarial Networks

A Generative Adversarial Network (GAN) is a deep-learning-based model introduced by Goodfellow [18], that allows to create synthetic data that is indistinguishable from real data. The model has two neural networks (a generator and a discriminator), and a training data set.
The generator’s objective is to create data as similar as possible of the training data set. The discriminator’s objective is to say which data comes from the generator and which data comes from the training data set just having the data as input. This way, the generator and discriminator competes against each other to obtain better results.
If the discriminator fails to tell the source of the data, it means the generator was able to fool the discriminator. In this case, the discriminator will be the one learning from its mistake. If the discriminator is able to tell correctly the source of the data, the generator is the one that will be learning. When the training is complete, the generator network is the one used to keep generating data.

Refer to caption
Figure 2.1: GAN architecture.

In our case, the data are images of faces. In the training, the generator will be learning the structure of these images, projecting them to a nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT space called latent space. This way, the generator takes as an input a nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT point, usually called latent code or z𝑧zitalic_z, and the output is the image in that point of the latent space.

Loss function

The generator (G𝐺Gitalic_G) and the discriminator (D𝐷Ditalic_D) are competing against each other, playing a two-player minimax game. G𝐺Gitalic_G receives some input (z𝑧zitalic_z) and returns a synthetic image. D𝐷Ditalic_D receives an image, and tries to tell if it is a real image or a synthetic one. This competition can be described with the following function [18].

minGmaxD𝔼xpdata(x)[logD(x)]+𝔼zpz(z)[1logD(G(z))]subscript𝐺subscript𝐷subscript𝔼similar-to𝑥subscript𝑝data𝑥delimited-[]𝐷𝑥subscript𝔼similar-to𝑧subscript𝑝z𝑧delimited-[]1𝐷𝐺𝑧\min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}]+\mathbb{E}_{% z\sim p_{\text{z}}(z)}[1-\log{D(G(z))}]roman_min start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT [ roman_log italic_D ( italic_x ) ] + blackboard_E start_POSTSUBSCRIPT italic_z ∼ italic_p start_POSTSUBSCRIPT z end_POSTSUBSCRIPT ( italic_z ) end_POSTSUBSCRIPT [ 1 - roman_log italic_D ( italic_G ( italic_z ) ) ] (2.1)

Wasserstein GAN (WGAN)

WGAN [19] is an alternative algorithm to the traditional GAN algorithm. It proposes a new loss function using Wasserstein distance, the original GAN loss function uses the Jensen-Shannon divergence instead. This change improves the stability of learning and solve problems like mode collapse that the traditional GAN has.

Deep Convolutional GAN (DCGAN)

DCGANs [20], in comparison with the traditional GANs, changes the architecture of the Generator and Discriminator. This model uses convolutional networks for both of them. The most important result this model gives is the stability it has.

Conditional GAN (cGAN)

In the traditional GAN, there is no control in the outputs of the generator. This is why cGANs [21] are introduced. A cGAN is an extension of GAN that introduces the possibility of adding labels to the input in order to influence the generator and discriminator networks. This gives some control to the synthetic images the generator creates. It also presents the possibility of generating new tags to an image.

Auxiliary Classifier GAN (AC-GAN)

AC-GAN [22] is an extension of cGANs that modifies the discriminator to also predict the label (also known as class or auxiliary classifier) of the input image.

Information Maximizing GAN (InfoGAN)

An InfoGAN [23] is an extension to GANs that is based in information theory. InfoGANs are able to learn disentangled representations in a completely unsupervised manner. This representations are competitive to the ones learned with surpervised methods.

Pix2Pix

Pix2Pix [24] is a software that uses a cGAN for image-to-image translation. It can be used for different purposes. Some of the known applications are labels-to-street-scene, aerial-to-map, labels-to-facade, day-to-night and edges-to-photo.

Refer to caption
Figure 2.2: An example from a edges-to-shoe image translator that uses Pix2Pix.

Stacked GAN (StackGAN)

StackGAN [25] proposes an architecture that chains multiple GANs together to create photo-realistic images. The output of one of the GANs is the input of the next one. The idea is to give more resolution to the output image throughout each GAN it passes through. When introduced, it was tested with a text-to-image translation, giving some interesting results, as seen in figure 2.3.

Refer to caption
Figure 2.3: Results from StackGAN paper [25].

Cycle-Consistent GAN (CycleGAN)

CycleGAN [26] is an extension to GAN that connects two GANs in a cycle. Its objective is to obtain two map** functions that translates from one set of images to another. For example, some of the translations can be horses-to-zebras, artist-to-photo and winter-to-summer, and their respective inverses.

Progressive Growing GAN (PGGAN)

PGGAN [27] introduces a new training methodology for GANs. First it trains a GAN to generate 4x4 images, then it adds another layer to both the generator and discriminator to generate 8x8 images, and so on until the desired resolution (the paper goes up to 1024x1024). The idea behind the algorithm is to first learn greater-scale structure and later on focus in the fine details. This methodology reduces training time and obtains more realistic results.

Style-Based GAN (StyleGAN)

StyleGAN [9] is an extension of PGGAN that modifies the generator architecture. GANs used to use the vector z𝑧zitalic_z directly, but StyleGAN maps it to a new vector w𝑤witalic_w. This new vector w𝑤witalic_w is divided into different layers which are used together with Gaussian noise vectors as inputs of the different layers of the StyleGAN’s generation algorithm.

Refer to caption
Figure 2.4: Traditional GAN architecture vs StyleGAN architecture. [9]

Facial properties modification within GAN

In 2020, Erik Härkönen, Aaron Hertzmann, et al. released a paper titled GANSpace: Discovering Interpretable GAN Controls [28] showing how to find latent directions allowing to modify the properties of the objects in the generated image by the neural network. This was mainly based on Principal Component Analysis (PCA), which provides a much faster way to extract meaningful latent directions in comparison to manual supervision or expensive optimization. The results of this paper can be applied to StyleGAN’s 2 architecture, paving the way to modify facial attributes of a generated image within the neural network itself.

Face editing with GAN

Another paper was released in 2022 focusing in face editing with GANs moving across a direction in the latent space. Face editing with GAN’s – A Review [29], written by Parthak Mehta, Sarthak Mishra, et al, shows how can other classification models can be used, such as logistic regression or SVM, to find the directions that represent a feature in the trained StyleGAN model. Not only that, but they also found a way to find a latent vector which generated image is similar to a target one. This way, a real portrait can be used as an input to find a similar face generated by the neural network, which attributes could then be modified using the directions that represent a feature.

Initial project status

This project aims to continue the work of Jimena Lozano and Maite Herrán Oyhanarte on the subject of GANs to create and manipulate facial images [1]. This tool was intended as a software to generate realistic facial images with the ability to manipulate their main characteristics, such as eye size and separation, nose size, etc. The resulting software is able to generate new random, artificial facial images from scratch and generate transitions between two generated pictures. StyleGAN 2 neural network was the core of this software to provide those features, paired with an API and a Web application built in Python and React respectively. All of these made possible an initial version of a software that enabled the Lab to improve their preparation workflow for their experiments.

Face-frame stabilization in image projection to latent space

Face-frame variation measurement

There is no standardized way of defining a face-frame. We define it as the set of characteristics of a face that gives context for the internal characteristics of such face. A face-frame includes the contour of the face, hair, ears. A face-frame does not include eyes, mouth, nose or eyebrows. For example, Figure 3.1 shows two faces with the same frame. The hypothesis of the Lab is that this two faces may be mistaken with each other because they share the same frame.

Refer to caption
Figure 3.1: Two different faces with the same face-frame.

Having this in mind, we measure the variation of the face-frame between two images based on the position of the face and hair in the image.

Image segmentation

The first thing we need to do is to identify the face and hair inside an arbitrary image. This consists of a simple segmentation problem, separate the pixels of an image in three groups: face, hair and background. For this, an open-source pretrained neural network is used [30]. This network proved to give satisfying results for different faces and images with different conditions. This is essential for the measurement because there is no need to modify any parameters for different images. It is worth mentioning that although the neck is not part of the face-frame, the Lab thinks is better if it is preserved. Therefore, for the objective of this work, it is acceptable that the neural network classifies the neck as part of the face. In Figure 3.2, we can see some examples of segmentation made by this network. In yellow the pixels classified as hair; in blue the pixels classified as face; and in purple the pixels classified as background.

Refer to caption
Figure 3.2: Segmentation made by the neural network.

Face-frame variation formula

Once each pixel of each of the two images is classified, we compare the classifications between the pixels in the same position of the two images. Let f𝑓fitalic_f be a function that compares the classifications of two pixels,

f(c1,c2)={0c1=c20.2c1="Face"c2="Hair"0.2c1="Hair"c2="Face"1otherwise𝑓subscript𝑐1subscript𝑐2cases0subscript𝑐1subscript𝑐20.2subscript𝑐1"𝐹𝑎𝑐𝑒"subscript𝑐2"𝐻𝑎𝑖𝑟"0.2subscript𝑐1"𝐻𝑎𝑖𝑟"subscript𝑐2"𝐹𝑎𝑐𝑒"1𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒f(c_{1},c_{2})=\left\{\begin{array}[]{ll}0&c_{1}=c_{2}\\ 0.2&c_{1}="Face"\land c_{2}="Hair"\\ 0.2&c_{1}="Hair"\land c_{2}="Face"\\ 1&otherwise\\ \end{array}\right.italic_f ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = { start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0.2 end_CELL start_CELL italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = " italic_F italic_a italic_c italic_e " ∧ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = " italic_H italic_a italic_i italic_r " end_CELL end_ROW start_ROW start_CELL 0.2 end_CELL start_CELL italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = " italic_H italic_a italic_i italic_r " ∧ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = " italic_F italic_a italic_c italic_e " end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW end_ARRAY (3.1)

The number 0.2 is arbitrary, but it means that the face-hair variation is 5 times less important than variations involving the background of the image.

Let F be the function that measures the variation of face-frame between two images (I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) of same size, with height hhitalic_h and width w𝑤witalic_w. The classifications of the pixels of I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are pi,jsubscript𝑝𝑖𝑗p_{i,j}italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, and the classifications of the pixels of I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are qi,jsubscript𝑞𝑖𝑗q_{i,j}italic_q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT.

F(I1,I2)=i=1hj=1wf(pi,j,qi,j)hw𝐹subscript𝐼1subscript𝐼2superscriptsubscript𝑖1superscriptsubscript𝑗1𝑤𝑓subscript𝑝𝑖𝑗subscript𝑞𝑖𝑗𝑤F(I_{1},I_{2})=\frac{\sum_{i=1}^{h}\sum_{j=1}^{w}f(p_{i,j},q_{i,j})}{h*w}italic_F ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_f ( italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_h ∗ italic_w end_ARG (3.2)

This function yields a percentage of the images that is classified differently, giving more importance to background variations. If the face-frame does not vary between two images I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then F(I1,I2)=0𝐹subscript𝐼1subscript𝐼20F(I_{1},I_{2})=0italic_F ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0. If there is a change in at least one pixel, then 0<F(I1,I2)10𝐹subscript𝐼1subscript𝐼210<F(I_{1},I_{2})\leq 10 < italic_F ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ 1.

Correction for image projection to latent space

As mentioned in section 4.1, StyleGAN2 uses a metric [31] to estimate perceptual similarity to the target image. Albeit this method works pretty well, it does not take into account the face-frame variation. To try to obtain results with the least variation of face-frame as possible, a post-processing algorithm (henceforth referred to as face-frame correction) has been designed to pick the best image, which has the lower variation of them all, using the function defined in 3.1.2 to measure this metric.
The correction takes the target image and the latent code of the projection as an input. First, it calculates the standard deviation of 10000 random latent codes that create realistic images, which allows us to apply noise to the original latent code in a way that we can be certain it will yield a realistic image. The f(targetimage,G(initiallatentcode))𝑓𝑡𝑎𝑟𝑔𝑒𝑡𝑖𝑚𝑎𝑔𝑒𝐺𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑙𝑎𝑡𝑒𝑛𝑡𝑐𝑜𝑑𝑒f(targetimage,G(initiallatentcode))italic_f ( italic_t italic_a italic_r italic_g italic_e italic_t italic_i italic_m italic_a italic_g italic_e , italic_G ( italic_i italic_n italic_i italic_t italic_i italic_a italic_l italic_l italic_a italic_t italic_e italic_n italic_t italic_c italic_o italic_d italic_e ) ) is also calculated and stored in this first step. The noise strength in each iteration is the result of applying the following formula:

strength(i)=lstdnoise0(1i10000nrl)2𝑠𝑡𝑟𝑒𝑛𝑔𝑡𝑖subscript𝑙𝑠𝑡𝑑𝑛𝑜𝑖𝑠subscript𝑒0superscript1𝑖10000𝑛𝑟𝑙2strength(i)=l_{std}*noise_{0}*(\frac{1-\frac{i}{10000}}{nrl})^{2}italic_s italic_t italic_r italic_e italic_n italic_g italic_t italic_h ( italic_i ) = italic_l start_POSTSUBSCRIPT italic_s italic_t italic_d end_POSTSUBSCRIPT ∗ italic_n italic_o italic_i italic_s italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∗ ( divide start_ARG 1 - divide start_ARG italic_i end_ARG start_ARG 10000 end_ARG end_ARG start_ARG italic_n italic_r italic_l end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3.3)

Being i𝑖iitalic_i the current iteration number, lstdsubscript𝑙𝑠𝑡𝑑l_{std}italic_l start_POSTSUBSCRIPT italic_s italic_t italic_d end_POSTSUBSCRIPT the standard deviation of the latent codes previously mentioned, noise0𝑛𝑜𝑖𝑠subscript𝑒0noise_{0}italic_n italic_o italic_i italic_s italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT a constant representing the initial noise factor, and nrl𝑛𝑟𝑙nrlitalic_n italic_r italic_l a constant representing the noise ramp length. In these experiments, the values are noise0=0.005𝑛𝑜𝑖𝑠subscript𝑒00.005noise_{0}=0.005italic_n italic_o italic_i italic_s italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.005 and nrl=0.75𝑛𝑟𝑙0.75nrl=0.75italic_n italic_r italic_l = 0.75, which follow what StyleGAN2 uses [32]
Throughout every of the n𝑛nitalic_n iterations, the algorithm introduces a Gaussian noise multiplied by strength(i)𝑠𝑡𝑟𝑒𝑛𝑔𝑡𝑖strength(i)italic_s italic_t italic_r italic_e italic_n italic_g italic_t italic_h ( italic_i ) to the latent code. Depending on the constants noise0𝑛𝑜𝑖𝑠subscript𝑒0noise_{0}italic_n italic_o italic_i italic_s italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and nrl𝑛𝑟𝑙nrlitalic_n italic_r italic_l, it can be expected that the output image differs too little from the original one so that, if this new latent code generates an image with less face-frame variation than the previous latent code, then this new one is stored and used as the current latent code. If that is not the case, the original is preserved. In Figure 3.3, we can see an example of a target face being projected into the latent space as it outputs the neural network and another one after applying the correction algorithm.

Refer to caption
Figure 3.3: A target face image, its output from the neural network and its output after running it through the correction algorithm, with its face-frame deviation.

Having set fixed values of noise0=0.005𝑛𝑜𝑖𝑠subscript𝑒00.005noise_{0}=0.005italic_n italic_o italic_i italic_s italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.005 and nrl=0.75𝑛𝑟𝑙0.75nrl=0.75italic_n italic_r italic_l = 0.75 taken from StyleGAN2 [32], the optimal value of n𝑛nitalic_n is going to be studied having in mind the reduction of the face-frame variation of the target image and the projected one as much as possible.

Materials And Methods

Projecting an arbitrary image to latent space

The image to latent space projection operation StyleGAN2 provides, takes advantage of the fact that the latent space is semantically smooth. This means that small changes in the input vector, will result to small changes in the resulting image. A random input vector is used to start with. This vector is slightly modified and the resulting image is compared to the target image. It uses a standard LPIPS metric to estimate perceptual similarity between an image and the target image [31]. If the modification resulted in a better similarity to the target image, it is maintained. This process of slight modifications and measurements is done a thousand times.
The target image’s frame and the resulting image’s frame are compared.

Refer to caption
Figure 4.1: To the left, the target image. To the right, its projection to the latent space

Operations on the target image to measure face-frame variation

We can take advantage of the fact that there are multiple latent directions we can move through, modifying certain attributes of an image. Some of this latent directions are of interest as they can be used by the Lab to edit an image right in the neural network itself. We define an operation done on an image which originates by a latent code as an addition or subtraction of a vector in such a way that a property of the face is modified.
Having this in mind, we will study the face-frame variation when applying some of these operations. Let O𝑂Oitalic_O be the operation to be studied. The face-frames to compare are the ones from I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and O(I1)=I2𝑂subscript𝐼1subscript𝐼2O(I_{1})=I_{2}italic_O ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. As the results may vary from image to image, the measurement is taken as the average measurement of a hundred different images. The same hundred images are used in each operation.

Moving through latent directions

4.2.1.1   Age

The age direction transforms the face in a way that it looks like older or younger version of the original face.

Refer to caption
Figure 4.2: Images generated moving through age direction
4.2.1.2   Gender

The gender direction transforms the face in a way that it looks like a more masculine or feminine version of the original face.

Refer to caption
Figure 4.3: Images generated moving through gender direction
4.2.1.3   Horizontal orientation

The horizontal orientation direction transforms the face in a way that it looks like rotating horizontally.

Refer to caption
Figure 4.4: Images generated moving through horizontal face orientation direction
4.2.1.4   Vertical orientation

The vertical orientation direction transforms the face in a way that it looks like rotating vertically.

Refer to caption
Figure 4.5: Images generated moving through vertical face orientation direction
4.2.1.5   Eyes open

The eyes open direction transforms the face opening or closing its eyes.

Refer to caption
Figure 4.6: Images generated moving through eyes open direction
4.2.1.6   Mouth open

The mouth open direction transforms the face opening or closing its mouth.

Refer to caption
Figure 4.7: Images generated moving through mouth open direction
4.2.1.7   Smile

The smile direction transforms the face adding or removing a smile.

Refer to caption
Figure 4.8: Images generated moving through smile direction

Style mixing

Style mixing is an operation that takes two images, and generate other two images exchanging their styles. [9] This means that their layout of the face is maintained but the style of the image will be the one of the other input image.

Refer to caption
Figure 4.9: On the top row, the two input images. On the bottom row, the two images with their styles mixed

Experiments

Projection to latent space

In order to be examine the effectiveness of the face-frame correction algorithm proposed in Section 4.1, we first need to take a look at how well StyleGAN2 does when projecting a target image to the latent space in terms of face-frame variation. To measure this, we project 10 images of different faces to the latent space without using the correction algorithm, and we measure the variation using the function defined in Section 3.1.2. These results are then processed into a table allowing us to have a baseline of how much the variation is using the default StyleGAN2’s function.

Correction for projection

After examining the variation that StyleGAN2 introduces, we can then study how well our proposed algorithm corrects this variation. In order to test this, we set n=2000𝑛2000n=2000italic_n = 2000 and we run our algorithm for each of the 10 output images of the previous Section 5.1. We then compare the variation in each iteration, allowing us to study the optimal n𝑛nitalic_n in terms of computational time and reduced variation.

Latent directions

We also want to take a look at the face-frame variation for the 10 images when transforming those moving across the aforementioned latent directions. This allows us to rank each operation in respect of the variation they introduce on the generated image, making them more or less elegible for image modification if face-frame stabilization is of the essence.

Results

Projection to latent space

Table 6.1 shows the face-frame variation that the image to latent space projection achieved with each image. The maximum variation is at 3.192% while the minimum at 0.747%. The mean variation between these images is of 1.812%. This results suggest that images are a decent starting point when trying to generate similar faces with minimum face-frame variation.

Image Variation
Andrew 1.429%
Daniel 1.894%
Emma 2.531%
Jennifer 1.790%
Kristen 2.267%
Matt 1.452%
Paul 1.198%
Rob 2.381%
Rosa 3.192%
Sheldon 1.041%
Zendaya 0.747%
Table 6.1: Face-frame variations results for projecting to latent space
Refer to caption
Figure 6.1: The 11 chosen faces and their respective projection generated by StyleGAN2 to their bottom.

Correction for projection

Figures 6.2 and 6.3 illustrate the improvement in the face-frame variation through the iterations of the correction. The curves of variation significantly decrease in the first 750 iterations. The last 1250 iterations takes account for the last 10% of the improvement. It is notable that the improvement in the variation is always superior to the 20% of the initial value, and in some of the cases, it goes over the 50%.

Refer to caption
Figure 6.2: Graph of face-frame variations through the iterations
Refer to caption
Figure 6.3: Graph of the normalized face-frame variations through the iterations
Refer to caption
Figure 6.4: A target face image, its output from the neural network and its output after running it through the correction algorithm with 750 iterations, with its face-frame deviation.

Latent directions

We also check the face-frame variation for the 10 images when transforming those moving across the aforementioned latent directions. In Figure  6.5, we can see the chosen images. They are labeled as 1 to 10 from left to right.

Refer to caption
Figure 6.5: The 10 chosen faces for the latent directions experiments.

When an image moves a great magnitude through a direction, they tend to give unexpected results, deforming the base structure of the face and giving unrealistic images. This is the reason the range measured in each direction changes in each case. For the set of chosen images, we decided the following ranges:

Direction Range
Age [-3, 3]
Gender [-3, 3]
Vertical [-3, 3]
Horizontal [-3, 3]
Eyes-open [-10, 10]
Mouth-open [-10, 10]
Smile [-2, 2]
Table 6.2: Ranges to measure for each direction

Figures 6.8 and 6.9 both show that modifying the horizontal and vertical face perspective in the image greatly worsens the face-frame variation. Figure 6.6 shows that gender modification is another operation that significantly increases the face-frame variation. These operations are some of the worst ones when considering this variation as can be seen in 6.13, together with smile modification (6.12) and age modification (6.6).

However, figures 6.11 and 6.10 both show that modifying the openness of the face’s mouth or the eyes are the two operations that least increase the face-frame variation.

Another interesting result illustrated in 6.13 is that all the tested operations have a linear impact on the face-frame variation.

Refer to caption
Figure 6.6: Graph of the face-frame variations through the age direction
Refer to caption
Figure 6.7: Graph of the face-frame variations through the gender direction
Refer to caption
Figure 6.8: Graph of the face-frame variations through the vertical direction
Refer to caption
Figure 6.9: Graph of the face-frame variations through the horizontal direction
Refer to caption
Figure 6.10: Graph of the face-frame variations through the eyes open direction
Refer to caption
Figure 6.11: Graph of the face-frame variations through the mouth open direction
Refer to caption
Figure 6.12: Graph of the face-frame variations through the smile direction
Refer to caption
Figure 6.13: Graph of the mean face-frame variations through the each studied direction

Conclusion

In this work, we have explored how different operations in StyleGAN2’s latent space affect the face-frame of the resulting facial image. We have also studied how StyleGAN2’s image projection affects the face-frame of the original image and we proposed an algorithm to reduce this variation when projecting an image as much as possible.
This algorithm can be useful when trying to project a target image generating a new one which is not only similar to the original one in terms of facial characteristics, but it also keeps most the facial frame as much as possible.
Besides that, the exploration of the latent space can be useful to understand some underlying aspects of this generative model and the interpretable latent directions. The face-frame variation seems to increase with the magnitude of the movement through any of the studied directions in a linear way. However, the rate of such variation is different for each direction. For example, mouth-open and eyes-open directions presented the least variation while the horizontal and vertical face orientation, age and gender presented the greater variation, which is somewhat expected as the face-frame is different depending on the face’s orientation and as the face have different characteristics depending on the age or gender, which may include parts that define its frame.
When taking into consideration that the main goal of this paper is to help Laboratorio of Sueño y Memoria prove that face-frame plays a key role in the creation of false memories to convict innocent people in identification parades in a crime and that most of the known facts in these crimes are age and gender, these results can be useful for them to project target images faces and modify their relevant features changing as less as possible their frames.

Future Work

The face-frame correction algorithm and the variation function proposed in this paper provide an insight on how the face-frame may be maintained when a facial image is projected onto the latent space. However, we provide an initial approach which can be further enhanced in order to continue to minimize the face-frame variation while maintaining or even improving the output image fidelity. Here are some ideas we propose:

New directions

Latent directions are the key to modify faces inside the latent space. However, only some of the known latent directions have been studied. New directions can be studied to understand how the latent space is arranged. To find new directions, an unsupervised discovery method can be used, for example the one Voynov and Babenko propose [33].
If one of the found directions is interpretable in a similar way than other known, it is possible to compare them by their face-frame variation to choose between them for a better option.
This will allow further face’s characteristics modification within the net itself in such a way that it could be possible to help the Lab disprove the belief that police lineups are helpful when searching for a crime perpetrator.

Different face-frame variation functions

The face-frame variation function used for this project described in section 3.1.2, is a first version that was created with this project in mind and isn’t flawless: sometimes it confuses facial hair, such as beard, as being part of the face and sometimes it labels it as part of the background. The neck is also sometimes considered part of the face, which in reality isn’t. There may be a more precise way to measure the face-frame variation between two images that we are not aware of and which can solve this issues.
One approach could be taking advantage of facial landmark detection algorithms [34] to also avoid losing facial characteristics, as our algorithm only considers the face-frame, but it doesn’t take into consideration all of the facial characteristics.

Improvement of face-frame variation when modifying faces

The face-frame correction algorithm used is only applied and tested when projecting images to the latent space. However, it may also be used to reduce face-frame variation when performing some operations to a facial image, such as modifying its eye-opening.

Different generative networks

This project was done using StyleGAN2 [9] as it is the state of art of face images generation at the moment. However, the same experiment can be done using different networks. Text-to-image networks are starting to be popular, it may exist a way that tools such us DALL-E [35] or Stable diffusion [36], can be used to modify faces with similar results as the ones obtained with StyleGAN2. In fact, DALL-E 2[37] allows for replacing parts of an image using a text description entered in a prompt, in a process called inpainting. This allows for photo realistic modification of images, which yields a credible image in a fast and accurate way.

Refer to caption
Figure 8.1: An image of two art pictures in a museum used as an input image to the DALLE 2 AI.
Refer to caption
Figure 8.2: The original museum image including a dog in one of the pictures. This edit was entirely made by DALLE 2 AI.
Refer to caption
Figure 8.3: The original museum image including a dog in one of the pictures. This edit was entirely made by DALLE 2 AI.
Refer to caption
Figure 8.4: The original museum image including a dog in a sofa. This edit was entirely made by DALLE 2 AI.

Acknowledgements

We would like to express our deepest gratitude to our supervisor, Dr. Rodrigo Ramele, who not only presented the topic of this project, but also guided us through it, encouraging us to go deeper than we thought we could. We could not have undertaken this journey without Maite Herrán and Jimena Lozano who paved our way with their project and took the time to explain it to us.
Our project would not have been possible without the help of ITBA’s IT team who allowed us to make use of their hardware and who set up an environment so that we could use the neural network and perform all the tests we needed. We are also thankful to Dr. Cecilia Forcato and her team in the Laboratorio de Sueño y Memoria, who were always helpful and supporting of our project.
We thankfully acknowledge the NVIDIA Corporation Applied Research Accelerator Program that without their help this project wouldn’t have been conducted #NVIDIAGrant.
Lastly, we would like to mention our families, friends and loved ones for their continuous support during this journey.

References

  • Lozano and Herrán Oyhanarte [2021] Jimena Lozano and Maite Mercedes Herrán Oyhanarte. Use of generative adversarial networks for the creation and manipulation of facial images in the context of studying false memories and its effects on wrongful conviction cases: implementation of stylegan’s generative image modeling and style mixing properties to design an interface for experimentation purposes. 2021.
  • Hafez [2019] Md Hafez. Neuromarketing: a new avatar in branding and advertisement. Pacific Business Review International, 12(4):58–64, 2019.
  • Castellani et al. [2010] Rudy J Castellani, Raj K Rolston, and Mark A Smith. Alzheimer disease. Disease-a-month: DM, 56(9):484, 2010.
  • Högberg et al. [2011] Göran Högberg, Davide Nardo, Tore Hällström, and Marco Pagani. Affective psychotherapy in post-traumatic reactions guided by affective neuroscience: memory reconsolidation and play. Psychology Research and Behavior Management, 4:87, 2011.
  • Fukushima [1973] Kunihiko Fukushima. A model of associative memory in the brain. Kybernetik, 12(2):58–63, 1973.
  • Schacter [2002] Daniel L Schacter. The seven sins of memory: How the mind forgets and remembers. HMH, 2002.
  • Loftus [2004] Elizabeth F Loftus. Memories of things unseen. Current directions in psychological science, 13(4):145–147, 2004.
  • Rattner [1983] Arye Rattner. Convicting the innocent: When justice goes wrong. The Ohio State University, 1983.
  • Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, 2020.
  • Luxemburg [2019] Robert Luxemburg. StyleGAN2 latent directions, 2019. https://twitter.com/robertluxemburg/status/1207087801344372736 [Accessed: 2022-08-10].
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • Project [2022] Innocence Project. Research Resources - Innocence Project, 2022. https://innocenceproject.org/research-resources/ [Accessed: 2022-09-28].
  • West and Meterko [2015] Emily West and Vanessa Meterko. Innocence project: Dna exonerations, 1989-2014: review of data and findings from the first 25 years. Alb. L. Rev., 79:717, 2015.
  • MEMORIA [2022] LABORATORIO DE SUEÑO Y MEMORIA. Investigación | Argentina | Labsuenoymemoria, 2022. https://www.labsuenoymemoria.com/investigación [Accessed: 2022-08-13].
  • Zlotnik and Vansintjan [2019] Gregorio Zlotnik and Aaron Vansintjan. Memory: An extended definition. Frontiers in psychology, 10:2523, 2019.
  • Forcato [2011] Cecilia Forcato. Estudio de la fase de reconsolidación de la memoria declarativa en humanos. PhD thesis, Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales, 2011.
  • Roediger et al. [2001] Henry L Roediger, Jason M Watson, Kathleen B McDermott, and David A Gallo. Factors that determine false recall: A multiple regression analysis. Psychonomic bulletin & review, 8(3):385–407, 2001.
  • Goodfellow [2014] Ian J Goodfellow. On distinguishability criteria for estimating generative models. arXiv preprint arXiv:1412.6515, 2014.
  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • Odena et al. [2017] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pages 2642–2651. PMLR, 2017.
  • Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016.
  • Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
  • Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
  • Karras et al. [2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • Härkönen et al. [2020] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. Advances in Neural Information Processing Systems, 33:9841–9850, 2020.
  • Mehta et al. [2022] Parthak Mehta, Sarthak Mishra, Nikhil Chouhan, Neel Pethani, and Ishani Saha. Face editing with gan–a review. arXiv preprint arXiv:2207.11227, 2022.
  • Gupta [2018] Kamal Gupta. Face segmentation. https://github.com/kampta/face-seg, 2018.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • NVlabs [2021] NVlabs. Stylegan 2. https://github.com/NVlabs/stylegan2, 2021.
  • Voynov and Babenko [2020] Andrey Voynov and Artem Babenko. Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning, pages 9786–9796. PMLR, 2020.
  • Wu and Ji [2019] Yue Wu and Qiang Ji. Facial landmark detection: A literature survey. International Journal of Computer Vision, 127(2):115–142, 2019.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv, 2022. URL https://arxiv.longhoe.net/abs/2204.06125.