Controlling Face’s Frame generation in StyleGAN’s latent space operations
Modifying faces to deceive our memory
Agustín Roca
Nicolás Britos
Supervisor: Rodrigo Ramele
Final Project
Computer Engineering Degree
Abstract
Innocence Project is a non-profitable organization that works in reducing wrongful convictions. In collaboration with Laboratorio de Sueño y Memoria from Instituto Tecnológico de Buenos Aires (ITBA), they are studying human memory in the context of face identification. They have a strong hypothesis stating that human memory heavily relies in face’s frame to recognize faces. If this is proved, it could mean that face recognition in police lineups couldn’t be trusted, as they may lead to wrongful convictions. This study uses experiments in order to try to prove this using faces with different properties, such as eyes size, but maintaining its frame as much as possible.
In this project, we continue the work from a previous project [1] that provided the basic tool to generate realistic faces using StyleGAN2. We take a deep dive into the internals of this tool to make full use of StyleGAN2 functionalities, while also adding more features needed by the Lab, such as projecting a face from a target face image or modifying certain of its attributes, including mouth-opening or eye-opening.
As the usage of this tool heavily relies on maintaining the face-frame, we develop a way to identify the face-frame of each image and a function to compare it to the output of the neural network after applying some operations, such as the modification of the eye-opening. The objective is to have a numeric value that measures how much does a certain operation change the face-frame and to know how much of it is maintained, having a clearer perspective of which face image may be a better candidate for generating false memories.
We conclude that the face-frame is maintained when modifying eye-opening or mouth opening. When modifying vertical face orientation, gender, age and smile, have a considerable impact on its frame variation. And finally, the horizontal face orientation shows a major impact on the face-frame. This way, the Lab may apply some operations being confident that the face-frame won’t significantly change, making them viable to be used to deceive subjects’ memories.
Keywords – StyleGAN, GAN, Generative Image Modeling, Innocence Project, False Memories, Laboratorio de Sueño y Memoria, ITBA, Face-Frame, Face Landmarks, Image Segmentation, Image Processing
Contents
- 1 Introduction
-
2 Review of similar works
- 2.1 Innocence Project
- 2.2 Memories
-
2.3 Generative Adversarial Networks
- 2.3.1 Loss function
- 2.3.2 Wasserstein GAN (WGAN)
- 2.3.3 Deep Convolutional GAN (DCGAN)
- 2.3.4 Conditional GAN (cGAN)
- 2.3.5 Auxiliary Classifier GAN (AC-GAN)
- 2.3.6 Information Maximizing GAN (InfoGAN)
- 2.3.7 Pix2Pix
- 2.3.8 Stacked GAN (StackGAN)
- 2.3.9 Cycle-Consistent GAN (CycleGAN)
- 2.3.10 Progressive Growing GAN (PGGAN)
- 2.3.11 Style-Based GAN (StyleGAN)
- 2.4 Facial properties modification within GAN
- 2.5 Face editing with GAN
- 2.6 Initial project status
- 3 Face-frame stabilization in image projection to latent space
- 4 Materials And Methods
- 5 Experiments
- 6 Results
- 7 Conclusion
- 8 Future Work
- 9 Acknowledgements
List of Figures
- 2.1 GAN architecture.
- 2.2 An example from a edges-to-shoe image translator that uses Pix2Pix.
- 2.3 Results from StackGAN paper [25].
- 2.4 Traditional GAN architecture vs StyleGAN architecture. [9]
- 3.1 Two different faces with the same face-frame.
- 3.2 Segmentation made by the neural network.
- 3.3 A target face image, its output from the neural network and its output after running it through the correction algorithm, with its face-frame deviation.
- 4.1 To the left, the target image. To the right, its projection to the latent space
- 4.2 Images generated moving through age direction
- 4.3 Images generated moving through gender direction
- 4.4 Images generated moving through horizontal face orientation direction
- 4.5 Images generated moving through vertical face orientation direction
- 4.6 Images generated moving through eyes open direction
- 4.7 Images generated moving through mouth open direction
- 4.8 Images generated moving through smile direction
- 4.9 On the top row, the two input images. On the bottom row, the two images with their styles mixed
- 6.1 The 11 chosen faces and their respective projection generated by StyleGAN2 to their bottom.
- 6.2 Graph of face-frame variations through the iterations
- 6.3 Graph of the normalized face-frame variations through the iterations
- 6.4 A target face image, its output from the neural network and its output after running it through the correction algorithm with 750 iterations, with its face-frame deviation.
- 6.5 The 10 chosen faces for the latent directions experiments.
- 6.6 Graph of the face-frame variations through the age direction
- 6.7 Graph of the face-frame variations through the gender direction
- 6.8 Graph of the face-frame variations through the vertical direction
- 6.9 Graph of the face-frame variations through the horizontal direction
- 6.10 Graph of the face-frame variations through the eyes open direction
- 6.11 Graph of the face-frame variations through the mouth open direction
- 6.12 Graph of the face-frame variations through the smile direction
- 6.13 Graph of the mean face-frame variations through the each studied direction
- 8.1 An image of two art pictures in a museum used as an input image to the DALLE 2 AI.
- 8.2 The original museum image including a dog in one of the pictures. This edit was entirely made by DALLE 2 AI.
- 8.3 The original museum image including a dog in one of the pictures. This edit was entirely made by DALLE 2 AI.
- 8.4 The original museum image including a dog in a sofa. This edit was entirely made by DALLE 2 AI.
List of Tables
Introduction
Understanding how our brains store and recover memories can be really powerful. Marketing, psychology, design, biology, medicine or even computer science can have great breakthroughs by knowing how our brains remember things. Better advertisements [2], having a more solid base of knowledge to fight Alzheimer [3], dealing with traumas in an efficient way [4] or creating new models of information based on it [5]; all these can be potential consequences of understanding memories in our brains. However, we also know that our memory is malleable [6]. It can be manipulated and affected by external factors and deceive ourselves [7].
Having this information, it is natural to doubt the accuracy of human memory. This is the main reason why Innocence Project is doing its research. They are investigating cases where the eyewitnesses were the main evidence for a conviction. They theorize that many of these cases may be result of corruption, manipulation, or honest mistakes when identifying the culprit of the crime. In 1983, around 52% of wrongful convictions were results of eyewitnesses mistakes [8].
Innocence Project Argentina, partnered with Laboratorio de Sueño y Memoria from Instituto Tecnológico de Buenos Aires (ITBA), to research how humans remember faces, which are the key factors for our brains to identify faces and how accurate is human memory for face identification. The experiments the lab is conducting consists on simulating a crime that a subject witnesses. Later, the subject is given the task to try to recognize the perpetrator of the crime in a lineup where the criminal may be present or not. The objective of the experiment is to know which are the attributes that are common between the real criminal and the identified person.
During this research, they theorized that the contour of the face, the hair and the ears, are the things that our memory first notice in a face. This set of properties is called the "face-frame".
Currently, the Lab is using a tool to generate faces similar to the criminal in the experiment [1]. This tool is based on NVidia’s StyleGAN2 [9], a Generative Adversarial Network (GAN). A GAN is a deep-learning-based model that allows to create synthetic data which is indistinguishable from real data. Nvidia used this model and changed its architecture to create StyleGAN2, capable of generating realistic images of faces, which is why is used by the Lab.
Albeit the tool being used by the Lab is capable of generating images of faces, it’s still missing some key functionalities to fully take advantage of StyleGAN2, like being able to change one face’s attributes using the very same neural network, or map** an existing face to a one generated by the network.
This way, the project can be divided in two main parts. First, we implement the missing functionalities that StyleGAN2 offers: style mixing and projecting an image into the latent space. These functionalities provide additional tools for the Lab to generate faces, especially when they have specific requirements about how the face they want to generate should look like. In addition, we use some latent directions known by the community [10] to be able to modify some attributes of the face. This attributes include eye-opening, mouth-opening, smile, face orientation, presumable age and gender, among others.
The second part consists in measuring how much of the face differs from the original one after modifying it with these functionalities. These also allows us to find a way to project an image maintaining its frame as much as possible. Style mixing is not part of the experiments as it only changes color scheme and microstructure of the image [11].
In this work we will first present the main stakeholders of this investigation, a brief introduction to human’s memory, Generative Adversarial Networks and how faces generated from them can be modified within the same network (specifically in StyleGAN2) and this project’s initial status in Section 2. In Section 3 and 4 readers will find all the definitions and functions we’ve used to make our measurements, the faces we’ve used and the setup we have required. Furthermore, in Section 5 we explain the experiments we have done and in Section 6 we present the results we’ve obtained making a focus on face-frame deviation when modifying faces when moving across some of the neural network directions and in Section 7 we discuss the importance and significance of these results, and how they may be used. Finally, in Section 8 and Section 9 we conclude this work with future works and acknowledgements.
Review of similar works
Innocence Project
Innocence Project is one of the 69 organizations that are part of the Innocence Network, that work to free innocent people from jail and to prevent wrongful convictions. This work can be with legal services or investigation of the case presented. [12]
In the United States, the Innocence Project reported over 375 DNA exonerations [13]. The causes of wrongful conviction that the organization are aware of are: eyewitness identification, false confessions, forensic science errors, poor defense, among others. [12]
Innocence Project Argentina is part of the Innocence Network, and it works in Argentinian cases. This organization is the one that got in touch with Laboratorio de Sueño y Memoria from Instituto Tecnológico de Buenos Aires (ITBA).
The Lab does research in neuroscience. Some of the areas they are working on are: activation of memories while slee**, sleep paralysis, wave processing for the brain and false memories formation [14].
Memories
In 2019, Zlotnik and Vansintjan defined memory as "the capacity to store and retrieve information" [15]. The process of formation of memories have different phases.
The first of these phases is the acquisition. In this phase, the sensory stimuli are encoded into neurochemical representations. The second phase is the consolidation, that is a period in which the memory is stabilized in order to subsist through time. The retrieval phase is where the information in a memory can be recovered. A consolidated memory can go through a re-consolidation phase, in which its information can be modified and it is re-stabilized. [16]
This last phase is the one that has a key role for the experiments of the Lab. There are some theories about how false memories are created. Between them, there is the activation-monitoring theory, which explains the creation of false memories using two principles. The first came as a conclusion of an experiment that consists in giving the subject a list of words and later ask if a certain word was in the list or not. It was discovered that if the words in the list were related somehow, and the word asked later followed the same topic, the subject would think the word was in the list, even if it that was not the case. The second principle states that if the person cannot remember the source of the information of interest, that person may create a false memory. [17]
Generative Adversarial Networks
A Generative Adversarial Network (GAN) is a deep-learning-based model introduced by Goodfellow [18], that allows to create synthetic data that is indistinguishable from real data. The model has two neural networks (a generator and a discriminator), and a training data set.
The generator’s objective is to create data as similar as possible of the training data set. The discriminator’s objective is to say which data comes from the generator and which data comes from the training data set just having the data as input. This way, the generator and discriminator competes against each other to obtain better results.
If the discriminator fails to tell the source of the data, it means the generator was able to fool the discriminator. In this case, the discriminator will be the one learning from its mistake. If the discriminator is able to tell correctly the source of the data, the generator is the one that will be learning. When the training is complete, the generator network is the one used to keep generating data.
![Refer to caption](extracted/5701174/02_Images/GAN.png)
In our case, the data are images of faces. In the training, the generator will be learning the structure of these images, projecting them to a space called latent space. This way, the generator takes as an input a point, usually called latent code or , and the output is the image in that point of the latent space.
Loss function
The generator () and the discriminator () are competing against each other, playing a two-player minimax game. receives some input () and returns a synthetic image. receives an image, and tries to tell if it is a real image or a synthetic one. This competition can be described with the following function [18].
(2.1) |
Wasserstein GAN (WGAN)
WGAN [19] is an alternative algorithm to the traditional GAN algorithm. It proposes a new loss function using Wasserstein distance, the original GAN loss function uses the Jensen-Shannon divergence instead. This change improves the stability of learning and solve problems like mode collapse that the traditional GAN has.
Deep Convolutional GAN (DCGAN)
DCGANs [20], in comparison with the traditional GANs, changes the architecture of the Generator and Discriminator. This model uses convolutional networks for both of them. The most important result this model gives is the stability it has.
Conditional GAN (cGAN)
In the traditional GAN, there is no control in the outputs of the generator. This is why cGANs [21] are introduced. A cGAN is an extension of GAN that introduces the possibility of adding labels to the input in order to influence the generator and discriminator networks. This gives some control to the synthetic images the generator creates. It also presents the possibility of generating new tags to an image.
Auxiliary Classifier GAN (AC-GAN)
AC-GAN [22] is an extension of cGANs that modifies the discriminator to also predict the label (also known as class or auxiliary classifier) of the input image.
Information Maximizing GAN (InfoGAN)
An InfoGAN [23] is an extension to GANs that is based in information theory. InfoGANs are able to learn disentangled representations in a completely unsupervised manner. This representations are competitive to the ones learned with surpervised methods.
Pix2Pix
Pix2Pix [24] is a software that uses a cGAN for image-to-image translation. It can be used for different purposes. Some of the known applications are labels-to-street-scene, aerial-to-map, labels-to-facade, day-to-night and edges-to-photo.
![Refer to caption](extracted/5701174/02_Images/pix2pix.png)
Stacked GAN (StackGAN)
StackGAN [25] proposes an architecture that chains multiple GANs together to create photo-realistic images. The output of one of the GANs is the input of the next one. The idea is to give more resolution to the output image throughout each GAN it passes through. When introduced, it was tested with a text-to-image translation, giving some interesting results, as seen in figure 2.3.
![Refer to caption](extracted/5701174/02_Images/stackgan.jpeg)
Cycle-Consistent GAN (CycleGAN)
CycleGAN [26] is an extension to GAN that connects two GANs in a cycle. Its objective is to obtain two map** functions that translates from one set of images to another. For example, some of the translations can be horses-to-zebras, artist-to-photo and winter-to-summer, and their respective inverses.
Progressive Growing GAN (PGGAN)
PGGAN [27] introduces a new training methodology for GANs. First it trains a GAN to generate 4x4 images, then it adds another layer to both the generator and discriminator to generate 8x8 images, and so on until the desired resolution (the paper goes up to 1024x1024). The idea behind the algorithm is to first learn greater-scale structure and later on focus in the fine details. This methodology reduces training time and obtains more realistic results.
Style-Based GAN (StyleGAN)
StyleGAN [9] is an extension of PGGAN that modifies the generator architecture. GANs used to use the vector directly, but StyleGAN maps it to a new vector . This new vector is divided into different layers which are used together with Gaussian noise vectors as inputs of the different layers of the StyleGAN’s generation algorithm.
![Refer to caption](extracted/5701174/02_Images/stylegan.png)
Facial properties modification within GAN
In 2020, Erik Härkönen, Aaron Hertzmann, et al. released a paper titled GANSpace: Discovering Interpretable GAN Controls [28] showing how to find latent directions allowing to modify the properties of the objects in the generated image by the neural network. This was mainly based on Principal Component Analysis (PCA), which provides a much faster way to extract meaningful latent directions in comparison to manual supervision or expensive optimization. The results of this paper can be applied to StyleGAN’s 2 architecture, paving the way to modify facial attributes of a generated image within the neural network itself.
Face editing with GAN
Another paper was released in 2022 focusing in face editing with GANs moving across a direction in the latent space. Face editing with GAN’s – A Review [29], written by Parthak Mehta, Sarthak Mishra, et al, shows how can other classification models can be used, such as logistic regression or SVM, to find the directions that represent a feature in the trained StyleGAN model. Not only that, but they also found a way to find a latent vector which generated image is similar to a target one. This way, a real portrait can be used as an input to find a similar face generated by the neural network, which attributes could then be modified using the directions that represent a feature.
Initial project status
This project aims to continue the work of Jimena Lozano and Maite Herrán Oyhanarte on the subject of GANs to create and manipulate facial images [1]. This tool was intended as a software to generate realistic facial images with the ability to manipulate their main characteristics, such as eye size and separation, nose size, etc. The resulting software is able to generate new random, artificial facial images from scratch and generate transitions between two generated pictures. StyleGAN 2 neural network was the core of this software to provide those features, paired with an API and a Web application built in Python and React respectively. All of these made possible an initial version of a software that enabled the Lab to improve their preparation workflow for their experiments.
Face-frame stabilization in image projection to latent space
Face-frame variation measurement
There is no standardized way of defining a face-frame. We define it as the set of characteristics of a face that gives context for the internal characteristics of such face. A face-frame includes the contour of the face, hair, ears. A face-frame does not include eyes, mouth, nose or eyebrows. For example, Figure 3.1 shows two faces with the same frame. The hypothesis of the Lab is that this two faces may be mistaken with each other because they share the same frame.
![Refer to caption](extracted/5701174/02_Images/face_frame.png)
Having this in mind, we measure the variation of the face-frame between two images based on the position of the face and hair in the image.
Image segmentation
The first thing we need to do is to identify the face and hair inside an arbitrary image. This consists of a simple segmentation problem, separate the pixels of an image in three groups: face, hair and background. For this, an open-source pretrained neural network is used [30]. This network proved to give satisfying results for different faces and images with different conditions. This is essential for the measurement because there is no need to modify any parameters for different images. It is worth mentioning that although the neck is not part of the face-frame, the Lab thinks is better if it is preserved. Therefore, for the objective of this work, it is acceptable that the neural network classifies the neck as part of the face. In Figure 3.2, we can see some examples of segmentation made by this network. In yellow the pixels classified as hair; in blue the pixels classified as face; and in purple the pixels classified as background.
![Refer to caption](extracted/5701174/02_Images/segmentation.png)
Face-frame variation formula
Once each pixel of each of the two images is classified, we compare the classifications between the pixels in the same position of the two images. Let be a function that compares the classifications of two pixels,
(3.1) |
The number 0.2 is arbitrary, but it means that the face-hair variation is 5 times less important than variations involving the background of the image.
Let F be the function that measures the variation of face-frame between two images ( and ) of same size, with height and width . The classifications of the pixels of are , and the classifications of the pixels of are .
(3.2) |
This function yields a percentage of the images that is classified differently, giving more importance to background variations. If the face-frame does not vary between two images and , then . If there is a change in at least one pixel, then .
Correction for image projection to latent space
As mentioned in section 4.1, StyleGAN2 uses a metric [31] to estimate perceptual similarity to the target image. Albeit this method works pretty well, it does not take into account the face-frame variation. To try to obtain results with the least variation of face-frame as possible, a post-processing algorithm (henceforth referred to as face-frame correction) has been designed to pick the best image, which has the lower variation of them all, using the function defined in 3.1.2 to measure this metric.
The correction takes the target image and the latent code of the projection as an input. First, it calculates the standard deviation of 10000 random latent codes that create realistic images, which allows us to apply noise to the original latent code in a way that we can be certain it will yield a realistic image. The is also calculated and stored in this first step. The noise strength in each iteration is the result of applying the following formula:
(3.3) |
Being the current iteration number, the standard deviation of the latent codes previously mentioned, a constant representing the initial noise factor, and a constant representing the noise ramp length. In these experiments, the values are and , which follow what StyleGAN2 uses [32]
Throughout every of the iterations, the algorithm introduces a Gaussian noise multiplied by to the latent code. Depending on the constants and , it can be expected that the output image differs too little from the original one so that, if this new latent code generates an image with less face-frame variation than the previous latent code, then this new one is stored and used as the current latent code. If that is not the case, the original is preserved. In Figure 3.3, we can see an example of a target face being projected into the latent space as it outputs the neural network and another one after applying the correction algorithm.
![Refer to caption](extracted/5701174/02_Images/correction_demo.jpeg)
Having set fixed values of and taken from StyleGAN2 [32], the optimal value of is going to be studied having in mind the reduction of the face-frame variation of the target image and the projected one as much as possible.
Materials And Methods
Projecting an arbitrary image to latent space
The image to latent space projection operation StyleGAN2 provides, takes advantage of the fact that the latent space is semantically smooth. This means that small changes in the input vector, will result to small changes in the resulting image. A random input vector is used to start with. This vector is slightly modified and the resulting image is compared to the target image. It uses a standard LPIPS metric to estimate perceptual similarity between an image and the target image [31]. If the modification resulted in a better similarity to the target image, it is maintained. This process of slight modifications and measurements is done a thousand times.
The target image’s frame and the resulting image’s frame are compared.
![Refer to caption](extracted/5701174/02_Images/projection.png)
Operations on the target image to measure face-frame variation
We can take advantage of the fact that there are multiple latent directions we can move through, modifying certain attributes of an image. Some of this latent directions are of interest as they can be used by the Lab to edit an image right in the neural network itself. We define an operation done on an image which originates by a latent code as an addition or subtraction of a vector in such a way that a property of the face is modified.
Having this in mind, we will study the face-frame variation when applying some of these operations. Let be the operation to be studied. The face-frames to compare are the ones from and . As the results may vary from image to image, the measurement is taken as the average measurement of a hundred different images. The same hundred images are used in each operation.
Moving through latent directions
4.2.1.1 Age
The age direction transforms the face in a way that it looks like older or younger version of the original face.
![Refer to caption](extracted/5701174/02_Images/age.png)
4.2.1.2 Gender
The gender direction transforms the face in a way that it looks like a more masculine or feminine version of the original face.
![Refer to caption](extracted/5701174/02_Images/gender.png)
4.2.1.3 Horizontal orientation
The horizontal orientation direction transforms the face in a way that it looks like rotating horizontally.
![Refer to caption](extracted/5701174/02_Images/horizontal.png)
4.2.1.4 Vertical orientation
The vertical orientation direction transforms the face in a way that it looks like rotating vertically.
![Refer to caption](extracted/5701174/02_Images/vertical.png)
4.2.1.5 Eyes open
The eyes open direction transforms the face opening or closing its eyes.
![Refer to caption](extracted/5701174/02_Images/eyesopen.png)
4.2.1.6 Mouth open
The mouth open direction transforms the face opening or closing its mouth.
![Refer to caption](extracted/5701174/02_Images/mouthopen.png)
4.2.1.7 Smile
The smile direction transforms the face adding or removing a smile.
![Refer to caption](extracted/5701174/02_Images/smile.png)
Style mixing
Style mixing is an operation that takes two images, and generate other two images exchanging their styles. [9] This means that their layout of the face is maintained but the style of the image will be the one of the other input image.
![Refer to caption](extracted/5701174/02_Images/stylemix.png)
Experiments
Projection to latent space
In order to be examine the effectiveness of the face-frame correction algorithm proposed in Section 4.1, we first need to take a look at how well StyleGAN2 does when projecting a target image to the latent space in terms of face-frame variation. To measure this, we project 10 images of different faces to the latent space without using the correction algorithm, and we measure the variation using the function defined in Section 3.1.2. These results are then processed into a table allowing us to have a baseline of how much the variation is using the default StyleGAN2’s function.
Correction for projection
After examining the variation that StyleGAN2 introduces, we can then study how well our proposed algorithm corrects this variation. In order to test this, we set and we run our algorithm for each of the 10 output images of the previous Section 5.1. We then compare the variation in each iteration, allowing us to study the optimal in terms of computational time and reduced variation.
Latent directions
We also want to take a look at the face-frame variation for the 10 images when transforming those moving across the aforementioned latent directions. This allows us to rank each operation in respect of the variation they introduce on the generated image, making them more or less elegible for image modification if face-frame stabilization is of the essence.
Results
Projection to latent space
Table 6.1 shows the face-frame variation that the image to latent space projection achieved with each image. The maximum variation is at 3.192% while the minimum at 0.747%. The mean variation between these images is of 1.812%. This results suggest that images are a decent starting point when trying to generate similar faces with minimum face-frame variation.
Image | Variation |
Andrew | 1.429% |
Daniel | 1.894% |
Emma | 2.531% |
Jennifer | 1.790% |
Kristen | 2.267% |
Matt | 1.452% |
Paul | 1.198% |
Rob | 2.381% |
Rosa | 3.192% |
Sheldon | 1.041% |
Zendaya | 0.747% |
![Refer to caption](extracted/5701174/02_Images/faces_with_projection.png)
Correction for projection
Figures 6.2 and 6.3 illustrate the improvement in the face-frame variation through the iterations of the correction. The curves of variation significantly decrease in the first 750 iterations. The last 1250 iterations takes account for the last 10% of the improvement. It is notable that the improvement in the variation is always superior to the 20% of the initial value, and in some of the cases, it goes over the 50%.
![Refer to caption](extracted/5701174/02_Images/absolute.png)
![Refer to caption](extracted/5701174/02_Images/normalized.png)
![Refer to caption](extracted/5701174/02_Images/750.jpeg)
Latent directions
We also check the face-frame variation for the 10 images when transforming those moving across the aforementioned latent directions. In Figure 6.5, we can see the chosen images. They are labeled as 1 to 10 from left to right.
![Refer to caption](extracted/5701174/02_Images/sample_faces.png)
When an image moves a great magnitude through a direction, they tend to give unexpected results, deforming the base structure of the face and giving unrealistic images. This is the reason the range measured in each direction changes in each case. For the set of chosen images, we decided the following ranges:
Direction | Range |
Age | [-3, 3] |
Gender | [-3, 3] |
Vertical | [-3, 3] |
Horizontal | [-3, 3] |
Eyes-open | [-10, 10] |
Mouth-open | [-10, 10] |
Smile | [-2, 2] |
Figures 6.8 and 6.9 both show that modifying the horizontal and vertical face perspective in the image greatly worsens the face-frame variation. Figure 6.6 shows that gender modification is another operation that significantly increases the face-frame variation. These operations are some of the worst ones when considering this variation as can be seen in 6.13, together with smile modification (6.12) and age modification (6.6).
However, figures 6.11 and 6.10 both show that modifying the openness of the face’s mouth or the eyes are the two operations that least increase the face-frame variation.
Another interesting result illustrated in 6.13 is that all the tested operations have a linear impact on the face-frame variation.
![Refer to caption](extracted/5701174/02_Images/graph_age.png)
![Refer to caption](extracted/5701174/02_Images/graph_gender.png)
![Refer to caption](extracted/5701174/02_Images/graph_vertical.png)
![Refer to caption](extracted/5701174/02_Images/graph_horizontal.png)
![Refer to caption](extracted/5701174/02_Images/graph_eyesopen.png)
![Refer to caption](extracted/5701174/02_Images/graph_mouthopen.png)
![Refer to caption](extracted/5701174/02_Images/graph_smile.png)
![Refer to caption](extracted/5701174/02_Images/graph_means.png)
Conclusion
In this work, we have explored how different operations in StyleGAN2’s latent space affect the face-frame of the resulting facial image. We have also studied how StyleGAN2’s image projection affects the face-frame of the original image and we proposed an algorithm to reduce this variation when projecting an image as much as possible.
This algorithm can be useful when trying to project a target image generating a new one which is not only similar to the original one in terms of facial characteristics, but it also keeps most the facial frame as much as possible.
Besides that, the exploration of the latent space can be useful to understand some underlying aspects of this generative model and the interpretable latent directions. The face-frame variation seems to increase with the magnitude of the movement through any of the studied directions in a linear way. However, the rate of such variation is different for each direction. For example, mouth-open and eyes-open directions presented the least variation while the horizontal and vertical face orientation, age and gender presented the greater variation, which is somewhat expected as the face-frame is different depending on the face’s orientation and as the face have different characteristics depending on the age or gender, which may include parts that define its frame.
When taking into consideration that the main goal of this paper is to help Laboratorio of Sueño y Memoria prove that face-frame plays a key role in the creation of false memories to convict innocent people in identification parades in a crime and that most of the known facts in these crimes are age and gender, these results can be useful for them to project target images faces and modify their relevant features changing as less as possible their frames.
Future Work
The face-frame correction algorithm and the variation function proposed in this paper provide an insight on how the face-frame may be maintained when a facial image is projected onto the latent space. However, we provide an initial approach which can be further enhanced in order to continue to minimize the face-frame variation while maintaining or even improving the output image fidelity. Here are some ideas we propose:
New directions
Latent directions are the key to modify faces inside the latent space. However, only some of the known latent directions have been studied. New directions can be studied to understand how the latent space is arranged. To find new directions, an unsupervised discovery method can be used, for example the one Voynov and Babenko propose [33].
If one of the found directions is interpretable in a similar way than other known, it is possible to compare them by their face-frame variation to choose between them for a better option.
This will allow further face’s characteristics modification within the net itself in such a way that it could be possible to help the Lab disprove the belief that police lineups are helpful when searching for a crime perpetrator.
Different face-frame variation functions
The face-frame variation function used for this project described in section 3.1.2, is a first version that was created with this project in mind and isn’t flawless: sometimes it confuses facial hair, such as beard, as being part of the face and sometimes it labels it as part of the background. The neck is also sometimes considered part of the face, which in reality isn’t. There may be a more precise way to measure the face-frame variation between two images that we are not aware of and which can solve this issues.
One approach could be taking advantage of facial landmark detection algorithms [34] to also avoid losing facial characteristics, as our algorithm only considers the face-frame, but it doesn’t take into consideration all of the facial characteristics.
Improvement of face-frame variation when modifying faces
The face-frame correction algorithm used is only applied and tested when projecting images to the latent space. However, it may also be used to reduce face-frame variation when performing some operations to a facial image, such as modifying its eye-opening.
Different generative networks
This project was done using StyleGAN2 [9] as it is the state of art of face images generation at the moment. However, the same experiment can be done using different networks. Text-to-image networks are starting to be popular, it may exist a way that tools such us DALL-E [35] or Stable diffusion [36], can be used to modify faces with similar results as the ones obtained with StyleGAN2. In fact, DALL-E 2[37] allows for replacing parts of an image using a text description entered in a prompt, in a process called inpainting. This allows for photo realistic modification of images, which yields a credible image in a fast and accurate way.
![Refer to caption](extracted/5701174/02_Images/dalle_original.jpeg)
![Refer to caption](extracted/5701174/02_Images/dalle_1.jpeg)
![Refer to caption](extracted/5701174/02_Images/dalle_2.jpeg)
![Refer to caption](extracted/5701174/02_Images/dalle_3.jpeg)
Acknowledgements
We would like to express our deepest gratitude to our supervisor, Dr. Rodrigo Ramele, who not only presented the topic of this project, but also guided us through it, encouraging us to go deeper than we thought we could. We could not have undertaken this journey without Maite Herrán and Jimena Lozano who paved our way with their project and took the time to explain it to us.
Our project would not have been possible without the help of ITBA’s IT team who allowed us to make use of their hardware and who set up an environment so that we could use the neural network and perform all the tests we needed. We are also thankful to Dr. Cecilia Forcato and her team in the Laboratorio de Sueño y Memoria, who were always helpful and supporting of our project.
We thankfully acknowledge the NVIDIA Corporation Applied Research Accelerator Program that without their help this project wouldn’t have been conducted #NVIDIAGrant.
Lastly, we would like to mention our families, friends and loved ones for their continuous support during this journey.
References
- Lozano and Herrán Oyhanarte [2021] Jimena Lozano and Maite Mercedes Herrán Oyhanarte. Use of generative adversarial networks for the creation and manipulation of facial images in the context of studying false memories and its effects on wrongful conviction cases: implementation of stylegan’s generative image modeling and style mixing properties to design an interface for experimentation purposes. 2021.
- Hafez [2019] Md Hafez. Neuromarketing: a new avatar in branding and advertisement. Pacific Business Review International, 12(4):58–64, 2019.
- Castellani et al. [2010] Rudy J Castellani, Raj K Rolston, and Mark A Smith. Alzheimer disease. Disease-a-month: DM, 56(9):484, 2010.
- Högberg et al. [2011] Göran Högberg, Davide Nardo, Tore Hällström, and Marco Pagani. Affective psychotherapy in post-traumatic reactions guided by affective neuroscience: memory reconsolidation and play. Psychology Research and Behavior Management, 4:87, 2011.
- Fukushima [1973] Kunihiko Fukushima. A model of associative memory in the brain. Kybernetik, 12(2):58–63, 1973.
- Schacter [2002] Daniel L Schacter. The seven sins of memory: How the mind forgets and remembers. HMH, 2002.
- Loftus [2004] Elizabeth F Loftus. Memories of things unseen. Current directions in psychological science, 13(4):145–147, 2004.
- Rattner [1983] Arye Rattner. Convicting the innocent: When justice goes wrong. The Ohio State University, 1983.
- Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, 2020.
- Luxemburg [2019] Robert Luxemburg. StyleGAN2 latent directions, 2019. https://twitter.com/robertluxemburg/status/1207087801344372736 [Accessed: 2022-08-10].
- Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Project [2022] Innocence Project. Research Resources - Innocence Project, 2022. https://innocenceproject.org/research-resources/ [Accessed: 2022-09-28].
- West and Meterko [2015] Emily West and Vanessa Meterko. Innocence project: Dna exonerations, 1989-2014: review of data and findings from the first 25 years. Alb. L. Rev., 79:717, 2015.
- MEMORIA [2022] LABORATORIO DE SUEÑO Y MEMORIA. Investigación | Argentina | Labsuenoymemoria, 2022. https://www.labsuenoymemoria.com/investigación [Accessed: 2022-08-13].
- Zlotnik and Vansintjan [2019] Gregorio Zlotnik and Aaron Vansintjan. Memory: An extended definition. Frontiers in psychology, 10:2523, 2019.
- Forcato [2011] Cecilia Forcato. Estudio de la fase de reconsolidación de la memoria declarativa en humanos. PhD thesis, Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales, 2011.
- Roediger et al. [2001] Henry L Roediger, Jason M Watson, Kathleen B McDermott, and David A Gallo. Factors that determine false recall: A multiple regression analysis. Psychonomic bulletin & review, 8(3):385–407, 2001.
- Goodfellow [2014] Ian J Goodfellow. On distinguishability criteria for estimating generative models. arXiv preprint arXiv:1412.6515, 2014.
- Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
- Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Odena et al. [2017] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pages 2642–2651. PMLR, 2017.
- Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016.
- Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
- Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
- Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
- Karras et al. [2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- Härkönen et al. [2020] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. Advances in Neural Information Processing Systems, 33:9841–9850, 2020.
- Mehta et al. [2022] Parthak Mehta, Sarthak Mishra, Nikhil Chouhan, Neel Pethani, and Ishani Saha. Face editing with gan–a review. arXiv preprint arXiv:2207.11227, 2022.
- Gupta [2018] Kamal Gupta. Face segmentation. https://github.com/kampta/face-seg, 2018.
- Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- NVlabs [2021] NVlabs. Stylegan 2. https://github.com/NVlabs/stylegan2, 2021.
- Voynov and Babenko [2020] Andrey Voynov and Artem Babenko. Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning, pages 9786–9796. PMLR, 2020.
- Wu and Ji [2019] Yue Wu and Qiang Ji. Facial landmark detection: A literature survey. International Journal of Computer Vision, 127(2):115–142, 2019.
- Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
- Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv, 2022. URL https://arxiv.longhoe.net/abs/2204.06125.