CS180 Project 5
Part A.0: Setup
An oil painting of a snowy mountain village
The prompt for this image was "an oil painting of a snowy mountain village." All of them are relavent to the prompt. However, the more inference steps it had, the more detailed and less impressionistic the image was. For example, the 4-step inference yielded a very sureal impression of a village, while the 20-step inference yielded a more defined and vibrant version of the same idea. The 10-step inference was somewhere in between. Stage 2 upsampled the outputs of stage 1 images to get much clearer images.
A man wearing a hat
The prompt for this image was "a man wearing a hat." All of them seem to be relavent to the prompt. However, the more inference steps it had, the more detailed and realistic the image was. For example, the 4-step inference yielded an altervative depiction of the prompt, while the 20-step inference yielded a very detailed and almost photographic version of the same idea. The 10-step inference was somewhere in between. Stage 2 upsampled the outputs of stage 1 images to get much clearer images.
A rocket ship
The prompt for this image was "a rocket ship." The 4-step inference didn't yield anything obviously relavent to the promop (maybe I don't know too much about rocket ships), while the 20-step inference yielded a realistic drawing of a rocket ship. The 10-step inference was somewhere in between, having yielded a hybrid between a traditional rocket ship and a bat. Similar to the previous two prompts, stage 2 upsampled the outputs of stage 1 images to get much clearer images.
I'm using seed = 22 for this problem.
Part A.1.1: Implementing the Forward Process
To add noises, we follow the following formula: \[ q(x_t | x_0) = N(x_t ; \sqrt{\bar\alpha} x_0, (1 - \bar\alpha_t)\mathbf{I}) \] After simplification, we arrive at the following equivlaent form: \[ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1) \tag{0} \] This means that we need to sample from a standard normal distribution and call it \( \epsilon \). For each \(\bar\alpha_t\) from the alphas_cumprod variable from the scheduler, indexed by t, we iteratively add noises to our clean image. Here, we display the Campanile with noises added at \(t \in [250, 500, 750]\):
I'm using seed = 22 for this problem. However, for some sections, I need to use different seeds to ensure the quality and reproducibility of the results.
Part A.1.2: Classical Denoising
In this section, I attempted to use classical methods to denoise the test images from the previous part. I used torchvision.transforms.functional.gaussian_blur (kernel = 5, sigma = 2) in order to get the denoised (blurred) versions of the images. The results are not satisfactory at all, especially the for the image with larger t.
I'm using seed = 22 for this problem.
Part A.1.3: One-Step Denoising
In this section, I attempted to denoise the noisy test images using a the pretrained unet. The unet will estimate both the noise of the current time step. I passed in the noisy image, the time step t, and the prompt embedding into the unet, after which I used output to reconstruct the denoised image using the following formula. \[ x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon}{\sqrt{\bar{\alpha}_t}} \tag{1} \] where \(x_0\) is the clean image, \(x_t\) is the noisy image, \(\epsilon\) is the estimated noise, and \(\bar{\alpha}_t\) is the alphas_cumprod variable from the scheduler, indexed by t. The following are the results from the denoisign process.
I'm using seed = 22 for this problem.
Part A.1.4: Iterative Denoising
In this section, I used iterative denoising to achieve more optimal results than one-step denoising. I started with the most noisy time step, \(T = 990\) and iteratively denoised the image with the stride size of -30 until T reached 0. The list of Ts are in the variable called strided timesteps. In a for loop, I looped through each value of strided timesteps and denoised the image for each time step t. The algorithm is as follows:
For each t, I used the following formula to reconstruct the denoised image at that time step. \[ x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma \tag{2} \] Where:
- \( t' \): The previous (index + 1) time step relative to the current time step \( t \).
- \( \bar{\alpha}_{t'} \): The cumulative product found in alphas_cumprod at time step \( t' \).
- \( x_{t'} \): The denoised image at time step \( t' \).
- \( \bar{\alpha}_t \): The cumulative product found in alphas_cumprod at time step \( t \).
- \( \alpha_t \): \(\frac{\bar\alpha_t}{\bar\alpha_{t'}}\)
- \( \beta_t \): \(1 - \alpha_t\)
- \( x_t \): The noisy image at time step \( t \).
- \( x_0 \): The current state of the cleaned image at time step \( t \), retrieved using equation (1).
- \( v_\sigma \): The predicted noise variance form the unet.
Specific details about implementation:
- I used
add_variance
function to add the predicted variance from the unet to the denoised image. - I used
i_start = 10
as the starting time step. - I used the estimated noise from the unet as \(\epsilon\) in equation (1) to get \( x_0 \).
I'm using seed = 22 for this problem. Here are the results.
Part A.1.5: Diffusion Model Sampling
In this section, used the trained unet diffusion model to denoise pure noise with the embedded guidance of "a high quality photo". To denoise images from random
noises, I set i_start = 0
and passed in a random noise, sampled from a standard normal distribution, with the shape of (5, 3, 64, 64), five images sampled at once. The model sampling/ reconstruction loop is the same as A.1.4.
Though the following sampled images are not very good, they are decent in quality.
Part A.1.6: Classifier-Free Guidance (CFG)
In this section, I used classifier-free guidance to improve the quality of the denoised images. The CFG algorithm involves running the unet model twice, once as an unconditional unet and once as a conditional unet. I used the null prompt, '', as the embedded prompt for the unconditional unet and the original prompt, 'a high quality photo', as the embedded prompt for the conditional unet. The output of the unconditional unet is called \(\epsilon_u\) and the output of the conditional unet is called \(\epsilon_c\). Using the equation below, I got the final noise at the current time step.
\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \tag{3} \]
\(\gamma\) is the guidance scale, a variable that controlls how much guidance the final noise receives. When \(\gamma = 1\), the estimated noise is purely conditional, while when \(\gamma = 0\), the estimated noise is purely unconditional. To make the magic happen, I set \(\gamma = 7\) per the spec's recommendation.
After getting the guided estimated noise, the rest of the image denoising process is the same as A.1.5. Here are 5 samples from the iterative denoising process with CFG. I would say they are of much higher quality (clearer and more vibrant) than the samples from the previous section.
Part A.1.7: Image-to-image Translation
In this section I used the SDEdit algorithm to iteratively guide, CFG, the noisy image back to its something resembling the original image. To do so, I followed the following steps:
- I added noise to the chosen image by running the forward method outlined in equation (0).
- I then used "a high quality photo" as the embedded prompt to guide the denoising process at time steps [1, 3, 5, 7, 10, 20].
- I applied these two steps on the test image and two other images of my choice.
- The seed is 100 for all three images.
Test Image SDEdit:
Burney Fall SDEdit:
Yosemite SDEdit:
Part A.1.7.1: Editing Hand-Drawn and Web Images
Now it's time to apply the SDEdit technique to web images and hand-drawn images. To do so, I downloaded one web image and drew 2 images by hand, after which I applied the same procedure as the previous part and recorded the results and intermediate images.
Kangaroo (Web) SDEdit:
Kangaroo Image Link
Hand-drawn Leaf SDEdit:
Hand-drawn Tree SDEdit:
Part 1.7.2 Inpainting
To allow inpainting, we use a binary mask to allow new image to be painted when the mask is 1 and keeps the original image's noises when the mask is 0. To achieve this effect we need to use the following equation after we get the denoised image \( x_t \) at each time step \( t \).
\[ x_t \leftarrow \textbf{m} x_t + (1 - \textbf{m}) \text{forward}(x_{orig}, t) \tag{4} \] Where:
- \( x_t \): The denoised image at time step \( t \).
- \( x_{orig} \): The original image.
- \( \textbf{m} \): The binary mask.
Here are the results from the inpainting process.
Part 1.7.3 Text-Conditioned Image-to-image Translation
The text-conditioned image-to-image translation requires us to change the conditioning prompt from "a high quality photo" to any of the given prompts. For this section, I used "a rocket ship" for the test image, "a lithograph of a river" for the Yosemite image, and "a rocket ship" for the street light image.
Here are the results using text-conditioning with seed = 100.
Part A.1.8: Visual Anagrams
Now it's time for something fun! In this section, I used the pretrained denoiser model to create various anagrams using two text prompts as guidances. The result should look like the first guidance prompt when rightside up and the second guidance prompt when upside down. To do so, I followed the following equations to process the estimated noise:
\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \tag{5} \] \[ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \tag{6} \] \[ \epsilon = (\epsilon_1 + \epsilon_2) / 2 \tag{7} \]
For the first text prompt \(p_1\), I used equation (5) to estimate the noises \(\epsilon_1\) of the image at each time step. Then, I used equation (7) to retrieve the estimated noise in the flipped image (along the height axis) using \(p_2\) as the conditional prompt. Finally, I used equation (7) to estimate the total noise by averaging the two estimated noises \(\epsilon_1\) and \(\epsilon_2\). Note that we also need CFG here, so I created the unconditional noises for both original and flipped images separately using the null prompt. Then, I applied the CFG algorithm separately to \(\epsilon_1\) and \(\epsilon_2\) before I averaged them. The two prompts shared the same guidance scale. Everything else is the same as text-conditioned iterative denoising.
Here are some of the results:
Part A.1.9: Hybrid Images
We can also try to make some hybrid images that look like the first prompt when far away and the second prompt when close up. In principle, we can just take the low frequncyies of the image conditioned by the first prompt, take the high frequencies of the image conditioned by the second prompt, and average them together to get the final noise. The resulting image should retain the low frequencies from the first prompt and the high frequencies from the second prompt, thus creating a hybrid effect. Here are the equations and how to use them.
\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \tag{8} \] \[ \epsilon_2 = \text{UNet}(x_t, t, p_2) \tag{9} \] \[ \epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2) \tag{10} \]
Similar to A.1.8, I useded equation (8) to estimate the noises \(\epsilon_1\) of the image at each time step using the first prompt \(p_1\). Then, I used equation (9) to retrieve the estimated noise for the second prompt \(p_2\). The only difference was that I didn't need to flip the image for the second prompt. Finally, I used two Gaussian blurs with kernel = 33 and sigma = 2 to retrieve the low frequency noises of \(\epsilon_1\) and \(\epsilon_2\). The high frequency noises of prompt two was the difference between the original image and the low frequency noises of prompt two. Using equation (10), I added the low frequency from prompt one to the high frequency from prompt two to get the final noise. I also used CFG for prompt one and prompt two separately and applied the guidance scale before adding the noises to get the final noise.
Part B.1.1: Implementing the UNet
In this section, I implemented the unconditional UNet using various convolutional blocks and concatination layers. The high level idea of this UNet is that it downsamples the image from the original size into a 1 by 1 by 2D tensor, learning the image along the way. Then, it upsamples the image from the 1 by 1 by 2D tesnor back into the original size, combining the results from downsampling along the way through concatination. This structure can be used to transform the input image.
Here is the general architecture of the UNet (borrowed from the project spec):
To implement the UNet, I added both the block initilization and the forward method of each block. Then, I combined everything into the UNet class, adding necessary layers and filling in the foward methods. After constructing the UNet architecture, I'm ready to start training a denoiser for the MNIST dataset.
I used 22 as the SEED
.
Part B.1.2: Using the UNet to Train a Denoiser
To train the network, I needed to identify a loss funciton first. Per the spec's suggestion, I used an average L2 loss between the estimated clean image and the true clean image. Here is the loss function:
\[ L = \mathbb{E}_{z,x}\| D_{\theta}(z) - x\|^2. \tag{11} \] Where:
- \(D_{\theta}(z)\) is the estimated clean image from the model.
- \(x\) is the original image.
To prepare the data for training, I artificially added noises to the MNIST train dataset using the following equation:
\[ z = x + \sigma \epsilon,\quad \text{where }\epsilon \sim N(0, I). \tag{12} \]
For demonstration, I used \(\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]\) to add noises to the numbers 7, 2, 1, 0, and 4. Here is the result.
Part B.1.2.1: Training
I constructed a training loop that trains the UNet for 5 epochs on the shuffle MNIST train dataset. I started to train the denoiser using the following parameters:
- \(\text{D} = 128\)
- \(\text{Number of epochs} = 5\)
- \(\text{Batch size} = 256\)
- \(\text{learning rate} = 0.0001\)
- \(\text{Optimizer} = Adam \)
Note that I only added noise to the train images right before it was fed into the model so that the model can see a different noise each time. The noise at each time step was added using equation (12). Within the training loop, I also collected the training loss and graphed the output at appropriate intervals. Here are the results:
The training loss showed a drastic decrease in epoch 1 and gradually levels off at a very small number. The results on the digits from the test set after 1 epoch was not as accurate and defined as the ones after 5 epochs of training, which was expected.
Part B.1.2.2: Out-of-Distribution Testing
After the model had been trained for 5 epochs, I applied the model to the same number 7 image with different levels of noises. For each \(\sigma\) in the list \(\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]\), I added noises to the number 7 and then used the model to predict the denoised image. Here are the results:
The denoised results were very good for \(\sigma <= 0.5\). For \(\sigma > 0.5\), the model tended to add some artifacts to the image probably because the model was not trained to handle more these higher noise levels. Our current model still needed improvment.
Part B.2: Training a Diffusion Model
With the unconditional UNet under our belts, we could train a similar network to estimate the noise at each level, thus building ourselves a diffusion model. The new objective begged for a new objective funciton. Instead of minimizing the loss between the estimated final image and the original image, we minimized the L2 (MSE) difference between the predicted noise and the true noise. \[ L = \mathbb{E}_{\epsilon,z} \|\epsilon_{\theta}(z) - \epsilon\|^2 \tag{13} \] where \(\epsilon_{\theta}(z)\) is the predicted noise from the UNet.
To add noise, we used equation (0) from part A at each time step in the training process.
Part B.2.1 - B.2.3: Adding Time Conditiong to UNet
To add time conditioning so that we can train the model to predict noise at each time step, we first needed to change the model structure. We added an additional parameter, called t, to the model function signature. Under the hood, we used an FCBlock to embed the scalar t to fit the dimension of the convolutional layers. Then we added the embedded t to the unflattened and the first upsampled blocks of the architecture. Here is a diagram:
The FCBlock consists of two fully connected layers and one layer of GELU activation. It can embed the scaler t and make it understandable to our model.
To construct the training loop, I first set up a training schedule that contained precomputed \(\beta_t\), \(\alpha_t = 1 - \beta_t\), and \(\bar\alpha_t = \prod_{s=1}^t \alpha_s\). I first calculated \(\beta_t\), which was a list of numbers that were evenly space from 0.0001 to 0.02 with the total length being 300. Then I found the rest of the variables using this piece of information.
The training loop is as follows:
I got the precomputed \(\bar\alpha\) from the scheduler. For each time step t, I sampled the noise from a standard normal distribution and added the noise to the image according to equation (0). Then, I passed the noise and the normalized t into the model to get the predicted noise, after which I minimized the loss. I repeated this process for the number of epochs.
The training parameters stayed similar except that I used an exponential learning rate scheduler to slow down the gradient descent after each epoch. The number of epochs was increased to 20, and the hidden dimension was decreased to 64.
Here is the training loss curve through out the 20 epochs:
Now that I had the trained UNet, I could start to sample from it. To sample, I followed the alogrithm below to reconstruct the clean image at each time step using the noise output form the model. Here is the algorithm:
I retrieved the elements from the scheduler and followed the steps outlined in the diagram to iteratively denoise the image. On the high level, the algorithm iteratively inversed the noise adding process. For each t (from large to small), I estimated the clean image at that time step using the output of the model. Then, I used more information from the scheduler to reconstruct the cleaned image for the next step. I repeated for all t from T to 1 and outputed the final image. Here are some results:
Part B.2.4 - B.2.5: Adding Class Conditiong to UNet
I added two more FCBlocks for the class conditional UNet. Each block takes in the one-hot encoded class condtioner and embeds it to fit the dimension of the convolutional layers. The first embedded class vector was added to the unflattened tensor via direct product (before adding time conditioning). The second class embedding was added to the upsample 1 tensor through direct product as well before adding time conditioning. I also added a dropout mask (a tensor of zeros) to ensure tha we only consider class conditioning 90% of the time.
The training loop for this section is very similar to the one for teim conditioning. The only difference was that we needed to use the labels as one-hot encoded condition vector. We also needed to apply dropout 10% of the time as well. I used a Bernoulli random variable as the indictor with 90% of success rate and 10% of failure rate. If the random variable fails (returns 0), I would multiply the one-hot encoded class with the mask to set it to zero.
The training loop is as follows:
I got the precomputed \(\bar\alpha\) from the scheduler. For each time step t, I sampled the noise from a standard normal distribution andadded the noise to the image according to equation (0). Then, I passed the noise, the normalized t, and the class conditioning vectors into the model to get the predicted noise, after which I minimized the loss. I repeated this process for the number of epochs.
The training parameters stayed similar except that I used a exponential learning rate scheduler to slow down after each epoch. The number of epochs was increased to 20, and the hidden dimension was decreased to 64.
Here is the training loss curve through out the 20 epochs:
Now that I had the trained UNet, I could start to sample from it. To sample, I followed the alogrithm below to reconstruct the clean image at each time step using the noise output form the model. Here is the algorithm:
The general approach was similar to the one for time-conditioned UNet. The only difference was that I added an unconditional guidance and added it with the guided noise under the guidance scale. More concretely, I called the UNet model twice, once on the unconditional prompt (tensor of all zeros) and once one the one-hot encoded classes that I wanted to generate. The process is the same as the CFG in part A. I chose the guidance scale to be 5. Here are some results:
Conclusion
This is by far the most interesting and rewarding project I have ever done. I got a chance to learn so much about image-to-image translation from theory to implementation, from seed to fruition. As fun as this project is, it wasn't without many challenges. I had to familiarize myself with Google Colab very quickly and learn to use GPU resources efficiently. There were several bugs that seemed impossible to solve, but I managed to locate the mistakes and eradicate them on time. I would love to do more projects like this in the future.