Project 5A: The Power of Diffusion Models

Kishan Jani

Part 5A 0: Setup

We will use the DeepFloydIF diffusion model. We use a random seed of seed = 10, here and throughout the project. In this part, we simply test out the model, generating images for 3 text prompts with captions. We do this across different values of num_inference_steps .


To see the images generated, click on the following link:
5A Part 0 Results

Part 5A 1.1: Implementing the Forward Porcess

Overview

The forward process is defined by \[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\varepsilon \] where \( \varepsilon \sim N(0,1) \). Here \(x_0\) is the clean image, the noisy image generated is \( x_t \), so we really are making the noisy image by sampling from a Gaussian of mean \( \sqrt{\bar{\alpha}_t} x_0\) and variance \( 1- \bar{\alpha}_t \). We perform the process on a test image of the campanile shown below, which we will resize to \( 64 \times 64 \).

Results

To see results for \( t\in [250,500,750] \), click the link below:

5A Part 1.1 Results

Part 5A 1.2: Gaussian Blur Denoising

Overview

We naively denoise images by blurring them with a fixed Gaussian. We use kernel_size = 5 and sigma = 1.5 for the Gaussian. We do this for the noise levels seen before.

Results

To see results for \( t\in [250,500,750] \), click the link below:

5A Part 1.2 Results

Part 5A 1.3: One-Step Denoising

Overview

Now, we'll use a pretrained diffusion model to denoise. The actual denoiser can be found at stage_1.unet. This is a UNet that has already been trained on a very, very large dataset of pairs of images \( x_0, x_t \). We can use it to recover Gaussian noise from the image. Then, we can remove this noise to recover (something close to) the original image. To compute \(x_0\) from \(x_t\), we use \[ \hat{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} [x_t - \sqrt{1 - \bar{\alpha}_t} \varepsilon_\theta(x_t,t)],\] where \(\varepsilon_\theta(\cdot,\cdot)\) is the noise predicted by our model for a given noisy image \(x_t\) and corresponding noise level \(t\).

Results

To see results for \( t\in [250,500,750] \), click the link below:

5A Part 1.3 Results

Part 5A 1.4: Iterative Denoising

Overview

We can denoise an image and recover more of the original features by denoising several times in small steps instead of denoising in one-step. Specifically, over \(T=1000\) iterations (step = -30), we denoise the image iteratively as follows: \[ x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}} \beta_t} {1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t} x_t + v_\sigma,\] where \(x_0\) is our current clean estimate via a one-step estimate from \(x_t\), \(v_\sigma\) is random noise (predicted for DeepFloyd), alphas and betas are known parameters. Here \(x_t\) is the more noisy image at some timestep \(t\), while \(x_{t'}\) is the less noisy image at the next timestep \(t'\) in our iteration sequence.

Results

To see results, click the link below:

5A Part 1.4 Results

Part 5A 1.5: Diffusion Model Sampling

Overview

We "generate" images using the Diffusion Model by feeding random noise into the iterative denoiser, using i_start = 0 . By starting with random noise as the noisy image, we are "solving" towards a random image via iterative denoising. We use the text prompt "a high quality photo" for denoising.

Results

To see results, click the link below:

5A Part 1.5 Results

Part 5A 1.6: Classifier Free Guidance (CFG)

Overview

We can generate higher quality images by implementing conditional diffusion with prompt "a high quality photo" (using unconditional prompt ""). With scale \( \gamma = 7\), this ensures that our model diffuses the image towards the prompt being fed to generate the conditional noise. Specifically, we skew our noise estimate as \[\varepsilon = \varepsilon_{\text{uncond}} + \gamma(\varepsilon_{\text{cond}} - \varepsilon_{\text{uncond}} ).\] With \(\gamma=0\), we have the standard unconditional noise estimate, while with \(\gamma=1\), we have the conditional noise estimate. It has been determined empirically that choosing \(\gamma > 1\) improves conditioning. This generates images of higher quality, albeit slightly lesser variance.

Results

To see results, click the link below:

5A Part 1.6 Results

Part 5A 1.7: Image-to-Image translation

Overview

We create various edited images using Diffusion methods.

Results

To see results, click the link below:

5A Part 1.7 Results

Part 5A 1.8: Visual Anagrams

Overview

Similar to inpainting, we can make other fancy edits to the noise to create interesting images. Over here, we develop visual anagrams: oriented normally, the image looks like prompt1, but when flipped, the image looks like prompt2. Once again, within our denoising loop, we make the following edit to noisy image \( x_t\): \[\varepsilon_1 \gets \textup{UNet}(x_t,t, \text{prompt}_1) \] \[\varepsilon_2 \gets \textup{UNet}(x_t,t, \text{prompt}_2) \] \[\varepsilon \gets \frac{\varepsilon_1 + \varepsilon_2}{2},\] where \( \varepsilon \) is the final noise estimate for iteration \(t \).

Results

To see results, click the link below:

5A Part 1.8 Results

Part 5A 1.9: Hybrid Images

Overview

We create hybrid images: smaller image / from afar looks like prompt1, but larger image / from close-by looks like prompt2. We accomplish this by using low and high pass filters, implemented via a Gaussian of kernel 33 and sigma 2. For any iteration of the denoising loop, \[\varepsilon_1 \gets \textup{UNet}(x_t,t, \text{prompt}_1) \] \[\varepsilon_2 \gets \textup{UNet}(x_t,t, \text{prompt}_2) \] \[\varepsilon \gets f_{\text{low}}(\varepsilon_1) + f_{\text{high}}(\varepsilon_2),\] where \( \varepsilon \) is the final noise estimate for iteration \(t \).

Results

To see results, click the link below:

5A Part 1.9 Results