Part 5A 0: Setup
We will use the DeepFloydIF
diffusion model. We use a random seed of seed = 10
,
here and throughout the project. In this part, we simply test out the model, generating images
for 3 text prompts with captions. We do this across different values of num_inference_steps
.
To see the images generated, click on the following link:
5A Part 0 Results
Part 5A 1.1: Implementing the Forward Porcess
Overview
The forward process is defined by \[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\varepsilon \] where \( \varepsilon \sim N(0,1) \). Here \(x_0\) is the clean image, the noisy image generated is \( x_t \), so we really are making the noisy image by sampling from a Gaussian of mean \( \sqrt{\bar{\alpha}_t} x_0\) and variance \( 1- \bar{\alpha}_t \). We perform the process on a test image of the campanile shown below, which we will resize to \( 64 \times 64 \).
Results
To see results for \( t\in [250,500,750] \), click the link below:
5A Part 1.1 ResultsPart 5A 1.2: Gaussian Blur Denoising
Overview
We naively denoise images by blurring them with a fixed Gaussian. We use
kernel_size = 5
and sigma = 1.5
for the Gaussian. We do
this for the noise levels seen before.
Results
To see results for \( t\in [250,500,750] \), click the link below:
5A Part 1.2 ResultsPart 5A 1.3: One-Step Denoising
Overview
Now, we'll use a pretrained diffusion model to denoise. The actual denoiser
can be found at stage_1.unet
. This is a UNet that has already been trained on
a very, very large dataset of pairs of images \( x_0, x_t \). We can use it to
recover Gaussian noise from the image. Then, we can remove this noise to recover
(something close to) the original image. To compute \(x_0\) from \(x_t\), we use
\[ \hat{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} [x_t - \sqrt{1 - \bar{\alpha}_t} \varepsilon_\theta(x_t,t)],\]
where \(\varepsilon_\theta(\cdot,\cdot)\) is the noise predicted by our model for a given noisy image \(x_t\)
and corresponding noise level \(t\).
Results
To see results for \( t\in [250,500,750] \), click the link below:
5A Part 1.3 ResultsPart 5A 1.4: Iterative Denoising
Overview
We can denoise an image and recover more of the original features by denoising several times in small steps instead of denoising in one-step. Specifically, over \(T=1000\) iterations (step = -30), we denoise the image iteratively as follows: \[ x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}} \beta_t} {1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t} x_t + v_\sigma,\] where \(x_0\) is our current clean estimate via a one-step estimate from \(x_t\), \(v_\sigma\) is random noise (predicted for DeepFloyd), alphas and betas are known parameters. Here \(x_t\) is the more noisy image at some timestep \(t\), while \(x_{t'}\) is the less noisy image at the next timestep \(t'\) in our iteration sequence.
Results
To see results, click the link below:
5A Part 1.4 ResultsPart 5A 1.5: Diffusion Model Sampling
Overview
We "generate" images using the Diffusion Model by feeding random noise into the iterative
denoiser, using i_start = 0
. By starting with random noise as the noisy image,
we are "solving" towards a random image via iterative denoising. We use the text prompt "a high
quality photo" for denoising.
Results
To see results, click the link below:
5A Part 1.5 ResultsPart 5A 1.6: Classifier Free Guidance (CFG)
Overview
We can generate higher quality images by implementing conditional diffusion with prompt "a high quality photo" (using unconditional prompt ""). With scale \( \gamma = 7\), this ensures that our model diffuses the image towards the prompt being fed to generate the conditional noise. Specifically, we skew our noise estimate as \[\varepsilon = \varepsilon_{\text{uncond}} + \gamma(\varepsilon_{\text{cond}} - \varepsilon_{\text{uncond}} ).\] With \(\gamma=0\), we have the standard unconditional noise estimate, while with \(\gamma=1\), we have the conditional noise estimate. It has been determined empirically that choosing \(\gamma > 1\) improves conditioning. This generates images of higher quality, albeit slightly lesser variance.
Results
To see results, click the link below:
5A Part 1.6 ResultsPart 5A 1.7: Image-to-Image translation
Overview
We create various edited images using Diffusion methods.
Results
To see results, click the link below:
5A Part 1.7 ResultsPart 5A 1.8: Visual Anagrams
Overview
Similar to inpainting, we can make other fancy edits to the noise to create interesting images.
Over here, we develop visual anagrams: oriented normally, the image looks like prompt1
,
but when flipped, the image looks like prompt2
. Once again, within our denoising loop,
we make the following edit to noisy image \( x_t\):
\[\varepsilon_1 \gets \textup{UNet}(x_t,t, \text{prompt}_1) \]
\[\varepsilon_2 \gets \textup{UNet}(x_t,t, \text{prompt}_2) \]
\[\varepsilon \gets \frac{\varepsilon_1 + \varepsilon_2}{2},\]
where \( \varepsilon \) is the final noise estimate for iteration \(t \).
Results
To see results, click the link below:
5A Part 1.8 ResultsPart 5A 1.9: Hybrid Images
Overview
We create hybrid images: smaller image / from afar looks like prompt1
, but
larger image / from close-by looks like prompt2
. We accomplish this by using low
and high pass filters, implemented via a Gaussian of kernel 33 and sigma 2. For any iteration
of the denoising loop,
\[\varepsilon_1 \gets \textup{UNet}(x_t,t, \text{prompt}_1) \]
\[\varepsilon_2 \gets \textup{UNet}(x_t,t, \text{prompt}_2) \]
\[\varepsilon \gets f_{\text{low}}(\varepsilon_1) + f_{\text{high}}(\varepsilon_2),\]
where \( \varepsilon \) is the final noise estimate for iteration \(t \).
Results
To see results, click the link below:
5A Part 1.9 Results