Pinchflat

But how do AI images and videos actually work? | Guest video by Welch Labs

Actions

But how do AI images and videos actually work? | Guest video by Welch Labs

Uploaded: 2025-07-25

Open Original or Open Local Stream

Diffusion models, CLIP, and the math of turning text into images Welch Labs Book: https://www.welchlabs.com/resources/imaginary-numbers-book Sections 0:00 - Intro 3:37 - CLIP 6:25 - Shared Embedding Space 8:16 - Diffusion Models & DDPM 11:44 - Learning Vector Fields 22:00 - DDIM 25:25 - Dall E 2 26:37 - Conditioning 30:02 - Guidance 33:39 - Negative Prompts 34:27 - Outro 35:32 - About guest videos Special Thanks to: Jonathan Ho - Jonathan is the Author of the DDPM paper and the Classifier Free Guidance Paper. https://arxiv.org/pdf/2006.11239 https://arxiv.org/pdf/2207.12598 Preetum Nakkiran - Preetum has an excellent introductory diffusion tutorial: https://arxiv.org/pdf/2406.08929 Chenyang Yuan - Many of the animations in this video were implemented using manim and Chenyang’s smalldiffusion library: https://github.com/yuanchenyang/smalldiffusion Cheyang also has a terrific tutorial and MIT course on diffusion models https://www.chenyang.co/diffusion.html https://www.practical-diffusion.org/ Other References All of Sander Dieleman’s diffusion blog posts are fantastic: https://sander.ai/ CLIP Paper: https://arxiv.org/pdf/2103.00020 DDIM Paper: https://arxiv.org/pdf/2010.02502 Score-Based Generative Modeling: https://arxiv.org/pdf/2011.13456 Wan2.1: https://github.com/Wan-Video/Wan2.1 Stable Diffusion: https://huggingface.co/stabilityai/stable-diffusion-2 Midjourney: https://www.midjourney.com/ Veo: https://deepmind.google/models/veo/ DallE 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf Code for this video: https://github.com/stephencwelch/manim_videos/tree/master/_2025/sora Written by: Stephen Welch, with very helpful feedback from Grant Sanderson Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu Technical Notes The noise videos in the opening have been passed through a VAE (actually, diffusion process happens in a compressed “latent” space), which acts very much like a video compressor - this is why the noise videos don’t look like pure salt and pepper. 6:15 CLIP: Although directly minimizing cosine similarity would push our vectors 180 degrees apart on a single batch, overall in practice, we need CLIP to maximize the uniformity of concepts over the hypersphere it's operating on. For this reason, we animated these vectors as orthogonal-ish. See: https://proceedings.mlr.press/v119/wang20k/wang20k.pdf Per Chenyang Yuan: at 10:15, the blurry image that results when removing random noise in DDPM is probably due to a mismatch in noise levels when calling the denoiser. When the denoiser is called on x_{t-1} during DDPM sampling, it is expected to have a certain noise level (let's call it sigma_{t-1}). If you generate x_{t-1} from x_t without adding noise, then the noise present in x_{t-1} is always smaller than sigma_{t-1}. This causes the denoiser to remove too much noise, thus pointing towards the mean of the dataset. The text conditioning input to stable diffusion is not the 512-dim text embedding vector, but the output of the layer before that, [with dimension 77x512]( https://stackoverflow.com/a/79243065) For the vectors at 31:40 - Some implementations use f(x, t, cat) + alpha(f(x, t, cat) - f(x, t)), and some that do f(x, t) + alpha(f(x, t, cat) - f(x, t)), where an alpha value of 1 corresponds to no guidance. I chose the second format here to keep things simpler. At 30:30, the unconditional t=1 vector field looks a bit different from what it did at the 17:15 mark. This is the result of different models trained for different parts of the video, and likely a result of different random initializations. Premium Beat Music ID: EEDYZ3FP44YX8OWT

Raw Attributes

Source: Youtube default playlist

media_redownloaded_at:
media_downloaded_at: 2025-11-04T11:40:59Z
uploaded_at: 2025-07-25T12:14:11Z
metadata_filepath: /downloads/Youtube default playlist/2025-07-25 But how do AI images and videos actually work？｜ Guest video by Welch Labs/But how do AI images and videos actually work？｜ Guest video by Welch Labs [iv-5mZ_9CPY].info.json
original_url: https://www.youtube.com/watch?v=iv-5mZ_9CPY
media_filepath: /downloads/Youtube default playlist/2025-07-25 But how do AI images and videos actually work？｜ Guest video by Welch Labs/But how do AI images and videos actually work？｜ Guest video by Welch Labs [iv-5mZ_9CPY].mp4
source_id: 1
description: Diffusion models, CLIP, and the math of turning text into images Welch Labs Book: https://www.welchlabs.com/resources/imaginary-numbers-book Sections 0:00 - Intro 3:37 - CLIP 6:25 - Shared Embedding Space 8:16 - Diffusion Models & DDPM 11:44 - Learning Vector Fields 22:00 - DDIM 25:25 - Dall E 2 26:37 - Conditioning 30:02 - Guidance 33:39 - Negative Prompts 34:27 - Outro 35:32 - About guest videos Special Thanks to: Jonathan Ho - Jonathan is the Author of the DDPM paper and the Classifier Free Guidance Paper. https://arxiv.org/pdf/2006.11239 https://arxiv.org/pdf/2207.12598 Preetum Nakkiran - Preetum has an excellent introductory diffusion tutorial: https://arxiv.org/pdf/2406.08929 Chenyang Yuan - Many of the animations in this video were implemented using manim and Chenyang’s smalldiffusion library: https://github.com/yuanchenyang/smalldiffusion Cheyang also has a terrific tutorial and MIT course on diffusion models https://www.chenyang.co/diffusion.html https://www.practical-diffusion.org/ Other References All of Sander Dieleman’s diffusion blog posts are fantastic: https://sander.ai/ CLIP Paper: https://arxiv.org/pdf/2103.00020 DDIM Paper: https://arxiv.org/pdf/2010.02502 Score-Based Generative Modeling: https://arxiv.org/pdf/2011.13456 Wan2.1: https://github.com/Wan-Video/Wan2.1 Stable Diffusion: https://huggingface.co/stabilityai/stable-diffusion-2 Midjourney: https://www.midjourney.com/ Veo: https://deepmind.google/models/veo/ DallE 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf Code for this video: https://github.com/stephencwelch/manim_videos/tree/master/_2025/sora Written by: Stephen Welch, with very helpful feedback from Grant Sanderson Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu Technical Notes The noise videos in the opening have been passed through a VAE (actually, diffusion process happens in a compressed “latent” space), which acts very much like a video compressor - this is why the noise videos don’t look like pure salt and pepper. 6:15 CLIP: Although directly minimizing cosine similarity would push our vectors 180 degrees apart on a single batch, overall in practice, we need CLIP to maximize the uniformity of concepts over the hypersphere it's operating on. For this reason, we animated these vectors as orthogonal-ish. See: https://proceedings.mlr.press/v119/wang20k/wang20k.pdf Per Chenyang Yuan: at 10:15, the blurry image that results when removing random noise in DDPM is probably due to a mismatch in noise levels when calling the denoiser. When the denoiser is called on x_{t-1} during DDPM sampling, it is expected to have a certain noise level (let's call it sigma_{t-1}). If you generate x_{t-1} from x_t without adding noise, then the noise present in x_{t-1} is always smaller than sigma_{t-1}. This causes the denoiser to remove too much noise, thus pointing towards the mean of the dataset. The text conditioning input to stable diffusion is not the 512-dim text embedding vector, but the output of the layer before that, [with dimension 77x512](https://stackoverflow.com/a/79243065) For the vectors at 31:40 - Some implementations use f(x, t, cat) + alpha(f(x, t, cat) - f(x, t)), and some that do f(x, t) + alpha(f(x, t, cat) - f(x, t)), where an alpha value of 1 corresponds to no guidance. I chose the second format here to keep things simpler. At 30:30, the unconditional t=1 vector field looks a bit different from what it did at the 17:15 mark. This is the result of different models trained for different parts of the video, and likely a result of different random initializations. Premium Beat Music ID: EEDYZ3FP44YX8OWT
last_error:
updated_at: 2025-11-04T11:41:07Z
id: 17
matching_search_term:
playlist_index: 9
predicted_media_filepath: /downloads/Youtube default playlist/2025-07-25 But how do AI images and videos actually work？｜ Guest video by Welch Labs/But how do AI images and videos actually work？｜ Guest video by Welch Labs [iv-5mZ_9CPY].mp4
uuid: 6ba94cb0-7ac8-478d-870c-686be4989d59
culled_at:
duration_seconds: 2240
upload_date_index: 0
livestream: false
inserted_at: 2025-11-04T11:35:47Z
media_id: iv-5mZ_9CPY
prevent_culling: false
media_size_bytes: 413629119
nfo_filepath:
title: But how do AI images and videos actually work? | Guest video by Welch Labs
thumbnail_filepath: /downloads/Youtube default playlist/2025-07-25 But how do AI images and videos actually work？｜ Guest video by Welch Labs/But how do AI images and videos actually work？｜ Guest video by Welch Labs [iv-5mZ_9CPY]-thumb.jpg
tasks:
short_form_content: false
prevent_download: false
subtitle_filepaths: en/downloads/Youtube default playlist/2025-07-25 But how do AI images and videos actually work？｜ Guest video by Welch Labs/But how do AI images and videos actually work？｜ Guest video by Welch Labs [iv-5mZ_9CPY].en.srt

Nothing Here!

We can't find the internet

Something went wrong!

But how do AI images and videos actually work? | Guest video by Welch Labs

Raw Attributes