PriorGrad

논문 ICLR 2022 Poster 배경 지식 [[Diffusion Model (DDPM)]] [[Latent Diffusion Model (LDM)]] 논문의 아이디어 원래 [[Diffusion Model (DDPM)]]에서 Forward Process를 진행할때, 결국 어떠한 형태의 노이즈가 되는지 기억나는가? 그렇다, $\mathcal{N}(0,I)$이다. 그리고 이 아무 노이즈에서 원...

2026-02-22 paper / priorgrad / diffusion-model / conditional-generation

PriorGrad 글 대표 일러스트 — Jeffrey Kim의 SecondBrain 빌드 로그 아티클 커버

Quick context

First, this page captures one concrete build-log step, research note, or project lesson from Jeffrey Kim.

Next, use the tags, related reading, and home archive to move from this note to deeper material in the same topic cluster.

Finally, follow the RSS feed if you want the next experiment, retrospective, or paper review as soon as it ships.

Archive note

First, this imported note is intentionally compact. It acts as a pointer into the wider SecondBrain archive rather than a long-form standalone article.

Next, use the tags, related reading, and project sections to move toward deeper context. Those paths usually lead to fuller write-ups, experiments, or project retrospectives.

Finally, revisit this page together with the home archive and RSS feed when you want the follow-up posts that expand the same topic.

논문
ICLR 2022 Poster

배경 지식

[[Diffusion Model (DDPM)]]
[[Latent Diffusion Model (LDM)]]

논문의 아이디어

원래 [[Diffusion Model (DDPM)]]에서 Forward Process를 진행할때, 결국 어떠한 형태의 노이즈가 되는지 기억나는가? 그렇다, $\mathcal{N}(0,I)$ 이다. 그리고 이 아무 노이즈에서 원래 이미지를 복원하는 Denoise 과정을 학습하는 것이다. 근데 [[Latent Diffusion Model (LDM)]]과 같은 condition generation을 할 때에, 아무 노이즈를 쓰는 것이 아니라, condition의 데이터에 특성을 반영한 노이즈부터 시작하면 안될까? 그렇게 하면 디퓨젼 모델을 효율적으로 학습 및 인퍼런스 할 수 있지 않을까? 이것이 이 논문의 아이디어다.

참고로 forward process의 끝이자 reverse process의 시작인 노이즈 $X_T$ 가 이 논문에서 말하는 prior이다.

어떻게 할까?

먼저 DDPM의 훈련 과정과 인퍼런스 과정을 다시 복기하자.

Pasted%20image%2020250408172324

DDPM Train 과정

데이터셋에서 데이터를 샘플링한다.
1~ $T$ 중에 특정 step을 랜덤으로 고른다. (노이즈 스케줄이 정해져있기 때문에 순차적이 아닌 마구잡이로 학습할 수 있다)
$\mathcal{N}(0,I)$ 를 따르는 노이즈를 샘플링한다.
이제 $\mathcal{L}_{simple}$ 을 계산하고 그 그라디언트를 이용해 모델 $\epsilon_\theta$ 를 업데이트한다.

DDPM 인퍼런스 과정

생성을 시작할 노이즈를 $\mathcal{N}(0,I)$ 에서 샘플링한다.
이제 $T$ 부터 순차적으로 모델 $\epsilon_{\theta}$ 를 이용해서 디노이즈 한다.
마지막으로 생성 결과 $X_0$ 를 리턴한다.

증명을 생략하면 쉽다(???)

아래는 PriorGrad의 훈련 및 인퍼런스 과정이다.

Pasted%20image%2020250408172829

PriorGrad 훈련 과정

데이터셋에서 데이터를 샘플링한다.
1~ $T$ 중에 특정 step을 랜덤으로 고른다.
$\mathcal{N}(0,\Sigma)$ 를 따르는 노이즈를 샘플링한다.
이제 $\mathcal{L}_{simple}$ 를 계산한다. 이 때, 공분산 행렬의 역행렬 $\Sigma^{-1}$ 를 이용한 마할라노비스 거리이다.
그라디언트를 이용해 모델 $\epsilon_\theta$ 를 업데이트한다.

PriorGrad 인퍼런스 과정

생성을 시작할 노이즈를 $\mathcal{N}(0,\Sigma)$ 에서 샘플링한다.
이제 $T$ 부터 순차적으로 모델 $\epsilon_{\theta}$ 를 이용해서 디노이즈 한다.
마지막으로 생성 결과 $X_0$ 를 리턴한다. 이 때 prior의 평균 $\mu$ 를 더해준다.

결과

[[Vocoder]]에서의 성능

Diffwave와 비교했더니, 학습 속도도 빠르고 결과도 더 좋았다.

Pasted%20image%2020250408173729

Pasted%20image%2020250408173738

[[Acoustic Model]]에서의 성능

FastSpeech 2와 비교했더니 성능이 더 좋았다. Pasted%20image%2020250408174100