DPO

Direct Preference Optimization 리워드 모델 없이 LLM policy model만을 학습시킨다. 어떻게 하냐면, 사람에게 두 답변 결과를 주고 선호되는 답변을 고르게 한다. 훈련되는 모델에게는 선호되는 답변을 생성하게, 선호되지 않은 답변은 덜 생성하게 loss를 계산하여 훈련한다. 수식 $$ \mathcal{L} {DPO}(\pi \theta;\pi {ref}) =...

2026-02-22 dpo / direct-preference-optimization / llm / alignment

DPO 글 대표 일러스트 — Jeffrey Kim의 SecondBrain 빌드 로그 아티클 커버

Quick context

First, this page captures one concrete build-log step, research note, or project lesson from Jeffrey Kim.

Next, use the tags, related reading, and home archive to move from this note to deeper material in the same topic cluster.

Finally, follow the RSS feed if you want the next experiment, retrospective, or paper review as soon as it ships.

Archive note

First, this imported note is intentionally compact. It acts as a pointer into the wider SecondBrain archive rather than a long-form standalone article.

Next, use the tags, related reading, and project sections to move toward deeper context. Those paths usually lead to fuller write-ups, experiments, or project retrospectives.

Finally, revisit this page together with the home archive and RSS feed when you want the follow-up posts that expand the same topic.

Direct Preference Optimization 리워드 모델 없이 LLM policy model만을 학습시킨다. 어떻게 하냐면, 사람에게 두 답변 결과를 주고 선호되는 답변을 고르게 한다. 훈련되는 모델에게는 선호되는 답변을 생성하게, 선호되지 않은 답변은 덜 생성하게 loss를 계산하여 훈련한다.

수식

\mathcal{L}_{DPO}(\pi_\theta;\pi_{ref}) = - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} [log \ \sigma(\beta \log \frac{\pi_\theta(y_w | x)}{\pi_{ref}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{ref}(y_l | x)} )]

위 수식이 DPO의 loss function이다.

$x$ - 프롬프트
$y_w$ - reference model의 답변 결과 중 사람이 선호한 결과
$y_l$ - reference model의 답변 결과 중 사람이 선호하지 않은 결과
$\sigma$ - 시그모이드 함수
$\pi_\theta$ - 훈련 대상 모델
$\pi_{ref}$ - Reference model. 훈련 중 가중치가 업데이트 되지 않는다.

수식을 보면 알 수 있겠지만, 사람이 선호한 결과는 높게, 선호하지 않은 결과가 샘플링될 확률은 낮게 훈련이 진행된다는 것을 알 수 있다.

Quick context

Archive note

수식

Related reading