PPO
[[Search R1]]에서의 PPO 예시 (search engine과 함께 이용) 잘 설명해준 블로그 https://ai com.tistory.com/entry/RL %EA%B0%95%ED%99%94%ED%95%99%EC%8A%B5 %EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98 5 PPO 기본적으로 advantage를 극대화 하면서도, old policy에서 업데이트된...
Quick context
First, this page captures one concrete build-log step, research note, or project lesson from Jeffrey Kim.
Next, use the tags, related reading, and home archive to move from this note to deeper material in the same topic cluster.
Finally, follow the RSS feed if you want the next experiment, retrospective, or paper review as soon as it ships.
Archive note
First, this imported note is intentionally compact. It acts as a pointer into the wider SecondBrain archive rather than a long-form standalone article.
Next, use the tags, related reading, and project sections to move toward deeper context. Those paths usually lead to fuller write-ups, experiments, or project retrospectives.
Finally, revisit this page together with the home archive and RSS feed when you want the follow-up posts that expand the same topic.
[[Search-R1]]에서의 PPO 예시 (search engine과 함께 이용)
잘 설명해준 블로그 -> https://ai-com.tistory.com/entry/RL-%EA%B0%95%ED%99%94%ED%95%99%EC%8A%B5-%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98-5-PPO
기본적으로 advantage를 극대화 하면서도, old policy에서 업데이트된 new policy가 너무 차이가 크지 않기 위해서 clipping을 사용 (절벽 가장자리를 걷는 것과 같은 비유로 설명할 수 있는데, 새로운 정책이 이전 정책으로부터 안전한 거리 내에서만 업데이트되도록 보장)