Anthropic의 RAG 기법 AutoRAG로 바로 사용해보기

테디노트 님께서 링크드인에서 Anthropic의 좋은 블로그 글을 소개해 주셨습니다. 테디노트 님 글 바로가기 Anthropic 블로그 바로가기테디님께서 요약해주신 다섯 가지 주요 RAG 기법 (혹은 팁)은 다음과 같습니다.HybridSearch: Sementic +

2024-09-29 autorag / rag / anthropic / contextual-retrieval

Anthropic의 RAG 기법 AutoRAG로 바로 사용해보기 글 대표 일러스트 — Jeffrey Kim의 SecondBrain 빌드 로그 아티클 커버

Quick context

First, this page captures one concrete build-log step, research note, or project lesson from Jeffrey Kim.

Next, use the tags, related reading, and home archive to move from this note to deeper material in the same topic cluster.

Finally, follow the RSS feed if you want the next experiment, retrospective, or paper review as soon as it ships.

테디노트 님께서 링크드인에서 Anthropic의 좋은 블로그 글을 소개해 주셨습니다.

테디님께서 요약해주신 다섯 가지 주요 RAG 기법 (혹은 팁)은 다음과 같습니다.

HybridSearch: Sementic + BM25
Embedding Model 선택의 중요성
Chunk 의 개수(top_n) 는 너무 적은 것보다는 20개가 제일 좋았음
Contextual Chunk 를 기존 Chunk 에 앞단에 삽입
Reranker. 150개 문서 retrieval 후 Rerank(k=20)

이 기법들은 대부분 이미 AutoRAG에서 지원하고 있는 것들입니다. 아래 예시 YAML 파일을 이용하면 각 기법들을 모두 실험해보고 최적의 기법을 선택할 수 있습니다.

node_lines:
  - node_line_name: retrieve_node_line
    nodes:
      - node_type: retrieval
        strategy:
          metrics: [ retrieval_ndcg, retrieval_mrr, retrieval_map, retrieval_recall] # Anthropic에서 임베딩 모델 성능 비교를 위해 활용한 메트릭 4종
        top_k: 150 # Retrieval에서 150개 문서를 retrieve
        modules:
          - module_type: bm25
          - module_type: vectordb
            embedding_model: [ openai_embed_3_large, upstage, cohere, huggingface_bge_m3 ] # 임베딩 모델 비교
          - module_type: hybrid_rrf # Hybrid Retrieval
          - module_type: hybrid_cc # Hybrid Retrieval
            normalize_method: [ mm, tmm, z, dbsf ]
      - node_type: passage_reranker
        strategy:
          metrics: [ retrieval_ndcg, retrieval_mrr, retrieval_map, retrieval_recall]
        top_k: 20 # 20개의 chunk를 이용
        modules:
          - module_type: pass_reranker
          - module_type: tart
          - module_type: upr
          - module_type: cohere_reranker
          - module_type: koreranker
          - module_type: flag_embedding_reranker

위 YAML 파일에만 네 가지 포인트가 모두 들어가 있습니다.

1. Chunk 의 개수(top_n) 는 너무 적은 것보다는 20개가 제일 좋았음

손쉽게 top_k(top_n과 동일)를 YAML 파일에서 바꿔주면 됩니다.

2. Reranker. 150개 문서 retrieval 후 Rerank(k=20)

위 YAML 파일에 retrieval의 top_k는 150, Reranker의 top_k는 20으로 바꿔주면 끝입니다.

3. Embedding Model 선택의 중요성

많은 임베딩 모델의 성능을 한 번에 비교 가능합니다. ⇒ 임베딩 모델 벤치마크 블로그 참고!

4. HybridSearch: Sementic + BM25

이 부분은 AutoRAG가 정말 공을 많이 들인 부분입니다!
흔히 사용하는 rrf 방법 외에도 cc 기법 역시 사용합니다.
또한, 100개의 하이퍼 파라미터 중에서 가장 최적의 것을 자동으로 선택합니다.
cc 기법에서는 각기 다른 4가지 normalization 기법도 제공합니다.

이 모든 것을 YAML 파일 하나로 실험해보고 최적의 것을 선택해 줍니다. 참… 쉽죠?

26348a4a 05b3 4ba6 9f20 7cc31d06809b article image

Chunking method

아래는 AutoRAG 0.3 버전에서 새롭게 추가된 chunk YAML 파일입니다.

modules:
  - module_type: llama_index_chunk
    chunk_method: [ Token, Sentence ]
    chunk_size: [ 1024, 512 ]
    chunk_overlap: 24
    add_file_name: en # 제목을 chunk에 추가
  - module_type: llama_index_chunk
    chunk_method: [ SentenceWindow ]
    window_size: 3
    add_file_name: en # 제목을 chunk에 추가

세 가지 청킹 방법을 실험하는데 add_file_name 이라는 것이 있죠. 바로 파일 명을 청크에 추가해 주는 것입니다.

Anthropic 블로그에서 나온, 추가 information을 chunk 앞 부분에 삽입하는 기술의 일종이라고 할 수 있습니다.

Anthropic과 같이 추가 정보를 LLM으로 생성하는 방법 역시 준비중입니다. [Github issue]

결론

AutoRAG를 사용하면 복잡한 기법들을 배우지 않고도 사용할 수 있다. 이를 통해 RAG의 성능이 올라갈 것이다!

당장 위의 YAML 파일을 써보고 싶다면, 이 튜토리얼 부터 시작해보세요!