RRF(Reciprocal Rank Fusion)란?

RRF(Reciprocal Rank Fusion)는 문자 그대로 역수(Reciprocal)를 이용해 순위(Rank)를 혼합(Fusion)하는 알고리즘입니다. 구글에서 만든 알고리즘인데¹, 다른 알고리즘보다 낫다고 알려져 있습니다.

구체적으로는 아래와 같은 수식을 이용해 계산하게 됩니다.

$$ \text{score}^d_\text{RRF} = \sum_{r \in R} \frac{1}{k+\text{rank}^d_{r}} $$

수식이 복잡하다면 아래와 같은 파이썬 예제 코드로 생각하셔도 무방합니다.

def get_document_rank(ranking_list, target_document):
    # ranking_list is a list of documents in ranked order
    # e.g. ['A', 'B', 'C']
    return ranking_list.index(target_document) + 1

def calculate_rrf_score(ranking_lists, target_document):
    # K is a constant factor
    K = 3
    # find rank of document in ranks
    return sum(
        [
            1 / (K + get_document_rank(ranking_list, target_document))
            for ranking_list in ranking_lists
        ]
    )

예를 들어, A, B, C 세 문서에 대해 다음과 같이 점수를 매겨보았다고 가정해봅시다.

문서	알고리즘 X	알고리즘 Y
A	0.5 (2위)	0.9 (2위)
B	0.3 (3위)	0.95 (1위)
C	0.7 (1위)	0.82 (3위)

k를 3으로 잡으면 RRF는 다음과 같습니다.

문서	X 순위	Y 순위	계산식	점수
A	2	2	$\frac{1}{3+2}+\frac{1}{3+2}=0.4$	0.4
B	3	1	$\frac{1}{3+3}+\frac{1}{3+1}=0.416$	0.416
C	1	3	$\frac{1}{3+1}+\frac{1}{3+3}=0.416$	0.416

논문 저자는 이런 기법을 사용하는 이유에 대해 다음과 같이 설명하고 있습니다.

Our intuition in choosing this formula derived from fact that while highly-ranked documents are more important, the importance of lower-ranked documents does not vanish as it would were, say, an exponential function used. The constant k mitigates the impact of high rankings by outlier systems
높은 순위의 문서들이 더 중요하지만, 예를 들어 지수 함수를 사용했다고 해서 낮은 순위의 문서들이 덜 중요하게 되는 것은 아니라는 사실이 이 공식을 고르게 된 직관적인 이유였다. 상수 k는 아웃라이어 알고리즘(outlier systems)의 높은 순위가 영향을 주는 것을 완화하는 역할을 한다.

Cormack, Gordon V., Charles LA Clarke, and Stefan Buettcher. “Reciprocal rank fusion outperforms condorcet and individual rank learning methods.” In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 758-759. 2009. ↩︎