Estimating Preference For Ranking Functions With Clicks On Interleaved Search Results

Introduction

The way Wikimedia Search Platform’s (formerly Wikimedia Discovery) analyses have been assessing changes to search has traditionally relied on A/B testing wherein the control group receives results using the latest configuration and the test group (or groups) receives results using the experimental configuration. Another way to evaluate the user-perceived relevance of search results from the experimental configuration relies on a technique called interleaving. In it, each user is their own baseline – we perform two searches behind the scenes and then interleave them together into a single set of results using the team draft algorithm described by Chapelle et al. (2012):

Input: result sets \(A\) and \(B\).
Initialize: an empty interleaved result sets \(I\) and drafts \(T_A, T_B\) for keeping track of which results belong to which team.
For each round of picking:
1. Randomly decide whether we first pick from \(A\) or from \(B\).
2. Without loss of generality, if \(A\) is randomly chosen to go first, grab top result \(a \in A\), append it to \(I\) and \(T_A\): \(I \gets a, T_A \gets a\).
3. Take the top result \(b \in B\) such that \(b \neq a\) and append it to \(I\) after \(a\) and to \(T_B\): \(I \gets b, T_B \gets b\).
4. Update \(A = A \setminus \{a, b\}\) and \(B \setminus \{a, b\}\) so the two results that were just appended to \(I\) are not considered again.
5. Stop when \(|I| = \text{maximum per page}\), so only the first page contains interleaved results.
Output: interleaved results \(I\) and team drafts \(T_A, T_B\).

By keeping track of which results belong to which ranking function when the user clicks on them, we can estimate a preference for one ranker over the other. The preference statistic \(\Delta_{AB}\) is described by Chapelle et al. as

\[ \Delta_{AB} = \frac{\text{wins}_A + \frac{1}{2} \text{ties}}{\text{wins}_A + \text{wins}_B + \text{ties}} - 0.5, \]

where wins are calculated by counting clicks on the results from teams “A” and “B”. A positive value of \(\Delta_{AB}\) indicates that \(A \succ B\), a negative value indicates that \(B \succ A\). We performed two types of calculations: per-session and per-search. In per-session, “A” has won if there are more clicks on team “A” results than team “B” results and \(\text{wins}_A\) is incremented by one for each such session. In per-search, “A” has won if there are more clicks on team “A” results in each search, thus any one session can contribute multiple points to the overall \(\text{wins}_A\).

In order to obtain confidence intervals for the preference statistic, we utilize bootstrapping with \(m\) iterations.

For bootstrap iteration \(i = 1, \ldots, m\):
1. Sample unique IDs with replacement.
2. Calculate \(\Delta_{AB}^{(i)}\) from new data.
The confidence intervals (CIs) are calculated by finding percentiles of the distribution of bootstrapped preferences \(\{\Delta_{AB}^{(1)}, \ldots, \Delta_{AB}^{(m)}\}\) – e.g. the 2.5th and 97.5th percentiles for a 95% CI.

Simulated Data

This package provides simulated search and click data. The three built-in datasets have simulated users that (1) exhibit no preference, (2) exhibit preference for the ranking function “A”, and (3) exhibit preference for the ranking function “B”.

data(interleaved_data, package = "wmfastr") # no preference
data(interleaved_data_a, package = "wmfastr") # preference for A
data(interleaved_data_b, package = "wmfastr") # preference for B

Here are the first few rows of the third dataset:

knitr::kable(head(interleaved_data_b))

session_id	timestamp	event	position	ranking_function
ufxl7jprs8	2018-08-01 00:01:06	serp	NA	NA
zbuk6t3n4m	2018-08-01 00:00:37	serp	NA	NA
zbuk6t3n4m	2018-08-01 00:03:16	click	8	B
zbuk6t3n4m	2018-08-01 00:03:37	click	13	B
zbuk6t3n4m	2018-08-01 00:05:26	click	16	B
zbuk6t3n4m	2018-08-01 00:05:42	click	18	B

Estimation

library(wmfastr)

To calculate \(\Delta_{AB}\) with interleaved_preference, we will need to use the clicks. We also use bootstrapping via interleaved_bootstraps which resamples sessions (with replacement) to obtain a distribution of the preference statistic \(\Delta_{AB}\). After we plot each bootstrapped sample, we mark the 95% confidence interval bounds. Note that interleaved_confint outputs the quantile-based CI and uses the same bootstrap function internally.

No preference

When users click on the interleaved results without a preference, the resulting preference statistic is close to 0 and the confidence interval covers 0:

x <- interleaved_data[interleaved_data$event == "click", ]
x <- x[order(x$session_id, x$timestamp), ]
boot_x <- interleaved_bootstraps(x$session_id, x$ranking_function)
hist(boot_x, col = "gray70", border = NA, main = "No preference", xlab = "Bootstrapped preferences")
abline(v = quantile(boot_x, c(0.025, 0.975)), lty = "dashed")
abline(v = interleaved_preference(x$session_id, x$ranking_function), lwd = 2)

Preference for A

When users click on the interleaved results with a preference for A, the resulting preference statistic is positive and the confidence interval does not cover 0:

y <- interleaved_data_a[interleaved_data_a$event == "click", ]
y <- y[order(y$session_id, y$timestamp), ]
boot_y <- interleaved_bootstraps(y$session_id, y$ranking_function)
hist(boot_y, col = "gray70", border = NA, main = "Preference for A", xlab = "Bootstrapped preferences")
abline(v = quantile(boot_y, c(0.025, 0.975)), lty = "dashed")
abline(v = interleaved_preference(y$session_id, y$ranking_function), lwd = 2)

Preference for B

When users click on the interleaved results with a preference for B, the resulting preference statistic is negative and the confidence interval does not cover 0:

z <- interleaved_data_b[interleaved_data_b$event == "click", ]
z <- z[order(z$session_id, z$timestamp), ]
boot_z <- interleaved_bootstraps(z$session_id, z$ranking_function)
hist(boot_z, col = "gray70", border = NA, main = "Preference for B", xlab = "Bootstrapped preferences")
abline(v = quantile(boot_z, c(0.025, 0.975)), lty = "dashed")
abline(v = interleaved_preference(z$session_id, z$ranking_function), lwd = 2)