The way Wikimedia Search Platform’s (formerly Wikimedia Discovery) analyses have been assessing changes to search has traditionally relied on A/B testing wherein the control group receives results using the latest configuration and the test group (or groups) receives results using the experimental configuration. Another way to evaluate the user-perceived relevance of search results from the experimental configuration relies on a technique called interleaving. In it, each user is their own baseline – we perform two searches behind the scenes and then interleave them together into a single set of results using the team draft algorithm described by Chapelle et al. (2012):
By keeping track of which results belong to which ranking function when the user clicks on them, we can estimate a preference for one ranker over the other. The preference statistic \(\Delta_{AB}\) is described by Chapelle et al. as
\[ \Delta_{AB} = \frac{\text{wins}_A + \frac{1}{2} \text{ties}}{\text{wins}_A + \text{wins}_B + \text{ties}} - 0.5, \]
where wins are calculated by counting clicks on the results from teams “A” and “B”. A positive value of \(\Delta_{AB}\) indicates that \(A \succ B\), a negative value indicates that \(B \succ A\). We performed two types of calculations: per-session and per-search. In per-session, “A” has won if there are more clicks on team “A” results than team “B” results and \(\text{wins}_A\) is incremented by one for each such session. In per-search, “A” has won if there are more clicks on team “A” results in each search, thus any one session can contribute multiple points to the overall \(\text{wins}_A\).
In order to obtain confidence intervals for the preference statistic, we utilize bootstrapping with \(m\) iterations.
This package provides simulated search and click data. The three built-in datasets have simulated users that (1) exhibit no preference, (2) exhibit preference for the ranking function “A”, and (3) exhibit preference for the ranking function “B”.
data(interleaved_data, package = "wmfastr") # no preference
data(interleaved_data_a, package = "wmfastr") # preference for A
data(interleaved_data_b, package = "wmfastr") # preference for B
Here are the first few rows of the third dataset:
::kable(head(interleaved_data_b)) knitr
session_id | timestamp | event | position | ranking_function |
---|---|---|---|---|
ufxl7jprs8 | 2018-08-01 00:01:06 | serp | NA | NA |
zbuk6t3n4m | 2018-08-01 00:00:37 | serp | NA | NA |
zbuk6t3n4m | 2018-08-01 00:03:16 | click | 8 | B |
zbuk6t3n4m | 2018-08-01 00:03:37 | click | 13 | B |
zbuk6t3n4m | 2018-08-01 00:05:26 | click | 16 | B |
zbuk6t3n4m | 2018-08-01 00:05:42 | click | 18 | B |
library(wmfastr)
To calculate \(\Delta_{AB}\) with interleaved_preference
, we will need to use the clicks. We also use bootstrapping via interleaved_bootstraps
which resamples sessions (with replacement) to obtain a distribution of the preference statistic \(\Delta_{AB}\). After we plot each bootstrapped sample, we mark the 95% confidence interval bounds. Note that interleaved_confint
outputs the quantile
-based CI and uses the same bootstrap function internally.
When users click on the interleaved results without a preference, the resulting preference statistic is close to 0 and the confidence interval covers 0:
interleaved_data[interleaved_data$event == "click", ]
x <- x[order(x$session_id, x$timestamp), ]
x <- interleaved_bootstraps(x$session_id, x$ranking_function)
boot_x <-hist(boot_x, col = "gray70", border = NA, main = "No preference", xlab = "Bootstrapped preferences")
abline(v = quantile(boot_x, c(0.025, 0.975)), lty = "dashed")
abline(v = interleaved_preference(x$session_id, x$ranking_function), lwd = 2)
When users click on the interleaved results with a preference for A, the resulting preference statistic is positive and the confidence interval does not cover 0:
interleaved_data_a[interleaved_data_a$event == "click", ]
y <- y[order(y$session_id, y$timestamp), ]
y <- interleaved_bootstraps(y$session_id, y$ranking_function)
boot_y <-hist(boot_y, col = "gray70", border = NA, main = "Preference for A", xlab = "Bootstrapped preferences")
abline(v = quantile(boot_y, c(0.025, 0.975)), lty = "dashed")
abline(v = interleaved_preference(y$session_id, y$ranking_function), lwd = 2)
When users click on the interleaved results with a preference for B, the resulting preference statistic is negative and the confidence interval does not cover 0:
interleaved_data_b[interleaved_data_b$event == "click", ]
z <- z[order(z$session_id, z$timestamp), ]
z <- interleaved_bootstraps(z$session_id, z$ranking_function)
boot_z <-hist(boot_z, col = "gray70", border = NA, main = "Preference for B", xlab = "Bootstrapped preferences")
abline(v = quantile(boot_z, c(0.025, 0.975)), lty = "dashed")
abline(v = interleaved_preference(z$session_id, z$ranking_function), lwd = 2)