The way Wikimedia Search Platform’s (formerly Wikimedia Discovery) analyses have been assessing changes to search has traditionally relied on A/B testing wherein the control group receives results using the latest configuration and the test group (or groups) receives results using the experimental configuration. Another way to evaluate the user-perceived relevance of search results from the experimental configuration relies on a technique called *interleaving*. In it, each user is their own baseline – we perform two searches behind the scenes and then interleave them together into a single set of results using the team draft algorithm described by Chapelle et al. (2012):

**Input**: result sets \(A\) and \(B\).**Initialize**: an empty interleaved result sets \(I\) and drafts \(T_A, T_B\) for keeping track of which results belong to which team.- For each round of picking:
- Randomly decide whether we first pick from \(A\) or from \(B\).
- Without loss of generality, if \(A\) is randomly chosen to go first, grab top result \(a \in A\), append it to \(I\) and \(T_A\): \(I \gets a, T_A \gets a\).
- Take the top result \(b \in B\) such that \(b \neq a\) and append it to \(I\) after \(a\) and to \(T_B\): \(I \gets b, T_B \gets b\).
- Update \(A = A \setminus \{a, b\}\) and \(B \setminus \{a, b\}\) so the two results that were just appended to \(I\) are not considered again.
- Stop when \(|I| = \text{maximum per page}\), so only the first page contains interleaved results.

**Output**: interleaved results \(I\) and team drafts \(T_A, T_B\).

By keeping track of which results belong to which ranking function when the user clicks on them, we can estimate a preference for one ranker over the other. The preference statistic \(\Delta_{AB}\) is described by Chapelle et al. as

\[ \Delta_{AB} = \frac{\text{wins}_A + \frac{1}{2} \text{ties}}{\text{wins}_A + \text{wins}_B + \text{ties}} - 0.5, \]

where wins are calculated by counting clicks on the results from teams “A” and “B”. A positive value of \(\Delta_{AB}\) indicates that \(A \succ B\), a negative value indicates that \(B \succ A\). We performed two types of calculations: per-session and per-search. In **per-session**, “A” has won if there are more clicks on team “A” results than team “B” results and \(\text{wins}_A\) is incremented by one for each such session. In **per-search**, “A” has won if there are more clicks on team “A” results in each search, thus any one session can contribute multiple points to the overall \(\text{wins}_A\).

In order to obtain confidence intervals for the preference statistic, we utilize bootstrapping with \(m\) iterations.

- For bootstrap iteration \(i = 1, \ldots, m\):
- Sample unique IDs with replacement.
- Calculate \(\Delta_{AB}^{(i)}\) from new data.

- The confidence intervals (CIs) are calculated by finding percentiles of the distribution of bootstrapped preferences \(\{\Delta_{AB}^{(1)}, \ldots, \Delta_{AB}^{(m)}\}\) – e.g. the 2.5th and 97.5th percentiles for a 95% CI.

This package provides simulated search and click data. The three built-in datasets have simulated users that (1) exhibit no preference, (2) exhibit preference for the ranking function “A”, and (3) exhibit preference for the ranking function “B”.

```
data(interleaved_data, package = "wmfastr") # no preference
data(interleaved_data_a, package = "wmfastr") # preference for A
data(interleaved_data_b, package = "wmfastr") # preference for B
```

Here are the first few rows of the third dataset:

`::kable(head(interleaved_data_b)) knitr`

session_id | timestamp | event | position | ranking_function |
---|---|---|---|---|

ufxl7jprs8 | 2018-08-01 00:01:06 | serp | NA | NA |

zbuk6t3n4m | 2018-08-01 00:00:37 | serp | NA | NA |

zbuk6t3n4m | 2018-08-01 00:03:16 | click | 8 | B |

zbuk6t3n4m | 2018-08-01 00:03:37 | click | 13 | B |

zbuk6t3n4m | 2018-08-01 00:05:26 | click | 16 | B |

zbuk6t3n4m | 2018-08-01 00:05:42 | click | 18 | B |

`library(wmfastr)`

To calculate \(\Delta_{AB}\) with `interleaved_preference`

, we will need to use the clicks. We also use bootstrapping via `interleaved_bootstraps`

which resamples sessions (with replacement) to obtain a distribution of the preference statistic \(\Delta_{AB}\). After we plot each bootstrapped sample, we mark the 95% confidence interval bounds. **Note** that `interleaved_confint`

outputs the `quantile`

-based CI and uses the same bootstrap function internally.

When users click on the interleaved results *without* a preference, the resulting preference statistic is close to 0 and the confidence interval covers 0:

```
interleaved_data[interleaved_data$event == "click", ]
x <- x[order(x$session_id, x$timestamp), ]
x <- interleaved_bootstraps(x$session_id, x$ranking_function)
boot_x <-hist(boot_x, col = "gray70", border = NA, main = "No preference", xlab = "Bootstrapped preferences")
abline(v = quantile(boot_x, c(0.025, 0.975)), lty = "dashed")
abline(v = interleaved_preference(x$session_id, x$ranking_function), lwd = 2)
```

When users click on the interleaved results *with* a preference for A, the resulting preference statistic is *positive* and the confidence interval does *not* cover 0:

```
interleaved_data_a[interleaved_data_a$event == "click", ]
y <- y[order(y$session_id, y$timestamp), ]
y <- interleaved_bootstraps(y$session_id, y$ranking_function)
boot_y <-hist(boot_y, col = "gray70", border = NA, main = "Preference for A", xlab = "Bootstrapped preferences")
abline(v = quantile(boot_y, c(0.025, 0.975)), lty = "dashed")
abline(v = interleaved_preference(y$session_id, y$ranking_function), lwd = 2)
```

When users click on the interleaved results *with* a preference for B, the resulting preference statistic is *negative* and the confidence interval does *not* cover 0:

```
interleaved_data_b[interleaved_data_b$event == "click", ]
z <- z[order(z$session_id, z$timestamp), ]
z <- interleaved_bootstraps(z$session_id, z$ranking_function)
boot_z <-hist(boot_z, col = "gray70", border = NA, main = "Preference for B", xlab = "Bootstrapped preferences")
abline(v = quantile(boot_z, c(0.025, 0.975)), lty = "dashed")
abline(v = interleaved_preference(z$session_id, z$ranking_function), lwd = 2)
```

- Chapelle, O., Joachims, T., Radlinski, F., & Yue, Y. (2012). Large-scale validation and analysis of interleaved search evaluation.
*ACM Transactions on Information Systems*,**30**(1), 1-41. doi:10.1145/2094072.2094078 - Radlinski, F. and Craswell, N. (2013). Optimized interleaving for online retrieval evaluation.
*ACM International Conference on Web Search and Data Mining (WSDM)*. doi:10.1145/2433396.2433429