Abstract
Using a curated list of 10 search queries and the English Wikipedia articles that were the top 5 results for each one, we asked randomly selected visitors to those articles whether the article they were on was relevant to the respective search query. Using our own judgement about those articles’ relevance as the gold standard, a summary relevance score computed from users’ responses, and the users’ engagement with the survey, we were able to train models to classify articles as relevant or irrelevant with a remarkably high accuracy for the few data points we had to work with. These methods, combined with more data, would enable us to leverage the opinions of our enormous audience to predict article rankings for search queries at a large scale, which we could then feed into our learning-to-rank project to make searching Wikipedia and other Wikimedia projects better for our users.
Phabricator ticket |
Open source analysis |
Open data
We performed a series of tests on English Wikipedia requesting feedback from users about whether the article they were reading was relevant to one of the curated search queries. For our minimum viable product (MVP) test, we hard-coded a list of queries and articles. We also tried different wordings of the relevance question, to assess the impact of each.
Uploaded to Phabricator by Erik Bernhardson (F9161493)
For this MVP, the queries were chosen to be about topics for which we could confidently judge an article’s relevance beforehand, such as American pop culture:
For each query, we judged the relevance of the articles that were the top 5 results for the queries at the time (and most are still the top 5 results). The following table shows which pages we asked users about and our judgements:
A user visiting one of those articles might be randomly picked for the survey. There were 4 varieties of questions that we asked:
(Where … was replaced with the actual query.)
The variations on the questions were so we could assess how the wording/phrasing affected the results.
aggregates_first <- responses_first %>%
dplyr::group_by(query, article, question, choice) %>%
dplyr::tally() %>%
dplyr::ungroup() %>%
tidyr::spread(choice, n, fill = 0) %>%
dplyr::mutate(
total = yes + no,
score = (yes - no) / (total + 1),
yes = yes / total,
no = no / total,
dismiss = dismiss / (total + dismiss),
engaged = (total + dismiss) / (total + dismiss + timeout)
) %>%
dplyr::select(-c(total, timeout)) %>%
tidyr::gather(choice, prop, -c(query, article, question)) %>%
dplyr::mutate(choice = factor(choice, levels = c("yes", "no", "dismiss", "engaged", "score")))
The first test (08/04-08/10) had 0 time delay and presented users with options to answer “Yes”, “No”, “I don’t know”, or dismiss the notification. The notification disappeared after 30 seconds if the user did not interact with it. Due to a bug, the “I don’t know” responses were not recorded for this test. There were 11,056 sessions and 3,016 yes/no responses. 8,703 (73.6%) surveys timed out and 100 surveys were dismissed by the user.