Favicon-Phabricator-WM Phabricator ticket | Open Source Initiative keyhole Open source analysis | Download font awesome Open data

Background

We performed a series of tests on English Wikipedia requesting feedback from users about whether the article they were reading was relevant to one of the curated search queries. For our minimum viable product (MVP) test, we hard-coded a list of queries and articles. We also tried different wordings of the relevance question, to assess the impact of each.

Uploaded to Phabricator by Erik Bernhardson (F9161493)

Uploaded to Phabricator by Erik Bernhardson (F9161493)

For this MVP, the queries were chosen to be about topics for which we could confidently judge an article’s relevance beforehand, such as American pop culture:

  • who is v for vendetta?
  • star and stripes
  • block buster
  • 10 items or fewer
  • sailor soldier tinker spy
  • how do flowers bloom?
  • yesterday beetles
  • search engine
  • what is a genius iq?
  • why is a baby goat a kid?

For each query, we judged the relevance of the articles that were the top 5 results for the queries at the time (and most are still the top 5 results). The following table shows which pages we asked users about and our judgements:

query article opinion
who is v for vendetta? V for Vendetta (film) ok
V for Vendetta ok
List of V for Vendetta characters good
V (comics) best
Vendetta Pro Wrestling bad
star and stripes Stars and Stripes Forever (disambiguation) ok
The White Stripes bad
Tars and Stripes bad
The Stars and Stripes Forever ok
Stripes (film) bad
block buster Blockbuster best
Block Buster! good
The Sweet (album) bad
Block Busters ok
Buster Keaton bad
10 items or fewer Fewer vs. less ok
10-foot user interface very bad
Magic item (Dungeons & Dragons) very bad
Item-item collaborative filtering very bad
Item 47 very bad
sailor soldier tinker spy Tinker Tailor Soldier Spy best
Tinker, Tailor good
Blanket of Secrecy ok
List of fictional double agents ok
Ian Bannen ok
how do flowers bloom? Britain in Bloom bad
Flowers in the Attic (1987 film) very bad
Flower best
Thymaridas very bad
Flowers in the Attic very bad
yesterday beetles Private language argument very bad
Diss (music) very bad
How Do You Sleep? (John Lennon song) very bad
Maria Mitchell Association very bad
The Collected Stories of Philip K. Dick very bad
search engine Web search engine best
List of search engines good
Search engine optimization ok
Search engine marketing ok
Audio search engine ok
what is a genius iq? Genius good
IQ classification best
Genius (website) bad
High IQ society ok
Social IQ score of bacteria bad
why is a baby goat a kid? Goat best
Super Why! very bad
Barney & Friends very bad
The Kids from Room 402 very bad
Oliver Hardy filmography very bad

A user visiting one of those articles might be randomly picked for the survey. There were 4 varieties of questions that we asked:

  1. Would you click on this page when searching for ‘…’?
  2. If you searched for ‘…’, would this article be a good result?
  3. If you searched for ‘…’, would this article be relevant?
  4. If someone searched for ‘…’, would they want to read this article?

(Where … was replaced with the actual query.)

The variations on the questions were so we could assess how the wording/phrasing affected the results.

Results

First Test

aggregates_first <- responses_first %>%
  dplyr::group_by(query, article, question, choice) %>%
  dplyr::tally() %>%
  dplyr::ungroup() %>%
  tidyr::spread(choice, n, fill = 0) %>%
  dplyr::mutate(
    total = yes + no,
    score = (yes - no) / (total + 1),
    yes = yes / total,
    no = no / total,
    dismiss = dismiss / (total + dismiss),
    engaged = (total + dismiss) / (total + dismiss + timeout)
  ) %>%
  dplyr::select(-c(total, timeout)) %>%
  tidyr::gather(choice, prop, -c(query, article, question)) %>%
  dplyr::mutate(choice = factor(choice, levels = c("yes", "no", "dismiss", "engaged", "score")))

Summary

The first test (08/04-08/10) had 0 time delay and presented users with options to answer “Yes”, “No”, “I don’t know”, or dismiss the notification. The notification disappeared after 30 seconds if the user did not interact with it. Due to a bug, the “I don’t know” responses were not recorded for this test. There were 11,056 sessions and 3,016 yes/no responses. 8,703 (73.6%) surveys timed out and 100 surveys were dismissed by the user.

↑ Top of section

Survey Responses