Wikipedia Android app users in India

For the mobile personas project in India, we need to choose what cities/regions to perform research in.

Note: since users can opt out of sharing usage data, the numbers here are all under-counts as there are (presumably) many users who have not opted in.

Setup

This Jupyter notebook running on Simple Wikimedia Analytics Platform (SWAP) requires R, an R kernel, and the following packages:

install.packages(c("devtools", "tidyverse", "lubridate", "glue", "zeallot"))
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf")
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/polloi")
In [1]:
library(magrittr)
library(glue)
library(wmf)
library(zeallot)

Cities and Regions

What cities and regions in India does the Android app get most usage (here defined as sessions in the last 90 days) from?

In [2]:
query <- "-- Language clustering analysis for Wikipedia Android app research in India
SELECT
  subdivision AS region, city,
  COUNT(DISTINCT uuid) AS n_devices
FROM bearloga.android_active_users
WHERE
  country_code = 'IN' -- requests from India
  -- use the date from 90 days ago (e.g. '2018-02-09' if running on '2018-05-10'):
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '${start_date}'
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '${end_date}'
GROUP BY subdivision, city;"
In [3]:
suppressPackageStartupMessages(library(lubridate))
end_date <- today() - days(1)
start_date <- end_date - days(90)
# c(start_date, end_date) %<-% c("2018-02-01", "2018-02-02")
In [4]:
india_android_users <- wmf::query_hive(glue(query, .open = "${"))
readr::write_tsv(india_android_users, "~/mobile-apps-team/india_android_users.tsv")
system("gzip --force ~/mobile-apps-team/india_android_users.tsv")
In [5]:
india_android_users <- readr::read_tsv("~/mobile-apps-team/india_android_users.tsv.gz", col_types = "cci")
In [6]:
india_android_total <- wmf::query_hive(glue("SELECT
  COUNT(DISTINCT uuid) AS n_devices
FROM bearloga.android_active_users
WHERE country_code = 'IN' -- requests from India
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '${start_date}'
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '${end_date}'
", .open = "${"))$n_devices

Top 20 cities by number of Wikipedia Android app users

In [7]:
india_android_users %>%
    dplyr::mutate(prop_devices = sprintf("%.1f%%", 100 * n_devices / india_android_total)) %>%
    dplyr::filter(city != "Unknown") %>%
    dplyr::top_n(20, n_devices) %>%
    dplyr::arrange(desc(n_devices)) %>%
    dplyr::mutate(n_devices = polloi::compress(n_devices, 2)) %>%
    dplyr::select(-region)
cityn_devicesprop_devices
Bengaluru 108.05K 12.6%
Delhi 93.55K 10.9%
Pune 85.7K 10.0%
Chennai 77.69K 9.1%
Mumbai 64.03K 7.5%
Hyderabad 63.6K 7.4%
Kolkata 58.74K 6.9%
Ahmedabad 51.34K 6.0%
New Delhi 48.29K 5.6%
Lucknow 46.47K 5.4%
Patna 45.2K 5.3%
Bhubaneswar34.14K 4.0%
Jaipur 31.39K 3.7%
Bhopal 30.56K 3.6%
Ernakulam 29.5K 3.4%
Kochi 26.82K 3.1%
Indore 26.62K 3.1%
Noida 25.95K 3.0%
Coimbatore 22.02K 2.6%
Thane 20K 2.3%

Note: since users can go between cities, n_devices represents number of devices observed in a city independent of other cities, so there may be overlaps. prop_devices is the proportion of true total number of unique Android app users in the last 90 days.

Top 10 regions by number of devices

In [8]:
india_android_users %>%
    dplyr::group_by(region) %>%
    dplyr::summarize(n_devices = sum(n_devices)) %>%
    dplyr::ungroup() %>%
    dplyr::mutate(prop_devices = sprintf("%.1f%%", 100 * n_devices / india_android_total)) %>%
    dplyr::filter(region != "Unknown") %>%
    dplyr::top_n(10, n_devices) %>%
    dplyr::arrange(desc(n_devices)) %>%
    dplyr::mutate(n_devices = polloi::compress(n_devices, 2))
regionn_devicesprop_devices
Maharashtra 212.04K 24.8%
Tamil Nadu 203.33K 23.8%
Uttar Pradesh 198.02K 23.1%
Kerala 167.2K 19.5%
Karnataka 152.31K 17.8%
West Bengal 148.33K 17.3%
National Capital Territory of Delhi146.41K 17.1%
Gujarat 121.16K 14.2%
Madhya Pradesh 80.72K 9.4%
Bihar 79.77K 9.3%

Language Combinations

What are the most typical language clusters? For example, this report might say that 80% of users from India have looked at only English content in the past 90 days, 10% have looked at both English and Hindi, 5% have looked at English, Hindi and Marathi content, 2.5% have looked at English-Hindi-Kannada, 1% have looked at Marathi-only, etc etc. (I know from looking at stats Jon sent that the highest % of India pageviews go to content in English, and then Hindi. But in a country like India, that actually doesn't mean much - many educated people speak English, and a majority of folks speak Hindi, so that doesn't tell us what other languages they might speak at home, and thus what other languages they might want content in!)

In [9]:
query <- "WITH per_device AS (
    SELECT uuid, CONCAT_WS(', ', COLLECT_LIST(DISTINCT wikipedia)) AS languages
    FROM bearloga.android_active_users
    WHERE country_code = 'IN' -- requests from India
      AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '${start_date}'
      AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '${end_date}'
      AND wikipedia IS NOT NULL
      AND NOT wikipedia IN('', '-')
    GROUP BY uuid
)
SELECT languages, COUNT(1) AS n_devices
FROM per_device
GROUP BY languages;"
In [10]:
india_android_langs <- wmf::query_hive(glue(query, .open = "${"))
readr::write_tsv(india_android_langs, "~/mobile-apps-team/india_android_langs.tsv")
In [11]:
india_android_langs <- readr::read_tsv("~/mobile-apps-team/india_android_langs.tsv", col_types = "ci")
In [12]:
prefixes <- polloi::get_prefixes()
languages <- set_names(prefixes$language, prefixes$prefix)
In [13]:
india_android_langs$languages <- strsplit(india_android_langs$languages, ", ") %>%
    purrr::map_chr(~ paste(languages[.x], collapse = ", "))

Top 20 combinations of languages

…of Wikipedia that Android app users in India accessed in the last 90 days:

In [14]:
india_android_langs %>%
    dplyr::mutate(prop_devices = sprintf("%.1f%%", 100 * n_devices / india_android_total)) %>%
    dplyr::top_n(20, n_devices) %>%
    dplyr::arrange(desc(n_devices)) %>%
    dplyr::mutate(n_devices = polloi::compress(n_devices, 2))
languagesn_devicesprop_devices
English 636.12K 74.4%
English, Hindi 69.02K 8.1%
Hindi 15.51K 1.8%
English, Tamil 13.01K 1.5%
English, Malayalam 11.31K 1.3%
English, Simple English 10.55K 1.2%
Bengali, English 9.63K 1.1%
English, Marathi 7.75K 0.9%
English, Telugu 6.83K 0.8%
English, Kannada 4.72K 0.6%
English, Hindi, Marathi 4.33K 0.5%
English, Gujarati 3.48K 0.4%
English, Hindi, Simple English2.76K 0.3%
Malayalam 2.55K 0.3%
Chinese 2.52K 0.3%
English, Gujarati, Hindi 2.17K 0.3%
Marathi 2.13K 0.2%
German, English 1.87K 0.2%
Tamil 1.81K 0.2%
Portuguese 1.55K 0.2%