Wikipedia Android app users in India¶

For the mobile personas project in India, we need to choose what cities/regions to perform research in.

Note: since users can opt out of sharing usage data, the numbers here are all under-counts as there are (presumably) many users who have not opted in.

Setup¶

This Jupyter notebook running on Simple Wikimedia Analytics Platform (SWAP) requires R, an R kernel, and the following packages:

install.packages(c("devtools", "tidyverse", "lubridate", "glue", "zeallot"))
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf")
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/polloi")

library(magrittr)
library(glue)
library(wmf)
library(zeallot)

Cities and Regions¶

What cities and regions in India does the Android app get most usage (here defined as sessions in the last 90 days) from?

query <- "-- Language clustering analysis for Wikipedia Android app research in India
SELECT
  subdivision AS region, city,
  COUNT(DISTINCT uuid) AS n_devices
FROM bearloga.android_active_users
WHERE
  country_code = 'IN' -- requests from India
  -- use the date from 90 days ago (e.g. '2018-02-09' if running on '2018-05-10'):
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '${start_date}'
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '${end_date}'
GROUP BY subdivision, city;"

suppressPackageStartupMessages(library(lubridate))
end_date <- today() - days(1)
start_date <- end_date - days(90)
# c(start_date, end_date) %<-% c("2018-02-01", "2018-02-02")

india_android_users <- wmf::query_hive(glue(query, .open = "${"))
readr::write_tsv(india_android_users, "~/mobile-apps-team/india_android_users.tsv")
system("gzip --force ~/mobile-apps-team/india_android_users.tsv")

india_android_users <- readr::read_tsv("~/mobile-apps-team/india_android_users.tsv.gz", col_types = "cci")

india_android_total <- wmf::query_hive(glue("SELECT
  COUNT(DISTINCT uuid) AS n_devices
FROM bearloga.android_active_users
WHERE country_code = 'IN' -- requests from India
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '${start_date}'
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '${end_date}'
", .open = "${"))$n_devices

Top 20 cities by number of Wikipedia Android app users¶

india_android_users %>%
    dplyr::mutate(prop_devices = sprintf("%.1f%%", 100 * n_devices / india_android_total)) %>%
    dplyr::filter(city != "Unknown") %>%
    dplyr::top_n(20, n_devices) %>%
    dplyr::arrange(desc(n_devices)) %>%
    dplyr::mutate(n_devices = polloi::compress(n_devices, 2)) %>%
    dplyr::select(-region)

Note: since users can go between cities, n_devices represents number of devices observed in a city independent of other cities, so there may be overlaps. prop_devices is the proportion of true total number of unique Android app users in the last 90 days.

Top 10 regions by number of devices¶

india_android_users %>%
    dplyr::group_by(region) %>%
    dplyr::summarize(n_devices = sum(n_devices)) %>%
    dplyr::ungroup() %>%
    dplyr::mutate(prop_devices = sprintf("%.1f%%", 100 * n_devices / india_android_total)) %>%
    dplyr::filter(region != "Unknown") %>%
    dplyr::top_n(10, n_devices) %>%
    dplyr::arrange(desc(n_devices)) %>%
    dplyr::mutate(n_devices = polloi::compress(n_devices, 2))

Language Combinations¶

What are the most typical language clusters? For example, this report might say that 80% of users from India have looked at only English content in the past 90 days, 10% have looked at both English and Hindi, 5% have looked at English, Hindi and Marathi content, 2.5% have looked at English-Hindi-Kannada, 1% have looked at Marathi-only, etc etc. (I know from looking at stats Jon sent that the highest % of India pageviews go to content in English, and then Hindi. But in a country like India, that actually doesn't mean much - many educated people speak English, and a majority of folks speak Hindi, so that doesn't tell us what other languages they might speak at home, and thus what other languages they might want content in!)

query <- "WITH per_device AS (
    SELECT uuid, CONCAT_WS(', ', COLLECT_LIST(DISTINCT wikipedia)) AS languages
    FROM bearloga.android_active_users
    WHERE country_code = 'IN' -- requests from India
      AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '${start_date}'
      AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '${end_date}'
      AND wikipedia IS NOT NULL
      AND NOT wikipedia IN('', '-')
    GROUP BY uuid
)
SELECT languages, COUNT(1) AS n_devices
FROM per_device
GROUP BY languages;"

india_android_langs <- wmf::query_hive(glue(query, .open = "${"))
readr::write_tsv(india_android_langs, "~/mobile-apps-team/india_android_langs.tsv")

india_android_langs <- readr::read_tsv("~/mobile-apps-team/india_android_langs.tsv", col_types = "ci")

prefixes <- polloi::get_prefixes()
languages <- set_names(prefixes$language, prefixes$prefix)

india_android_langs$languages <- strsplit(india_android_langs$languages, ", ") %>%
    purrr::map_chr(~ paste(languages[.x], collapse = ", "))

Top 20 combinations of languages¶

…of Wikipedia that Android app users in India accessed in the last 90 days:

india_android_langs %>%
    dplyr::mutate(prop_devices = sprintf("%.1f%%", 100 * n_devices / india_android_total)) %>%
    dplyr::top_n(20, n_devices) %>%
    dplyr::arrange(desc(n_devices)) %>%
    dplyr::mutate(n_devices = polloi::compress(n_devices, 2))

city	n_devices	prop_devices
Bengaluru	108.05K	12.6%
Delhi	93.55K	10.9%
Pune	85.7K	10.0%
Chennai	77.69K	9.1%
Mumbai	64.03K	7.5%
Hyderabad	63.6K	7.4%
Kolkata	58.74K	6.9%
Ahmedabad	51.34K	6.0%
New Delhi	48.29K	5.6%
Lucknow	46.47K	5.4%
Patna	45.2K	5.3%
Bhubaneswar	34.14K	4.0%
Jaipur	31.39K	3.7%
Bhopal	30.56K	3.6%
Ernakulam	29.5K	3.4%
Kochi	26.82K	3.1%
Indore	26.62K	3.1%
Noida	25.95K	3.0%
Coimbatore	22.02K	2.6%
Thane	20K	2.3%

region	n_devices	prop_devices
Maharashtra	212.04K	24.8%
Tamil Nadu	203.33K	23.8%
Uttar Pradesh	198.02K	23.1%
Kerala	167.2K	19.5%
Karnataka	152.31K	17.8%
West Bengal	148.33K	17.3%
National Capital Territory of Delhi	146.41K	17.1%
Gujarat	121.16K	14.2%
Madhya Pradesh	80.72K	9.4%
Bihar	79.77K	9.3%

languages	n_devices	prop_devices
English	636.12K	74.4%
English, Hindi	69.02K	8.1%
Hindi	15.51K	1.8%
English, Tamil	13.01K	1.5%
English, Malayalam	11.31K	1.3%
English, Simple English	10.55K	1.2%
Bengali, English	9.63K	1.1%
English, Marathi	7.75K	0.9%
English, Telugu	6.83K	0.8%
English, Kannada	4.72K	0.6%
English, Hindi, Marathi	4.33K	0.5%
English, Gujarati	3.48K	0.4%
English, Hindi, Simple English	2.76K	0.3%
Malayalam	2.55K	0.3%
Chinese	2.52K	0.3%
English, Gujarati, Hindi	2.17K	0.3%
Marathi	2.13K	0.2%
German, English	1.87K	0.2%
Tamil	1.81K	0.2%
Portuguese	1.55K	0.2%