For the mobile personas project in India, we need to choose what cities/regions to perform research in.
Note: since users can opt out of sharing usage data, the numbers here are all under-counts as there are (presumably) many users who have not opted in.
This Jupyter notebook running on Simple Wikimedia Analytics Platform (SWAP) requires R, an R kernel, and the following packages:
install.packages(c("devtools", "tidyverse", "lubridate", "glue", "zeallot"))
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf")
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/polloi")
library(magrittr)
library(glue)
library(wmf)
library(zeallot)
What cities and regions in India does the Android app get most usage (here defined as sessions in the last 90 days) from?
query <- "-- Language clustering analysis for Wikipedia Android app research in India
SELECT
subdivision AS region, city,
COUNT(DISTINCT uuid) AS n_devices
FROM bearloga.android_active_users
WHERE
country_code = 'IN' -- requests from India
-- use the date from 90 days ago (e.g. '2018-02-09' if running on '2018-05-10'):
AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '${start_date}'
AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '${end_date}'
GROUP BY subdivision, city;"
suppressPackageStartupMessages(library(lubridate))
end_date <- today() - days(1)
start_date <- end_date - days(90)
# c(start_date, end_date) %<-% c("2018-02-01", "2018-02-02")
india_android_users <- wmf::query_hive(glue(query, .open = "${"))
readr::write_tsv(india_android_users, "~/mobile-apps-team/india_android_users.tsv")
system("gzip --force ~/mobile-apps-team/india_android_users.tsv")
india_android_users <- readr::read_tsv("~/mobile-apps-team/india_android_users.tsv.gz", col_types = "cci")
india_android_total <- wmf::query_hive(glue("SELECT
COUNT(DISTINCT uuid) AS n_devices
FROM bearloga.android_active_users
WHERE country_code = 'IN' -- requests from India
AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '${start_date}'
AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '${end_date}'
", .open = "${"))$n_devices
india_android_users %>%
dplyr::mutate(prop_devices = sprintf("%.1f%%", 100 * n_devices / india_android_total)) %>%
dplyr::filter(city != "Unknown") %>%
dplyr::top_n(20, n_devices) %>%
dplyr::arrange(desc(n_devices)) %>%
dplyr::mutate(n_devices = polloi::compress(n_devices, 2)) %>%
dplyr::select(-region)
Note: since users can go between cities, n_devices
represents number of devices observed in a city independent of other cities, so there may be overlaps. prop_devices
is the proportion of true total number of unique Android app users in the last 90 days.
india_android_users %>%
dplyr::group_by(region) %>%
dplyr::summarize(n_devices = sum(n_devices)) %>%
dplyr::ungroup() %>%
dplyr::mutate(prop_devices = sprintf("%.1f%%", 100 * n_devices / india_android_total)) %>%
dplyr::filter(region != "Unknown") %>%
dplyr::top_n(10, n_devices) %>%
dplyr::arrange(desc(n_devices)) %>%
dplyr::mutate(n_devices = polloi::compress(n_devices, 2))
What are the most typical language clusters? For example, this report might say that 80% of users from India have looked at only English content in the past 90 days, 10% have looked at both English and Hindi, 5% have looked at English, Hindi and Marathi content, 2.5% have looked at English-Hindi-Kannada, 1% have looked at Marathi-only, etc etc. (I know from looking at stats Jon sent that the highest % of India pageviews go to content in English, and then Hindi. But in a country like India, that actually doesn't mean much - many educated people speak English, and a majority of folks speak Hindi, so that doesn't tell us what other languages they might speak at home, and thus what other languages they might want content in!)
query <- "WITH per_device AS (
SELECT uuid, CONCAT_WS(', ', COLLECT_LIST(DISTINCT wikipedia)) AS languages
FROM bearloga.android_active_users
WHERE country_code = 'IN' -- requests from India
AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '${start_date}'
AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '${end_date}'
AND wikipedia IS NOT NULL
AND NOT wikipedia IN('', '-')
GROUP BY uuid
)
SELECT languages, COUNT(1) AS n_devices
FROM per_device
GROUP BY languages;"
india_android_langs <- wmf::query_hive(glue(query, .open = "${"))
readr::write_tsv(india_android_langs, "~/mobile-apps-team/india_android_langs.tsv")
india_android_langs <- readr::read_tsv("~/mobile-apps-team/india_android_langs.tsv", col_types = "ci")
prefixes <- polloi::get_prefixes()
languages <- set_names(prefixes$language, prefixes$prefix)
india_android_langs$languages <- strsplit(india_android_langs$languages, ", ") %>%
purrr::map_chr(~ paste(languages[.x], collapse = ", "))
…of Wikipedia that Android app users in India accessed in the last 90 days:
india_android_langs %>%
dplyr::mutate(prop_devices = sprintf("%.1f%%", 100 * n_devices / india_android_total)) %>%
dplyr::top_n(20, n_devices) %>%
dplyr::arrange(desc(n_devices)) %>%
dplyr::mutate(n_devices = polloi::compress(n_devices, 2))