« back

MySQL and Hive from PAWS Internal (R Edition)

This is the R version of Madhu's tutorial notebook for accessing MySQL and Hive from PAWS Internal using Python.

Prereqs

Assuming you've got production access and SSH configured (see Discovery/Analytics on Office wiki for more examples of SSH configs), you need to create an SSH tunnel like you would if you wanted to query Analytics-Store on your local machine:

ssh -N notebook1001.eqiad.wmnet -L 8000:127.0.0.1:8000

Then navigate to localhost:8000 in your favorite browser and login with your LDAP credentials (username/password that you use to login to Wikitech).

By the way, if you want a quick way to get into PAWS Internal, you can make an alias (in, say, ~/.bash_profile) that creates the SSH tunnel and opens the browser:

alias paws="ssh -N notebook1001.eqiad.wmnet -L 8000:127.0.0.1:8000"
alias PAWS="ssh -N notebook1001.eqiad.wmnet -L 8000:127.0.0.1:8000 & open http://localhost:8000/"

Then in terminal: PAWS or paws if you want to go to localhost:8000 yourself.

PAWS will launch your default browser and output a numeric process ID. When you want to close the tunnel: kill [pid]

In [1]:
Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")
options(repos = c(CRAN = "https://cran.rstudio.com/"))

If you want, you can put those 3 lines in ~/.Rprofile and they will be executed every time you launch R.

Dependencies

We need to install the devtools and Discovery's wmf packages. If you want to work with user agents in R, I've included the command to install the uaparser package.

install.packages("devtools")
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf")

# uaparser:
devtools::install_github("ua-parser/uap-r", configure.args = "-I/usr/include/yaml-cpp -I/usr/include/boost")

MySQL

The mysql_connect function in wmf will try to look for some common MySQL config files (those vary between stat1002, stat1003, and notebook1001). It'll let you know if it encounters any problems but I doubt you'll have any. Try querying with mysql_read:

In [3]:
log_tables <- wmf::mysql_read("SHOW TABLES;", "log")
head(log_tables)
Fetched 375 rows and 1 columns.
Tables_in_log
BannerImpression_5329872
CentralAuth_5690875
CentralNoticeBannerHistory_13447710
ChangesListFilters_15876023
ChangesListFilters_16174591
CommandInvocation_15237653

Hive

Since hive has been configured on notebook1001, we don't need to do anything extra to use the query_hive function in wmf:

In [5]:
wmf_tables <- wmf::query_hive("USE wmf; SHOW TABLES;")
head(wmf_tables)
tab_name
aqs_hourly
browser_general
last_access_uniques_daily
last_access_uniques_monthly
mediacounts
mediawiki_archive