As of 2022-11-14, the Growth team uses an inconsistent definition of mentees in the Mentees overview module of the Mentor dashboard. Mentors see mentees who edited in the last 6 months, or registered in the last 14 days.
This can lead to biased conclusions, like "the majority of my newcomers are editing". Even though we'd love the statement to be true, we should provide accurate information regardless. It was suggested to include all newcomers who edited or registered in the last 6 months in the M1 module, resulting in newcomers represented proportionally in the dashboard.
The notebook exists to estimate how many rows will be added to growthexperiments_mentee_data
DB table. It does so by going through all mentees (growthexperiments_mentor_mentee
table) and calculating how many have a matching row in the _data
table and how many would have one using the newly proposed business rules.
from datetime import datetime, timedelta
import requests
from wmfdata import spark, mariadb, utils
import pandas as pd
import numpy as np
wiki_df = {}
def get_df_for_wiki(wiki, domain):
if wiki in wiki_df:
return wiki_df[wiki]
mentor_list = requests.get('https://%s/w/index.php?title=MediaWiki:GrowthMentors.json&action=raw&ctype=application/json' % domain).json()['Mentors']
menteesDf = mariadb.run('''
SELECT
gemm_mentee_id,
gemm_mentor_id,
mentee_data
FROM growthexperiments_mentor_mentee
LEFT JOIN growthexperiments_mentee_data ON mentee_id=gemm_mentee_id
WHERE
gemm_mentor_role = 'primary'
''', wiki, use_x1=True)
menteesDf['mentored_by_current_mentor'] = menteesDf.gemm_mentor_id.apply(lambda x: str(x) in mentor_list)
menteesDf.set_index('gemm_mentee_id', inplace=True)
usersDf = mariadb.run('''
SELECT
user_id, user_name, COALESCE(user_registration, '') AS timestamp_registration
FROM user
WHERE user_id IN ({ids})
'''.format(ids=', '.join([str(x) for x in menteesDf.index])), wiki)
usersDf['timestamp_registration'] = pd.to_datetime(usersDf.timestamp_registration)
usersDf.set_index('user_id', inplace=True)
editsDf = mariadb.run('''
SELECT actor_user, MAX(rev_timestamp) AS timestamp_latest_edit
FROM revision
JOIN actor ON actor_id=rev_actor
WHERE
actor_user IN ({ids})
GROUP BY actor_user
'''.format(ids=', '.join([str(x) for x in menteesDf.index])), wiki)
editsDf.set_index('actor_user', inplace=True)
editsDf['timestamp_latest_edit'] = pd.to_datetime(editsDf.timestamp_latest_edit)
df = menteesDf.join(usersDf).join(editsDf)
df['timestamp_last_activity'] = df.filter(like='timestamp_').max(1)
df['last_active_at'] = (datetime.now() - df.timestamp_last_activity)
df['should_be_in_dashboard'] = (df.last_active_at <= timedelta(days=30*6)) & (df.mentored_by_current_mentor)
df['is_in_dashboard'] = df.mentee_data.apply(lambda x: x is not None)
wiki_df[wiki] = df
return df
wikis = spark.run('''
SELECT
database_code,
domain_name
FROM canonical_data.wikis WHERE database_code IN (
'arwiki',
'bnwiki',
'cswiki',
'dewiki',
'frwiki',
'fawiki'
)
''')
data = {
'wiki': [],
'should_be_in_dashboard': [],
'is_in_dashboard': [],
}
for _, wiki in wikis.iterrows():
df = get_df_for_wiki(wiki.database_code, wiki.domain_name)
data['wiki'].append(wiki.domain_name)
should_be_in_dashboard = df.loc[df.should_be_in_dashboard == True].user_name.count()
is_in_dashboard = df.loc[df.is_in_dashboard == True].user_name.count()
data['should_be_in_dashboard'].append(int(should_be_in_dashboard))
data['is_in_dashboard'].append(int(is_in_dashboard))
aggDf = pd.DataFrame(data)
aggDf['diff'] = aggDf.should_be_in_dashboard - aggDf.is_in_dashboard
aggDf['diff_%'] = np.round(aggDf.should_be_in_dashboard / aggDf.is_in_dashboard * 100, 2)
aggDf
PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
PYSPARK_PYTHON=/usr/lib/anaconda-wmf/bin/python3
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/11/15 15:46:20 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 22/11/15 15:46:21 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001. 22/11/15 15:46:22 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 22/11/15 15:46:32 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
wiki | should_be_in_dashboard | is_in_dashboard | diff | diff_% | |
---|---|---|---|---|---|
0 | ar.wikipedia.org | 55160 | 17893 | 37267 | 308.28 |
1 | bn.wikipedia.org | 11221 | 2675 | 8546 | 419.48 |
2 | cs.wikipedia.org | 7940 | 3291 | 4649 | 241.26 |
3 | de.wikipedia.org | 39126 | 15052 | 24074 | 259.94 |
4 | fa.wikipedia.org | 29238 | 11844 | 17394 | 246.86 |
5 | fr.wikipedia.org | 68675 | 31458 | 37217 | 218.31 |
wiki_df['bnwiki'].loc[wiki_df['bnwiki'].should_be_in_dashboard == True].sample(20)
gemm_mentor_id | mentee_data | mentored_by_current_mentor | user_name | timestamp_registration | timestamp_latest_edit | timestamp_last_activity | last_active_at | should_be_in_dashboard | is_in_dashboard | |
---|---|---|---|---|---|---|---|---|---|---|
gemm_mentee_id | ||||||||||
395648 | 363460 | None | True | Hakim Mondal | 2022-09-21 18:12:26 | NaT | 2022-09-21 18:12:26 | 54 days 21:36:07.112878 | True | False |
381562 | 182715 | {"username":"\u09b6\u09b9\u09bf\u09a6\u09c1\u0... | True | শহিদুল ইসলাম কবির | 2022-05-30 15:34:08 | 2022-05-30 15:49:05 | 2022-05-30 15:49:05 | 168 days 23:59:28.112878 | True | True |
392992 | 44230 | None | True | Aadhunik duniya fact | 2022-09-01 15:56:18 | NaT | 2022-09-01 15:56:18 | 74 days 23:52:15.112878 | True | False |
389409 | 154396 | {"username":"Sheikh379","reverted":0,"question... | True | Sheikh379 | 2022-08-02 18:08:43 | 2022-08-02 18:39:13 | 2022-08-02 18:39:13 | 104 days 21:09:20.112878 | True | True |
385000 | 182715 | None | True | Alamin 577 | 2022-06-26 04:29:11 | NaT | 2022-06-26 04:29:11 | 142 days 11:19:22.112878 | True | False |
401571 | 363460 | {"username":"Kakoli aktar","reverted":0,"quest... | True | Kakoli aktar | 2022-11-08 03:20:05 | NaT | 2022-11-08 03:20:05 | 7 days 12:28:28.112878 | True | True |
380585 | 154396 | None | True | Alateautumnflower | 2022-05-22 18:55:49 | NaT | 2022-05-22 18:55:49 | 176 days 20:52:44.112878 | True | False |
385195 | 157985 | {"username":"NayeemSalafi","reverted":1,"quest... | True | NayeemSalafi | 2022-06-27 10:10:21 | 2022-06-27 10:44:11 | 2022-06-27 10:44:11 | 141 days 05:04:22.112878 | True | True |
388825 | 157985 | None | True | মোঃ হাসিবুর রহমান হুসাইন | 2022-07-28 15:29:12 | NaT | 2022-07-28 15:29:12 | 110 days 00:19:21.112878 | True | False |
367412 | 181938 | {"username":"\u09ae\u09cb \u09a4\u09be\u09b8\u... | True | মো তাসীন আলম | 2022-02-08 15:50:09 | 2022-09-08 03:37:19 | 2022-09-08 03:37:19 | 68 days 12:11:14.112878 | True | True |
380455 | 154396 | {"username":"M Badhon Khan","reverted":1,"ques... | True | M Badhon Khan | 2022-05-21 18:07:21 | 2022-05-21 18:11:37 | 2022-05-21 18:11:37 | 177 days 21:36:56.112878 | True | True |
363986 | 42623 | {"username":"Md.Mithun Hasan Jakaria","reverte... | True | Md.Mithun Hasan Jakaria | 2022-01-14 16:54:39 | 2022-09-05 11:25:34 | 2022-09-05 11:25:34 | 71 days 04:22:59.112878 | True | True |
385470 | 42623 | None | True | Mdjayad88 | 2022-06-29 06:20:37 | NaT | 2022-06-29 06:20:37 | 139 days 09:27:56.112878 | True | False |
395294 | 363460 | None | True | Md Sharif Chowdhury | 2022-09-19 04:13:42 | NaT | 2022-09-19 04:13:42 | 57 days 11:34:51.112878 | True | False |
382836 | 182715 | None | True | Labib Hasan Prio | 2022-06-09 17:49:49 | NaT | 2022-06-09 17:49:49 | 158 days 21:58:44.112878 | True | False |
395904 | 157985 | None | True | Souravsahax | 2022-09-23 17:25:28 | NaT | 2022-09-23 17:25:28 | 52 days 22:23:05.112878 | True | False |
393537 | 42623 | {"username":"S.M. OMAR FARUQ","reverted":0,"qu... | True | S.M. OMAR FARUQ | 2022-09-05 17:27:19 | 2022-09-05 17:38:26 | 2022-09-05 17:38:26 | 70 days 22:10:07.112878 | True | True |
380332 | 159620 | None | True | Townbd | 2022-05-20 20:02:45 | NaT | 2022-05-20 20:02:45 | 178 days 19:45:48.112878 | True | False |
397561 | 157985 | {"username":"Mosud khan pabna","reverted":16,"... | True | Mosud khan pabna | 2022-10-07 03:17:11 | 2022-10-07 19:54:25 | 2022-10-07 19:54:25 | 38 days 19:54:08.112878 | True | True |
388034 | 154396 | None | True | Mehadi Hasan Faisal | 2022-07-21 20:24:20 | NaT | 2022-07-21 20:24:20 | 116 days 19:24:13.112878 | True | False |