Date: 2023-09-04
Author: Martin Urbanec, WMF (murbanec@wikimedia.org)
License: Apache License
See T345484: Data on usage of StructuredDiscussions (Flow) for details.
For the purposes of this report, "usage of Flow" means making For the purposes of this report, "usage of Flow" means making an edit in the Topic namespace on a particular wiki (such as, adding a new post, contributing to a thread or editing an existing post). Reading a Flow-based discussion, deleting/hiding/suppressing a topic/post, editing a board/thread summary or similar are not counted as Flow usage in this report. For Discussion tools, "usage" means an edit to a talk page that is tagged with DiscussionTools tags.
This decision was made to make numbers easier to compare with non-Flow discussion tools: for example, while those concepts can be emulated in DiscussionTools, there is no native support for hiding or summarizing a thread in DiscussionTools.
All data and counts in this report are based on August 2023 data.
from wmfdata import mariadb, spark, utils
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
You are using wmfdata v1.4.0, but v2.0.0 is available. To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release --ignore-installed`. To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md
Query the flowdb
database and get necessary data around Flow revisions.
Sources:
dfAllFlow = mariadb.run('''
WITH flow_revision_ts AS (
SELECT
workflow_wiki,
workflow_page_id,
workflow_namespace,
workflow_title_text,
date_format(from_unixtime((conv(substring(hex(rev_id), 1, 12), 16, 10) >> 2) / 1000),"%Y%m%d%H%i%S") AS rev_timestamp,
rev_change_type,
rev_user_id,
rev_user_ip
FROM flow_workflow
INNER JOIN flow_tree_node ON workflow_id = tree_ancestor_id
INNER JOIN flow_tree_revision ON tree_descendant_id = tree_rev_descendant_id
INNER JOIN flow_revision ON tree_rev_id = rev_id
)
SELECT *
FROM flow_revision_ts
WHERE
rev_timestamp LIKE '202308%'
-- the other post we're interested in is `new-post`, but those entries
-- are also automatically entered as `reply`, so condition to one to avoid
-- duplication.
AND rev_change_type = 'reply'
''', 'flowdb', use_x1=True)
dfAllFlow['rev_timestamp_day'] = pd.to_datetime(dfAllFlow.rev_timestamp.apply(lambda x: x[:8]))
dfAllFlow.head(2)
workflow_wiki | workflow_page_id | workflow_namespace | workflow_title_text | rev_timestamp | rev_change_type | rev_user_id | rev_user_ip | rev_timestamp_day | |
---|---|---|---|---|---|---|---|---|---|
0 | mediawikiwiki | 464252 | 0 | LiquidThreads_Test_Page | 20230828062231 | reply | 0 | 2001:16A2:C0B9:8A0D:7D2C:6931:3C4B:D1A1 | 2023-08-28 |
1 | mediawikiwiki | 464366 | 101 | Coding_conventions | 20230819163244 | reply | 0 | 37.111.211.157 | 2023-08-19 |
dfAllDT = spark.run('''
SELECT *
FROM event.mediawiki_revision_tags_change
WHERE
year = 2023
AND month = 8
AND ARRAY_CONTAINS(tags, 'discussiontools')
''')
dfAllDT['rev_timestamp'] = dfAllDT.meta.apply(lambda x: x.dt)
dfAllDT['rev_timestamp_day'] = pd.to_datetime(dfAllDT.rev_timestamp.apply(lambda x: x.split('T')[0]))
dfAllDT.head(2)
PySpark executors will use /opt/conda-analytics/bin/python3.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/09/13 15:39:01 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 23/09/13 15:39:01 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001. 23/09/13 15:39:01 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002. 23/09/13 15:39:01 WARN Utils: Service 'sparkDriver' could not bind on port 12002. Attempting port 12003. 23/09/13 15:39:01 WARN Utils: Service 'sparkDriver' could not bind on port 12003. Attempting port 12004. 23/09/13 15:39:01 WARN Utils: Service 'sparkDriver' could not bind on port 12004. Attempting port 12005. 23/09/13 15:39:01 WARN Utils: Service 'sparkDriver' could not bind on port 12005. Attempting port 12006. 23/09/13 15:39:02 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 23/09/13 15:39:02 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. 23/09/13 15:39:02 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043. 23/09/13 15:39:02 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044. 23/09/13 15:39:02 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045. 23/09/13 15:39:02 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046. 23/09/13 15:39:11 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001. 23/09/13 15:39:11 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002. 23/09/13 15:39:11 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13002. Attempting port 13003. 23/09/13 15:39:11 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13003. Attempting port 13004. 23/09/13 15:39:11 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13004. Attempting port 13005. 23/09/13 15:39:11 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13005. Attempting port 13006. 23/09/13 15:39:12 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! 23/09/13 15:39:27 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. 23/09/13 15:39:27 WARN NioEventLoop: Selector.select() returned prematurely 512 times in a row; rebuilding Selector io.netty.channel.nio.SelectedSelectionKeySetSelector@228d0120.
comment | database | meta | page_id | page_is_redirect | page_namespace | page_title | parsedcomment | performer | prior_state | rev_content_format | rev_content_model | rev_id | rev_len | rev_minor_edit | rev_sha1 | rev_timestamp | tags | rev_parent_id | _schema | chronology_id | is_wmf_domain | normalized_host | datacenter | year | month | day | hour | rev_timestamp_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | /* Monumentenglas */ Reactie: koof met oud glas | nlwiki | (nl.wikipedia.org, 2023-08-16T13:58:53Z, f2296471-5591-4282-b385-6574840aaf85, cef7dd70-cd85-4630-999f-cf20c11bfb8e, None, None, https://nl.wikipedia.org/wiki/Overleg_gebruiker:Hoyanova, mediawiki.revision-tags-change) | 5374614 | False | 3 | Overleg_gebruiker:Hoyanova | <span dir="auto"><span class="autocomment">Monumentenglas: </span> Reactie: koof met oud glas</span> | (41928, [extendedconfirmed, *, user, autoconfirmed], 143793, False, 2007-12-06T10:27:29Z, Bertux) | ([],) | text/x-wiki | wikitext | 65212434 | 10307 | False | aoe4at8jmcya4qwo84j41038ska5xqr | 2023-08-16T13:58:53Z | [discussiontools, discussiontools-reply, discussiontools-source, discussiontools-source-enhanced] | 65212399.0 | None | None | True | (wikipedia, nl, [], org, wikipedia) | eqiad | 2023 | 8 | 16 | 13 | 2023-08-16 |
1 | /* Castellaro (San Marino) */ Risposta | itwiki | (it.wikipedia.org, 2023-08-16T13:54:10Z, 03912e63-263e-4bda-9413-54278c956131, 81e55aed-fb5e-4fca-b593-3a650fa9b805, None, None, https://it.wikipedia.org/wiki/Wikipedia:Pagine_da_cancellare/Castellaro_(San_Marino), mediawiki.revision-tags-change) | 9860792 | False | 4 | Wikipedia:Pagine_da_cancellare/Castellaro_(San_Marino) | <span dir="auto"><span class="autocomment">Castellaro (San Marino): </span> Risposta</span> | (4563, [autopatrolled, *, user, autoconfirmed], 578661, False, 2010-12-06T12:01:09Z, Sciking) | ([],) | text/x-wiki | wikitext | 134986370 | 987 | False | fx2julv06tvkxgw8dndl8ruosvraols | 2023-08-16T13:54:10Z | [discussiontools, discussiontools-reply, discussiontools-source, discussiontools-source-enhanced] | 134964065.0 | None | None | True | (wikipedia, it, [], org, wikipedia) | eqiad | 2023 | 8 | 16 | 13 | 2023-08-16 |
wikis_liquidthreads = ['enwikinews', 'enwiktionary', 'huwiki', 'ptwikibooks', 'svwikisource']
dfs = []
for wiki in wikis_liquidthreads:
tmpDf = mariadb.run('''
SELECT *
FROM revision
JOIN page ON rev_page=page_id
WHERE
page_namespace=90
AND rev_timestamp LIKE '202308%'
''', wiki)
tmpDf['wiki'] = wiki
dfs.append(tmpDf)
dfLTAll = pd.concat(dfs)
dfLTAll['rev_timestamp_day'] = pd.to_datetime(dfLTAll.rev_timestamp.apply(lambda x: x[:8]))
dfLTAll.head(2)
rev_id | rev_page | rev_comment_id | rev_actor | rev_timestamp | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1 | page_id | page_namespace | page_title | page_is_redirect | page_is_new | page_random | page_touched | page_links_updated | page_latest | page_len | page_content_model | page_lang | wiki | rev_timestamp_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4739725 | 2976832 | 424341 | 2888715 | 20230801210048 | 0 | 0 | 623 | 0 | 69w7d3oh1uc7ziq593fyo5phdf54g31 | 2976832 | 90 | Comments:City_of_Calgary,_Canada_removes_fluoride_from_drinking_water/Flouride/reply_(3) | 0 | 1 | 0.997647 | 20230801210049 | 20230801210048 | 4739725 | 623 | wikitext | None | enwikinews | 2023-08-01 |
1 | 4739739 | 2976842 | 1109458 | 2963947 | 20230801233431 | 0 | 0 | 104 | 0 | nb0chmukvjqr0whxrp2diq9cwzy7uzu | 2976842 | 90 | Comments:Heatwaves_surge_worldwide_as_researchers'_analysis_indicates_global_temperatures_reaching_new_highs/3_random_days_are_not_a_"wave". | 0 | 1 | 0.060584 | 20230801233431 | 20230801233431 | 4739739 | 104 | wikitext | None | enwikinews | 2023-08-01 |
Flow is most frequently used on the following wikis:
dfAllFlow[['workflow_wiki', 'rev_timestamp']]\
.rename(columns={'rev_timestamp': 'revisions'})\
.groupby('workflow_wiki')\
.count()\
.sort_values('revisions', ascending=False)\
.head(5)
revisions | |
---|---|
workflow_wiki | |
frwiki | 2931 |
mediawikiwiki | 1401 |
arwiki | 663 |
wikidatawiki | 608 |
cawiki | 500 |
dfAgg = dfAllFlow[['rev_timestamp_day', 'rev_timestamp']].rename(columns={'rev_timestamp': 'revisions'}).groupby('rev_timestamp_day').count()
dfAgg
revisions | |
---|---|
rev_timestamp_day | |
2023-08-01 | 265 |
2023-08-02 | 286 |
2023-08-03 | 191 |
2023-08-04 | 182 |
2023-08-05 | 201 |
2023-08-06 | 150 |
2023-08-07 | 417 |
2023-08-08 | 220 |
2023-08-09 | 188 |
2023-08-10 | 176 |
2023-08-11 | 201 |
2023-08-12 | 193 |
2023-08-13 | 218 |
2023-08-14 | 280 |
2023-08-15 | 277 |
2023-08-16 | 170 |
2023-08-17 | 241 |
2023-08-18 | 246 |
2023-08-19 | 218 |
2023-08-20 | 288 |
2023-08-21 | 458 |
2023-08-22 | 219 |
2023-08-23 | 194 |
2023-08-24 | 287 |
2023-08-25 | 198 |
2023-08-26 | 200 |
2023-08-27 | 219 |
2023-08-28 | 484 |
2023-08-29 | 221 |
2023-08-30 | 379 |
2023-08-31 | 307 |
dfAgg.plot(figsize=(15,5));
dfAgg = dfAllDT[['rev_timestamp_day', 'rev_timestamp']].rename(columns={'rev_timestamp': 'revisions'}).groupby('rev_timestamp_day').count()
dfAgg
revisions | |
---|---|
rev_timestamp_day | |
2023-08-01 | 19797 |
2023-08-02 | 18514 |
2023-08-03 | 18199 |
2023-08-04 | 18125 |
2023-08-05 | 16673 |
2023-08-06 | 17671 |
2023-08-07 | 18934 |
2023-08-08 | 19808 |
2023-08-09 | 19501 |
2023-08-10 | 19080 |
2023-08-11 | 18373 |
2023-08-12 | 16897 |
2023-08-13 | 17303 |
2023-08-14 | 19984 |
2023-08-15 | 19489 |
2023-08-16 | 19259 |
2023-08-17 | 19239 |
2023-08-18 | 19064 |
2023-08-19 | 17370 |
2023-08-20 | 18058 |
2023-08-21 | 19271 |
2023-08-22 | 18974 |
2023-08-23 | 20046 |
2023-08-24 | 19085 |
2023-08-25 | 18389 |
2023-08-26 | 16991 |
2023-08-27 | 17711 |
2023-08-28 | 20515 |
2023-08-29 | 20414 |
2023-08-30 | 20473 |
2023-08-31 | 19058 |
dfAgg.plot(figsize=(15,5));
NOTE: LiquidThreads is enabled only on 5 Wikimedia wikis (compared to Flow, which is enabled on 48 wikis and DiscussionTools, which is enabled everywhere). Its usage is low for that reason.
List of LiquidThreads-enabled wikis:
dfAgg = dfLTAll[['rev_timestamp_day', 'rev_timestamp']].rename(columns={'rev_timestamp': 'revisions'}).groupby('rev_timestamp_day').count()
dfAgg = dfAgg.reindex(pd.date_range('2023-08-01', '2023-08-31'), fill_value=0)
dfAgg
revisions | |
---|---|
2023-08-01 | 2 |
2023-08-02 | 0 |
2023-08-03 | 0 |
2023-08-04 | 0 |
2023-08-05 | 1 |
2023-08-06 | 0 |
2023-08-07 | 0 |
2023-08-08 | 2 |
2023-08-09 | 0 |
2023-08-10 | 0 |
2023-08-11 | 0 |
2023-08-12 | 1 |
2023-08-13 | 0 |
2023-08-14 | 1 |
2023-08-15 | 0 |
2023-08-16 | 6 |
2023-08-17 | 1 |
2023-08-18 | 1 |
2023-08-19 | 0 |
2023-08-20 | 0 |
2023-08-21 | 0 |
2023-08-22 | 0 |
2023-08-23 | 0 |
2023-08-24 | 0 |
2023-08-25 | 0 |
2023-08-26 | 0 |
2023-08-27 | 0 |
2023-08-28 | 0 |
2023-08-29 | 1 |
2023-08-30 | 0 |
2023-08-31 | 1 |
dfAgg.plot(figsize=(15,5));
The Phabricator task does not make it clear whether "most active Flow boards" is meant across all wikis or on each top-5 wiki separately. This report interprets "most active Flow board" as Flow board on all wikis combined (this means larger wikis are overrepresented, as we're looking on the raw number of usages).
dfAllFlow[['workflow_wiki', 'workflow_namespace', 'workflow_title_text', 'rev_timestamp']]\
.groupby(['workflow_wiki', 'workflow_namespace', 'workflow_title_text'])\
.count()\
.reset_index()\
.rename(columns={'rev_timestamp': 'revisions'})\
.sort_values('revisions', ascending=False)\
.head(20)
workflow_wiki | workflow_namespace | workflow_title_text | revisions | |
---|---|---|---|---|
338 | frwiki | 4 | Forum_de_relecture | 583 |
339 | frwiki | 4 | Forum_des_nouveaux | 547 |
451 | mediawikiwiki | 4 | Support_desk | 547 |
568 | testwiki | 1 | Pywikibot_test | 261 |
192 | frwiki | 3 | Fourmidable | 200 |
556 | plwiki | 12 | Pytania_nowicjuszy | 190 |
45 | arwiki | 3 | باسم | 182 |
135 | fawiki | 4 | درخواست_راهنمایی | 173 |
301 | frwiki | 3 | Sherwood6 | 170 |
467 | mediawikiwiki | 13 | OAuth | 160 |
342 | frwiki | 103 | Les_sans_pagEs | 139 |
557 | ptwiki | 12 | Contato/Fale_com_a_Wikipédia | 136 |
670 | wikidatawiki | 4 | Bistro | 112 |
49 | arwiki | 3 | حبيشان/أرشيف_النقاشات_الهيكلية_1 | 101 |
340 | frwiki | 102 | Astronomie/Porte_des_étoiles | 99 |
102 | cawiki | 4 | La_taverna/Novetats | 99 |
100 | cawiki | 4 | La_taverna/Ajuda | 81 |
429 | mediawikiwiki | 3 | Jdlrobson/Stable_interface_policy/frontend | 78 |
196 | frwiki | 3 | GF38storic | 68 |
103 | cawiki | 4 | La_taverna/Propostes | 62 |
Because the data for Flow is recorded in a separate database and not directly available in the data lake, this is not an easy query to make. Excluding from the report per the task.