Date: August 05, 2022
Author: Martin Urbanec martin.urbanec@wikimedia.cz
This notebook shows how are different pages distributed across namespaces at the Wikimedia projects.
from wmfdata import spark, mariadb
from IPython.display import display, Markdown, Latex, HTML
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
You are using wmfdata v1.3.2, but v1.3.3 is available. To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release --ignore-installed`. To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md
df = spark.run('''
WITH page_ns_distribution_raw AS (
SELECT
snapshot,
wiki_db,
page_namespace,
COUNT(*) AS cnt
FROM wmf_raw.mediawiki_page
WHERE
snapshot = '2022-07'
-- AND wiki_db = 'cswiki'
AND page_is_redirect = false
-- no extra namespaces
AND page_namespace NOT BETWEEN 100 AND 500
GROUP BY snapshot, wiki_db, page_namespace
)
SELECT
wiki_db,
database_group,
page_namespace,
IF(page_namespace=0, '(Main)', namespace_canonical_name) AS namespace,
cnt
FROM page_ns_distribution_raw AS pnd
JOIN wmf_raw.mediawiki_project_namespace_map AS nsm ON ((nsm.snapshot = pnd.snapshot) AND (nsm.dbname=pnd.wiki_db) AND (nsm.namespace=pnd.page_namespace))
JOIN canonical_data.wikis AS w ON (w.database_code = wiki_db)
ORDER BY wiki_db, page_namespace
''')
PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
PYSPARK_PYTHON=/usr/lib/anaconda-wmf/bin/python3
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/08/05 20:24:01 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12002. Attempting port 12003. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12003. Attempting port 12004. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12004. Attempting port 12005. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12005. Attempting port 12006. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12006. Attempting port 12007. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12007. Attempting port 12008. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12008. Attempting port 12009. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12009. Attempting port 12010. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12010. Attempting port 12011. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12011. Attempting port 12012. 22/08/05 20:24:01 WARN Utils: Service 'sparkDriver' could not bind on port 12012. Attempting port 12013. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4047. Attempting port 4048. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4048. Attempting port 4049. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4049. Attempting port 4050. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4050. Attempting port 4051. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4051. Attempting port 4052. 22/08/05 20:24:01 WARN Utils: Service 'SparkUI' could not bind on port 4052. Attempting port 4053. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13002. Attempting port 13003. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13003. Attempting port 13004. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13004. Attempting port 13005. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13005. Attempting port 13006. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13006. Attempting port 13007. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13007. Attempting port 13008. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13008. Attempting port 13009. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13009. Attempting port 13010. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13010. Attempting port 13011. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13011. Attempting port 13012. 22/08/05 20:24:12 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13012. Attempting port 13013.
wiki_totals = {}
for wiki_db in df.wiki_db.unique():
wiki_totals[wiki_db] = df.loc[df.wiki_db == wiki_db].cnt.sum()
df['pct_wiki'] = df.apply(lambda x: np.round(x.cnt / wiki_totals[x.wiki_db] * 100, decimals=2), axis=1)
agg_df = df[['namespace', 'cnt']].groupby('namespace').sum().reset_index()
agg_df['pct'] = np.round(agg_df.cnt / agg_df.cnt.sum() * 100, decimals=2)
agg_df.sort_values('pct', ascending=False)
namespace | cnt | pct | |
---|---|---|---|
0 | (Main) | 198077788 | 41.91 |
36 | User talk | 90750660 | 19.20 |
7 | File | 88394968 | 18.70 |
3 | Category | 30914264 | 6.54 |
25 | Talk | 27493772 | 5.82 |
19 | Project | 10968916 | 2.32 |
35 | User | 9326742 | 1.97 |
26 | Template | 6949722 | 1.47 |
33 | Translations | 3414227 | 0.72 |
4 | Category talk | 2856764 | 0.60 |
8 | File talk | 1174669 | 0.25 |
27 | Template talk | 657542 | 0.14 |
15 | Module | 441101 | 0.09 |
13 | MediaWiki | 375782 | 0.08 |
32 | Topic | 364937 | 0.08 |
20 | Project talk | 221433 | 0.05 |
1 | CNBanner | 71877 | 0.02 |
28 | Thread | 34876 | 0.01 |
11 | Help | 44642 | 0.01 |
14 | MediaWiki talk | 23895 | 0.01 |
30 | TimedText | 1341 | 0.00 |
29 | Thread talk | 4 | 0.00 |
6 | EntitySchema talk | 79 | 0.00 |
31 | TimedText talk | 148 | 0.00 |
2 | CNBanner talk | 26 | 0.00 |
37 | モジュール‐ノート | 117 | 0.00 |
34 | Translations talk | 95 | 0.00 |
22 | Story | 141 | 0.00 |
24 | Summary talk | 1 | 0.00 |
23 | Summary | 314 | 0.00 |
5 | EntitySchema | 378 | 0.00 |
21 | SecurePoll | 580 | 0.00 |
18 | Newsletter talk | 6 | 0.00 |
17 | Newsletter | 20 | 0.00 |
16 | Module talk | 9708 | 0.00 |
12 | Help talk | 7408 | 0.00 |
10 | Gadget talk | 3 | 0.00 |
9 | Gadget definition talk | 3 | 0.00 |
38 | モジュール・ノート | 1 | 0.00 |
for group in ['wikipedia', 'wikibooks', 'wikiquote', 'wiktionary', 'wikinews', 'wikisource', 'wikiversity', 'wikivoyage', 'commons', 'wikidata']:
display(Markdown('### %s' % group))
agg_df = df.loc[df.database_group == group][['namespace', 'cnt']].groupby('namespace').sum().reset_index()
agg_df['pct'] = np.round(agg_df.cnt / agg_df.cnt.sum() * 100, decimals=2)
display(agg_df.sort_values('pct', ascending=False))
namespace | cnt | pct | |
---|---|---|---|
25 | User talk | 63758316 | 33.88 |
0 | (Main) | 59228471 | 31.47 |
16 | Talk | 26853539 | 14.27 |
1 | Category | 14680369 | 7.80 |
24 | User | 7655496 | 4.07 |
17 | Template | 5368803 | 2.85 |
2 | Category talk | 2816971 | 1.50 |
3 | File | 2805520 | 1.49 |
12 | Project | 2781353 | 1.48 |
4 | File talk | 852720 | 0.45 |
18 | Template talk | 632006 | 0.34 |
23 | Topic | 189867 | 0.10 |
13 | Project talk | 194966 | 0.10 |
10 | Module | 186598 | 0.10 |
8 | MediaWiki | 124002 | 0.07 |
9 | MediaWiki talk | 13398 | 0.01 |
6 | Help | 24695 | 0.01 |
11 | Module talk | 7489 | 0.00 |
14 | Story | 141 | 0.00 |
15 | Summary | 29 | 0.00 |
19 | Thread | 2578 | 0.00 |
20 | Thread talk | 1 | 0.00 |
21 | TimedText | 1336 | 0.00 |
22 | TimedText talk | 148 | 0.00 |
7 | Help talk | 4623 | 0.00 |
5 | Gadget talk | 2 | 0.00 |
26 | モジュール‐ノート | 117 | 0.00 |
namespace | cnt | pct | |
---|---|---|---|
0 | (Main) | 386034 | 38.30 |
21 | User talk | 293145 | 29.09 |
20 | User | 77218 | 7.66 |
15 | Template | 52848 | 5.24 |
3 | File | 48728 | 4.83 |
1 | Category | 46852 | 4.65 |
14 | Talk | 37459 | 3.72 |
7 | MediaWiki | 12973 | 1.29 |
19 | Topic | 12708 | 1.26 |
17 | Thread | 12516 | 1.24 |
11 | Project | 11802 | 1.17 |
9 | Module | 8627 | 0.86 |
16 | Template talk | 2040 | 0.20 |
12 | Project talk | 1460 | 0.14 |
5 | Help | 1268 | 0.13 |
8 | MediaWiki talk | 760 | 0.08 |
2 | Category talk | 803 | 0.08 |
6 | Help talk | 285 | 0.03 |
4 | File talk | 305 | 0.03 |
13 | Summary | 14 | 0.00 |
10 | Module talk | 39 | 0.00 |
18 | Thread talk | 1 | 0.00 |
namespace | cnt | pct | |
---|---|---|---|
0 | (Main) | 301091 | 36.51 |
18 | User talk | 270392 | 32.79 |
1 | Category | 89350 | 10.83 |
17 | User | 50704 | 6.15 |
14 | Template | 37232 | 4.51 |
11 | Project | 29224 | 3.54 |
13 | Talk | 28425 | 3.45 |
7 | MediaWiki | 9963 | 1.21 |
9 | Module | 1641 | 0.20 |
3 | File | 1470 | 0.18 |
12 | Project talk | 1299 | 0.16 |
15 | Template talk | 880 | 0.11 |
2 | Category talk | 836 | 0.10 |
5 | Help | 711 | 0.09 |
16 | Topic | 674 | 0.08 |
8 | MediaWiki talk | 561 | 0.07 |
6 | Help talk | 168 | 0.02 |
4 | File talk | 46 | 0.01 |
10 | Module talk | 21 | 0.00 |
namespace | cnt | pct | |
---|---|---|---|
0 | (Main) | 32852999 | 89.82 |
1 | Category | 1798063 | 4.92 |
18 | Template | 677298 | 1.85 |
25 | User talk | 484278 | 1.32 |
11 | Module | 222525 | 0.61 |
24 | User | 209465 | 0.57 |
17 | Talk | 209852 | 0.57 |
13 | Project | 48134 | 0.13 |
9 | MediaWiki | 25190 | 0.07 |
19 | Template talk | 9260 | 0.03 |
2 | Category talk | 7502 | 0.02 |
23 | Translations | 7281 | 0.02 |
20 | Thread | 8470 | 0.02 |
7 | Help | 2773 | 0.01 |
3 | File | 4446 | 0.01 |
14 | Project talk | 4315 | 0.01 |
22 | Topic | 659 | 0.00 |
21 | Thread talk | 1 | 0.00 |
4 | File talk | 256 | 0.00 |
16 | Summary talk | 1 | 0.00 |
5 | Gadget definition talk | 1 | 0.00 |
12 | Module talk | 1321 | 0.00 |
10 | MediaWiki talk | 1418 | 0.00 |
8 | Help talk | 353 | 0.00 |
6 | Gadget talk | 1 | 0.00 |
15 | Summary | 12 | 0.00 |
namespace | cnt | pct | |
---|---|---|---|
11 | Project | 6862439 | 50.24 |
19 | User talk | 2785582 | 20.39 |
1 | Category | 2108588 | 15.44 |
0 | (Main) | 1743222 | 12.76 |
15 | Template | 48140 | 0.35 |
14 | Talk | 44651 | 0.33 |
18 | User | 37056 | 0.27 |
17 | Thread | 9326 | 0.07 |
3 | File | 7736 | 0.06 |
7 | MediaWiki | 6154 | 0.05 |
9 | Module | 1026 | 0.01 |
2 | Category talk | 1116 | 0.01 |
12 | Project talk | 1441 | 0.01 |
16 | Template talk | 1167 | 0.01 |
4 | File talk | 220 | 0.00 |
5 | Help | 500 | 0.00 |
6 | Help talk | 82 | 0.00 |
8 | MediaWiki talk | 361 | 0.00 |
13 | Summary | 21 | 0.00 |
10 | Module talk | 22 | 0.00 |
namespace | cnt | pct | |
---|---|---|---|
0 | (Main) | 3418012 | 76.85 |
1 | Category | 301805 | 6.79 |
20 | User talk | 251279 | 5.65 |
14 | Talk | 175661 | 3.95 |
3 | File | 107666 | 2.42 |
19 | User | 74167 | 1.67 |
15 | Template | 66774 | 1.50 |
11 | Project | 21359 | 0.48 |
7 | MediaWiki | 12006 | 0.27 |
9 | Module | 4911 | 0.11 |
18 | Topic | 3176 | 0.07 |
16 | Template talk | 2613 | 0.06 |
12 | Project talk | 1856 | 0.04 |
5 | Help | 1784 | 0.04 |
2 | Category talk | 1276 | 0.03 |
8 | MediaWiki talk | 1279 | 0.03 |
17 | Thread | 1277 | 0.03 |
4 | File talk | 280 | 0.01 |
6 | Help talk | 414 | 0.01 |
13 | Summary | 5 | 0.00 |
10 | Module talk | 165 | 0.00 |
namespace | cnt | pct | |
---|---|---|---|
0 | (Main) | 145372 | 31.52 |
19 | User talk | 117900 | 25.56 |
18 | User | 54428 | 11.80 |
3 | File | 41598 | 9.02 |
1 | Category | 40370 | 8.75 |
15 | Template | 30085 | 6.52 |
14 | Talk | 16866 | 3.66 |
12 | Project | 6361 | 1.38 |
8 | MediaWiki | 1987 | 0.43 |
10 | Module | 1733 | 0.38 |
13 | Project talk | 1058 | 0.23 |
2 | Category talk | 1009 | 0.22 |
16 | Template talk | 957 | 0.21 |
6 | Help | 685 | 0.15 |
17 | Topic | 274 | 0.06 |
9 | MediaWiki talk | 172 | 0.04 |
4 | File talk | 192 | 0.04 |
7 | Help talk | 154 | 0.03 |
11 | Module talk | 65 | 0.01 |
5 | Gadget definition talk | 1 | 0.00 |
namespace | cnt | pct | |
---|---|---|---|
18 | User talk | 181467 | 36.79 |
0 | (Main) | 129619 | 26.28 |
17 | User | 68226 | 13.83 |
1 | Category | 30877 | 6.26 |
13 | Talk | 23567 | 4.78 |
14 | Template | 21889 | 4.44 |
3 | File | 15553 | 3.15 |
11 | Project | 7572 | 1.54 |
7 | MediaWiki | 5878 | 1.19 |
9 | Module | 3918 | 0.79 |
12 | Project talk | 1661 | 0.34 |
15 | Template talk | 1136 | 0.23 |
5 | Help | 535 | 0.11 |
4 | File talk | 410 | 0.08 |
16 | Topic | 399 | 0.08 |
8 | MediaWiki talk | 274 | 0.06 |
6 | Help talk | 82 | 0.02 |
2 | Category talk | 106 | 0.02 |
10 | Module talk | 59 | 0.01 |
19 | モジュール・ノート | 1 | 0.00 |
namespace | cnt | pct | |
---|---|---|---|
3 | File | 85341613 | 77.08 |
1 | Category | 11628889 | 10.50 |
20 | User talk | 11127486 | 10.05 |
11 | Project | 1120810 | 1.01 |
17 | Translations | 378719 | 0.34 |
19 | User | 349038 | 0.32 |
4 | File talk | 319841 | 0.29 |
14 | Template | 261797 | 0.24 |
0 | (Main) | 124090 | 0.11 |
7 | MediaWiki | 21344 | 0.02 |
2 | Category talk | 26621 | 0.02 |
12 | Project talk | 10344 | 0.01 |
9 | Module | 1273 | 0.00 |
8 | MediaWiki talk | 3849 | 0.00 |
13 | Talk | 2812 | 0.00 |
6 | Help talk | 88 | 0.00 |
15 | Template talk | 4076 | 0.00 |
16 | Topic | 19 | 0.00 |
5 | Help | 809 | 0.00 |
18 | Translations talk | 6 | 0.00 |
10 | Module talk | 216 | 0.00 |
namespace | cnt | pct | |
---|---|---|---|
0 | (Main) | 98173548 | 99.49 |
18 | Translations | 222696 | 0.23 |
21 | User talk | 78693 | 0.08 |
12 | Project | 61664 | 0.06 |
20 | User | 55947 | 0.06 |
17 | Topic | 25231 | 0.03 |
14 | Talk | 32674 | 0.03 |
15 | Template | 8803 | 0.01 |
13 | Project talk | 2062 | 0.00 |
19 | Translations talk | 3 | 0.00 |
2 | Category talk | 17 | 0.00 |
16 | Template talk | 267 | 0.00 |
3 | EntitySchema | 349 | 0.00 |
4 | EntitySchema talk | 77 | 0.00 |
1 | Category | 4330 | 0.00 |
10 | Module | 720 | 0.00 |
9 | MediaWiki talk | 402 | 0.00 |
8 | MediaWiki | 3554 | 0.00 |
7 | Help talk | 149 | 0.00 |
6 | Help | 2354 | 0.00 |
5 | File talk | 4 | 0.00 |
11 | Module talk | 79 | 0.00 |