A partial archive of https://discourse-mediawiki.wmflabs.org as of Saturday May 21, 2022.

Tracking pageview stats for a toolforge tool?

evad37

I’ve made a tool that basically displays sitelinks and other info for a Wikidata item: Free Knowledge Portal (Meta-wiki page, see further links there, since this software won’t let me post more than 2 links in a message).

A feature that would be nice would be to keep a record of page visits for each item page, which could then be visualised into graphs and the like. I’m thinking of making an sql database (per https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#User_databases), with a table with just some columns for item id and date-timestamp of visit, where each visit would be recorded in a row. Then you could get page visits by doing a query that counts the number of rows with a matching item and a date-timestamp within a given range.

So my questions are:

  • Is this a sane approach? Or is there a better approach? Or any reason it shouldn’t be done?
  • Does it matter that the table would keep growing indefinitely? What sort of limits are there for user databases?
  • Apart from basic security measures like escaping/validating/sanitising inputs, is there anything else I should be aware of?
LucasWerkmeisterWMDE

Probably make sure you’re not storing any sensitive user information, like client IP addresses, in that table. But item ID and date-timestamp of visit sounds fine.

Chicocvenancio

Sounds sane to me. I see no reason it couldn’t/shouldn’t be done as you described.

Sure, it does matter, but unless I’m missing something it would not grow by a large amount.

If it does start to grow in a measurable way you can make tables with less granular data for each hour/day/week/month and delete the granular records after a set period of time.
I’ll leave the more generic “what are the limits to user databases?” to @bd808 and our DBAs to answer.

Not that I can think of. Due to WMCS not being a safe place to deal with sensitive data I recommend against collecting anything private from the users. If you do decide to go that route, remember to display the terms of service to them beforehand.

bd808

Keeping raw page views counts indefinitely is a bad idea. Consider the number of rows that dataset may grow to in 5 years for a popular URL. A better approach would be to determine the granularity you need the historical data at (I would suggest daily is probably more than granular enough). Once you determine that, a better implementation would be to have a “raw hits” table as you have described and an “aggregated hits” table where you store something like (item id, day, total hits). You could cron a job that runs once per day to populate this aggregate table from the raw table and then delete the raw rows that you counted. This would functionally limit the size of your raw hits table to ~24 hours of traffic at any given time. Storing one row per item id per day long term will probably not be a major problem unless your tool becomes very, very popular.