Talk:Wikipedia Diversity Observatory

From Meta, a Wikimedia project coordination wiki

Wrong list[edit]

How can it be that when I click a link like this (lang_origin=lv&lang_target=sv), I sometimes get another list, e.g. Estonian-Catalan or Ukrainian-Swedish instead of Latvian-Swedish? Does the tool bring back something from a global cache and not a unique list for each user? --LA2 (talk) 01:29, 2 November 2018 (UTC)[reply]

It looked like there was a caché bug. But it seems that it is a matter of how the Framework (Dash/Plotly) is defined. It now works fine after adding the Dropdown menus. My apologies if it bothered you. --Marcmiquel (talk) 19:15, 13 November 2018 (UTC)[reply]

My thoughts[edit]

As a fellow researcher interested in this topic, I finally got around to investing this in more depth. Excellent job (although some documentation on mediawiki is unfinished). Some thoughts (@User:Marcmiquel):

  • why is Basic English included as a language?
  • on https://wdo.wmcloud.org/diversity_over_time/ there seems to be an error either in the description or menus - the first line of opinions, "Select a group of Wikipedias. You can add or remove languages:", allows only a single Wikipedia to be looked at, adding a second one erases the first one
  • there is some inconsistency in labels. For example, on the linked diversity over time, the parameter "language CCC" seems to be the same as "CCC" from https://wdo.wmcloud.org/cultural_context_content/
  • in general, I think each page should always contain a definition (operationalization) of such terms. I recently read one of your articles which contained a big table, and I didn't figure out what most abbreviations mean until I visited the MediaWiki pages (you got "lucky" that the reviewers didn't notice/complain about this).
  • chronological data seems to be labeled as 'monthly', but presumably it's based on a snapshot of a particular day? Maybe the exact day could be included
  • is there a way to download specified data as CVS or such? For example, I'd like to be able to download data for several Wikipedias (let's say, English, Polish, German, Korean... etc.), and chose options such as "CCC art." and "diversity over time", i.e. I'd like to compare how the diversity changed for each of those, over time. Right now I don't know how to do it "easily" using your tools.
  • playing more with that visualization (scatterplot), I'd like to see an option to "match" languages to Wikipedias (ex. I entered English, German, Korean, Polish, and I'd like to see a one button feature of making the content match (i.e. be limited to English, German, Korean, Polish). Another feature I'd like to see is to narrow down what's displayed, for example, just the matching pairs, or other pairs readers could define. For example, I'd like to see only coverage of English or Polish, or both of them but not other languages).
  • your data and visualization would benefit from focus on non-trivial languages / cultures (as you already do in many of your papers). This right now affects the friendliness of some of your tools. For example, https://wdo.wmcloud.org/ccc_spread/ starts with Cebuano (that, let's face it, nobody cares about). Worse, trying to get data (lie in that page) for regional groupings (ex. Asia) gives us a ton of minor languages that make the table very slow to load and mostly useless until one controls by removing 2/3 of the languages. So people could chose Asia (major entities) option that generates something mangable. Oh, and that particular tool (ccc spread) simply breaks down for more than 30 entries anyway (choosing Top 40 or Asia simply doesn't generate a new table).
  • since some folks are not linguists, it may be good to provide a dominant country name after some languages. Ex Tagalog (Philippines), Urdu (Pakistan), etc.
  • finally, and related to the point about non-trivial groupings, I'd love to see a way to create one's own groupings, based on whatever the reader wants. As in, I'd like to be able to define for example categories based on IGW cultural clusters, such as "Baltic countries", "post-communist countries", etc. You could even allow people to save them publically. This would give you many more toggle'able, pre-defined settings for people to play with, some of which could even give you ideas for future research. --Piotrus (talk) 11:36, 4 May 2022 (UTC)[reply]


Thank you for writing me, @Piotrus. I'm glad you liked the project. I've devoted much to it over the years - some things are more advanced than others, but overall, it gave interesting research, some papers, and dashboards. I'll go through your comments by order.
  • Basic English (or Simple English) is just another Wikipedia (and another version of English), that's why it is included. I find it interesting to know what it covers.
  • I guess you are looking at the time series. It allows more than one language (a group). But you need to select either "limit to one language" or "limit to one entity". If you want a group of languages, you need to select the second option in the radio.
  • You're right. We try to always be a bit redundant, but sometimes we forget that not everybody is aware of the meaning of the abbreviations, even though this is a tool/research that makes sense in the context of Wikimedians.
  • The dump is always from the beginning of the month. Even though it may be from a specific date, the stats are made based on the last day of the previous month.
  • You could download the database wikipedia_diversity.db to have all the articles / features for each and the stats.db to have the data that is displayed in the website. I think this is the easiest way to use the data.
  • Yes, the interface for the scatterplot could have some options to limit the languages more easily, but usually you can use the legend and double-click to just select one language.
  • Cebuano is in the top 10. It's one group of languages that is convenient. Still, you are right that the groupings can be improved to fit more usual demands. The groupings is a good idea, as it saves time to introduce each language one by one. But generally, there are many things in the interface that can be improved.
  • The concept of dominant country (or largest country speaking the language) is interesting. It could be used somewhere, but I'm not sure which dashboard. Ideally, you would learn about the language-territory mapping, etc. and many other aspects using the tools, but it is true, that at the moment it is hard to know where to put the limit.
  • Fantastic ideas for future research. Yeh, I believe that there are other groupings (of countries) and of cultures that could be interesting to dig into.
I suppose you had the chance to take a look at the last paper we published on the project. I think it is a good summary:
Miquel-Ribé, M., & Laniado, D. (2021). The Wikipedia Diversity Observatory: helping communities to bridge content gaps through interactive interfaces. Journal of Internet Services and Applications, 12(1), 1-25. https://jisajournal.springeropen.com/articles/10.1186/s13174-021-00141-y
I also wanted to say that in the past year I have been working in a project called Knowledge Gaps Index supporting the Wikimedia Foundation research team. The project will track different knowledge gaps and will have a dedicated dashboard that will also monitor the content gaps. Most of the work done in the Diversity Observatory will be incorporated into it. At the moment I am not an active part of the project, but as far as ::I know the interface will look so much better.
Again, thank you for the feedback and I'm glad you liked the research. Best regards,
Marc Miquel Marcmiquel (talk) 20:17, 4 May 2022 (UTC)[reply]
@Marcmiquel Yes, I read it. You are doing wonderful job that ties into my interests and published and ongoing research (ex. [1], European version is in review, and I hope to finish my draft on global summary soon, and I've already started work on another one that is going to tie with WDO more - I only learned about it this year). Briefly, I am trying to see what factors influence popularity of Wikipedia and Wikipedia activism. Your focus is on various "knowledge gaps", but while digital divide and like are important, they don't explain everything. As in, some countries that are at very similar levels of development, have different levels of Wikipedia editing. I am trying to understand why. Size is one issue, government policies can have extreme impact, but culture seems to be relevant too (I am testing for example if there are patterns that match Inglehart-Welzel or Hofstede models). If you ever want to discuss this more, or collaborate on something, feel free to hit me up, maybe by email (sadly I don't often go to Wikimania etc. these days). Piotrus (talk) 11:46, 5 May 2022 (UTC)[reply]
You're totally right. The causes of the community development are various and we do not explain them well at the moment. I look at the final outcome, but we should spend more efforts on trying to understand the roots and see if we can have valuable collaborations with governments and other actors to try to improve them. The cultural aspects you mention seem very interesting. Let's stay in touch. Unfortunately, the next (in person) Wikimania still seems too far! Best, Marcmiquel (talk) 12:35, 5 May 2022 (UTC)[reply]

Running the code[edit]

When running a code from GitHub (https://github.com/marcmiquel/WDO), I get various errors that result from the lack of prepared records in database files (.db). @Marcmiquel: If it is possible, please help. In particular, I am interested in the following issues:

  1. Is the code on GitHub is a draft version (and need to be improved) or ready to go? For example, this file https://github.com/marcmiquel/WDO/blob/wcdo/src_data/stats_generation.py at lines 64, 65, 66 and lines 111, 112, 113 has empty input commands, which stop the code from running until the user presses "Enter" (6 times in total). If such action from user is obligatory, it is worth adding some information about it, because this is where the code may "run forever" ;) . Another example, file https://github.com/marcmiquel/WDO/blob/wcdo/src_data/missing_ccc_selection.py at line 1 has update_pull_missing_ccc_wikipedia_diversity command, which wasn't defined and therefore it provides error at the beginning of the code execution.
  2. Do I understand correctly that in order to get current CCC statistics (such as CCC Spread at https://wdo.wmcloud.org/ccc_spread/) between different Wikipedia language versions, I need to run only four files, given in the manual: wikipedia_diversity.py, content_retrieval.py, content_selection.py and stats_generation.py?
  3. Is it enough to have a dumps of Wikidata and different language versions of Wikipedia (that are available at https://dumps.wikimedia.org/) to properly run the above-mentioned four files and get final results (statistics related to CCC)? Prof.DataScience (talk) 19:59, 23 June 2022 (UTC)[reply]
Pinging User:Marcmiquel, I am also curious. Piotrus (talk) 10:33, 7 July 2022 (UTC)[reply]

Thanks for writing, @Prof.DataScience and @Piotrus. I'll do my best to reply to each of the questions.

  1. The code is the same that is running on the server. The Wikipedia Diversity Observatory is a work-in-progress as many others in the Movement, more focused on studying the content gaps (and publishing research) rather than making a final product. This means that while the code works to deliver the results, there are some functions in the code that are left to be completed in the future or that they are left there because they were used at some point and might be useful again. stats_generation.py is an example of this. There are some intersections for LGBT+ content gaps over time that are not completed, although they exist in the spreadsheet.
  2. The 64, 65, 66 in stats_generations.py are input(). I use these to stop the execution when I'm running the code in the terminal while I am trying new things or I want to check something in specific. Possibly there was some code before it that I deleted, and I forgot to do the same with the input. Since I use github to save the code (not to deliver it as a final product, I forgot they were there). Same for the update_pull_missing_ccc_wikipedia_diversity. It seems an unintended copy-paste. This function (update_pull_missing_ccc_wikipedia_diversity) is part of the content_selection.py script and it transfers rows between two databases. I fixed both of them.
  3. The order is correct. Please, be mindful that it is really time-expensive (possibly more than two weeks for the 300 languages). I have not revised/run the code for an year and there might have been changes in the dumps syntax. At the same time, I do recommend running it from a WMF server to access the dumps directly without the need to download them (which is also time consuming).
  4. Generally, I think the approach I took to systematize and process the content gaps is helpful, but I must say that there are more efficient ways to do it. The project Knowledge Gaps Index is already working on monitoring most of these gaps and will be using a different and much quicker (thread-based) technical approach to create a dataset and a website. Unfortunately I am not aware of the current stage of the project.
  5. Sorry for not replying earlier. I had not logged in for some days and I did not see it. I hope my comments are helpful. Feel free to ask further. Best regards. Marcmiquel (talk) 11:30, 13 July 2022 (UTC)[reply]