Research talk:Expanding Wikipedia articles across languages

From Meta, a Wikimedia project coordination wiki

Finding representative categories per Wikipedia article[edit]

@Magnus Manske: Hi. I have a question that you may have run into and can help us with. In order to be able to recommend which sections can be added to an article, we need to be able to find a representative set of categories for each Wikipedia article. There are two problems we are facing atm. One is that there are too many categories per article (for some articles) and even those may not be the interesting ones. For example, for Alan Turing, we see that Category:20th-century_mathematicians is a category, but what is really interesting for this article is Category:Mathematicians. We can go up the category structure to get the latter category, but the issue is that we run into many other categories in the process as well. What makes things even worse is that the categories are not ontological. For Alan Turning, we see Category:History_of_artificial_intelligence and one level up from this category is Category:Artificial Intelligence! As humans, we know that Artificial Intelligence is not a type category, but we need to somehow tell the machine about this. How should the algorithm pick it up that Artificial Intelligence is a bad category for Alan Turing article (at least for the purposes of article expansion and section recommendation)?

I appreciate that this is a big problem, so any approximation recommendations would be great, i.e., we're not looking for a way to solve the category system issues though any solution that can have that outcome as a side-product is welcome as well. Thanks for your time in advance. --LZia (WMF) (talk) 18:10, 6 March 2017 (UTC)[reply]

The category system has its uses, but it is an absolute mess for this kind of thing. One could presumably have users collect many examples, and then run machine learning on that set to try to predict the useful (parent) categories, but that's it's own PhD thesis ;-) Or we could manually make a "whitelist" with the categories we want, then follow all categories to their parents until we hit one of those, but that will take a lot of time and effort, and make answers quite vague.
But we have something that's much more precise in such matters than the categories: Wikidata. (I know, for a man with a hammer...) If you go to Alan Turing (Q7251), you can see all kinds of precise statements, some of which we care about, like "occupation" and "field of work". These would be different for non-human entities, and there can be more than two ("notable work" comes to mind), but it's a limited set. This follows two properties from Alan Turing, finds the "main category" for those, and lists them. This works better the more complete the Wikidata items involved are, of course., so it might work less well for enwp stubs. --Magnus Manske (talk) 09:18, 9 March 2017 (UTC)[reply]
Thanks for your response, Magnus Manske. As you mentioned, the category system is one big mess, and it's heartwarming to know that we're not the only ones who have discovered it. ;) I played with your approach using Wikidata and it does work nicely if you have a human in the loop that can specify which statements are representative for an item (or a class of items). The issue with this (at least in the short run) is scalability. Automatic extraction of representative statements can turn into a problem of its own as well. Here is what we have decided to test based on your recommendation and digging some more in this space:
What we're looking for is a way to clean WP's category system. We can do this by combining the category system's information (which is messy) with a relatively clean data/system. One way to do this is: for each WP category, take all articles that are in that category. For each Wikipedia article, we use DBpedia and identify the top class the entity belongs to, for example, person, location, etc. (we did some tests with Wikidata and DBpedia and it seems DBpedia is easier for testing at this stage given that its ontology size is bigger. Of course, if our approach eventually works in DBpedia, we can use it in Wikidata as well.). Then, for a given category, we count over all DBpedia top level classes and generate a histogram using the distribution of counts across classes. If the WP category we started from is "clean", we expect to see most of the counts in one DBpedia class. If the counts are more spread across classes, this is a "messy" category. We expect this approach to work well enough in identifying Category:Lausanne as messy, for example, given that the category is used in articles about organizations, persons, etc.
We're hoping that we will be able to find a solution using the above. If we have any breakthroughs, I'll let you know. ;) --LZia (WMF) (talk) 18:45, 13 March 2017 (UTC)[reply]

Community initiatives that involve section reocmmendations[edit]

@Jean-Frédéric: Hi. We'd like to have a chat with folks involved with Ma Commune to learn a bit more about what the project is up to and also share some of the work we're doing on section recommendation. I'm not sure who is/are the person/people I should contact. Can you help? (I had a quick look at the GitHub page and I see quite a few of you active there. It's hard to tell if I should bug you ALL or not.;) Thank you! --LZia (WMF) (talk) 11:15, 24 October 2017 (UTC)[reply]

Hi @LZia (WMF): − I suggest you contact @Ash Crow: and @0x010C:, the main developers of the application. :-) Jean-Fred (talk) 12:36, 24 October 2017 (UTC)[reply]
Thanks for the quick response, Jean-Fred.
@Ash Crow: and @0x010C:: is it fine if I reach out to you via email to set up a time for us to have a chat about this? --LZia (WMF) (talk) 14:21, 24 October 2017 (UTC)[reply]
Absolutely, you can email me through my Wikimedia account. It's nice to hear that Ma Commune is of interest for you 0x010C ~talk~ 16:33, 24 October 2017 (UTC)[reply]
Same for me, you can contact me by email at sylvain.boissel(_AT_) (sorry for the delayed answer, I'm just back from holidays) -Ash Crow (talk) 11:15, 5 November 2017 (UTC)[reply]
Thank you, @Ash Crow: Sorry I saw this message just now. I'll reach out to you over email.
@0x010C: Thanks for the meeting. I documented some of the things I learned from our meeting. I may have misunderstood things, and if that's the case, feel free to correct them or call them out and I will correct them. As mentioned in the meeting: the next step for me is to discuss this with the rest of the research team on this project and get back to you. I'm obviously quite happy to see that there may be a possibility that we work with each other more closely and on specific directions of your project and ours. I will update you here. In the mean time, please ping with questions/comments whenever you want. --LZia (WMF) (talk) 20:04, 9 November 2017 (UTC)[reply]
@Ash Crow: and @0x010C:, we've created a first version of recommendations for frwiki which are documented at Research talk:Expanding Wikipedia articles across languages/Work log/2018-01-03. It would be helpful if you have a look at these recommendations and then we set up a 20-min call to see how best we can surface them to you. There are a few things we should discuss:
  1. Do you want to have an experienced editor in the middle between these recommendations and Ma Commune's users where the experienced editor receives the recommendation for a given category and "approves" or "rejects" them? There are tradeoffs here to consider in terms of creating more work for your experienced editors versus allowing a less experienced user make a decision.
  2. If you decide to go with the option of no experienced editor in the middle, we should discuss if you want more context to be provided to Ma Commune users. For example, we may be able to provide this kind of information: "section x is missing. n% of the articles in the category this article belongs to have section x." Is this required? Is it a nice-to-have?
  3. How do you want recommendations to be surfaced to you? There are at least two options: you can pass the article title to an API that we provide and you get in return the recommendations. Or, you can pass the category title to the API and get recommendations for each category. In this second approach, you will have to do some work on your end to combine results. For example, when the article belongs to multiple categories, you should then fuse the results by summing up the relevance scores of all recommendations by different categories for that article.
Please let me know what you think and what kind of information we can provide to help you dig deeper. --LZia (WMF) (talk) 21:24, 3 January 2018 (UTC)[reply]
@Ash Crow: and @0x010C: a gentle ping to get a sense for when you think you will have a chance to go over the sample recommendations and let us know what you think. We can surface those in different ways to you, but your feedback is critical to make sure we surface something useful for you. :) --LZia (WMF) (talk) 03:53, 17 January 2018 (UTC)[reply]
Hi @LZia (WMF):! Sorry, I was pretty buzzy those two last weeks; for your three questions:
  • Having experienced editor in the middle approving each recommendation would ask way too much work for them. A fully automated process would be better.
  • The more context and information you can provide, the more precise and helpfull we can be in the displayed result. So yes, it's a nice-to-have!
  • We are already working with several other APIs (the MediaWiki API from Wikipedia, Commons, Wikidata; the wikimedia maps API;...). In each of them, we send the title as input and get the result for the given page. So if it's possible for you, it would be the easiest for us.
Thanks for this update, that's very nice! — 0x010C ~talk~ 09:43, 17 January 2018 (UTC)[reply]
Oh, and I missed it in my first reading, if you want to setup a quick call it is possible, just send me an email like the last time 0x010C ~talk~ 09:47, 17 January 2018 (UTC)[reply]
Hi, sorry I missed your last notifications. Please use my pro account (User:Sylvain WMFr) for things related to Ma Commune (and more generally all things related to my work at Wikimédia France) and not my personal one (User: Ash_Crow), so I see the ping when I'm at work. I agree with what @0x010C: said and I'm also available for a call. I guess you are in San Francisco, so the best time would be around 19:00 CET/10:00 AM PST? -Sylvain WMFr (talk) 10:30, 17 January 2018 (UTC)[reply]
@0x010C: Thanks for your response. I think it's pretty clear what we should do now. You pass an article to the API and in return you want recommendations for that article (we will do the fusing of information on our end), you are interested to get more context back so you can surface it, and you don't want to have an experienced editor in the loop. @Bmansurov (WMF): @Tizianopiccardi: FYI.
@Sylvain WMFr: got you. I'll ping you on this one then, at least for Ma Commune purposes. :)
Both, I will invite you to our recurring meeting on Mondays for the coming week. This is an optional invitation. If you're there, we can spend some time and talk with you. Please don't feel obligated to attend though. We have everything we need to make this happen. If we run into questions, we will ping again. And, thanks! :) --LZia (WMF) (talk) 20:59, 18 January 2018 (UTC)[reply]


How to improve translations of this tool? I'm don't find related group and messages in the translatewiki:. Russian translation may be improved. --Kaganer (talk) 01:26, 17 November 2018 (UTC)[reply]

See phab:T209748. --Kaganer (talk) 02:03, 17 November 2018 (UTC)[reply]
Thank you for pointing me to phab, I am thinking to update localized tool labels, too, and wish the tool labels will appear in the language you picked in the button at the page top. Esp as the team started to collect feedbacks in ja and six languages now. --Omotecho (talk) 04:57, 18 June 2019 (UTC)[reply]
Kaganer & Omotecho thanks for the feedback. I'm not sure how the text in the gapfinder tool is translated. Perhaps Diego (WMF) knows. The section recommendation testing tool does not currently support interface text translation, because we built it very quickly in order to get feedback and did not intend to use it for anything else. I was hoping that translating the feedback instructions would be sufficient to allow people who are not fluent in English to use the tool. Do you think that having localized instructions is sufficient for this purpose? Best, Jmorgan (WMF) (talk) 16:39, 18 June 2019 (UTC)[reply]
@Jmorgan (WMF): No, nobody reads the instructions. And my question was a little about something else. Now I already see the Russian messages, and I asked where it came from and where it can be changed. --Kaganer (talk) 17:30, 18 June 2019 (UTC)[reply]
Kaganer sigh, I suppose I shouldn't be surprised that no one reads instructions. Re: the Russian messages in Gapfinder, I'll ask Diego (WMF) to follow up with you here when he's back from holiday (early next week). Cheers, Jmorgan (WMF) (talk) 17:41, 18 June 2019 (UTC)[reply]
Comment Comment @Jmorgan (WMF): adding to Kaganer’s points, instruction is read by certain type of small number of people like translators, but not anybody, in any language. Secondly, I’ve purposely skipped reading instruction and tried out your tool, and found those questions supplied on the bottom half could be easy for somebody fluent in dictionary-type/semantic analysis as well as reading English. Great survey your team is on, and, not sure how costly making the tool multi-language. Still I’m happy to translate UI labels for your tool. --Omotecho (talk) 06:10, 19 June 2019 (UTC)[reply]