Grants talk:Project/Frimelle and Hadyelsahar/Scribe: Supporting Under-resourced Wikipedia Editors in Creating New Articles

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by FULBERT (talk | contribs) at 20:42, 8 January 2019 (→‎Questions about your budget: new section). It may differ significantly from the current version.

Name

Internet Archive table-top Scribe

I suggest choosing another name. In Wikimedia and Wikisource circles, "Scribe" is usually the Internet Archive scanner. --Nemo 13:34, 26 November 2018 (UTC)[reply]

That is an excellent point. We are open to change the tool's name. Do you have suggestions? --Frimelle (talk) 15:43, 26 November 2018 (UTC)[reply]
How about "Itemizer" since it starts with an item and attempts to expand it using structured rules, the way you include anything on an itemized list. Jane023 (talk) 16:44, 26 November 2018 (UTC)[reply]
I currently can't propose a name because I didn't fully grasp what's the point of the proposed software. More on this below. Nemo 08:43, 28 November 2018 (UTC)[reply]
Thanks all for your suggestions. We think that Itemizer does not capture entirely the scope of the project (as we add contextual information). We suggest keeping the name in the proposal for the time being as we have sent notifications to community members with this name. With the start of the project, we will create a poll about the name, collecting larger scale suggestions and deciding on those.--Hadyelsahar (talk) 11:51, 29 November 2018 (UTC)[reply]

Not just for small wikis

I endorse this project, but I have some reservations about the presentation of the project. I agree this is important work that should be done and it would be useful for any Wikipedia, not just small ones. For this reason I object to the lead image c:File:Wikipedia articles and editors distribution over languages.png being used to illustrate this, since it doesn't really say anything relevant. Yes we have lots of small languages with small numbers of articles, but don't forget the Encyclopedia Britannica in 2010 was also just 65,000 articles. Size matters, but usability is much more important, and the usability of the current input interface for article creation is what is mostly involved here, not size. If anything a much more relevant image would be a comparison of article growth for those smaller Wikipedias which adopted the Wikidata infoboxes. As a Wikimedian experienced in article & item creation I am curious whether the old opinion about infoboxes being scary for first-time users is true or not. Comparing English Wikipedia to any small Wikipedia is just comparing apples with oranges. Jane023 (talk) 16:40, 26 November 2018 (UTC)[reply]

I agree. If developed as a gadget to expand existing articles (see #Difference from article placeholder), this is actually more likely to be useful on larger projects, where more potential users exist. When I edit in English or Italian I often scan the corresponding articles and references in the French/German/Spanish Wikipedias, an automated nudge and summary in that sense can be useful for many. Nemo 11:02, 28 November 2018 (UTC)[reply]
We decided to focus on underserved communities, as they have different challenges as the large ones (e.g. the lack of editors). We believe focusing our attention and resources on those communities can be beneficial for all Wikipedians, and can lead in a later stage to deployment on different language Wikipedias. At this stage, however, we want to tailor it to the needs of those editors often overlooked as they are less in number. Knowledge gaps is a well identified problem in Wikipedia and represents a high priority problem addressed also by the Wikimedia foundation annual plan [1]. Accordingly, the top image displays this maldistribution between different language Wikipedias. Agreeing with you on the fact that it is quality, not quantity that matters, our tool encourages the creation of high-quality articles providing users references. --Frimelle (talk) 12:39, 29 November 2018 (UTC)[reply]

Comments on community needs

Hi! Thank you all for this submission. I'd like to explain our local way of working on Arabic Wikipedia about creating bot articles. That will probably help you to identify community needs.

- Bot articles need community approval (on the technical village pump), there by, offering that tool to regular editors will not be allowed for us.

- Bot-generated articles are reviewed in 3 steps: structure (we search for the best structure to extract maximum data from wikidata without changing the ability to easily read those articles, we identify the minimum data needed from Wikidata to select potential articles), test (we create some articles by bot on a sandbox and try to enhance the structure with further comments) and finally reviewing bot-generated articles created mainly by @جار الله:.

Structured articles already created on our Wiki:

  • We are creating now articles about actors (around 30000 should be created)

Take a look at those examples: 1, 2, 3. Lists of films and series are generated from Wikidata if there is a translation in Arabic (see below how), Infobox with references comes from Wikidata and external links are brought from the famous External links module. The first sentence has references also taken from English or French Wikipedia.

  • Same as above but for years by country, we identified paragraphs that we wanted to include (events, films, books, politics, sports, births, deaths, ..) and paragraphs lenght (ex. how many births by month we need to include and how to select famous ones ? We based our selection at the end on the number of interwikis from Wikidata)

Examples: 1975 in USA, 1975 in Tunisia, 1975 in Germany, in Japan, ..

  • Years BC have been created by bots using Wikidata: 187 BC, 787 BC, ..

Our main problem now, is the absence of a tool to make transliteration. Non-latin script based Wikis need an open transliteration tool. Soccer players articles have been created because we found an external website to translate names, same thing for artists. We imported those names into Wikidata labels before creating articles. But what about obscure species of algae? or villages in Samoa? Arabic content online is not available and a transliteration is needed (from latin for species and local or official languages for human settlements). I found some tools online (Google, Microsoft, Drupal) but I didn't test them (I'm not programmer). --Helmoony (talk) 17:05, 26 November 2018 (UTC)[reply]

Wow fascinating summary - thanks for posting! I totally agree about this transliteration tool - has it been proposed in the Wishlist survey? Good point about the "year articles". Someone mentioned recently that we need to create a set of properties to better describe these types of articles that are more Wikipedia roundups than coming from external sources. Same thing about technical articles like categories and templates, but lists are a different thing and often have their own sets of rules (what makes the list? what gets dropped?). Bot generation is of course very controversial, but I don't think that is what is going on here. I think this is more of a two-step process where the bot-generated content is then offered for human review and easier copy-editing. Jane023 (talk) 09:37, 27 November 2018 (UTC)[reply]
Thanks a lot for your summary of the very interesting work on Arabic Wikipedia! We will definitely get in touch with you during our planned interviews in the early phases of the project to get in-depth details about the process of article creation within each targeted community and what techniques could be useful to adopt while creating this tool. While we will certainly look into the bot you mention, as it does something very related to our work in terms of gathering information from Wikidata, our approach differs in two important points: (1) our tool is not creating articles, such as bots do, but supports editors in the creation of high-quality articles (imagine the difference between bots and the content translation tool), i.e. the tool will not display any information on Wikipedia, without an editor has worked on it (2) We gather information also from online resources in the target language, that will help editors to create more citations, a point often criticised as missing from underserved Wikipedias’ editors. So overall, it is exactly more the idea of copy-editing generated content as Jane023 clarified in the comment above, which will lead to a high quality in content and presentation of each article created through the tool.
Regarding the topic of a tool for transliteration: We believe this is an excellent idea, we have discussed and identified as a missing problem previously, too. It is not in the scope of this project, given that this problem is an open research question by itself and very language dependent. But it is a topic we will be interested in working on in the future in a different context. --Frimelle (talk) 13:57, 29 November 2018 (UTC)[reply]

Feedback from Harej

Thank you for submitting this proposal. It is a very impressive proposal and I am excited at the opportunities that are here.

  • I am concerned about the plan to implement Scribe as a gadget. The gadgets infrastructure isn't really designed for gadgets that run on multiple wikis. I think it's possible but basically you have to make sure that there is one wiki that hosts the code (with the other wikis loading that code) so that you don't end up with multiple, slightly different copies of the codebase – no one wants that outcome. The gadget will also need to support internationalization, and I am not sure there is a standard approach for doing so within gadgets. (I think it's been done before but I don't know the details or if the implementation details make sense.) Of course, if you already figured out how to have a translatable gadget run on multiple wikis without causing problems, great!
  • I really like the idea of going beyond just basic facts from Wikidata and providing deeper contextual information that can be used in writing articles. However I am wondering how exactly it will work in practice. One of the problems you highlight is that there's a shortage of information online in languages such as Arabic and Hindi. And if the Arabic-speaking Internet and Hindi-speaking Internet are anything like the English-speaking Internet, you need to be very careful in what you accept as a source. So I am wondering if you have certain repositories you will be focusing on for this feature.
  • What will the volunteer developers be responsible for? My concern is that volunteers and paid staff operate with different motives and incentives, and I wouldn't want to see the project fall behind because of unavailability of volunteers or volunteers reneging/falling behind on commitments.

Cheers, Harej (WMF) (talk) 00:24, 27 November 2018 (UTC)[reply]

Thanks a lot for your feedback and for raising such interesting points, we will reply below to each of them in the same order:
  • We decided on a gadget, as e.g. ProveIt, which has a similar base idea as our project (in a very simplified way) uses the gadget infrastructure between multiple language Wikipedias, too. We are aware of the issues that come with replication of code between multiple Wikipedias and will look into a good solution for this for our tool. However, we are not settled on the gadget either, if another way of using our tool sees fit. The gadget will, for now, give us the possibility to interact with the editors and serve them a tool as the result of the underlying work fast. Will this prove as not sustainable, we will change it to e.g. an extension in the course of the project.
    • The problem of internationalization is important to address. We will make sure to use existing tools and infrastructure, but we do not have an out-of-the-box solution for this yet.
  • The question of trustworthiness of sources is another important point. We thought of building a repository of sources we found as acceptable, by collecting sources used before on this (or other languages) Wikipedia or by calculating trustworthiness through redundancy of the information. But as we will not display the information right away but have a human-in-the-loop approach, we have someone judging the references. While there will be some blacklisted anyway, the remaining references might be displayed to the editors and filtered by them. Which approach gives the best results is something we will have to explore as part of our studies.
  • When we find someone interested in working with us, we will decide with them based on their interest and previous contribution what they will work on. But of course, we will make sure that the project does not depend on them solely, as they might become unavailable or have other commitments. We want to emphasize including volunteers and having an outreach for them to ensure that the code will follow the standards and therefore is maintainable after the project is finished.
--Hadyelsahar (talk) 15:02, 3 December 2018 (UTC)[reply]

Difference from article placeholder

Speaking of why make it a gadget, it would be helpful if the proposal explained why this cannot be done as part of the article placeholder extension.

The whole point of the article placeholders, when the idea started, was filling red links with Wikidata. Automatic generation of prose from Wikidata information, à la Reasonator, was the biggest challenge for the article placeholder extension (and was for now sidestepped by opting for a tabular presentation). It was supposed to help users convert the article placeholder in an actual article with less effort than starting from a blank page.

On a first read, this project proposes an add-on to the initial article placeholder idea, in short (I simplify) that we can also automatically suggest a TOC and a references section in addition to the lead section content and infobox data. I would expect such a feature to be easiest to "sell" as part of the seed text created by ArticlePlaceholder when creating a new article (otherwise I'm still basically starting from zero text written). The only way I can see this being useful as a gadget is if you use it to suggest expansion of existing articles, "hey this article doesn't deal with topic X which seems important, do you want to check these sources X Y Z?". But then again that would be most useful as a call to action to unregistered users or users who would not otherwise be contributing to the article, so I can rather imagine it being activated via an extension on all articles marked as stub in a certain or something like that. Nemo 09:06, 28 November 2018 (UTC)[reply]

I have been very involved in the development of the ArticlePlaceholder project, and therefore happy to hear about it, too. In this project, however, we want to address particularly editors, to support them in their editing experience. We do not create articles and do outreach in the form of increasing article size by automatic creation. The idea of suggesting new content based on an existing article is similar to the work of [Leila's paper]. The author's don't include references but the idea is very similar. But we want to focus on the content gap, i.e. missing articles as a whole, that an editor wants to create supported by a tool that helps them with the structure and references. --Frimelle (talk) 12:39, 3 December 2018 (UTC)[reply]

More on reference digestion

A large part of this idea seems to be about machine learning on the references, of the kind initially tried by wikidata:Wikidata:Primary sources tool and more recently by Quicksilver. In this field, important projects are http://gettheresearch.org/ (planned) and https://www.semanticscholar.org/ (already useful). Nemo 09:06, 28 November 2018 (UTC)[reply]

Yes, indeed part of our project is collecting references and representing their corresponding key points for editors. Many of the pointers you mentioned are very related in an aspect or two. Unlike the Primary sources tool, we deal with information that is more contextual than information which can only be displayed on Wikidata. The type of references offered to editors will cover a large scope of documents not only scientific publications, hence the difference to Semantic scholar and Get the Research. Finally, and most importantly, we focus on helping editors in under-resourced languages communities in Wikipedia [see https://meta.wikimedia.org/wiki/Grants_talk:Project/Scribe:_Supporting_Under-resourced_Wikipedia_Editors_in_Creating_New_Articles#Not_just_for_small_wikis]. This makes our use-case different than Quicksilver although it is very possible that many of the underlying research in both projects will be common. --Hadyelsahar (talk) 15:21, 3 December 2018 (UTC)[reply]

Handling cultural and linguistic differences

It would be interesting to know how you propose to handle suggestions that are culture-specific. In your example, the Arabic Wikipedia, it's easy to imagine some content and references being very politically sensitive. The same for Ukrainian or any language and topic where there is a vast body of literature on the "same things" which are however perceived as very different in different languages or places. Nemo 09:06, 28 November 2018 (UTC)[reply]

In an encyclopedia such as Wikipedia, the content is written from a neutral point of view, referring to collecting information from different (trustworthy) primary and secondary sources, see also verifiability. We only support the editors in collecting those sources, we do not push the topics in either direction. We believe that is to every Wikipedia community to decide to work with those difficult topics and we do not want to interrupt this decision process. Therefore, our references are only suggestions, the editor can decide what information they want to include or what topic needs more research. We do not create a comprehensive article for the editors. Further, I do not support the assumption that sources from different languages per se have biases in topics. If a source is trustworthy, it will cover the topic thoroughly, independent from the original language. --Frimelle (talk) 12:39, 3 December 2018 (UTC)[reply]

Translation tool vs. approach

The community has built a set of tools to facilitate article creation, such as the Content Translation Tool. This tool enables editors to easily translate an existing article from one language to another.

What are the other tools? I was expecting a wider analysis rather than just one tool/approach.

The articles that can be translated are selected by their importance to the source Wikipedia community. Topics with significance to the target community do not necessarily have an equivalent in the source Wikipedia. In those cases, there is no start point using the content translation tool. It has been shown, that the English Wikipedia is not the superset of all Wikipedias, and the overlap of content between the different languages is relatively small, indicating cultural differences in the content. Editors should be encouraged to avoid a topical bias and cover articles important to their communities.

Editors can translate from any language (they know), not just from English. If there isn't source article in any language, then obviously it cannot be translated. It's not a limitation of Content Translation tool, but limitation of the whole translation approach.

Especially for underserved languages, the machine translation is limited. There are few documents available online aligned with English and even less with other languages, that the translation engine can be trained on. This leads to often criticised quality.

Not all translation engines need a corpora. Some small languages have good hand-built engines, although that is an exception and not the norm.

Monolingual speakers are disadvantaged. They cannot verify the translation in the context of the source language or the references used.

This again is a limitation by the approach, not by the tool. The section should be renamed to highlight that you are contrasting translation against structured/guided article creation. --Nikerabbit (talk) 11:18, 28 November 2018 (UTC)[reply]

Considering other tools and comparing to them: We focus in on our comparison on the content translation tool, as this is the closest one that (1) focuses on editors and (2) underserved language editors. We are aware of a range of tools/techniques that support editors, e.g.: Bots, Importing infobox from Wikidata, Article placeholders, Gap finder, External tools (Google Translate, Quicksilver), and the Translate extension. Thus the only possible comparison will be a use-case comparison since we don’t replace any of the existing tools but rather fill the gaps not covered by them. --Hadyelsahar (talk) 14:44, 3 December 2018 (UTC)[reply]
Considering the limitations of the content translation tool and the translation approach: In our proposal by mentioning the content translation tool we mean the translation approach in general. Since the content translation tool is the most used tool for content translation in Wikipedia by the underserved communities we thought to highlight it as an example of the translation approach. We believe that Human aided summarization approach in our proposal can fill in the gaps not filled by the translation approach. --Hadyelsahar (talk) 14:44, 3 December 2018 (UTC)[reply]

Eligibility confirmed, round 2 2018

This Project Grants proposal is under review!

We've confirmed your proposal is eligible for round 2 2018 review. Please feel free to ask questions and make changes to this proposal as discussions continue during the community comments period, through January 2, 2019.

The Project Grant committee's formal review for round 2 2018 will occur January 3-January 28, 2019. Grantees will be announced March 1, 2018. See the schedule for more details.

Questions? Contact us.

--I JethroBT (WMF) (talk) 03:16, 8 December 2018 (UTC)[reply]

Questions and concerns

Thank you for this interesting proposal. However I have a number of questions/concerns:

  1. As written the project aims at creation of a tool with rather fabulous capabilities: it should automatically suggest the topic, structure and references as well as the key points for all sections. Are the goals of this project realistic? Basically your tools, if created, will be have AI like capabilities.
  2. What are computational requirements for this tool? Can it be plausibly run within an internet browser itself running on a relativity slow computer? (which is common in underesourced countries)
  3. It is unclear where the references will come from? You mention Wikidata but also some unspecified external online sources. So, are you going to somehow look for sources in the internet? and how will this be accomplished?

Ruslik (talk) 18:31, 11 December 2018 (UTC)[reply]

  1. Indeed, the capabilities are quite impactful. We have experience working in those fields as well as in integrating a tool into Wikipedia, therefore we are sure we can tackle the problems in the time stated. We have worked on similar projects in the past, combining research ideas with real-life problems, and the support of a developer will make sure it will be implemented in a timely manner. Each of the problems represent well-researched areas. We aim to extend those areas with the focus on low-resource languages, but do not reinvent the field. For clarification: While suggesting topics to editors is the natural extension of our work, it is part of this proposal. Once the tool is implemented however, it can easily be integrated in previous work on the topic of suggesting articles to create and edit, e.g. the gap finder.
  2. Our computations will be split into the following: 1) client side computations (i.e. browser): to manage lightweight front end functionalities such as generation of Wikitext, drag and drop etc. This should be very lightweight and able to be processed by any computer 2) Service based computations: The developed gadget will do api calls to a web service hosted on an external server This is similar to what is happening in other gadgets such as ProveIt [2]. The server side will be responsible for performing tasks such as querying and filtering references, calculating textual similarity between Wikidata entities for section suggestion, performing extractive summarization. In order to reduce the online computation load we intend to do a large amount of caching for potential topics for each target language (existing Wikidata ids without articles) before the service goes public.
  3. We are collecting references through a search engine. This will be either an open online API or over a local index the common crawl web corpus [3]. Based on their rankings, we can discover related documents. We are aware, that there is are quality criteria for sources in Wikipedia, that we want to follow to ensure the best possible quality. To start, we will use a whitelist of sources, that have been used in this Wikipedia before, that the editor can select from. We will extend this work by studying better ways of ensuring the trustworthiness of sources, a topic widely covered in research. --Frimelle (talk) 18:51, 20 December 2018 (UTC)[reply]

FYI feature request Associate red links on Wikibase client wikis with items in the Wikibase repo

I guess this project will benefit of this feature request T212211
Tracked in Phabricator:
Task T212211
- Salgo60 (talk) 15:08, 21 December 2018 (UTC)[reply]

Questions about your budget

I have a couple questions about your proposal's budget Frimelle and Hadyelsahar.

  • Are the budgeted amounts sufficient for the work and time? The amounts seem a bit low, though I am also unclear where this will happen so they may not be.
  • I do not see any costs for office space or technology. How will these be accounted for?
  • I do not see an budget lines related to the editathons. Will there be costs associated with them?

Thank you in advance for your thoughts on these. --- FULBERT (talk) 20:42, 8 January 2019 (UTC)[reply]