Grants:Project/Information extraction and replacement

From Meta, a Wikimedia project coordination wiki
statusnot selected
Information extraction and replacement
summaryCheck whether it is possible to identify and extract factoids from Wikipedia, by training an extractor for template filling by using relocated data provided by Wikidata, thereby making it possible to identify fragments for replacement within Wikipedia but also to extend the text with references to found factoids at external pages.
targetAll Wikipedia projects of reasonably large size.
type of granttools and software
amountUSD 24.100
type of applicantindividual
granteejeblad
contact• jeblad@gmail.com
join
endorse
created on17:41, 14 March 2017 (UTC)

Project idea[edit]

What is the problem you're trying to solve?[edit]

It is very time consuming to try to manually find facts (often called "factoids") on external pages and reference them. Without any tools the authors must formulate queries, repeatedly prod external search engines, and manually scan through the found pages in an attempt to identify the facts. Often the writer tries to avoid the whole problem by using a single source or a few renown ones, often books, to get some diversity of the facts. That usually leaves out newer facts that has not yet made it into printed books. It also makes it difficult for the reader to verify facts without buying the books.

In short we need

  1. a better way to identify facts, and
  2. to automate the online searching for found facts.

This is often called information extraction, and the specific type described here is template filling.[1] The first point above is really about learning the extractor from our internal data, while the second is about using the extractor on a set of external pages. The end result should be to reference existing factoids.

(This problem is formulated as referencing an existing fact, but it is almost identical to inject facts from external pages into existing articles. Creating a reference is although slightly easier, and is chosen as the outcome of this project.)

What is your solution?[edit]

Learning an extractor is a rather big task, as it require a large base of labeled samples. At the Wikimedia-projects we are in a rather unique situation, as we have a large text base that is connected to a large fact base. The articles from Wikipedia has texts where some of the statements from Wikidata is given context. Not all facts from Wikidata can be refound in Wikipedia, and not all facts from Wikipedia can be refound in Wikidata, yet it might be possible to find a sufficient number of entries we can connect in both projects to build an extractor. This is a lot like feature learning in machine learning.

Entities in Wikidata have statements that give values that reappear in articles on Wikipedia. Such values are instantiations (valuations) of the propositions. Those propositions are bound to text realizations that usually comes in pretty standardized forms, that is their context. A statement with a predicate for "length" (d:Property:P2043) can show up as a mention of "length" in an article. Those mentions of "length" in articles about lakes, fjords or rivers does not come in complete random forms, they conforms to stringent grammatical rules. Those rules can be identified by analyzing text fragments over a lot of articles.

Some parts of the fragments can be recognized as referencing values of other predicates, and as such they will be marked accordingly. As we do not have any working anaphora resolution the algorithm will fail a lot, but hopefully it will work sufficiently often that we can extract necessary information and build basic templates. For example we can find “Mjøsa is 117 km long” but that can not be generalized to “It is 117 km long”, we must use two different templates.

Note that “templates” in information extraction is not the same as the templates on Wikipedia. Sometimes they are called scripts, but “script filling” does not give a good idea about how they are used.

There are also other situations where the simple solution will fail, like lists of values, but those can be solved with “subtemplates”. A more serious problem is that we don't have a working part-of-speech tagger (POS tagging). That sets fairly hard limit on what we can do with our information extraction. We cannot generalize our findings, we can only find (and use) special cases.

See also Grants:Project/Information extraction and replacement/Intuition on templates.

Project goals[edit]

Simplify the lookup and rererencing of factoids, by making a special page for quality assurance of information extraction templates, and a special page for creating external references (or replace script) for such found templates, thereby making it a lot faster to inject references.

(This is not about finding one, or a few sources, but finding hundreds or thousands of sources and injecting them into articles.)

Project impact[edit]

How will you know if you have met your goals?[edit]

The primary goal is to verify if it is possible to use Wikidata to create extraction templates from Wikipedia, but to do so the special page for assuring validity of the extracts are used. That is the existence of a working special page for building extraction templates is the output.

The secondary goal is to facilitate continued positive outcomes by making it easier to create external references (or a more general replace script). That is the the existence of a working special page for replacing or extending found templates would be the sustained outcome. Ie. the reason why the users on Wikipedia would want to use the solution.

Do you have any goals around participation or content?[edit]

During development there will be no specific needs for participation. Later on it would be necessary with feedback on how to do half-automated editing. It is also likely that there will be participations on documentation and user manuals.

Project plan[edit]

Activities[edit]

  • Build a limited engine for "template" extraction (bulk of the work)
    This can easily be very involved, using such techniques as lexical items, stemmed lexical items, shape, character affixes, part of speech, syntactic chunk labels, gazetteer or name lists (Wikidata items!), predicate tokens, bag of words and bag of N-grams. Not that we probably will have no pos-tagger, and that will create some limitations.
  • Build a special page for interacting with the extract engine
    It not quite clear where and how configurations should be saved, but it should be possible to save and load the configuration from this page.
  • Build a limited engine for "template" replacement
    Template replacement when referencing an external page is mostly just a call to mw:citoid.
  • Build a special page for interacting with the replace engine
    When this page saves a result the actual page on-site is updated

Budget[edit]

The estimated workload is about 6 person-months at 100 % for an experienced developer; or 12 person-months at 50 %. This workload estimation is based on the main developer's previous experience with similar projects.

Budget breakdown[edit]

Item Description Commitment Person-months Cost
Main developer Developing and releasing proposed code Part time (50 %) 12 USD 24,100
There is no co-funding
Total USD 24,100

The item costs are computed as follows: The main developer's gross salaries (including 35 % Norwegian income tax) are estimated upon pay given to similar projects using Norwegian standard salaries,[2] given the current exchange rate of 1 NOK = 01168 USD,[3] and a half year's full-time work.

Community engagement[edit]

Other than code review it is not expected that the community in general will participate very much in the initial development up to the baseline. It will although be possible for other developers to provide patches for the published code.

It is expected that it is necessary to get feedback on the very limited UI, and to get help with translation of system messages. The messages are quite simple, even if they use a patch-work approach.

A few example configurations will be made, but those are close to bare minimum.

Get involved[edit]

Participants[edit]

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

Community notification[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc. Need notification tips?

Endorsements[edit]

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  • Endorse. Any tools that might help the wiki-family in improving sources and references should be welcomed. GAD (talk) 21:48, 14 March 2017 (UTC)
  • Endorse - sounds interesting. Orphée (talk)
  • Endorse - +1 to GAD's comment. --Astrid Carlsen (WMNO) (talk) 17:47, 15 March 2017 (UTC)
  • Endorse - Not only great for finding new facts but also for comparing facts and finding quality issues in either Wikidata or a Wikipedia. Thanks, GerardM (talk) 09:56, 8 April 2017 (UTC)

References[edit]

  1. Jurafsky, Daniel; Martin, James H.; Speech and Language Processing, section Information Extraction, pp 739–778, ISBN 978-1-292-02543-8
  2. Regjeringen.no: Lønnstabeller for arbeidstakere i staten – lønnstrinn 47, akademikere
  3. DNB: Valutakalkulator – as of 14. March 2017