Grants talk:Project/Information extraction and replacement

From Meta, a Wikimedia project coordination wiki

Examples to clarify what this is supposed to do[edit]

The project lacks clarity. Please provide examples showing what this is supposed to do. Examples of where the information is supposed to come from, where the information is supposed to be added, how the information is to be added. Is this supposed to be entirely bot-editing, human controlled editing or a combination? Please simulate examples of how this will look in the end result. Please "simulate" examples that shows how this is meant to work. Please show the possible steps in the process leading to a result. --ツ regards. Dyveldi ☯ prat ✉ post 09:35, 15 March 2017 (UTC)[reply]

Thank you for your interest. This is a limited functional description, not a final design document. It is said several times that this will be accessed through special pages, and special pages are used by humans. Yet do know that as this is only a very coarse functional description it is not given that the page will be static like the special pages we use today. There are links to background on the topic, including a reference to the (perhaps) most renowned book on the field. — Jeblad 11:21, 15 March 2017 (UTC)[reply]

Compare previous proposal[edit]

See Grants:Project/Jeblad/Better support for imported data in wikitext. Glrx (talk) 21:48, 17 March 2017 (UTC)[reply]

I would decline this proposal on similar grounds to the earlier proposal's decline. In addition, this proposal is more vague than the earlier proposal. There's no concrete description of the extraction. There's no clear way to evaluate the results. Even if parallel statements can be found in Wikidata and en.WP, the proposal does not describe how such parallels can be exploited.
The use of "factoid" (a trivial item of information) seemed like a language error at first, but "factoid" may be apt. I'm not particularly interested in the elevation of various lakes, and somebody reading a list of such elevations would put me to sleep. On the other hand, I an en.WP infobox template for lakes could grab lake elevation data directly from Wikidata. There's a simpler, more direct, route to using the information.
Glrx (talk) 22:02, 20 March 2017 (UTC)[reply]
Information extraction is pretty well-defined, and the methods described are available in the literature. It is probably best if you check out the sources, especially the book mentioned in the proposal. — Jeblad 13:20, 1 April 2017 (UTC)[reply]

Notes.

Glrx (talk) 15:33, 27 April 2017 (UTC)[reply]

This proposal is about what is called template filling or slot filling. You find it in ch 21.5 at 21 Information extraction. The more fancy methods can't be used as we can't find referring expressions. This would have been done in Grants:Project/Jeblad/Better support for imported data in wikitext.
I'm not sure what you wonder about Grants:IEG/Lua libs for behavior-driven development, or if there is a question there, but you find it as mw:Help:Pickle. No, it is not done yet, but pretty close. It was not a three mont project at all. — Jeblad 22:12, 27 April 2017 (UTC)[reply]

Is extracting from Wikipedia the right goal?[edit]

On Wikidata we have a lot of authority control references for many of our items. Every week we add multiple year authority control references and most have a formatter link. It would be great if the information that's stored in the page linked by the authority control formatter could be extracted.
Given that Wikipedia doesn't want to import Wikidata facts that are imported from Wikipedia I think this would be more useful. ChristianKl (talk) 14:56, 22 March 2017 (UTC)[reply]
This isn't about extracting information from a specific page, but a general tool for extraction of specific information from a large set of (external) pages. Given that the extractors are found it is although possible to run them on some specific (internal) page(s). — Jeblad 13:29, 1 April 2017 (UTC)[reply]
At the moment I feel it's hard to imagine the end-user experience of using your tool. You don't describe an user-experience where it's easy to say that users would welcome to do whatever you want them to do on your new special page. If your extractor could run on the pages that are currently referenced with external-id and dumb all the claims that can be extracted from those pages into the primary sources tool, I think that would be a use-case that warrants funding your tool. It would allow a lot of new facts to be entered into Wikidata and then automatically imported into templates. On the other hand, information that you extract from Wikipedia and then store in Wikidata can't be imported into other Wikipedias because it only has a Wikipedia reference. ChristianKl (talk) 15:07, 29 May 2017 (UTC)[reply]

Eligibility confirmed, round 1 2017[edit]

This Project Grants proposal is under review!

We've confirmed your proposal is eligible for round 1 2017 review. Please feel free to ask questions and make changes to this proposal as discussions continue during the community comments period, through 4 April 2017.

The committee's formal review for round 1 2017 begins on 5 April 2017, and grants will be announced 19 May. See the schedule for more details.

Questions? Contact us.

--Marti (WMF) (talk) 19:53, 27 March 2017 (UTC)[reply]

Dear jeblad,

Thank you for submitting this proposal. I have three quick pieces of feedback:

  1. Though I am marking your proposal as eligible in terms of the content of the proposal itself, we cannot actually fund your project until your IEG project is in compliance, which will require you to submit your Final Report, or to reformat your project with a new end date so that your Final Report deadline is extended. Please contact me if you need support with either of these actions.
  2. Because of the collaborative nature of the Wikimedia projects, we ask all of our applicants to seek feedback from relevant members of the community and ask them to comment on their proposals with support and/or concerns. In the case of Wikidata-related proposals, we especially look for support from the Wikidata team.
  3. Among other steps in our review of grant proposals, we check to see if applicants have reflected back the concerns of the people who have commented on their talkpage, demonstrating that they both (a) seek to understand what matters to the people who have provided feedback and (b) are seeking to address it in some way (even if it is by taking care to explain why you think differently). I would recommend that you think about how you can provide more information in your responses so far on this page to help others understand your intentions. Though it can be useful to direct people to the book referenced in your proposal for more clarity, its unlikely that our volunteer committee members will have time to do that kind of investigation. It would be desirable for you to summarize the main points you want people to understand directly here on the talkpage.

Let me know if you have any questions.

--Marti (WMF) (talk) 01:35, 3 April 2017 (UTC)[reply]

Identification of errors[edit]

This proposal have an implicit effect on articles, as it will identify errors as those can not be found in external pages or even worse, it can find false pages with those errors. The purpose is to find references, but this fails if the claim (proposition) is erroneous, but can also succeed for false claims if there are false pages. Because it does this it will generate a lot of discussion about those articles, and whether the tool itself is correct. Those discussions will be heated and it is highly likely they will create an impression of a tool with disputed quality. Compare for example the discussions about ContentTranslation and Wikibase Quality.

The disccussions about CT and WQ makes me wonder if the communities are ready for this kind of tool. — Jeblad 09:03, 13 April 2017 (UTC)[reply]

Hi, my question is very related to the one of Jeblad so I'm posting it here. I'm not sure what is the end purpose of this grant. It looks like you want to add references to Wikipedia articles : but won't your method be very vulnerable to "false news", i.e. looking only for references that support the claim made in Wikipedia, instead of looking also for contradictions and nuances ? Léna (talk) 08:58, 16 April 2017 (UTC)[reply]
This is mostly about creating the tools for building the extract patterns. Later on those can be used for extracting additional facts (aka "values") from external pages, or to identify possible sources for existing facts. It is possible to look for variations, and then use statistical inference to pick the likely ones. That is a bit further ahead. Still It will not find contradictions or failed complex arguments. — Jeblad 16:00, 16 April 2017 (UTC)[reply]
Note that there are two modes here. One is to find an external claim that support the internal claim and its property value, and the other is to find all external claims and their property values. Statistics can be calculated for the later mode, and then it is possible to mark internal claims with citation needed. Different statistics must be made for each datatype, not each property, so the number of different types of statistics isn't that high. It is also worth noting that this is references for grounding of the premises, not the conclusion of the argument, and an argument can still be valid even if the ground premise is false. — Jeblad 21:15, 5 June 2017 (UTC)[reply]

About the simplification[edit]

This follows the usual pretty dumb implementation of a template-based extraction. This is to be able to implement the solution within a reasonable time frame. A much better solution is probably to use a convolving neural network with clustered word vectors. That makes it possible to learn more general patterns, thereby being able to extract more facts from same number of pages, or maintain precision with fewer available examples. The trade off is whether the problem is solvable within a reasonable time frame. — Jeblad 23:22, 26 April 2017 (UTC)[reply]

Could be even better (more general) with RNN. The idea would be to learn property-specific extractors for the whole Wikipedia, by predicting the location for the property inside the text. Those extractors could then be run on external pages later on. CNNs are cheaper in use (less calculations after learning, probably also less calculation during learning), but RNNs could have better precision. — Jeblad 20:36, 5 June 2017 (UTC)[reply]
Note that use of CNNs/RNNs could simplify the user interface, as the user then only has to verify if the correct property is found. Even that can be simplified as a sufficient low false rate will be absorbed by correct predictions. After a first extraction the user could reject or correct wrong predictions, which would be much faster than trying to find good patterns. — Jeblad 20:44, 5 June 2017 (UTC)[reply]

Round 1 2017 decision[edit]

This project has not been selected for a Project Grant at this time.

We love that you took the chance to creatively improve the Wikimedia movement. The committee has reviewed this proposal and not recommended it for funding. This was a very competitive round with many good ideas, not all of which could be funded in spite of many merits. We appreciate your participation, and we hope you'll continue to stay engaged in the Wikimedia context.


Next steps: Applicants whose proposals are declined are welcome to consider resubmitting your application again in the future. You are welcome to request a consultation with staff to review any concerns with your proposal that contributed to a decline decision, and help you determine whether resubmission makes sense for your proposal.

Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We are also currently seeking candidates to serve on regional grants committees and we'd appreciate it if you could help us spread the word to strong candidates--you can find out more here. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.

Aggregated feedback from the committee for Information extraction and replacement[edit]

Scoring rubric Score
(A) Impact potential
  • Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both?
  • Does it have the potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
4.4
(B) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
3.6
(C) Ability to execute
  • Can the scope be accomplished in the proposed timeframe?
  • Is the budget realistic/efficient ?
  • Do the participants have the necessary skills/experience?
4.4
(D) Measures of success
  • Are there both quantitative and qualitative measures of success?
  • Are they realistic?
  • Can they be measured?
3.5
Additional comments from the Committee:
  • While the development of an extractor tool is an interesting project in its own right, the proposal does not present a clear rationale for why such a tool would be of value to the Wikimedia community or would result in measurable improvement to the quality of the Wikimedia projects.
  • The ultimate impact of having facts extracted from WIkipedia is not clearly explained.
  • The project may fit with Wikimedia's strategic priorities and may have a potential for online impact. However the project is rather vague in its goals, so it’s impossible to make any firm conclusions. The long term sustainability is unclear.
  • May have an influence on quality of Wikidata and all Wikipedia’s content, but it is not clear what the exact impact will be.
  • The goals of the project are vague, and no specific targets or measures of success are proposed.
  • The project is a re-iteration of a previous proposal by the same applicant (Better_support_for_imported_data_in_wikitext) and suffers from the same problems: the goals are very vague and clear measures of success are lacking.
  • This is a somewhat innovative solution, but it lacks any clear measures of success.
  • No schedule of activities or detailed project plan is presented; consequently, it is difficult to assess whether the proposed 12-month effort is realistic.
  • Budget is OK, but I am concerned by the participant's profile.
  • There appears to be very limited community engagement and support. The project does not appear to have any relevance to (or impact on) diversity.
  • The community involvement is low. As I said when reviewing the previous grant application, the applicant should ask the community if such data extraction tools are really necessary.
  • Grantee wrote: "The discussions about CT and WQ makes me wonder if the communities are ready for this kind of tool." I am concerned that there is not serious support for this tool and understanding what will it do.
  • No specific target community (not clear who will benefit from this development), limited community support.
  • The project is interesting, but appears to have limited relevance to the Wikimedia projects. Redefining the scope of the project in a way that more directly leads to Wikimedia project improvement would be beneficial.
  • Glrx has stated my concerns almost exactly: https://meta.wikimedia.org/w/index.php?title=Grants_talk:Project/Information_extraction_and_replacement&oldid=16472193#Compare_previous_proposal. Additionally, they want to create a special page and have no exposure to MW development or code review.
  • Decline for the same reason as the previous proposal: goals are vague, measures of success are lacking. Lack of good standing of the applicant in WDME.
  • Looks like a useful tool.
  • It’s unclear whether this will have any impact; it lacks clear outcomes and measures of success (i.e. I can't understand what exactly will be delivered to which community/ies and how we expect this/ese community/ies are expected to use it).