Grants talk:Project/Harej/Librarybase: an online reference library

More details about citation extraction[edit]

I wonder what you mean exactly by "citation extraction" here. I understand you want to parse Wikipedia and extract reference metadata from it.

What is the scope of this extraction? I can think of various approaches:
- Finding identifiers in the wild, so essentially using regular expressions to match particular identifiers in the wikicode, as in Crossref's Event Data tool
- Extracting metadata from CS1-based templates (as in wikiciteparser)
- Extracting metadata from citations input as free text (as in bilbo, although it has not been adapted to wikipedia), possibly restricting the search to things enclosed by <ref></ref> for instance.
- Extracting metadata from outgoing links with scrapers such as Zotero/Citoid
- … or something else?
What tools / libraries do you plan to use? (I guess it would make sense not to start from scratch and contribute to the existing efforts.)
What is the metadata format you will use to store the extracted references?

− Pintoch (talk) 06:21, 3 August 2016 (UTC)[reply]

Hello Pintoch. The text of the proposal refers to DOI extraction (which has mostly already been done by EpochFail) and one additional strategy. We have not decided what that other strategy is yet; it will take some testing to figure out what the most reliable extraction technique is. All the strategies you mention sound like reasonable options. (Though I've done extensive metadata extraction with Citoid, and while it generally works well, I have gotten significant amount of error from it.)

Regarding libraries, for DOI extraction we will use the aforementioned DOI extractor. (I am not sure how complete/bug-free it is, however.) We also have a rich ecosystem of tools at our disposal, including mediawiki-utilities (which can quickly go through a MediaWiki dump and extract content, among other things) and mwparserfromhell. Σ may have additional strategies in mind as well.

The metadata will be stored on Librarybase, a Wikibase instance, with data being stored on Wikidata as well. I also plan on creating an API for retrieving the data as an alternative to dealing with SPARQL and Wikidata's JSON output. harej (talk) 22:15, 5 August 2016 (UTC)[reply]

Some comments[edit]

The project states two problems: (1) Lack of unified interface for different databases of the Wikipedia Library (2) Citation extraction from Wikipedia and other Wikimedia projects. However all proposed solutions are aimed only at the second problem. As to the first problem it is little said about it. Can you clarify how you are intending to solve the first problem? If you do not intend to solve should it be removed?
Have any comparable projects ever been realized? Or your project is just first? What were their results?
How will the extraction tool exactly look like? Will it be a bot or something else? The same question about the look up tools.

Ruslik (talk) 16:39, 14 August 2016 (UTC)[reply]

Hello Ruslik0. First, my apologies for the delay in responding.

Regarding the first problem, the lookup tool included in the proposal will address it by associating publications with databases wherever possible. For example, if a given article is available in Elsevier—a Wikipedia Library partner—that piece of information would be stored.
As for comparable projects, I would compare this effort to an earlier effort to produce a list of scholarly article citations on Wikipedia. There was also a list generated of the most referenced domain names on Wikipedia. This builds on that by creating something meant to be continuously updated, rather than represent a snapshot in time.
Extraction will be handled through scripts that extract the metadata from Wikipedia articles and from other sources (such as CrossRef). The extracted data will be stored in a public, community-editable repository. My preference is to put as much into Wikidata as possible, consistent with the efforts of Wikidata's WikiProject Source MetaData, but for the data that does not meet Wikidata notability standards, there is also a Librarybase wiki. The lookup tool will be a web search interface with accompanying API.

Please let me know if you have any additional questions. harej (talk) 03:20, 31 August 2016 (UTC)[reply]

But it is still not a "unified interface representing these different partnerships". You should probably clarify how you are going to contribute to solution of this problem or remove it altogether.
I meant how it would look from the technical point view. Will it be a bot that will edit Wikidata or your the database will be added to Wikidata by other means. How will the look-up tool for Wikidata look like? Will it be a Tool Labs hosted script, an extension, a user script or something else?

Ruslik (talk) 12:25, 31 August 2016 (UTC)[reply]

Ruslik, if it clarifies my above answer: I will work with the Wikipedia Library to get a continuously updated index of articles included in the databases of our partners. Ideally this would take the form of some kind of API that we could have a script periodically query for updates. Each entry in each database will be associated with corresponding Wikidata/Librarybase entries through standard identifiers such as digital object identifiers. This will be done through a script that queries the aforementioned APIs, Wikidata, and Librarybase. If the entry does not yet exist it will, at the very least, be created automatically in Librarybase. (More on the relationship between Librarybase and Wikidata below.) The objective here is to create a "master database" that is a superset of all the Wikipedia Library partner databases. This is in addition to the work of analyzing the contents of Wikipedia for use of citations. Between these two efforts we will have a substantial dataset that includes not just bibliographic metadata, but where to find full texts (both open access and closed access Wikipedia Library partners) and usage data where we have it. The lookup tool will be a web interface, hosted on Labs, that will be a convenient front-end for searching the data. (You could think of the Librarybase wiki as the model and the lookup tool as the view and controller, to use the terminology of the MVC architecture.) Since all of the holdings of the Wikipedia Library partners will be in one database, you can do one search and get results across the different databases.

Regarding Librarybase and Wikidata: Librarybase wiki entries include Wikidata identifiers where they exist, allowing for interoperability between the two databases. The specific implementation will probably require more thought, but my preliminary thoughts are that Librarybase should not duplicate what's already on Wikidata, since the data on Wikidata will probably be more up to date. Where there is a Wikidata entry, the Librarybase equivalent would store (a) the Wikidata ID number and (b) data that is not in scope for Wikidata, such as what Wikipedia page a source was cited on. (There might be a duplication of Wikidata on the Librarybase side if it makes automated lookups more feasible. But the data should be edited directly on Wikidata in this case; any changes on the Librarybase side would be overwritten.) Librarybase will also include bibliographic metadata for resources that do not meet Wikidata notability criteria, such as individual web pages.

Please let me know if anything else should be clarified. harej (talk) 18:39, 1 September 2016 (UTC)[reply]

Ok, I think you need to clarify the proposal along the lines that you presented above. I also noticed that the project duration is not specified. Ruslik (talk) 16:44, 4 September 2016 (UTC)[reply]

Eligibility confirmed, round 1 2016[edit]

This Project Grants proposal is under review!

We've confirmed your proposal is eligible for round 1 2016 review. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period.

The committee's formal review for round 1 2016 begins on 24 August 2016, and grants will be announced in October. See the schedule for more details.

Questions? Contact us.

--Marti (WMF) (talk) 18:03, 23 August 2016 (UTC)[reply]

Great project[edit]

Like the other commentators above, I think there are some questions that haven't fully been explored yet in this proposal (such as which part of the project becomes the first level of deliverables, and which strategies will work best for extracting the data). However, Harej has been very good, and has been deep in the conversations that came out of WikiCite -- so I have confidence that the right approach will be worked out. Like the work on structured data on Commons, the more likely challenges are going to emerge as this goes into practice and a community develops around using the data..

I also think this has some widespread repercussions on our relationship to the research and libraries communities: having a reliable set of linked data about our citations -- one of the underexplored areas of our content -- means that both more effective research and algorithmic/machine uses of Wikipedia could be persued.

Note: in part, Harej started working on this space/project, because of questions/requests I made while working on the Wikipedia Library: so slight conflict of interest here.

Cheers, Astinson (WMF) (talk) 18:02, 25 August 2016 (UTC)[reply]

+1, rather excited about seeing what comes out of this project. ImperfectlyInformed (talk) 19:06, 7 October 2016 (UTC)[reply]

Finding Consensus for storing citation data on Wikidata[edit]

I personally think it would be very valuable to store all citation data on Wikidata. However I don't think that Grant is the right venue to make that decision. I would encourage you to write a request for comments on Wikidata that lays out the position that large amounts of citation data belongs to Wikidata, so that the Wikidata community can make a decision about the issue. ChristianKl (talk) 18:22, 29 September 2016 (UTC)[reply]

ChristianKl, thank you for your comment. That is indeed the plan. harej (talk) 14:58, 5 October 2016 (UTC)[reply]

Aggregated feedback from the committee for Librarybase: an online reference library[edit]

Scoring rubric	Score
(A) Impact potential Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both? Does it have the potential for online impact? Can it be sustained, scaled, or adapted elsewhere after the grant ends?	7.6
(B) Community engagement Does it have a specific target community and plan to engage it often? Does it have community support?	7.4
(C) Ability to execute Can the scope be accomplished in the proposed timeframe? Is the budget realistic/efficient ? Do the participants have the necessary skills/experience?	8.1
(D) Measures of success Are there both quantitative and qualitative measures of success? Are they realistic? Can they be measured?	7.0
Additional comments from the Committee: Creating of a reference database fits with with Wikimedia's strategic priorities and its online impact will likely be significant. There is a reasonable assurance that the results of the project will be sustained. This will be a very useful tool should this happen. One of the most important things that holds the credibility of Wikipedia is citation and hence a tool like this that suggests citations is very useful. Seems to be sustainable and possible for further development. This project will be useful for editors of English Wikipedia and potentially other Wikipedias and might significantly simplify their job. It can make a significant online impact by simplifying access to the sources and it can be replicated elsewhere. It’s not clear this is a real solution to a problem. I understand that the idea is to have verifiability of some citations using The Wikipedia Library. But in my opinion this is not a solution. The project will raise the references collection to a higher level. The grantees have clear goals, reasonable measures of success and realistic plan. The project will likely have a significant long-term impact with modest investment. I would love to see numeric measures of success (e.g. what is the target for Percentage of public library organizations reached through national communications campaign : 1% or 50%?) but overall the approach is convincing and not very risky given the budget and potential impact. The approach is conservative and shows thoughtfulness. The duration of the project is not clearly specified but is probably six months (too short?). The budget looks realistic and participants have necessary skills. However the scope of the proposed look-up tool should better defined - will it be capable of searching all Wikipedia Library databases or not? Seems reasonable compared to the report from WikiProjectX. Grantees have very good experience and good advisors. They are very likely to be able to execute this project and reach their goals. Looks like they've already gotten some good feedback from people. I like that this is the outgrowth of the WikiCite conference. The project outlines a good plan for community engagement. The proposal has reasonable community support. I see real community interest on the talkpage and phab tickets of this topic. Community engagement is good, but on the low side. There are comments from a number of users, but also quite a lot of concerns from the community that should be addressed. The project should be funded provided that the duration of the project is clearly stated and the scope of the reference look up tool is defined better.

This proposal has been recommended for due diligence review.

The Project Grants Committee has conducted a preliminary assessment of your proposal and recommended it for due diligence review. This means that a majority of the committee reviewers favorably assessed this proposal and have requested further investigation by Wikimedia Foundation staff.

Next steps:

Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback and post any responses, clarifications or questions on this talk page.
Following due diligence review, a final funding decision will be announced on Thursday, May 27, 2021.

Questions? Contact us at projectgrants wikimedia · org.

Harej and Σ, our interview with you was part of the due diligence process (I'm getting these comments posted late). You are still welcome to record any response you have to committee comments on your talkpage to make your feedback publicly accessible, but I think we gathered the information we needed for the committee during our talk with you. Best regards, Marti (WMF) (talk) 18:44, 2 October 2016 (UTC)[reply]

Round 1 2016 decision[edit]

Congratulations! Your proposal has been selected for a Project Grant.

The committee has recommended this proposal and WMF has approved funding for the full amount of your request, $20,000 USD

Comments regarding this decision:
The committee is pleased to support your efforts improve the citation ecosystem for Wikipedia and develop structured bibliographic data around citations. We recognize the potential of this work and look forward to working with you to realize the goals you’ve outlined in your proposals.

Next steps:

You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
Review the information for grantees.
Use the new buttons on your original proposal to create your project pages.
Start work on your project!

Upcoming changes to Wikimedia Foundation Grants

Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We are also currently seeking candidates to serve on regional grants committees and we'd appreciate it if you could help us spread the word to strong candidates--you can find out more here. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.