Grants:Project/Harej/Librarybase: an online reference library
What is the problem you're trying to solve?
Explain the problem that you are trying to solve with this project or the opportunity you’re taking advantage of. What is the issue you want to address? You can update and add to this later.
In Wikipedia, verifiability means that anyone using the encyclopedia can check that the information comes from a reliable source. Wikipedia does not publish original research. Its content is determined by previously published information rather than the beliefs or experiences of its editors. Even if you're sure something is true, it must be verifiable before you can add it.
Wikipedia:Verifability on English Wikipedia
Linking is a small act of generosity that sends people away from your site to some other that you think shows the world in a way worth considering. [...]
[Sources] that are not generous with linking [...] are a stopping point in the ecology of information. That’s the operational definition of authority: The last place you visit when you’re looking for an answer. If you are satisfied with the answer, you stop your pursuit of it. Take the links out and you think you look like more of an authority.
Weinberger (2012), "Linking is a public good", via WikiCite 2016 slidedeck
Wikipedia provides knowledge and its provenance. Its reliability as a general reference is made possible by—and can be verified through—citation to authoritative sources. In 2012, The Wikipedia Library (TWL) was founded to make access to these reliable sources more widely available to the community. TWL has enjoyed considerable success, with 55 partnerships making sources available in five different languages, and with branches available on 22 different Wikipedia language editions. However, because these partnerships are all with separate organizations, users need to do searches in each individual database. There is no unified interface representing these different partnerships.
Additionally, there is an issue in general with how citations are represented on Wikipedia. Because they are stored as plain text, along with the rest of the article, it is rather difficult to do automated lookups and analyses of these sources. If structured data regarding the sources Wikipedia uses were available, it would be possible to, for example, develop tools recommending sources to people, or to perform reference audits.
What is your solution?
If you think of your project as an experiment in solving the problem you just described, what is the particular solution you're aiming to test? You will provide details of your plan below, but explain your main idea here.
In September 2015, Alex Stinson approached WikiProject X about a tool to recommend references to WikiProject participants. To make this possible, I recommended setting up a Wikibase instance to catalogue the usage of sources on Wikipedia. (Wikibase is the software used by Wikidata to store and present structured data.) This would be complemented by a script to populate this Wikibase based on the citations that appear on English Wikipedia. Thus Librarybase was created. With the help of volunteer Bomarrow1, thousands of items were created, corresponding to articles, publications, and authors that appear in Europe PMC. This also included the Wikipedia articles where these articles were cited. The work on Librarybase mirrors that of WikiProject Source MetaData on Wikidata, which aims to create a similar database but through Wikidata.
There has also been some work already on extracting source metadata from Wikipedia through work done at WikiCite 2016, a meeting held in Berlin. This strategy mostly focuses on extracting DOIs from articles and then cross-referencing them with the CrossRef service and similar. However, its main output—an index of sources and where (and when) they appear(ed) on Wikipedia—has not been delivered yet.
For this project, we would like to:
- Complete the extraction tools currently under development to allow citation extraction beyond just DOI.
- Work with the relevant communities to finalize (to the extent necessary to get work done) the ontologies used to represent various sources, especially in cases like books which are more difficult to model.
- Migrate the data to the Librarybase wiki or to Wikidata as deemed appropriate
- Generate reports necessary to help volunteers clean up the aggregated metadata
- Develop a lookup service to help Wikipedia editors perform queries on the metadata
- Generate source recommendations for WikiProjects, per Phab:T111066.
Explain what are you trying to accomplish with this project, or what do you expect will change as a result of this grant.
Our goal is to deliver software tools:
- To perform reference extraction on English Wikipedia as described above, using DOIs and at least one other strategy;
- To migrate this data to Wikidata or the Librarybase wiki;
- To allow users to perform lookups of this data;
- To generate source recommendations for at least five WikiProjects.
Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?
This project is divided broadly into consultation, strategy, and implementation phases. We expect each phase to last two months for a total of six months for the project.
Before starting work on tool development, we will want to consult with our stakeholders. This includes working with the Wikidata community, namely WikiProject Source MetaData and WikiProject Books, to resolve outstanding questions from the work session at WikiCite. While the data model may change and evolve over time, in order to proceed we need at least a workable proof of concept that allows us to get started on this work. This will also help address concerns that the WikiCite working group has not been working closely enough with the Wikidata community.
We will also want to come to a common understanding as to what is added to Wikidata directly and what is instead relegated to the Librarybase wiki. While the Librarybase wiki is intended to be compatible with Wikidata, it would be better to insert data directly into Wikidata, since that would make cross-referencing with other Wikidata items much more feasible. However, the Librarybase wiki remains an option for additional data that is either too granular for Wikidata or simply not in scope. For example, data about which Wikimedia projects use what sources will be stored in Librarybase, since metadata about Wikipedia articles is not in scope for Wikidata.
Since this work is intended to benefit the Wikipedia community primarily, we will also want to work with Wikipedia editors who would be interested in this work. This includes the current customers and volunteers of the Wikipedia Library as well as WikiProjects that are interested in reference recommendations. (We intend on working with current WikiProject X pilot projects.) Based on the strategy used by WikiProject X to gather feedback about WikiProjects, we will want to know what users find convenient and/or frustrating about their current experiences looking up sources to use in articles. We cannot guarantee we will act on each point raised but it is useful to collect this information so that we can prioritize our work accordingly.
If possible, we would also like to work with Wikipedia Library partners to get a continuously updating list of publications and articles in their collection. This information will be used in building a centralized search for the Wikipedia Library in conjunction with the other data collected. If for whatever reason we cannot collect this data, we will focus more deeply on the other aspects of the project, for example through more sophisticated citation extraction strategies.
The above consultation will result in a more precise product requirement document that describes the lookup tool to be developed. There will most likely be two components to it: a user-friendly lookup function, as well as an API that can be used by tools. This will be the front-end to a comprehensive database constructed using a variety of strategies, principally focused on indexing citations on Wikipedia and supplementing this with other data. In particular, we are interested in indexing the Wikipedia Library's partners' databases.
To develop the database, we will experiment with different extraction strategies, including figuring out which ones are most effective and yield the least error. We will also need to determine how integration with Wikipedia Library partner databases will work, assuming it is possible. (It may be possible for some databases but not others.) We also need to determine workflows for human curation of this data, since we expect that some cleanup of the data may need to take place, owing to the ambiguities that may exist in some citations.
The objective here is to create a "master database" that represents the relationships between sources and their usage on Wikipedia, as well as a superset of all the Wikipedia Library partner databases where this data is available. If we succeed in both indexing the partner databases and constructing our own database of source usage on Wikipedia we will have a substantial dataset that includes not just bibliographic metadata, but where to find full texts (both open access and closed access Wikipedia Library partners) and usage data where we have it. Since the holdings of the Wikipedia Library partners would be in one database, you can do one search and get results across the different databases. If we are not able to integrate this data, we will instead focus on more thorough citation extraction from Wikipedia, incorporating more sophisticated strategies we would not otherwise have the capacity to implement. (For example, this could include adapting the tools to work on other language editions of Wikipedia.)
The lookup tool will be a web interface, hosted on Labs, that will be a convenient front-end for searching the data. (You could think of the Librarybase wiki as the model and the lookup tool as the view and controller, to use the terminology of the MVC architecture.)
Regarding Librarybase and Wikidata: Librarybase wiki entries include Wikidata identifiers where they exist, allowing for interoperability between the two databases. The specific implementation will probably require more thought, but my preliminary thoughts are that Librarybase should not duplicate what's already on Wikidata, since the data on Wikidata will probably be more up to date. Where there is a Wikidata entry, the Librarybase equivalent would store (a) the Wikidata ID number and (b) data that is not in scope for Wikidata, such as what Wikipedia page a source was cited on. (There might be a duplication of Wikidata on the Librarybase side if it makes automated lookups more feasible. But the data should be edited directly on Wikidata in this case; any changes on the Librarybase side would be overwritten.) Librarybase will also include bibliographic metadata for resources that do not meet Wikidata notability criteria, such as individual web pages.
Based on the parameters imposed by the consultation and the resulting technical strategy, we will begin developing the software that carries out the extraction of source metadata from articles and migrates it to Wikidata and the Librarybase wiki. We will also begin implementing the lookup tool and any curation tools as needed. Once this work is done, we will begin encouraging Wikipedians, especially Wikipedia Library customers, to try looking up sources to see if the tool meets their expectations.
How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!
- Management, strategy, and lookup tool development: $10,000 for six months, part time (~$20/hour). James Hare will be responsible for this aspect of the project, drawing from his nearly 12 years of experience in the Wikipedia community. This work includes designing and executing a targeted outreach strategy, incorporating user feedback in the design of tools, coordinating with partners, and supervising the metadata extraction process.
- Citation extraction tool development and other code development work as needed: $10,000 for six months, part time (~$20/hour). This includes researching the existing data on Wikipedia and designing complex metadata extraction tools that are designed to work continuously and provide fresh data on source usage on Wikipedia, as well as relevant quality assurance mechanisms.
How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.
We will reach out to our stakeholders as described above.
What do you expect will happen to your project after the grant ends? How might the project be continued or grown in new ways afterwards?
The software we develop will be open source, allowing others to improve and expand upon it as necessary. Since Librarybase (and Wikidata) are wikis, volunteers can help curate the data. Once we have a working proof of concept we would like to expand beyond English Wikipedia, and potentially develop workflows for migrating freely licensed content to Wikisource. There is also opportunities to work with library associations and other collectors of metadata to integrate their data with this tool, giving Wikipedia editors access to even more sources.
Measures of success
How will you know if the project is successful and you've met your goals? Please include specific, measurable targets here.
- Size of Librarybase database. Target: metadata for at least 100,000 source items
- Number of lookup tool users. Target: at least 15 who have used and provided written feedback
- Number of WikiProjects receiving source recommendation reports. Target: at least 5 WikiProjects
Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.
- Harej has been editing Wikipedia since 2004 and has served as an administrator since 2006. He is also currently an active editor on Wikidata, where he has contributed heavily to Wikidata entries on scientific journal articles, both individually and in his capacity as a Wikipedian in Residence at the National Institute for Occupational Safety and Health in the United States. He is an experienced Wikimedia Foundation grantee, most recently working on WikiProject X as a product manager with some Python software development (including with Django). He serves on the Board of Directors of Wikimedia District of Columbia.
- Sigma has been editing Wikipedia since 2009. He is a seasoned bot developer on Wikipedia, with software engineering experience in Python and Java.
You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?
- Email sent to wikicite-discuss mailing list
- Comment posted on Wikipedia Library talk page
- Comment posted on WikiProject Open talk page
- Comment posted on WikiProject Source MetaData talk page
Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).
- The work being done by James to integrate the various activities that help WikiProjects operate, has been fairly successful. Moreover, improvements to the citation ecosystem for the Wikipedias in a way that allows WikiProjects or other communities of practice to actively look up relevant sources, could greatly strengthen the productivity of new editors. Imagine a world in which each editor, when on a page, could ask: where should I start my research for this topic? User:Astinson (WMF)
- Work on citations is very timely and important. Vladimir Alexiev (talk) 13:47, 8 August 2016 (UTC). Some important considerations:
- DBpedia has extracted hundreds of millions of citations as RDF but in rough form (i.e. using dbp: not any established ontology). If you can use this RDF data, submit your idea to http://wiki.dbpedia.org/blog/dbpedia-citations-references-challenge (disclaimer: I'm on the review committee there).
- Normalizing/deduplicating the data is a huge area of unexplored work. Eg just for authors, @Pigsonthewing: proposes to use ORCID. But this is a small part of the problem (eg how do you identify different Manifestations of the same Work? Revisions of the same article that appeared in a conference and then in a journal?) and attaching to the wrong object (if I mistype the ORCID of John Smith as author of an article, I've done a misservice: the ORCID should be recorded against the WD item of John Smith, not against every separate article of that author).
- Using Librarybase vs Wikidata for storing the citations: maybe a separate WD instance (Librarybase) is better, since I doubt we want a Wikidata entry for every work ever written (does every single citation meet Wikidata's notability guidelines, however lax they are?)
- Leveraging existing bibliographic databases is extremely important, eg WorldCat has 300M bibliographic records, TEL has 100M bibliographic records, Google Books has many million online books.
- The data/ontology model necessarily must support "part of" relations. Eg citing a chapter or page of a book should point to a URL for the book, but be its own record that carries the chapter/page identifier (locator). See https://github.com/dbpedia/extraction-framework/issues/452 for some considerations.
- The UI for generating references could definitely need work. Interesting comments by Vladimir Alexiev. Jura1 (talk) 17:33, 14 August 2016 (UTC)
- The prototype of Librarybase has proved to be full of useful infomation. +1 to pushing forward with this! ·addshore· talk to me! 09:40, 12 September 2016 (UTC)
- Improving citations by having a better infrastructure for them is very valuable. ChristianKl (talk) 18:26, 29 September 2016 (UTC)
- You know that frwiki has a namespace for reference? In any case, whataver improves the current situation I support. hoping one day we'll have some flexible and efficient.--Alexmar983 (talk) 17:44, 7 October 2016 (UTC)
- Only a (small?) minority of editors use WikiProjects as a starting point for what they edit. For the (large?) majority of editors, a "Generate a list of additional, potential sources/citations", on article talk pages, would be far more valuable than something designed to assist WikiProjects. And such a tool doesn't necessarily have to be real-time; even a "post a list to this page within 24 hours" option would be extremely valuable, IF the sources provided were truly valuable. John Broughton (talk) 00:46, 8 October 2016 (UTC)