As part of ongoing efforts to parse and extract scholarly citations from the Wikipedia dumps, we came across several instances of references with ill-formed and non resolving (phab:T99046) or simply missing identifiers. The English Wikipedia article on kelp forests has 76 scholarly references, but none of them use a DOI or unique identifier. We don't have an accurate estimate of how many citations lack (but should have) identifiers but cursory analysis of a sample of Wikipedia articles on scientific and medical topics indicate that the problem might be widespread and significant in many Wikipedia language editions, particularly prior to the introduction of en:W:Citoid.
- write code to identify DOI-less references that should have an identifier, trying to match existing metadata via the Crossref API
- provide descriptive statistics on the estimate number of DOI-less citations
- write a bot to fix the corresponding citation template calls when these instances are identified
- (bonus) check how DOI lookup stats change after the fixes are in
- familiarity with the Wikipedia dump structure and parsers like EpochFail's mwcites
- familiarity with the Crossref API or similar APIs to retrieve bibliographic metadata
- Antonin Delpeuch. Wrote wikiciteparser, another parser for citation templates, extracting more metadata than just the identifiers.
- Marin Dacos and Patrice Bellot. Involved in Bilbo, an open source project. "BILBO is an open source software for automatic annotation of bibliographic reference. It provides the segmentation and tagging of input string. It is principally based on Conditional Random Fields (CRFs), machine learning technique to segment and label sequence data. As external softwares, Wapiti is used for CRF learning and inference (reference annotation) and SVMlight is used for sequence classification (differentiating reference from plain text in notes).". More : http://lab.hypotheses.org/1437 and http://lab.hypotheses.org/category/bilbo-bibliographical-robot
- add your name here