Links and notes

Goal

Extractors of bibliographic information, improve metadata lookup tools

Notes

Librarybase
BILBO
- Automatic adding of DOI based on reference OpenEdition bilbo (Marin, Marseille)
- They have converted a million of references (1,4 million, finding 140 000 DOIs)
- Uses the Crossref service
- Bilbo Demo
- Other approaches for reference extraction:
  - AnyStyle: parsing scholarly references into different formats AnyStyle
    - This online service attempts to parse a reference into the components. It does not resolve to the DOI
  - Grobid: Information Extraction form Scientific Publications https://github.com/kermitt2/grobid
Finn Årup Nielsen showed information from OpenfMRI and Q21100980 with links to external databases and representation of numerical scientific data, respectively.
Antonin Delpeuch - OABot
- A webservice that suggests open access a version for a citation in a Wikipedia article based on the information of template
- There are issues of persistency and linking to possible 'pirated' versions.
University of Trento: Mattia Lago, Alessio Melandri
- Maintgraph on Toolserver (for it.wiki): Maintgraph on Toolserver
- Alessio Bogon
- Study on citation on Wikipedia and Microsoft Academic Graph.
- Pageview dataset for 2014: Wikipedia Pagecounts Sorted by Page Year 2014
The Citoid tool enables to cite easily within wikipedia by just inputing a URL. In the background the Zotero Translators are used for the information extraction from the webpage and the structured parts are filled within the citation work in Wikipedia.
- Citoid on MediaWiki: Citoid
- Citoid API: Citoid API
- Zotero Translators (>480): Zotero Translators on Github
Author disambiguation and separation
Open APIs for ISBN:
- SRU Library of Congress, e.g., lx2.loc.gov:210/LCDB?operation=searchRetrieve&version=1.1&query=bath.ISBN=9780820488660&maximumRecords=1
  - see also link
- SRU from GBV, e.g. http://sru.gbv.de/gvk?version=1.1&operation=searchRetrieve&query=pica.isb=9780820488660 AND pica.mat%3DB&maximumRecords=1
  - see also link
One group works on the OAbot Proposal
- First, we want to write down the specs for the bot.

Aaron's notes:

Tom Arrow

Extraction & fetching metadata from pubmeds
Using a list of DOIs and other IDs. Feed a script that extracts pubmed central metadata
Made items in a Wikibase installation.
Python script runs. Uses SparQL endpoint to check if items exists.
- Items for missing articles (dois) -- or otherwise linked
- Items for missing authors (orchid) -- or otherwise linked
- Items for missing institutions -- or otherwise linked
- Items for missing publishers/journals -- or otherwise linked

James Hare

Librarybase -- A Wikibase installation
100-150k-ish items -- 10-15k articles
- Similar to Wikidata -- But will accept everything that Wikidata doesn't
- Focused on sources and where they are used in Wikipedia
Has hierarchy of item types
- Source, Author, Publisher, Institution
A little messy. Lots of stuff from pubmed has been loaded
Has SparQL query service
Current load includes all <ref> tags
Would also include source metadata that is not included in Wikipedia
What is the growth plan?
- Want to hear what you think. Aaron will dump his DOI data in.
- After we do DOIs, move onto harder things -- like using citoid.

Aaron Halfaker

Extracting history of scholarly identifier (DOI, arXiv, ISBN, PubMed) additions to Wikipedia
Have fast code & datasets.
Want robust metadata extraction -- Want to integrate tarrow's work
Goal: Integrate and experiment with DOI extraction

Jon

Play with citations
paleontologist as a day job
Get papers rplos (https://github.com/ropensci/rplos) -- There are dumps, but this is pretty fast

Phillipp

Extracting data citation out of papers (future)
Zotero translators (Citoid -- URL --> Citation) -- Open source -- Active development

Marin

Bilbo.openeditionlab.org -- can test with a UI
Parses a bibliography -- can add DOIs
Machine learning based approach
92% ish recall and 80% ish precision
Alternatives
- CrossRef service (alegedly) doesn't work as well -- and is "very slow"
- https://anystyle.io/ -- Parses citation stuff
About two weeks for 1.4 million
http://lab.hypotheses.org/1532
"Enrichment process"
Uses crossref API with parsed stuff

Finn Årup Nielsen (User:Fnielsen)

Neuroinformatics
Trying to see if data can be represented in wikis
For metadata
- Download data from databases and add them to wikidata and then link them back
- Using tools SourceMD and QuickStatements
- Considering making own extractor for PubMed.
Trying to represent data in scientific papers
- https://wikidata.org/wiki/Q17141282 -- See "numeric value".
Alternatives
- Datacite registers connections between datasets and publications
http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6937/pdf/imm6937.pdf

Mike (OCLC)

Interest in OABot
Looking at databases that make a connection between an author and a paper (not just a string name)

Antonin

OABot
Takes citations in Wikipedia articles and tries to find non-paywall version of the PDF
Guesses whether a citation links to a paywalled PDF
Page name --> Reference metadata --> Find's accessible version --> new reference metadata
What is the minimum amount of information for a positive result?
- Title, date and sometimes authors.
Bot approval in process -- Maybe semi-automated.
- Also have the option of making it an OAuth tool
How do we know that it is not a copyright infringing PDF?
- We don't. We rely on the publisher to track down whoever is hosting it illegally.
- Might need to get Lawyers involved. Community might not like it too.
- Maybe could have whitelisted domains.
- Targeting repositories -- might still link to author's homepage

Sebastian

Data-repositorian by day
Translator leads on Zotero & Citation style language
Citoid demo
- (Nytimes page) --> VE "Cite
Ref tool bar has ISBN -- why no ISBN?
- Probably just not implemented yet. Also, WorldCat is bad. See "library of congress sru API".
Can just run Zotero code on nodejs?
- No. Lots of integration with the browser

Diego

Learning ecosystem
Worked with Wikipedia data for studies of communities
Information retrieval and text mining
Writing crawlers, parsers, and backend services

Cristian

Started on template usage data
Parsed the whole dump and extracted this data
tools.wmflabs.org/maintgraph -- Queries the database once per day
Ongoing study
- Looked at citation identifiers -- DOI, Arxiv, ISBN, Pubmed -- and compared with MS Academic Graph
- Looked at first introduction in Wikipedia
Published dataset
- Pagecounts -- reshuffled
- Allows for computation of journal and conference based on pageviews

Scott

Here to learn
Make tools for researchers to get data
Targeting R. Used to work with Ironholds.

Day 2

I suggest that we focus on one specific project. People from UK and US are invited to speak slowly for non native English speakers. Thank you ! LoC ISBN lookup:

//Sends an SRU formatted as CQL to the library of Congress asking for marcXML back

Request that works: http://lx2.loc.gov:210/LCDB?operation=searchRetrieve&version=1.1&maximumRecords=1&query=bath.ISBN={ISBN} # Full Marc21 data http://lx2.loc.gov:210/LCDB?operation=searchRetrieve&version=1.1&maximumRecords=1&recordSchema=dc&query=bath.ISBN={ISBN} #Summarized, Human readable

LEGAL ISSUES: https://www.loc.gov/legal/

"We reserve the right to block IP address that fail to honor our websites’ robot.txt files, or submit requests at a rate that negatively impacts service delivery to patrons. Current guidelines recommend that software programs submit a total of no more than 10 requests per minute to Library applications, regardless of the number of machines used to submit requests. We also reserve the right to terminate programs that require more than 24 hours to complete. "

//search the ISBN over the SRU of the GBV, and take the result it as MARCXML

//documentation: https://www.gbv.de/wikis/cls/SRU

var url = "http://sru.gbv.de/gvk?version=1.1&operation=searchRetrieve&query=pica.isb=" + queryISBN + " AND pica.mat%3DB&maximumRecords=1";

Citoid Group etherpad: https://etherpad.wikimedia.org/p/wikicite-citoid

Marc21 Mapping https://www.loc.gov/marc/bibliographic/ecbdlist.html

DOI metadata lookup:

$ cat datasets/500_dois.tsv | python demonstrate_doi_fetch_performance.py

Running against api.crossref.org
..e..e.................e................................e............................e..........e...ee.........................e.............e...........................................................................................e..ee...........................................e.................................e.............................................................................................................e..................e.......................................................
Processing 500 DOIs took 32.718 seconds.
So, ~0.065 seconds per lookup

Running against doi.org
..e..e.................e................................e............................e..........e...ee.........................e.............e...........................................................................................e..ee...........................................e.................................e.............................................................................................................e..................e.......................................................
Processing 500 DOIs took 160.348 seconds.
So, ~0.321 seconds per lookup

Running against citoid.wikimedia.org
..e..e.................e................................e............................e..........e...ee..e..e...................e.............e...........................................................................................e..ee.......................e...................e.................................e.............................................................................................................e..................e.......................................................
Processing 500 DOIs took 1718.462 seconds.

So, ~3.437 seconds per lookup

>1s lookup = morally wrong

So, if we're going to extract metadata for ~601k DOIs, that will take ~29 days. I'll be checking into using a threadpool to speed this up. It might be OK with the citoid folks.

doi.org lookup code: https://gist.github.com/halfak/7113348ab3496a3af3b7c2b2de14a526

OAbot group:

we have expanded the documentation of the project, now centralized here: http://en.wikipedia.org/wiki/Wikipedia:OABOT
we have added more debugging information to the web interface
we have tested the software and identified some bugs, some were corrected, some are being corrected.

mwlinks:

library and command-line tool for extracting wikilinks from XML Wikipedia dump history files;
https://github.com/mediawiki-utilities/python-mwlinks