WikiCite 2016/Report/Group 2/Notes

From Meta, a Wikimedia project coordination wiki

Links and notes[edit]

Goal[edit]

  • Extractors of bibliographic information, improve metadata lookup tools

Notes[edit]

  • Librarybase
  • BILBO
  • Finn Årup Nielsen showed information from OpenfMRI and Q21100980 with links to external databases and representation of numerical scientific data, respectively.
  • Antonin Delpeuch - OABot
    • A webservice that suggests open access a version for a citation in a Wikipedia article based on the information of template
    • There are issues of persistency and linking to possible 'pirated' versions.
  • University of Trento: Mattia Lago, Alessio Melandri
  • The Citoid tool enables to cite easily within wikipedia by just inputing a URL. In the background the Zotero Translators are used for the information extraction from the webpage and the structured parts are filled within the citation work in Wikipedia.
  • Author disambiguation and separation
  • Open APIs for ISBN:
  • One group works on the OAbot Proposal
    • First, we want to write down the specs for the bot.

Aaron's notes:

Tom Arrow

  • Extraction & fetching metadata from pubmeds
  • Using a list of DOIs and other IDs. Feed a script that extracts pubmed central metadata
  • Made items in a Wikibase installation.
  • Python script runs. Uses SparQL endpoint to check if items exists.
    • Items for missing articles (dois) -- or otherwise linked
    • Items for missing authors (orchid) -- or otherwise linked
    • Items for missing institutions -- or otherwise linked
    • Items for missing publishers/journals -- or otherwise linked

James Hare

  • Librarybase -- A Wikibase installation
  • 100-150k-ish items -- 10-15k articles
    • Similar to Wikidata -- But will accept everything that Wikidata doesn't
    • Focused on sources and where they are used in Wikipedia
  • Has hierarchy of item types
    • Source, Author, Publisher, Institution
  • A little messy. Lots of stuff from pubmed has been loaded
  • Has SparQL query service
  • Current load includes all <ref> tags
  • Would also include source metadata that is not included in Wikipedia
  • What is the growth plan?
    • Want to hear what you think. Aaron will dump his DOI data in.
    • After we do DOIs, move onto harder things -- like using citoid.

Aaron Halfaker

  • Extracting history of scholarly identifier (DOI, arXiv, ISBN, PubMed) additions to Wikipedia
  • Have fast code & datasets.
  • Want robust metadata extraction -- Want to integrate tarrow's work
  • Goal: Integrate and experiment with DOI extraction


Jon

Phillipp

  • Extracting data citation out of papers (future)
  • Zotero translators (Citoid -- URL --> Citation) -- Open source -- Active development

Marin

  • Bilbo.openeditionlab.org -- can test with a UI
  • Parses a bibliography -- can add DOIs
  • Machine learning based approach
  • 92% ish recall and 80% ish precision
  • Alternatives
    • CrossRef service (alegedly) doesn't work as well -- and is "very slow"
    • https://anystyle.io/ -- Parses citation stuff
  • About two weeks for 1.4 million
  • http://lab.hypotheses.org/1532
  • "Enrichment process"
  • Uses crossref API with parsed stuff

Finn Årup Nielsen (User:Fnielsen)

Mike (OCLC)

  • Interest in OABot
  • Looking at databases that make a connection between an author and a paper (not just a string name)

Antonin

  • OABot
  • Takes citations in Wikipedia articles and tries to find non-paywall version of the PDF
  • Guesses whether a citation links to a paywalled PDF
  • Page name --> Reference metadata --> Find's accessible version --> new reference metadata
  • What is the minimum amount of information for a positive result?
    • Title, date and sometimes authors.
  • Bot approval in process -- Maybe semi-automated.
    • Also have the option of making it an OAuth tool
  • How do we know that it is not a copyright infringing PDF?
    • We don't. We rely on the publisher to track down whoever is hosting it illegally.
    • Might need to get Lawyers involved. Community might not like it too.
    • Maybe could have whitelisted domains.
    • Targeting repositories -- might still link to author's homepage

Sebastian

  • Data-repositorian by day
  • Translator leads on Zotero & Citation style language
  • Citoid demo
    • (Nytimes page) --> VE "Cite
  • Ref tool bar has ISBN -- why no ISBN?
    • Probably just not implemented yet. Also, WorldCat is bad. See "library of congress sru API".
  • Can just run Zotero code on nodejs?
    • No. Lots of integration with the browser

Diego

  • Learning ecosystem
  • Worked with Wikipedia data for studies of communities
  • Information retrieval and text mining
  • Writing crawlers, parsers, and backend services

Cristian

  • Started on template usage data
  • Parsed the whole dump and extracted this data
  • tools.wmflabs.org/maintgraph -- Queries the database once per day
  • Ongoing study
    • Looked at citation identifiers -- DOI, Arxiv, ISBN, Pubmed -- and compared with MS Academic Graph
    • Looked at first introduction in Wikipedia
  • Published dataset
    • Pagecounts -- reshuffled
    • Allows for computation of journal and conference based on pageviews

Scott

  • Here to learn
  • Make tools for researchers to get data
  • Targeting R. Used to work with Ironholds.

Day 2[edit]

I suggest that we focus on one specific project. People from UK and US are invited to speak slowly for non native English speakers. Thank you ! LoC ISBN lookup:

//Sends an SRU formatted as CQL to the library of Congress asking for marcXML back

Request that works: http://lx2.loc.gov:210/LCDB?operation=searchRetrieve&version=1.1&maximumRecords=1&query=bath.ISBN={ISBN} # Full Marc21 data http://lx2.loc.gov:210/LCDB?operation=searchRetrieve&version=1.1&maximumRecords=1&recordSchema=dc&query=bath.ISBN={ISBN} #Summarized, Human readable

LEGAL ISSUES: https://www.loc.gov/legal/
"We reserve the right to block IP address that fail to honor our websites’ robot.txt files, or submit requests at a rate that negatively impacts service delivery to patrons. Current guidelines recommend that software programs submit a total of no more than 10 requests per minute to Library applications, regardless of the number of machines used to submit requests. We also reserve the right to terminate programs that require more than 24 hours to complete. "

//search the ISBN over the SRU of the GBV, and take the result it as MARCXML

//documentation: https://www.gbv.de/wikis/cls/SRU
var url = "http://sru.gbv.de/gvk?version=1.1&operation=searchRetrieve&query=pica.isb=" + queryISBN + " AND pica.mat%3DB&maximumRecords=1";

Citoid Group etherpad: https://etherpad.wikimedia.org/p/wikicite-citoid

Marc21 Mapping https://www.loc.gov/marc/bibliographic/ecbdlist.html

DOI metadata lookup:

$ cat datasets/500_dois.tsv | python demonstrate_doi_fetch_performance.py

Running against api.crossref.org
..e..e.................e................................e............................e..........e...ee.........................e.............e...........................................................................................e..ee...........................................e.................................e.............................................................................................................e..................e.......................................................
Processing 500 DOIs took 32.718 seconds.
So, ~0.065 seconds per lookup

Running against doi.org
..e..e.................e................................e............................e..........e...ee.........................e.............e...........................................................................................e..ee...........................................e.................................e.............................................................................................................e..................e.......................................................
Processing 500 DOIs took 160.348 seconds.
So, ~0.321 seconds per lookup

Running against citoid.wikimedia.org
..e..e.................e................................e............................e..........e...ee..e..e...................e.............e...........................................................................................e..ee.......................e...................e.................................e.............................................................................................................e..................e.......................................................
Processing 500 DOIs took 1718.462 seconds.

So, ~3.437 seconds per lookup

>1s lookup = morally wrong

So, if we're going to extract metadata for ~601k DOIs, that will take ~29 days. I'll be checking into using a threadpool to speed this up. It might be OK with the citoid folks.

doi.org lookup code: https://gist.github.com/halfak/7113348ab3496a3af3b7c2b2de14a526


OAbot group:

  • we have expanded the documentation of the project, now centralized here: http://en.wikipedia.org/wiki/Wikipedia:OABOT
  • we have added more debugging information to the web interface
  • we have tested the software and identified some bugs, some were corrected, some are being corrected.

mwlinks: