Research talk:Scholarly article citations in Wikipedia/Work log/2015-02-09

From Meta, a Wikimedia project coordination wiki

Monday, February 9, 2015[edit]

I did some work this morning with v0.0.5 of https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia.

Extract a random sample of DOI citations[edit]

$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | shuf -n 1000 > sample_doi.1k.tsv

Using crossref to check DOIs[edit]

$ cat sample_doi.1k.tsv | awk -F"\t" '{print "http://api.crossref.org/works/"$6"/agency"}' | xargs -I {} bash -c "wget --quiet -O- '{}' | sed -r 's/(.*)/\1\n/'" > doi_agencies.1k.json

Convert dois to sorted sets and diff[edit]

$ cat sample_doi.1k.tsv | tail -n+2 | cut -f6 | sort | uniq | tr '[:upper:]' '[:lower:]' > sample_doi.1k.set.tsv
$ cat doi_agencies.1k.json | mwstream json2tsv message.DOI | sort | uniq | tr '[:upper:]' '[:lower:]' > doi_agencies.1k.set.tsv
$ diff sample_doi.1k.set.tsv doi_agencies.1k.set.tsv | grep "<" | sed -r "s/>\s(.+)/\1/" > missing_doi.1k.tsv
$ wc missing_doi.1k.tsv 
103

Spot-checked missing dois[edit]

shuf -n10 missing_doi.1k.tsv

shuf -n10 missing_doi.1k.tsv

Well... that looks good. Almost all the IDs that aren't resolving with crossref resolve just find with dx.doi.org. And the ones that don't seem to be fine extractions.

Counts[edit]

DOI/Page pairs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | wc
742565 5269445 63756121
PubMed ID/Page pairs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep -v "doi" | grep -E "pmcid|pmid" | wc
437484 3011320 30215760
Unique DOIs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | cut -f6 | sort | uniq | wc
524357 524357 13332518
Unique pages with DOIs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | cut -f1 | sort | uniq | wc
172644 172644 1438573
Unique pages with PubMed IDs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep -v "doi" | grep -E "pmcid|pmid" | cut -f1 | sort | uniq | wc
68648 68648 575015

--22:52, 9 February 2015 (UTC)