Open Access Reader/Project Diary

From Meta, a Wikimedia project coordination wiki

Discussions over Open Scholarship Weekend and Wikimania[edit]

  • General positive feedback from community (so - in principle at least - idea is feasible and worthwhile)
  • Cameron Neylon: "Use CORE"
  • Identified conference to go to - a workshop organised by CORE on doing things exactly like what we're doing!

1 September 2014 Kickoff Meeting[edit]

  • Ed, Kimi & A930913 met up to review the project goals.
  • Discussed architecture:
    1. A process that finds and ranks metadata of papers missing from Wikipedia, and offers it as an API
    2. Various UXs built off this API to expose this info to users
      • the first being a web interface (few dependencies, so quick to prototype)
      • future ones perhaps using wiki-native pathways, e.g. Task Recommendations Project
  • CORE API service has proved intermittent, and anyway we need to join with some wikimedia tables (i.e. wikipedia outbound links), so we decide to replicate a static version in labs as a test.
  • set up project in Labs
  • Visually inspected some metadata - format seems relatively straightforward.
  • Next steps: A930913 to try and set up static metadata DB in labs

10 September 2014 Facebook Conversation[edit]

Upload of database proving is very slow:

A930913: Metadata is received as a series of text files. I have written code to extract each article, and then extract the information about each article. This works quite well. I have also designed a database to store aforementioned metadata, in a format that I made and I saw that it was good (enough). The previously noted code, upon extraction, inserts it into this database. This program does not appear to be consuming much resources on the labs, and therefore my current assumption is that the database server is the bottleneck, in that it can only accept the data in so fast. Resolutions will likely be inserting the data in a manner that is more efficient, or moving the database to a server that will handle it better.

A930913 resolves to ask Coren for advice.

11 September 2014 Discussion with Coren[edit]

Thanks to Coren's advice, upload efficiency was increased by 100,000%.

[13:23:08] <a930913>	 Coren: Poke?
[13:26:17] <a930913>	 Need some help working out why inserting a data dump into tools-db is taking so long.
[19:00:24] <Coren>	 Hmm?
[19:00:32] * Coren pokes back! Poing!
[19:00:38] <a930913>	 Coren: Did you see my message above?
[19:01:35] <a930913>	 14:26 < a930913> Need some help working out why inserting a data dump into tools-db is taking so long.
[19:01:48] <Coren>	 Ah, no.  Sorry.
[19:02:42] <Coren>	 But I'm going to venture a guess that you want someone more skilled than I at DBA.  I can see the obvious, but the subtleties of DB efficiencies of mysql require more arcana than I can summon.  :-)
[19:03:11] <a930913>	 Coren: I've bashed a python that extracts metadata from a dump, and puts it into the db.
[19:03:40] <a930913>	 When I htop it, it doesn't look like it's using much resources.
[19:03:40] <Coren>	 Sounds straightforward enough.
[19:04:03] <a930913>	 So I am assuming that the bottleneck is the db.
[19:04:41] <Coren>	 It probably is.  The obvious first question(s) is how you're doing the inserts and whether there are indices, triggers or expensive constraints that might be the primary issue.
[19:04:47] <a930913>	 So is it poor design on my part, or is there too much load on the db?
[19:05:30] <a930913>	 There is one primary key per table, and a few relations.
[19:06:05] <Coren>	 a930913: atm, the DB is fairly busy.  Are you currently inserting things?
[19:06:37] <Coren>	 Funny thing is that it's not I/O bound that I can see, mostly CPU-bound.  Which is odd.
[19:07:05] <a930913>	 (cat /data/project/oar/schema.sql)
[19:07:27] <a930913>	 Coren: Yeah, I've been inserting for about a week. :/
[19:07:45] <a930913>	 (And my computer crashed earlier, so I've lost the mosh session :/ )
[19:08:09] <Coren>	 a930913: What DB are you inserting into?
[19:08:25] <a930913>	 Erm, tools-db wasn't it?
[19:08:48] <Coren>	 Erm, schema.  Sorry, postgres parlance.  :-)
[19:09:12] <Coren>	 I.e.:  s?????_???
[19:09:47] <a930913>	 Coren: s52074__core_metadata
[19:10:55] <Coren>	 a930913: Yeah, I see it.  Hmm.
[19:18:39] <Coren>	 a930913: With postgres, I'd have suggested you use the COPY statement instead; not all that certain what the best way to do this with mysql is.  LOAD DATA LOCAL INFILE might be your best bet?  Otherwise, I'd consult with springle who is our resident mysql magician.  :-)
[19:22:00] <a930913>	 Coren: The wha? I'm using a series of INSERTs, is that not right?
[19:22:42] <Coren>	 a930913: It's /correct/, insofar as it will do the job, but I'm pretty sure that you can do a mass insert with LOAD DATA LOCAL INFILE in a way that's much more efficient.
[19:23:16] <Coren>	 a930913: I'm pretty sure that your task is currently dominated by the overhead of the individual insert statements.
[19:25:12] <Coren> might help, too
[19:27:19] <Coren>	 A cursory reading of the related also suggest that LOAD DATA is up to 20x faster than inserts, too.
[19:27:41] <a930913>	 Yeah, just looking at the MariaDB version of that.
[19:28:10] <a930913>	 Coren: Do I use the filename /data/project/?
[19:29:30] <Coren>	 If you use LOAD DATA LOCAL INFILE then you need the filename as seen by the client - i.e.: with the path name you'd use to look at it.
[19:30:59] <a930913>	 Coren: Yeah, but if the db server can read it direct?
[19:31:13] <a930913>	 Is it accessible from the db server?
[19:32:05] <Coren>	 a930913: No; the db server lives on a separate network; and runs as a user that wouldn't have the right permissions anyways.
[19:32:29] <a930913>	 Ok.
[19:32:41] <a930913>	 Coren: Do you know the format for this INFILE?
[19:34:03] <Coren>	 You get to pick it.  The defaults, IIRC, are tab-separated columns, with newlines between the rows.
[19:35:04] <Coren>	 look at
[19:36:05] <a930913>	 Hmm, "SHOW PROCESSLIST" suggests that the bottleneck might be where it searches to see if a piece of data can be normalised...
[21:08:50] <a930913>	 Coren: How can I make a bash script start many instances of the same script, but with different parameters?
[21:10:03] <Coren>	 a930913: I'm not sure what you're asking exactly.  Moar context?
[21:10:39] <a930913>	 Coren: Sorry, jsub.
[21:10:53] <a930913>	 Want many jobs with different parameters.
[21:12:58] <a930913>	 Coren: I.e. jsub $FILE, for FILE IN $FILES
[21:14:04] <Coren>	 a930913: A shell script could do it:  for foo in file1 file2 file3; do jsub $foo; done
[21:14:50] <a930913>	 Coren: I thought you couldn't submit arguments with jsub?
[21:14:59] <a930913>	 It could only be a pure executable.
[21:15:59] * a930913 wonders where he read that...
[21:16:57] <Coren>	 It has to be executable, but you can still give it arguments.  :-)  The only real difference with qsub is that you can't pipe a script into it.
[21:17:51] <a930913>	 Hmm. Anyway, DDoS submitted :p
[21:19:25] <Coren>	 It's not a DDOS, it's what gridengine is /for/  :-)
[21:19:58] <a930913>	 Oh ****.
[21:20:37] <a930913>	 I ran the old parser script...
[21:20:50] * a930913 introduces his face to the desk.
[21:21:44] <a930913>	 Good thing they all have the same name :) "qdel parser"
[21:21:48] <Coren>	 Palms hurt less and have mostly the same symbolic value.  :-)
[21:23:02] <a930913>	 For the first time, I'm piping qstat into less :o
[21:24:43] <a930913>	 Holy ****.
[21:25:09] <a930913>	 I've never really used the gridengine as a grid before.
[21:25:40] <a930913>	 It just went and parsed gigabytes of text, right before my eyes.
[21:25:46] <Coren>	 :-)
[21:26:21] <Coren>	 Moar parallel!
[21:26:28] <a930913>	 Coren: How much read can /data/project/ handle?
[21:27:13] <Coren>	 a930913: Taking around the various inneficiencies, and depending on the actual read pattern, probably somewhere around 5-6gbps
[21:27:56] <Coren>	 a930913: Though you'd be hard pressed to actually sustain that over a long period because of contention with other tools.
[21:28:12] <a930913>	 So it could load up all the grid in three seconds.
[21:28:26] <a930913>	 Then 10x parallel. :o
[21:28:40] <Coren>	 Sure, but it'd level off fairly quickly as the disk scheduler partitions the usage.

12 September 2014 3rd International Workshop on Mining Scientific Publications[edit]

EXCELLENT conference - lots of great outcomes:

  • People I met:
    • Peter Murray Rust: prominent scientist & supporter of the project, based in Cambridge. Shuttleworth fellow and head of the ContentMine project.
    • C Lee Giles: prominent scientist & supporter of the project, based at Penn State. Head of the CiteSeer project.
    • Michelle Brook: Recommended I get in touch with Steve Cross and Will Hutton as possible evangelists to help engage the public.
    • Joe Wass: R&D programmer at CrossRef, which runs the DOI system and has lots of openly accessible metadata.
    • Petr Knoth: Lead on the CORE project.

Notable talks:

C Lee Giles, Information Extraction and Data Mining for Scholarly Big Data[edit]

A world leader in mining scholarly data. Main relevant takeaways:

  • The amount of scholarly literature published is very large.
  • The amount of scholarly literature being published over time is growing very fast.
  • The proportion of scholarly literature that is open access is increasingly.

All good signs for this project!

Tim Gollub, A Keyquery-Based Classification System for CORE[edit]

Very literally, this guy was taking papers from CORE and grouping them in a hierarchy of topics using Wikipedia categories, exactly something I need for a later phase of Open Access Reader! He was also interested that OAR is Wikimedia funded and thinks some of his future work may apply, so I told him how to find out more about the Wikimedia grant programme.

Petr Knoth & Drahomira Herrmannova, A new semantic similarity based measure for assessing research contribution[edit]

A new way of measuring the significance of papers based on how many subject areas they affect. In general, a good overview of bibliometric techniques. Presented by two members of the CORE team - who later said that they'd be interested in integrating this and other bibliometric techniques directly into CORE - another thing I'll need for Open Access Reader!.

Birger Larsen, Developing benchmark datasets of scholarly documents and investigating the use of anchor text physics retrieval[edit]

An overview of "Information Retrieval" methods, which will be useful when we come to filtering the CORE output. Also useful for another Wikimedia project I'm [working on!]

16 September 2014 Database Update[edit]

  • Articles and non relational data uploaded into database (identifier, repository, shortTitle, abstract, date, doi) (15773726 inserted; 36 excessively long abstracts truncated.)
  • Authors uploaded into database
  • Mapping of authors to articles currently uploading into database (24 hours and counting)
  • Cites ready to upload.

Technical note: Database server is bottleneck, therefore removing as much load from it as possible is best. Currently using the lab's grid to massively parallelise parsing and normalisation, generating large INFILEs to import into the database.

Ed notes that the number of articles that have DOIs is surprisingly low. (1652649 out of 15773726)

Some example output:

19 September 2014 Meeting with Petr Knoth, CORE project lead[edit]

  • Ed met Petr at dl2014 conference on 12/09/14
  • Ed, A930913 and Petr met for coffee near The Strand. Ed explained the OAR project, and learnt some history and plans for CORE.
  • It's easier for the functionality we need to exist on the CORE side, so we can just build services off it, rather than us having to replicate, extend and maintain the service.
  • There was enthusiasm from both sides, and there was a long discussion on potential future directions for the technology.

22 September 2014 Follow up from Petr[edit]

  • Petr sent Ed an email with his summary of what he understood from the meeting on the 19th, which included an indication of price for his team to implement the necessary changes that will allow us to proceed with OAR. With his permission, it's reproduced here:

Hi Ed,

It was great meeting you on Friday. Based on our discussion, I see the following role for us in the project.

The team on our side will consist of Dasha (in CC) and myself. The main output of this project (achievable within the 6 month duration) is that we will provide you with a list of OA articles ordered according to some meaningful metric that will characterize the importance of each article for Wikipedia. Where possible, we will also supply a link to the full-text of the article in CORE, a DOI, or a URL of the repository of origin. The output of this will be a prototype SW for processing the data and producing the outputs as well as a short report describing what we achieved, limitations of the approach, etc.

The ultimate goal (vision) for us, which goes beyond the lifetime of this 6 months project, is to also automatically interlink ("wikify") full-texts of OA articles in order to improve the accessibility of public knowledge in research papers to the general public. I can imagine many kinds of use cases of this that will enable people to engage with relevant recent research while reading Wikipedia and vice versa.

In terms of tasks and time, we need to:

  • process and harmonise data from CORE for use in this project
  • design our scoring metric
  • gather useful data from external sources which are an evidence of usefulness of an article for WIkipedia
  • implement our scoring method and test it on the gathered data
  • evaluate the level of success of the method and reiterate if needed
  • export the data and agree a method for sharing it among ourselves
  • write a short report describing the outputs, limitations, etc.

We will also need to communicate on a regular basis and it might be useful to have perhaps two meetings in London over the 6 months period.

I think that a reasonable (though fairly low) estimate of this work will be around 250 hours (we can easily spend more, but I think this is what will make this doable). The structure is as follows:

Dasha $50 (per hour) * 150h = $7,500
Petr ($75 per hour) * 100h = $7,500
Total: $15k

I know this is a bit higher than we discussed, but I can see the maximum funding is $30k ( and I believe we should go for the maximum as the project has a lot of potential.

Please let me know what you think.

Best wishes,


As far as I (Ed) can tell, he understands correctly our requirements for the back end, and is in a unique position to deliver it. I think this would be a good, low-delivery-risk way to proceed, that produces a stable back end that can be re-used for other applications. NB, we don't have to wait 6 months to begin working on the front end:

Ed: How long do you estimate it will take you to get the first prototype service up? I'm keen to work on a front end component alongside this, and if we can build off your service from the start it'll be easier. The quality of the metric isn't very important for this, just the interface.
Petr: I don't see a problem for you to work in parallel on the frontend. We should define the format of the data we will supply to you early on. We can then each work in parallel and integrate our effort once we are ready. Depending on the situation, we might (and perhaps should) also be able to do a few iterations (data exchanges) during the project.

7 October 2014 Catchup with Siko[edit]


  • We have shown that the back end is possible
    • The metadata exists in the form that we need, in sufficient quantities to be useful
    • The significance ranking system is something that can be built (and we access to the expertise to build it well)


  • We must share more explicitly our intended plan for the front end;
    • for the newbie (website) version, describe a route to market, create mock-ups
    • for the wikimedian (bot) version, engage with communities
    • begin write-up

8 Oct 2014 Press List[edit]

I intend to do a PR campaign around the tool to popular science publications, when it is functional. When I elicited suggestions, I received dozens:

10 Oct 2014 First Wireframes[edit]

Kimi produced some sketches of how the interface might work:

End of Oct Illness[edit]

Ed was really sick, so not much happened!

UPDATE: Turns out, Ed's house has been infected with damp mould, which has been making him sick for months! After a clear out of ventilators and dehumidifying, he should be a lot more productive now!

Week 10 Nov 2014 Writing up[edit]

With Katherine Bavage's help, Ed began collating materials for the midpoint report, producing a compressed summary of progress so far:

  • Strong recommendations for CORE as the source for OA metadata.
  • A small but promising proof of concept, generated from a static metadata dump from CORE.
  • Discovery of research outlining a method of matching OA articles from CORE to Wikipedia categories.
  • A set of wireframes for a desktop UI, with mockups coming soon.
  • A proposal from the CORE team to produce and support:
    • the backend required to supply open access metadata in the form we require for OAR.
    • a considered and justified ranking methodology.
  • Discovery of Citoid, a tool to automatically generate correct citation links.
  • A press list for a campaign to develop a crowdsourcing community.

A930913 was paid £1250 for work done so far, which is equal to the exploration period developer budget.

9 Dec 2014 Community Outreach[edit]

Ed posted to various community locations asking for ideas and feedback:

15 Dec update: no replies so far...

15 Dec 2014 Crowdsourcing Examples[edit]

9 Feb 2015 DOI work[edit]

30 Jun 2015 Article[edit]

Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science
Abstract: "With the rise of Wikipedia as a first-stop source for scientific knowledge, it is important to compare its representation of that knowledge to that of the academic literature. This article approaches such a comparison through academic references made within the worlds 50 largest Wikipedias. Previous studies have raised concerns that Wikipedia editors may simply use the most easily accessible academic sources rather than sources of the highest academic status. We test this claim by identifying the 250 most heavily used journals in each of 26 research fields (4,721 journals, 19.4M articles in total) indexed by the Scopus database, and modeling whether topic, academic status, and accessibility make articles from these journals more or less likely to be referenced on Wikipedia. We find that, controlling for field and impact factor, the odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to closed access journals. Moreover, in most of the worlds Wikipedias a journals high status (impact factor) and accessibility (open access policy) both greatly increase the probability of referencing. Among the implications of this study is that the chief effect of open access policies may be to significantly amplify the diffusion of science, through an intermediary like Wikipedia, to a broad public audience."

9 April 2016 DOI Stream[edit]

Saw this, might be useful: