Community Wishlist Survey 2020/Archive/Improve workflow for uploading academic papers to Wikisource

Random proposal ►◄ Archive The survey has concluded. Here are the results!

Improve workflow for uploading academic papers to Wikisource

N Not enough information

5-minute documentary on medics using Internet-in-a-Box in the Dominican Republic, where many medical facilities have no internet. These medics would presumably also appreciate better access to the research literature (Doc James?).

Opportunity: There are a large and increasing number of suitably-licensed new Wikisource texts: academic papers (strikingly-illustrated example on Wikisource). Many articles are now published under CC-BY or CC-BY-SA licenses, and with fully-machine-readable formats; initiatives like Plan S will further this trend.
Problem: Uploading these articles is needlessly difficult, and few are on Wikisource.
Who would benefit: Having these papers on Wikisource (or a daughter project) would benefit:

anyone accessing open-access fulltexts online. There is an ongoing conflict between traditional academic-article publishers and the open-access movement in academia, and some publishers have done things which make freely-licensed article fulltexts harder to access (for instance, one publisher has paywalled open-access articles, pressured academics and third-party hosters to take down fulltexts (examples), charged for Creative-Commons licensed materials,[1] sought to retroactively add NC restrictions to Creative Commons licenses, forbidden republication of CC-licensed articles on some platforms at some times, and acted with the apparent aim of making legal free fulltexts harder to find [2]); it also bought out a sharing platform (example). Wikimedia has social and legal clout to resist such tactics, and is well-indexed by search tools.
anyone wanting offline access to the academic literature (through Internet-in-a-Box): people with poor internet connectivity (including field scientists), or censorship
those who have difficulties reading non-accessible content. Some online journals work hideously badly with screenreaders.
anyone wishing to archive a previously published paper, including the Wikijournals (journals go bust and go offline, and many research funders require third-party archiving)
those using the fulltexts for other Wikimedia projects (e.g. Wikipedia sourcing, academic article illustrations copied to Commons for reuse)

Proposed solution: Create an importer tool for suitable articles, with a GUI.
More comments:
- Konrad Foerstner's JATS-to-Mediawiki importer does just this, but seems stuck in pre-alpha.
  - Even the citation-metadata scrapers which we already use could automate much of the formatting.
  - Another apparently-related tool is the possibly-proprietary Pubchunks.
- The #Icanhazwikidata system allows academics to add academic-paper metadata to Wikidata by tweeting an identifier; Magnus Manske's Source Metadata tool added it automatically. An #Icanhazfulltexthosting system could allow uploading a fulltext by tweeting a fulltext link (with some feedback if your linked PDF lacks the needed machine-readable layer).

Phabricator tickets:
Proposer: HLHJ (talk) 02:02, 9 November 2019 (UTC)[reply]

Discussion

Thanks to Daniel Mietchen for the example article. I've made a lot of statements about projects I don't know much about, and would appreciate advice and corrections. HLHJ (talk) 02:02, 9 November 2019 (UTC)[reply]

Us becoming a repository for high quality open access journals is a good idea. We just need to be careful that we do not include predatory publishers. Doc James (talk · contribs · email) 09:22, 9 November 2019 (UTC)[reply]

Agreed. Plan S and such look as if they may help more broadly with that, too. HLHJ (talk) 01:11, 21 November 2019 (UTC)[reply]

@Doc James: What, in your view, is the advantage of hosting academic papers on Wikisource vs just hosting the PDFs on Commons. It seems like most mobile devices and browsers these days support reading PDFs, and modern PDFs almost always have a text layer that is already searchable (and gets indexed by Google, etc.). Converting PDFs into Wikisource texts is a lot of work, and I'm curious what it achieves in your view. Kaldari (talk) 16:21, 12 November 2019 (UTC)[reply]

User:Kaldari Yes maybe the problems with PDFs are improving. Papers are more sources than media files though. So they more naturally fit within Wikisource. Doc James (talk · contribs · email) 16:28, 12 November 2019 (UTC)[reply]

Doc James, Kaldari: I'd prefer markup to PDF for several reasons:

First, text reusability; the text in these articles could be included in other wiki content. Cutting and pasting from a PDF can be really awkward, requiring a lot of manual editing to say, re-insert the spaces between the words which the OCR for some reason omitted. It also makes the images much more easily and quickly reusable, on Wikiversity, in the Wikijournals, in Wikipedia, etc.. The media in old, no-text-layer PDFs are also useful.
Separately, if you don't have a good cheap high-bandwidth internet connection, a PDF is bigger, to download or pack by sneakernet. Consider downloading all of Wikisource vs downloading all PDFs from Commons (which I don't think Internet-in-a-box has an interface for). If you are paying for satellite bandwidth, this matters.
Using a screenreader, it is reportedly rather difficult to read many academic paper PDFs (things like no spaces, or having the page footer and header read out to you in the middle of a sentence once a page, are annoying). By contrast, it is comparatively easy to read Mediawiki pages, which is very much to the credit of the devs. The reward for a job well done is more jobs...
This could attract new editors to the community

Most of the fancy formatting for an article on Wikisource is for the metadata. Just a simple tool that would create the metadata template from Crossref data or PMC would make importing articles a lot easier, especially if there is a machine-readable text layer (even a lousy one). This semi-automation seems to me as if it would not be a lot of work, but as I say, I am ignorant. HLHJ (talk) 01:11, 21 November 2019 (UTC)[reply]

Thank you for posting this proposal, HLHJ! However, there is a problem: it describes in much detail who will benefit, but does not explain the problem: what's the current procedure, what are the pain points, etc. Could you elaborate on this, please? MaxSem (WMF) (talk) 01:35, 14 November 2019 (UTC)[reply]

Unfortunately, we can't accept proposals that don't explain the problem in sufficient detail. Thank you for participating in our survey. MaxSem (WMF) (talk) 00:29, 19 November 2019 (UTC)[reply]

Sorry, MaxSem (WMF), I am a tad late checking back, off-wiki life. This answer is probably too late, but... Currently you manually upload a PDF to commons, and manually type it up, or if there is a machine-readable copy you cut-and-paste and manually reformat it, as far as I know (I asked on the main Wikisource forum a while back and got nothing better). It's doable, it just takes forever, and it's frustrating to manually copy structured machine-readable metadata when a program could do it much more rapidly and reliably. No objection to manual proofreading, but cutting the scutwork would be good. Additionally, it can be difficult to extract the images from articles (especially scanned pdfs). This problem applies even more to Wikibooks. I describe it in great detail in the section below. HLHJ (talk) 02:01, 20 November 2019 (UTC)[reply]

Improve workflow for book illustration cleanup

Some books and other publications are frankly pretty useless without their illustrations. Some newer PDFs have machine-recognizable images which could be automatically extracted,, but a very large number of illustrations need to be cropped from their scanned pages, and have their yellowed backgrounds adjusted to white, so that they can be included in the Wikisource text. The colour adjustment is such a common task that a reliable standard command to do it for greyscale images is posted the wiki. Only the cropping then requires human input.

Current workflow:

Download the scanned page image file
Use an external tool like * to crop (or identify the top-left-corner-co-ords and dimensions desired, and use jpegtrans manually).
Run a command on the file to adjust the colour. Many editors are not familliar with command-line tools, so this may actually be a significant obstacle.
Upload new file
Copy over most of the metadata, adding a note to say what file it is cropped from
Put the cropped image into the Wikisource text
Repeat for every illustration in the book

Desired workflow: If the images cannot be extracted automatically, semi-automate:

Click on the note in the Wikisource file that there was an illustration at this point to open the scanned page image in the illustration-extraction tool. Either:
1. Adjust automated guess as to which rectangles contain illustrations
2. Manually draw a polygon around the illustrations (rectangle default)
Look at the automated colour adjustment.
1. If the automatic colour adjustment is bad (unlikely), click to skip the colour adjustment and template the cropped Commons file as needing white balance adjustment. ##Otherwise, click "Done"
The image is automatically cropped, adjusted, uploaded with suitable metadata to commons, and inserted into the Wikisource text.
Repeat for every illustrated scanned page in the book.

As a bonus, this tool could be used to easily produce cropped images for use on other projects, a perennial Commons and Wikipedia wish. HLHJ (talk) 02:01, 20 November 2019 (UTC)[reply]