Talk:WikiCite/2020 Virtual conference/Align your Open Access Journal with Wikidata - using Python and OpenRefine

From Meta, a Wikimedia project coordination wiki

How to prepare Jupyter Notebook for querying OAI ?[edit]

Servus Christian,thank you very much for your presentation at Wikicite 2020. It's great that you managed to find a way to scrape repositories! A wish that I've cherished for a long time, but so far I haven't been able to convert harvested OAI-PMH records (XML) into a file that can be read into Wikidata. Congrats! After your presentation I immediately installed Jupyter Notebook. But I am unable to open your notebook in Jupyter. How did you set up Jupyter to query OAI interfaces? Is there maybe a screencast or documentation on this? I use OpenRefine to prepare MARC records from the ASCL library catalog for upload to Wikidata, Jupyter notebook is completely new or me. My approach: 1. I copied your code (notebook) into Notepad and saved it as .ipynb. When uploading to Jupyter I get the following error message: cannot upload unvalid notebook (SyntaxError: Unexpected token r in JSON at position 1) 2. I copied your code into a cell of a new, blank notebook and executed it with run. Error Message: No module named 'SPARQLWrapper'. Thank you in advance for all hints and tips. Herzliche Grüsse nach Wien, Walkuraxx (talk) 09:34, 29 October 2020 (UTC)[reply]

Hi @Walkuraxx: thanks for your reply and your interest in my python script. you mentioned different points. do you run Jupyter local or do you use a Jupyter Cloud environment? Never mind I really would recommend the Wikimedia Jupyter environment PAWS: PAWS-Login (PAWS on Tech-Mediawiki), every Mediawiki User could easily run its own Jupyter environment there.
E.g. in paws you can run a terminal prompt and directly clone my gitlab repository:
afterwards you can open the Notebook and everything should work fine. After running through the script an export file is created in the script directory which you can use for further edits in Open Refine.
The error message about the missing SPARQLWrapper module shows you that you have to install this module. If you use the terminal in PAWS you can install them with "pip3 install SPARQLWrapper" and everything should be fine!
If you need further assistance, we could setup a remote session and go through the different questions and points! Schöne Grüße en groetjes naar Leiden! --Mfchris84 (talk) 17:47, 29 October 2020 (UTC)[reply]
Thank you very much for your explanation and kind offer @Mfchris84: :-) I will try it out and report back. Kind regards,Walkuraxx (talk) 07:01, 30 October 2020 (UTC)[reply]
Hurray, it works! This is pretty cool, thank you very much @Mfchris84: :-) I have just scraped my first oai journal and added a first batch of articles to Wikidata, see the Southern journal for contemporary history (Q101010757, articles published in 2020). A last question: what do I have to consider when working with sets. I wanted to scrape the African Study Monographs, base url: http://repository.kulib.kyoto-u.ac.jp/dspace-oai/request, set: com_2433_66220 . As a result I retrieved an empty file (same result for Arabian Epigraphic Notes, base url https://openaccess.leidenuniv.nl/oai/request, set: hdl_1887_38880). Liebe Grüsse, Walkuraxx (talk) 10:58, 31 October 2020 (UTC)[reply]
hi, @Walkuraxx: fine that the script first works for your attempt! The problem in scraping the Japanese repository based on the missing marcxml metadataFormat. In the first versions of my script i also asked for the metadataformat, i have to rebuild this stage and push a new code. i will inform you, when i have done the work! --Mfchris84 (talk) 06:30, 2 November 2020 (UTC)[reply]
Dank je wel @Mfchris84: :-)