Talk:Wikicite/e-scholarship/Stevenliuyi

From Meta, a Wikimedia project coordination wiki

@Stevenliuyi: Thank you for this proposal. Can you provide a query result that demonstrates the number of existing wikidata items which could be improved by this proposal (e.g. how many Chinese journal articles are in WD, but don't have a DOI property, that you believe is scrape-able in scope of this project? Also, as the eScholarship program is not intended to replace volunteer editing, can you please describe how the work proposed Is qualitatively Or quantitatively different/harder than the work you do as a volunteer community member. Sincerely, LWyatt (WMF) (talk) 10:14, 1 October 2020 (UTC)[reply]

Hi @LWyatt (WMF):
Thank you for the questions. Please let me answer your second question first. Since I started importing Chinese journal articles from CNKI more than a year ago, I have been always thinking about adding DOI to those items, but still couldn't do it because it would take more time and effort than I was able to commit to analysis all potential publishers, identify common website structures, and then develop web scrapers for each of those publishers. Due to the diversity of journals and publishers, I expect it would take much manual effort to do those work. Those are not the kind of work I have been doing as a community member.
After necessary scrapers are developed, then for each journal, the scraper still need to be run semi-automatically to add relevant statements to WD (probably via QuickStatements). This is actually similar to what I have been doing as a volunteer, which is out of scope of this eScholarship application. So to summarize, this proposal only includes the developing work that is not part of my normal volunteer editing. Adding DOI using the developed tools is not part of this application, which I plan to do as a volunteer member after the tools are developed.
Then let's go back to the first question. Precisely due to the difficulties I illustrated above, currently I am not able to give an estimate of total number of items that can be improved, since I need to manually investigate many different publishers/journals first. So the best I can do right now is to give you an example. Chinese Journal of Theoretical and Applied Mechanics (Q98517082) (official website: http://lxxb.cstam.org.cn/CN/0459-1879/home.shtml) and Acta Entomologica Sinica (Q21386079) (official website: http://www.insect.org.cn/CN/0454-6296/home.shtml) are two journals published by two different institutes under Chinese Academy of Sciences (Q530471). Official websites of both journals are built using the similar template, which means only one scraper needs to be developed for both of them. Many other journals published by other different institutes under CAS adopt the similar web template. Currently, there are 130K article items that are published by a CAS journal that do not have DOI. However, without a thorough investigation, I am not sure if the vast majority of CAS journals use similar template, so one scraper may not be enough for all CAS journals. --Stevenliuyi (talk) 12:13, 1 October 2020 (UTC)[reply]

Crossref[edit]

Hi @Stevenliuyi: thanks for this proposal! Are the DOIs for these articles not in the Crossref.org data? I wonder if you could get it from there rather than scraping the journal sites? thanks, -- phoebe | talk 17:58, 7 October 2020 (UTC)[reply]

Hi @Phoebe: Thanks for the question. Most Chinese journal articles are not in the Crossref database. Actually, there are near 100M Chinese journal articles in CNKI's CJFD database (China Academic Journals Full-text Database), while there are "only" 85M journal article DOIs in Crossref. --Stevenliuyi (talk) 23:00, 7 October 2020 (UTC)[reply]

DOI seems invalid[edit]

I am not able to resolve either https://doi.org/10.16380/j.kcxb.2020.08.001 or https://doi.org/10.6052/0459-1879-20-124. Probably the DOIs are not something officially registered (i.e. invented) and should not be added to Wikidata.--GZWDer (talk) 00:13, 8 October 2020 (UTC)[reply]

Hi @GZWDer: it seems that the most recent issue (2020-08) of Acta Entomologica Sinica (Q21386079) has not been added to the CNKI database yet (see [1]). DOIs from the previous issue (2020-07), for example https://doi.org/10.16380/j.kcxb.2020.07.001, can be resolved correctly. Similarly, DOIs from the previous issue of Chinese Journal of Theoretical and Applied Mechanics (Q98517082), for example https://doi.org/10.6052/0459-1879-20-059, can also be resolved without problem. So I think it's just some sort of temporary database synchronization issue. --Stevenliuyi (talk) 01:11, 8 October 2020 (UTC)[reply]