Talk:Wikicite/grant/Wikipedia Citations in Wikidata
Dear @Giovanni1085:, thank you for starting this very interesting and potentially very impactful proposal document. I see that it is still an unfinished draft - and that's perfectly fine. I am looking forward to reading, and for the committee to review, the finished proposal.
In the meantime, I wanted to particularly encourage you to use the community notification and endorsement sections of the application form. Your proposal does impact the wider wikidata community to some degree especially as it involves a mass-upload of content. Therefore, I think it is important to demonstrate as part of your application, that you have community support for that action. For example, the research datast you cite in proposal talks of 4 million citations to scholarly publications in Wikipedia. It is unclear (yet) how many of those are already wikidata items. But, if we assume "half" (i.e. 2million) already exist and you wish to upload the other half, and we know that Wikidata is currently 90 million), then you are proposing a 2% increase in wikidata size with one batch process. For that scale of work, broader community notification is important.
Sincerely, LWyatt (WMF) (talk) 10:40, 16 September 2020 (UTC)
- Good idea. Jura1 (talk) 12:28, 19 September 2020 (UTC)
- Yes, good idea to ask around.
- Probably also a good idea to try to estimate the number of items / links to be added (ideally such that the estimates find their way into the proposal), so as to inform both the community discussions and the practical workflows. I expect that the main issue with this project will not be the number of items added, but the number of new or modified links between them, and the number of edits and other curation acts necessary to accomplish that. Most of the Crossref DOIs cited from the English Wikipedia are already in Wikidata, while most of the P2860 (cites) links between them as well as most of the P50 (author) links to authors and links from these to organizations are still missing. Adding such links at the scale of tens of millions that would be necessary for this corpus will take months of continuous operation, so the things you tackle first (or perhaps not at all) should be informed by community discussion. For instance, scholarly works that do not have a Crossref DOI are heavily underrepresented, so increasing their proportion of the overall scholarly corpus is a good thing in principle, and the benefits of indexing COVID-19-related information are probably easy to explain at the moment.
- Another thing to consider is how the development (and maintenance) of the proposed tools fits into the landscape of existing/ outdated/ etc. tooling. For instance, the proposed disambiguation workflows might be integratable with the Author Disambiguator with relatively little effort, while the relatively mighty SourceMD is blocked. -- Daniel Mietchen (talk) 20:03, 21 September 2020 (UTC)
- Yes, good idea to include explicit estimates of the number of items and statements involved.
- On works without a CrossRef DOI: it was pointed out before that the various tools to add papers on Wikidata do not/did not seem to play well with journals which rely on DataCite (pretty common in ). Plan S accepts identifiers like handles too. It's always good to improve existing upstream tools to make such data easier to use; sometimes it might as easy as writing a Zotero translator for some repository which is not yet supported (although the main ones should all be already). Nemo 13:45, 22 September 2020 (UTC)
Other software libraries
If I understand correctly, https://github.com/Harshdeep1996/cite-classifications-wiki is what you already have so far, and you plan to work on that. Could you please briefly explain how it compares to the various alternatives and why you think it's a better starting point? Personally I know and have used https://github.com/mediawiki-utilities/python-mwcites (for all languages) and https://github.com/dissemin/wikiciteparser (for the English Wikipedia). Nemo 12:42, 21 September 2020 (UTC)
Measures of success
The proposed measures of success need some work, in my opinion. In order.
- It's not clear to me what you mean by "increased number of citations included in Wikidata". Do you mean adding links between existing Wikidata entities? Cleaning up the tens of millions of existing items about scholarly works would certainly be welcome by everyone (we need more information about persons, affiliations, entities, subjects etc.). Adding more items about scholarly works might be problematic, see also wikidata:Wikidata:WikiCite/Roadmap.
- For "increased engagement with citation data", it would be useful to define what baseline you're considering. For instance, are we talking about number of downloads for datasets like Research:Scholarly article citations in Wikipedia? Or something else? How much do you think you can increase?
- As for usage by existing services, do you mean things like Internet Archive's book links and book wishlist? If you intend to replicate such successes, it might be interesting to talk with existing users of our citation datasets and see what they need.
Nemo 12:42, 21 September 2020 (UTC)
A short message just to thank anyone who has or will comment: we are working on your feedback and really appreciate it!
- Quick note: I thank you for the changes made to the proposal today - which re-order the task and the measures of success to focus on the development of the codebase to be functional, re-usable and documented, with the actual upload of the specific dataset to be the secondary outcome - as that is the component dependent on external factors only partially in your control. LWyatt (WMF) (talk) 14:46, 30 September 2020 (UTC)