Talk:WikiCite (3)
Add topicPresented at WikiConference North America 2024
[edit]Bluerasberry (talk) 23:48, 4 October 2024 (UTC)
Questions from Prototyperspective
[edit]interested but find it quite unclear; what exactly is proposed here and how could it move things out of Wikidata and so in a way that enables going beyond the low limits of Wikidata?
The limits are documented at d:Wikidata:WikiProject Limits of Wikidata.
The proposal here is for the Wikimedia Foundation to do a high level assessment about whether to invest millions of dollars into Wikidata's capacity, with WikiCite being a test case and one of several projects to use an expanded Wikidata.
I think WikiCite is special because 1) it is the most popular Wikidata project 2) it is unique for attracting more investment from more institutions than any other Wikimedia project anywhere and 3) it aligns with a lot of our values, like fact-checking, access to information, global partnerships with universities, and surfacing underrepresented knowledge sources.
A big part of the solution to going beyond the limits of Wikidata is several million dollars. I think we could even fundraise that externally to Wikimedia donations, if only the Wikimedia had a fundraising goal and a plan for updating Wikidata.
Moreover, the usefulness is not entirely made clear but afaik when it comes to Scholia for example, it is largely dependent on the state of completion of metadata on books and studies where Wikidata currently only has an estimated 1% of both (and not even the most notable thereof but more like a random subset).
There are about 340 million scholarly publications, ever, for all time and all countries. WikiCite already has about 40 million, so we have more than 1% of scholarly publications. It is not a random subset and in different ways, we uploaded collections which examine use cases.
The only reason we did not upload more is because of the limits of Wikidata that you named. Wikidata's capacity is unchanged since 2015, and we got 40 million papers. by en:Moore's law technological capacity has improved in society. Getting all 300 million should be affordable eventually, and maybe now. It is easy to get the research paper datasets; the limit is Wikidata capacity.
Books are more complicated to catalog because they get reprinted, changed, and transformed a lot more than research publications, but eventually, I think Wikipedia and the world will quit plain text prose citations and move to structured data for everything. We should start with research publications because they are the most valuable and least expensive place to start, and 300 million is more than enough for a pilot project.
There are some metadata-databases from which data would need to be imported for Scholia to be useful but I'm not sure how this is meant to become used and useful beyond Scholia
The reason why universities invest in WikiCite/ Scholia is because they want to know about themselves. The insight that surprises everyone is that almost no research university in the world is able to determine what research papers its own faculty published in the last year. That already is super alarming, and just giving organizations a way to do that is helpful. If we can help universities track that, then there are countless applications, including profiling individuals for promotion and tenure, matching researchers to grant opportunities, automated search for peer reviewers for research, measuring effectiveness of teachers/mentors by tracking student research progress, giving non-traditional credit for research outputs like software or datasets, and internationalizing access to research.
WikiCite is both super useful for improving Wikipedia and Wikimedia projects, but also a stand-alone product for research support. The institutions that put money into this mostly wanted that off-wiki service. Bluerasberry (talk) 21:24, 10 February 2026 (UTC)
- Thanks for the explanations. Is what's proposed here 'improve Wikidata capacity to enable a comprehensive database of citations'? Because the title just says WikiCite and there is no intro describing what exactly is proposed here. If what's proposed is to improve Wikidata, then it seems more like that should be in the title (and a separate proposal) with concrete applications of Wikicite like Scholia being example(s).
- it aligns with a lot of our values, like fact-checking, access to information, global partnerships with universities, and surfacing underrepresented knowledge sources. I'd wonder how this would be useful to these goals and applications more concretely in practice. I don't use wikidata items about studies for fact-checking for example and the studies don't become available with that. I don't even use wikidata for searching and sorting studies for which I use scienceopen, Google Scholar, and openalex.
- A big part of the solution to going beyond the limits of Wikidata is several million dollars. there also needs to be some technical roadmap / ideas / … how this could be achieved technologically.
- There are about 340 million scholarly publications … WikiCite already has about 40 million, so we have more than 1% of scholarly publications I'm not sure if the 340 million fully takes into account preprints and lots of items on WD are about preprints. It's not the 1% of most notable studies. But you're right that 1% is probably an estimation that's substantially too low and it could be closer to 10%.
- The only reason we did not upload more I'm not sure it would be so easily possible. I think users didn't import more also for other reasons like lack of bot coders and having to semi-manually rerun import queries etc.
- I think Wikipedia and the world will quit plain text prose citations and move to structured data for everything. So far it doesn't look like it and it's a bit unclear what the benefit of doing so would be. A disadvantage is that the respective wikidata items can be modified in problematic ways. I don't think the millions of bot-imported study items are well watched where people identify people g adding an author to a paper that isn't actually one or do other hard to detect things. Vandalism could be prevented by locking such scholarly paper items to bot/script edits or at least for a start to long-registered users with at least 100 unreverted edits etc but so far there are no such measures in Wikidata. In any case, the clear use-case of moving to WD structured data is not there.
- universities … want to know about themselves … tracking student research progress … non-traditional credit for research outputs key things missing on that end are citation counts, associated software repo links, links to news reports coverage citing or about the study, altmetrics scores, etc. Here I'd also use scienceopen for example and Wikidata is limited heavily by which studies are or aren't in its database, meaning this only works if the university made sure the papers relevant to them have all been imported. automated search for peer reviewers for research currently based on a very incomplete dataset and so far at least the proposal does not outline how to close the huge data gap. Addressing the scaling issues of Wikidata would enable that but it doesn't in itself get all those items imported and there is no info on how the scaling issues are meant to be addressed or addressable.
- Just trying to provide constructive feedback and think it would be good to make it clearer what exactly is being proposed. A separate domain for the wiki cite community? Fixing Wikidata scaling issues? Mass data imports? All of these things? … Prototyperspective (talk) 19:14, 16 February 2026 (UTC)
- @Prototyperspective: I and the WikiCite project have major challenges communicating. I have been talking about this for years and hearing others talk about it, and I have not discovered or developed any explanation that is quick and easy. The best quick explanation that I have is to say that WikiCite is a citation database and that Scholia is a frontend for "scholarly profiling", and comparable to other scholarly profiling products like Google Scholar, OpenAlex, or library catalogs. I believe that Wikidata has strengths in being open data and community managed that give it major advantages over closed top-down platforms, but it is hard to communicate those without understanding the different products out there.
- Yes, the point is 'improve Wikidata capacity to enable a comprehensive database of citations', but the applications are far beyond library science or any concept of traditional catalogs. One key difference is that a citation index likes this directly intervenes in every government and university research budget in the world, about US$trillion annually. In the United States for example, the government hosts the National Science Foundation database, but despite mandated reporting, it really is not possible to connect research projects to their research outputs or identify the people involved in the research. If there were a citation database, then from a funder perspective, that database enables queries on research compliance and identifying at a glance who fulfilled the terms of their grant. From a researcher perspective, these systems greatly improve matching calls for grants to people with merit, when right now most grants go to people who have access to highly skilled grantwriting teams at institutions with that infrastructure.
- applications more concretely in practice So for example, there are lots of humanities and local journals which researchers may want to surface, and which are not indexed in Google or OpenAlex. I had a student import Journal of Indian Philosophy (Q6295325), https://qlever.scholia.wiki/venue/Q6295325 , which lacked electronic indexing at the time. Commercial systems try to convince researchers to contribute their data to closed platforms to be shared in that proprietary walled garden, but when a resource gets opened once, then we can keep sharing that research forever. We can also ask ethical questions that Google knows behind the scenes but which they do not discuss, like "what are the demographics of authors for a particular journal". With some topics like religion, demographic perspective can matter, and Wikidata has ways of reporting demographics like the university affiliations of all the authors.
- some technical roadmap The roadmap right now is that Wikidata adopts someone else's free and open source software, and we pray that it works and expands our capacity. The Wikimedia Movement is interconnected with and relies on a lot of external resources. Besides a technical roadmap, I wish we had a social roadmap for deciding when to send money to a major institutional partner. We pull in US$200 million a year - the lions share of nonprofit community tech - and personally I think we should share with other orgs which cannot practically fundraise but on whom we rely. This includes OpenStreetMap, Internet Archive, Tor, and an ecosystem of open source software including whatever database is Wikidata's target for migration. I think these partnerships are of major interest as social topics to discuss, even among people who know nothing technically of how they work as computing systems.
- key things missing on that end We have the data for many of the things you mention here, and we also have very interesting processes to curate and enrich that data beyond what is freely available elsewhere. The reason why we stopped importing it is because in 2015 Blazegraph got "bought" when Amazon hired all their developers, and our Wikidata backend has not been updated in 10 years. We have 40 million citations in Wikidata, and if we had 50% of the whole 350M citations (every paper that has ever itself been cited), then I am just guessing that this would meet the needs of 90% of requests. What we have is already superior to marketplace competitors in many ways, and we could be a viable preferred choice for many users with a little development. For altmetrics - you are right, Wikipedia is not getting into social media, but the underdog services like https://www.altmetric.com/solutions/free-tools/bookmarklet/ are more free and open than the commercial market leaders. We have a scope and things which we are not going to do, and indexing social media is far from Wikipedia's scope right now.
- Fixing Wikidata scaling issues? Mass data imports? All of these things? To stay on topic, migrating the backend in summer 2026 as planned then doing mass data imports is the next step for WikiCite. The actual issue is Wikidata Graph Split and how we address major challenges - we have social issues in that we do not have a chain of decision making between users and the Wikimedia Foundation. Wikipedians are enthusiastic about everything to do with citations, and imagine many very different applications for using a citation database. It is a super popular idea. For various reasons, the Wikidata database Blazegraph became nonviable in 2015, and we have had a lot of social tension about talking about that ever sense. If that was a one-off, I could tolerate that, but I think we have a systemic social disfluency in how we surface and address major infrastructure issues. We will not out compete Google etc for tech, but we can out compete them for community organization, talking about intentions, values, and ethics, and for recognizing when there are expensive commercial products where we can make a free and open equivalent that will attract investment from users.
- Thanks for the questions - I wish I could express myself more concisely. Bluerasberry (talk) 16:16, 24 February 2026 (UTC)
- Thanks for the explanations! It may be a good idea to briefly list / explain use-cases like insights for research funders to the benefits and applications are clearer. The problem currently is that this only works for journals where somebody made some near-complete batch import of data and even there the citations data would be inaccurate. That doesn't mean there's lots of potential there. (Albeit I have doubts how well bibliographic data like whether and how many studies were written, which journals they were published in, and there they're open access are good for that – or most of most compliance terms – and also relevant entities probably do have some tools for such data and insights such as checking or scraping Google Scholar author profiles.) And I don't think it would be good to overstate the usefulness of having nonindexed journals – these are usually of low quality and eg Google Scholar also even indexes papers on preprint servers like arxiv. Demographic statistics I think more often than not stoke controversy and perceptions of gaps or bias when there is no unfair / unjust ones and it just reflects which people are interested in researching in that area. I mean they can maybe be interesting but at least I don't see how this is actionable actually-acted-upon insights so at least some info on that like examples where orgs used such insights in a specific way would be useful. Afaik, Tor gets a lot of donations and is not used on Wikipedia except for a small fraction of users abroad (more using VPNs I think) because proxies and VPNs are still banned. Instead of donating to openstreetmap, it would be better imo to first of all technically improve the map in the Wikipedia app. It seems not connected to this topic and while I'd support supporting other projects, there's more than enough things to fund solving within Wikimedia itself or softwared used with about 0 as opposed to just too little funding. For altmetrics - you are right, Wikipedia is not getting into social media[…] We have a scope and things which we are not going to do, and indexing social media is far from Wikipedia's scope right now. they are not much about social media but about news coverage; and indexing the external coverage / use would be useful but WD does not even have the altmetrics score, let alone its own altmetrics score. I just have difficulty seeing the real-world use and usefulness now and in the forseeable future and clear examples and/or goals thereof would help. To stay on topic, migrating the backend in summer 2026 as planned then doing mass data imports is the next step for WikiCite I think things like that would be good to add to the page to make it clearer. Thanks again, Prototyperspective (talk) 21:19, 26 February 2026 (UTC)
Scope
[edit]Can we expand the scope of WikiCite beyond scholarly publications to include books and newspapers as well? I enjoy using the "cite q" template on several Wikimedia sister projects outside Wikidata, and I thought the whole WikiCite project was about making the database of all citations across Wikimedia sister projects more formalized and accessible. Rtnf (talk) 04:31, 17 February 2026 (UTC)
- @Rtnf: The intention is to expand WikiCite. Getting quality data into the system is easy and we have datasets, ready to load. The bottleneck is computational capacity to query the system.
- By the timing of you asking, you probably came here through "Wikidata Graph Split and how we address major challenges". In that article, I anticipate a new test database summer 2026, and for Wikidata to adopt it by summer 2027, at which time we should be able to add more items.
- One of your proposals is to connect en:template:Cite Q to Wikipedia. Yes, I get it, there is a full proposal on this at WikiCite/Shared Citations. The problem is that all citations in Wikipedia are currently text prose, and instead they should be linked data to a database. If they were linked, we could ask questions like, "How many times is a given author cited?" or "how many times do people at a particular university get cited?" or "what percentage of the authors cited in articles about India are of Indian origin?" One reason why we cannot have linked citations is because Wikidata is so low capacity that we cannot even manage a dataset of unlinked citations, which is a much less resource intensive endeavor. If we could at least have a database of citations, then we could consider linking it to Wikipedia.
- About books and newspapers - yes, but, I think these will be the last datasets in "scholarly profiling" that we collect. There are 350 million scholarly articles - that is all of them, from all universities in all languages all time. They are relatively stable publications which are well archived from link rot, and that number is relatively low. Newspapers and books are legally hard to archive due to copyright, plus they have multiple identifiers and are much harder to link than scholarly articles, so it is hard to preserve a citation database to them. Wikipedia's relationship with Internet Archive helps but we have a list of issues to address.
- I want what you want and I think we will get there. What you imagine is one of the end goals. WikiCite is bigger than that even, because there are no open databases like this, and if we created one, it would have use applications in everything related to publishing, not just Wikipedia. Bluerasberry (talk) 15:30, 24 February 2026 (UTC)