Wikidata/Preventing unwanted edits

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Data in Wikidata must be reliable. At the same time Wikidata needs to be as open as possible to achieve its goal. This page is for brainstorming ideas about measures we can take to discourage unwanted edits (or making it easy to spot them) while at the same time not giving up on the openness of the project. Please add your ideas.

History, Recent Changes, diffs and so on[edit]

These are available on Wikidata as well.

Show Wikidata edits in the Wikipedias[edit]

Wikidata edits are going to be visible in the Wikipedias (using it so editors can review them easily).

On each WP should be shown on the left not only "Recent changes" like now, but additional "Recent changes in Wikidata" - everybody may look at (and later on everybody may use such items = may edit a link to a wikidata-item within any wp-article). Dr.cueppers (talk) 19:30, 20 August 2012 (UTC)
On the smaller wikipedias, it matters a lot where those edits are going to be visible. For example, if those edits are going to be shown in Recent Changes, they need to be hidden.--Snaevar (talk) 15:26, 20 August 2012 (UTC)
That means, each WP has to promote itself to be admitted to look at wikidata? Dr.cueppers (talk) 16:36, 20 August 2012 (UTC)
Nah, that´s a bit excessive. If the edits from Wikidata are not shown in page histories on wikipedia and if either the additional "Recent changes in Wikidata" is an seperate page or if the edits from Wikidata are hidden in the current recent changes, then that will be enough.--Snaevar (talk) 13:23, 21 August 2012 (UTC)
Perhaps all recent changes on Wikidata that are related to articles existing in the specific Wikipedia (ie. all with a sitelink) could/should be listed in the recent changes page? A possible problem is that this will lag somewhat behind. — Jeblad 13:27, 16 September 2012 (UTC)
That wouldn´t be radical enough, even though it is an good idea. The main problem is that the changes from Wikidata, when shown on small wikipedias are going to flood the recent changes, and we don´t wan´t that. Even though the changes would be filtered to include only changes related to articles existing in that specific Wikipedia, the recent changes on the small wikipedias are still going to be flooded. On second thought, I think that Dr.cuppers idea that "each WP has to promote itself to be admitted to look at wikidata" would be an good comprimise.--Snaevar (talk) 01:57, 13 November 2012 (UTC)

There should also be a "related metadata changes" link for each article. At first, it could show changes on the linked wikidata page, and later intelligently include other changes that effect the page (e. g. if the page shows a list generated via some sort of data query). Same for watchlists. --Tgr (talk) 11:27, 25 August 2012 (UTC)

A change should be shown in each article on the wikipedia in the version history. That s the easiest way and it would be the way it is now: you see when an interwikilink changes --92.193.31.53 16:07, 27 September 2012 (UTC)
I don't think it is true now. We're talking about transcluded information. When the source of the transclusion changes, it's certainly not shown in every page on which it is used. Chris55 (talk) 22:27, 1 October 2012 (UTC)

Editing in Wikidata[edit]

Possibilities:

(1) IP's are not allowed to edit in wikidata; this ist only allowed for registered and logged-in users with a valid user-name - vandalism will be reduced from 100 to 1.
But than even to allow "corrections in wikidata by all registered users" will open the door for a lot of mistakes.
Don't you fear that vandals could then start stealing username/password pairs from registered users? As far as I know, even the logon is usually performed over HTTP: username and password are not encrypted... Klipe (talk) 22:06, 13 October 2012 (UTC)
(2) Don't allow editing for all users (registered and IP's), but open them a short and simple way to inform the author about doubts or mistakes: To change such data will be allowed only by the "first author" (almost in the first time, later on the community may open other rules).
(2.1) Perhaps: Additional allow editing in wikidata by - from a "portal" or "editorial members" (Redaktion) - elected authors. Or allow at first wp-admins to edit in wikidata.
Dr.cueppers (talk) 19:30, 20 August 2012 (UTC)
Other method:
For all items in wikidata has to exist the chance to input more than one value for each item; e.g.: for the amount of a hill may exist three values; allow to edit all three with "amount", "source (and measuring method)" and perhaps "license" (and 4 "~" to sign ist).
i.e.: Inputs should have "open end": Every time an input is effected is available the next blank input.
With this method IP's are not allowed to edit in wikidata; this ist only allowed for registered and logged-in users with a valid user-name - vandalism will be reduced from 100 to 1.
All registered users are allowed to input such a new value, but not allowed, to edit within the other existing values or delete them; this will be admins duty.
Any user may select and use the value he prefers. And if the number of selections is shown, even inexperienced users also see which value is suitable.
(This method entails less work for other users)
Dr.cueppers (talk) 20:55, 22 August 2012 (UTC)
Supplement: We could still aggravate the writing rights, while "autoconfirmed" user ist not allowed to edit, but only 'harvesters' (de-WP "Sichter") may edit (en-WP: users with the right for "rollback"). Than vandalism will be reduced from 100 to 0,01 Dr.cueppers (talk) 20:55, 23 August 2012 (UTC)
I propose a review process. Only if a change is reviewed by 2 authors it will become visible. The German-Wiki provides the "flagged revisions system" which could do this with some changes.--92.193.31.53 16:09, 27 September 2012 (UTC)
I propose a gradual system. Example: IPs can only modify one value per day. Registered users began with 4 changes per day, and it will be increased automatically if there is no vandalism detected.
Do not forget every change will be reflected in many wikipedias. So, although a vandalism will be more delicated, will have more eyes to revert it. I think there is no need to be so strict.
Eloy (talk) 22:21, 22 October 2012 (UTC)

Sourced Statements Should be Read Only[edit]

Once at least once source is provided for a statement/claim, the claim can no longer be changed. This should prevent casual modification of information that appears to be well sourced.

Not practical. Sourced statements entered by hand or perhaps processed by a script can easily be erroneous. Only if an automated import process from a source is considered to be reasonably bug free, a differentiation might be valuable. --G.Hagedorn (talk) 13:20, 20 August 2012 (UTC)
Sounds like an invitation for trolling/gaming the rules. Having a source does not mean that the claim is trivially inferrable from the source - think of controversial stuff like nationality. --Tgr (talk) 11:30, 25 August 2012 (UTC)

To clarify: users with special privileges (like admins, perhaps) could still change sourced statements. And you could still remove the source, and then make the change.

An alternative would be to allow modification of sourced statements, but list them in a special high priority review queue -- Daniel Kinzler (WMDE) (talk) 08:56, 26 August 2012 (UTC)

Restrict Modifications of Rank[edit]

If new statements get the "normal" rank, and only trusted users (autoconfirmed?) can change rank, this would prevent random modifications from showing up on Wikipedia immediately, while still allowing everyone to contribute. It's a bit like pending changes aka flagged revisions.

a: The meaning of rank in this context needs explanation. b: I would propose to call the feature flagged revisions and to make it as similar to the flagged revision behaviour (or to those behaviours that have so far been deployed on Wikipedias) as possible. --G.Hagedorn (talk) 13:20, 20 August 2012 (UTC)
a) I agree. We have some vague draft somewhere, but that should be overhauled and lined here.
b) then it would no longer be using ranks. -- Daniel Kinzler (WMDE) (talk) 08:57, 26 August 2012 (UTC)

Suggestion[edit]

I think, within the first time (may be one or two years) Wikidata needs some "professional editors (and watchers)" for "wanted edits", for their references and license-clearing - otherwise you demand to much from community (this is my (de:WP:RC) opinion in view of about 5000 chemicals with about 40 items each). Other fields probably may expect smimilar problems. Think about! Dr.cueppers (talk) 13:44, 20 August 2012 (UTC)

Reviewing Edits[edit]

Edits should be reviewed from professional wikipedians like in the german Wikipedia, before they are visible to everyone. --Sk!d (talk) 12:57, 21 August 2012 (UTC)

Create a "trusted user" permission in a flagged revisions system[edit]

Have flagged revisions enabled by default, but set an extremely low bar for accounts to not need to have their data submissions reviewed anymore. Perhaps 10 accepted pieces of data, after which an account is automatically given the "trusted" flag? This "trusted" flag would allow the account to submit data without having it reviewed, but would not allow the user to review other people's work.

The idea of having to help before being able to create false data should deter most would-be vandals, while the extremely low barrier to being able to edit freely should pacify most of those opposed to flagged revisions. WaitingForConnection (talk) 03:21, 25 August 2012 (UTC)

Another way is the patrol extension with assignment of autopatrol to users who are experienced and/or trusted. Romaine (talk) 02:05, 2 November 2012 (UTC)

Limited permission to enter data[edit]

As wikidata will work as data supplier for most of the wikipedias the impact of one change will be transfered on all WPs automatically. This power has to be controlled in order to prevent trolling actions and to keep wikidata credible.

  • IP users cannot enter new data
  • flagged revisions system: new data has to be validated by an authorized user
  • only data with reference will be considered
  • authorized users have to work only in fields where they have some knowledge
  • data ranking will be defined by specific projects based on references ranking.

This will lead to a slow process in data collection but the idea is to focus on reliable data and not on a fast growing. Then automated collections from official or recognized sources have to be organized instead of data collection from thousands individual actions. For city or country data governmental statistics have to be preferred, for physical properties famous books or databanks have to be used,... Wikidata cannot work as wikipedia because wikidata will be a prefered reference in wikipedia. Snipre (talk) 13:18, 25 August 2012 (UTC)

A very bad idea, in my opinion. The key to keeping the data reliable is to cultivate a large wiki with a large editor base, so that there are plenty of patrollers ready to ready to revert bad edits. It is no big deal if a bad edit sticks for a little while, or even if a small proportion of edits stick for a long while. It is preferable to deal with that problem than to put hindrances in the way of editors making edits, and new editors gaining acceptance to the community.
If you look around the wikisphere, you will notice that the wikis that either (1) have little/no activity or (2) have a lot of bad edits that stick, are those with not enough editors to patrol everything. Without a sufficient editor community, they are forced to either lock everything down from editing or let a bunch of bad stuff through unpatrolled. The key to avoiding those two bad options is to attract a lot of patrollers, which can include editors running bots with anti-vandalism heuristics.
All encyclopedias have some bad data; as long as it is kept within reasonable proportions, it is acceptable. Completeness is an important goal too, and all other things equal, a project with 1,000,000 data, 1,000 of which are incorrect, is better than a project with 1,000 data, 1 of which is incorrect, even though they have the same percentage rate of incorrect data. Leucosticte (talk) 20:43, 26 August 2012 (UTC)
Thank you for the comment. You are right concerning the need of a large community for a good development but reality shows that for accurate information few persons take or have the time and the ressources for a good search. My comment is based on the assumption that as for many projects on the different wikipedias only a small number of users will be available for the development and the maintenance of that database. From my personal experience with WP the number of user deacreases with the development of project because from a certain point only a small number of persons have the knowledge to add something new and few persons are interested only in maintenance work.
Then the principle of wikidata differs a little from wikipedia: the idea is not to present different opinion about values but to provide reliable data. Another difference is the idea of ranking the different information and this implies an evaluation which not in the "philosophy" of wikipedia. Data can't differ according to opinion or according time. Once a value is set up nothing could change it: the surface of a country is fixed at a defined time, water is boiling at 100°C,... So the need of large number of patrollers to keep hte data reliable is not needed and is a lost of human ressource.
Wikidata needs a systematic data collection with internal verification, a system of automatic update for some informations and a simple interface for error announcement. No need of persons changing every week data and patrollers to check the changes. Once a data is entered no change is expected: data are not article with continuous improvement. Snipre (talk) 22:28, 26 August 2012 (UTC)
I suspect that what we will see happen in practice is that Wikipedians will occasionally see incorrect data from Wikidata appear in articles, and want to make edits to Wikidata to correct those errors. Wikipedia is of course by far the most active wiki under the WMF umbrella; most Wikipedians only make a handful of edits, if that, to sister projects. So I think we can realistically expect the median number of edits per user to be fairly low on Wikidata, much as it is for Wikiquote, Wiktionary, Wikisource, etc. A situation in which that median is low calls for making it as easy as possible for new users to make the edits they came there to do with a minimum amount of hassle or impediments in getting set up at the new wiki and accomplishing the task that brought them there.
You raise the point that water boiling at 100°C will not change. Since that is the case, why have Wikipedia pull that data from Wikidata? If there is no need to ever update it, the data can just be entered into Wikipedia the way that it is now; i.e. requiring an edit to Wikipedia in order to change it on Wikipedia. There are so many eyes watching those Wikipedia articles that bad edits of that nature will tend to be detected more readily than on a project with fewer editors. Of course, it could be possible to have a cascading watchlist that would automatically include or exclude wikidata data from the user's watchlist depending on whether a watchlisted Wikipedia page transcludes that data.
There can indeed be differing opinions about some data. Consider, for example, world population. There is no global census bureau, and even if there were, its enumeration would not be 100% accurate. There would still be differing estimates of the true world population. Likewise with data that some states have an incentive to want to distort or keep secret, such as military statistics. Of course, such data also changes over time.
To encourage full and equitable participation in the bold, revert, discuss cycle, we should avoid using tools such as FlaggedRevs on wikis with low median participation rates per user. If a user has only made three edits to the wiki, he should have as much ability to edit (including reverting edits) as the user who is more active. The reason is that having only three edits on Wikidata would not necessarily indicate a likely sockpuppet; that could quite possibly be a user with thousands of Wikipedia edits, but he just doesn't have a need to edit, or interest in editing, Wikidata very often. Leucosticte (talk) 00:01, 27 August 2012 (UTC)

Protect data that will not conceivably change[edit]

What I am about to propose is not intended to compete with the ideas above, but to complement them.

Pre-emptive protection is unacceptable for articles on the English Wikipedia, because even a featured article could conceivably be improved. But while on the whole the same principle will apply to Wikidata, there probably is value in locking down elements which no user would ever need to update. Why would Helium's atomic number ever need to be updated? Will seven ever cease to be a prime number?

Obviously, care must be taken. A government could change or attempt to change a country's name. An element's symbol (He for helium) could conceivably be changed, particularly in non-latin scripts. In 2006 the IAU changed its mind on how many planets there are in the solar system, a verdict that is not universally accepted. Thus, however unlikely it is that a specific data entry along these lines might need to be changed, to fully protect any of them could cost us would-be contributors.

The way to get use out of this system would be to aply what I call the "hell would freeze over" test. To protect for this reason, we should be as sure that something will never need changing as we are of our own names. The number of protons in helium will never change, nor will seven's status as a prime number. Protecting those particular pieces of information, if technically possible, would clearly be a net positive, particularly given how high-profile and widely used these pieces of information are. The benefit of protection would be in ensuring accuracy, and saving time that might otherwise have been spent reviewing or reverting. —WFC— 07:31, 29 August 2012 (UTC)

This protection would not accomplish much improvement in quality of wikidata. If we are certain that 7 will remain a prime number, then if someone makes a revision to say that 7 is not a prime number, won't it be quickly reverted, and the editor possibly warned/blocked? So there will be an extremely small percentage of time that such data will be in a state of incorrectness, probably well within tolerable limits for an encyclopedia. People should be aware that, as with any wiki project, they use wikidata at their own risk and that there is always a chance, however small, that the data might be incorrect at the time that they pull it.
If we do this, there will also be some cases in which incorrect data gets protected. A person may say, "I'm an expert in my field, and I know that this data is correct and won't change, so please protect it." A sysop who doesn't know any better than to believe him may do what he asks, especially if there are people trying to change the data (which will tend to happen, if the data is controversial or incorrect). Then if he's wrong, another editor won't be able to come along and change it without going to the trouble of getting it unprotected, which a sysop might be reluctant to do if he's too unfamiliar with the subject matter to know who is correct. My guess, based on observing what has happened on a lot of protected and unprotected wiki pages, is that protection will actually hurt our accuracy rate more than it helps.
Protection gives people one more meta-issue to argue about, because there will be marginal cases in which people argue whether a datum does or does not fall in the category of something that will never need changing. There will be arguments over whether the wrong version of a datum was protected. Instead of edit wars over content, there will be wheel wars over protection, unless sysops defer to each other on such matters, which is not necessarily a helpful practice. On Wikipedia, I've seen page protections that other sysops were unwilling to remove greatly hinder progress in improving the articles in question.
Protection should be reserved for cases in which (1) a datum is so widely transcluded that changing it would invalidate a lot of caches and put a strain on the server; or (2) there's an edit war that can't be stopped by other means (such as warning/blocking a few editors). In case #2, the protection should be as limited (e.g. semi-protection) and as temporary as possible while still being effective at helping impel the opposing sides to sort out their issues on the talk page.
No need to reinvent the wheel; I see every reason to suspect that what we've seen work on Wikipedia will tend to work on Wikidata as well, and what we've seen be counterproductive on Wikipedia will tend to be counterproductive on Wikidata too. Experiments of trying to increase reliability by protecting pages from editing by any but an elite few gatekeepers have been tried already; they were called Encyclopædia Britannica and Nupedia. Leucosticte (talk) 18:47, 29 August 2012 (UTC)
The arguments you make in the first two paragraphs are only relevant to my suggestion if you ignore the hell freezing over/be as sure as you are of your own name aspect. In fairness you address this anomaly in your third paragraph, but the solution is very simple: make clear in no uncertain terms that admins have zero latitude in interpreting "sure as your own name" to mean "on balance of probability" or "beyond reasonable doubt". Your fourth paragraph is simply a summary of your opinion. But it's your closing paragraph that has really caught my eye. I could probably write an entire essay on the comparisons you make, but I'll try to condense it slightly.

Nupedia's failure was based on a lack of trust of ordinary contributors; the belief that commoners couldn't possibly contribute to something which might one day be of professional quality. To be blunt, I'm glad it died. And yet Wikipedia (or at least en.wikipedia) now has a strong a wiki-equivalent of Nupedia's process. New page patrollers look over articles written by people who are not considered to be "professionals" at starting articles up. Almost all articles are checked over for certain issues by bots. We have cleanup tags for every conceivable problem, and a WikiProject dedicated to the copyediting of articles. We have good article assessment, which primarily looks at content, and peer review, which generally looks at formatting, encyclopaedic tone, POV, spelling etc – there is a degree of intentional overlap between the two. At the very top of the tree we have the Featured Article process, a consensus driven peer review by a large number of experienced reviewers. The gauntlet a Wikipedia FA has to run is no less stringent than it was for a Nupedia article, the only difference is that Wikipedia has the numbers to pull it off. As for Britannica, wasn't Wikipedia formed by data-dumping the entire eleventh edition? Sorry for straying so far off topic, but I wanted to be thorough before making my point, which is that Wikimedia's strength lies in embracing what others do that works, and reinventing the wheel where it doesn't.

Getting back to the matter at hand, a lot of what you say does not apply in the context that we're talking about. The reason we don't pre-emptively protect on Wikipedia is that no article can be considered perfect, not even one that has gone through the scrutiny outlined above. That is not true for all data. A field that says that seven is a prime number is perfect, as is one that says that a Helium atom's nucleus has two protons. Any edit to a perfect piece of data is by definition a negative. Any inaccuracy caused by this edit was avoidable, and any time spent dealing with that edit is time that could have been saved without any consequences.

Finally you address the issue of vandalism, asserting that Wikipedia already copes with it very well. True enough, Wikipedia is good at dealing with profanity, or sleights on someone's sexuality, race or religion. Subtle statistical changes on the other hand are virtually impossible to detect via anti-vandalism methods. Unless someone is keeping a close eye on their watchlist, it can go undetected for long periods. —WFC— 03:19, 30 August 2012 (UTC)

I don't mind you straying off the topic; you raise interesting points there. Wikipedia did indeed copy a lot of the old Britannica's content (as well as other public domain content, such as U.S. Census Bureau statistics). Although that led to a lot of pages (e.g. on certain towns) that left a lot to be desired, it was considered (and was) better than nothing.
We both acknowledge that what might be a good policy/practice for data like prime numbers would not be for other data. The problem is, once the door is opened just a crack to allow for protection in cases like what you describe, it ends up letting in a lot of bad protections too, in practice. I would compare it to the situation with not only protections on Wikipedia but CSDs and many other policies/practices/processes intended to handle extreme cases. For instance, A1 was intended to make it easy to delete articles with no other content besides "He is a funny man with a red car. He makes people laugh." It says right there in the policy that it applies only to very short articles, i.e. substubs.
Well, I have seen other articles that weren't substubs get deleted because of abuses of that policy. Whenever a rule is created (especially a rule granting authority, such as speedy deletion authority), people try to stretch it to cover situations it wasn't intended to cover, even in violation of the letter and spirit of the rule. Then when someone objects to an improper speedy deletion, people say, "Well, that may be a violation of process, but it needed to be deleted anyway for other reasons." Perhaps, but if speedy deletions in cases outside the CSD parameters are allowed to stand, why bother having speedy deletion criteria that are different from the WP:AfD deletion criteria? Probably because it's easier to prevent an article getting deleted at WP:AfD than to get a deleted article undeleted at WP:DRV, and/or because it's better to have a bad article remain undeleted for a week than to have a good article remain deleted for a week. The same here; if there's a situation in which either a Type I error or Type II error are possible, better from the standpoint of the wiki way to err on the side of having pages that should be protected remain unprotected, than to err on the side of letting pages that should be unprotected remain protected.
To take a case to DRV is to publicly say, "I think this sysop screwed up, and wasn't willing to correct his mistake when he pointed it out, so I had to take it to the community." Better just not to give him a policy in the first place that gives him the ability to make that screw-up, so that no one will have to call him out for it and try to get community consensus to reverse his decision. He can't overreach if he's not given any reach to speedily delete stuff. I would say, CSD as it exists now is harmful in many ways, because it allows summary deletion that an ordinary user can't revert. The logic behind the pure wiki deletion proposal was that it would allow an ordinary user to revert a speedy deletion, and thereby put sysops and editors on a more even playing field, while still allowing immediate removal of bad content. It would combine many of the benefits of both CSD and AfD, and avoid some of the disadvantages (viz. overly-quick deletion in the case of some CSDs, and overly-slow deletion in the case of some AfDs).
I am concerned that the same would happen with this; if we create a rule to cover extreme cases, people will stretch that rule to cover less-extreme cases. And those abuses are difficult to correct, when the only users able to correct them are sysops. I haven't run into too much of the "sneaky" kind of vandalism involving insertion of plausible but incorrect data on Wikipedia; have you? There are a lot of measures to prevent such mischief, such as the fact that once you catch a person doing it once, his whole contribution history comes under scrutiny. I am probably a bit biased on this issue, admittedly, because I've usually been a regular editor and new article creator rather than a vandal-fighter (unless I happened to stumble across vandalism). The other side, i.e. those who only fight vandals rather than adding/editing content, is probably biased in many cases too. Leucosticte (talk) 05:10, 30 August 2012 (UTC)
What we seem to agree on is that the English Wikipedia combines sledgehammer policies with tick-box endorsements of these decisions, and that this is the worst of all worlds. Where we disagree is in cases where tough policy is in theory the best: you're not disputing the theory of what I'm saying, you're saying that it would be harmful in practise. So, do we introduce a similarly uncompromising set of measures to deal with admins misusing the policy? Or do we go with a solution which is not so good in theory, but for which there is negligible risk of admin abuse?

The answer is ideological. My belief is that if you are going to introduce a hierarchy (although not intentional, advanced permissions do create a hierarchy) the checks and balances against potential abuse need to be strong. The deterrents against abuse need to be stronger than any potential gain to be had from using the tools to your own ends. With a robust system against tool misuse in place, we can be confident that any system we introduce will be fair, allowing us to concentrate on figuring out which one will work best.

I'm going off topic again, but this argument applies to every measure on this page. We're all throwing ideas around, but it's hard to know which ideas will be best suited to Wikidata, until we know what Wikidata's approach to admins will be. —WFC— 17:48, 30 August 2012 (UTC)

Creating systems to highlight questionable entries[edit]

From what I can tell, much of the discussion here seems to revolve around management of single pieces of data, but what about being able to examine that piece of data in context with the rest of the data of that particular type of property? The main idea is that the rest of the data gives you some insight into the expected ranges and distributions of values. While this won't find all bad edits, it will at least allow for the creation of systems that can highlight values that don't seem to fit. Comparing the values in use for different properties of the same object could also help. For example, if for a country, you find that the GDP/person does not roughly correlate with the income/person, then you know that something doesn't seem right. It might be inspirational to look at Google Refine and the various techniques that they use to find outliers and anomalous data.

Also, I'm not sure if anyone's played with writing SPARQL queries on DBpedia, but I often use it to find all the values in use for a parameter used in a specific template on Wikipedia. This very quickly highlights inconsistencies in the data, in terms of both values and inconsistent representations of the data. Similar functionality could be quite useful for Wikidata. --Mr3641 (talk) 19:50, 29 August 2012 (UTC)

Perhaps a good way to implement this would be to program some bots to monitor certain data using their watchlists and by API modules such as prop=categories, and compare the data from new revisions to what the bot's own heuristics say are reasonable values. My guess is that if we have an open system, right after we launch Wikidata, accuracy will not be all that great, but that it will get better as people develop and refine tools to revert bad edits. It will be kinda like how there used to be a lot of vandalism that would stick at enwiki for awhile, but then ClueBot and similar tools were developed to quickly revert most of it.
I think we should set pretty high tolerances for bad data in the beginning months (and even years) of Wikidata, and consider it as a beta product compared to what it will be after these tools are developed. The key to being both comprehensive and accurate is to refrain from succumbing to the temptation to close the system in an effort to keep bad edits out, and be patient (while also developing all the necessary interfaces, such as an adequate API, for more effective and efficient data retrieval and automated editing) while the users collaborate to work out the kinks. The was the approach Wikipedia took, and it worked out pretty well. That philosophy of bottom-up development is also called wikidynamism. Leucosticte (talk) 20:54, 29 August 2012 (UTC)
Basically detection of outlier data, not sure if this is the correct term in English. One way is to use some kind of learning network to estimate error bounds on vales for a group of similar entities. Its not simple, but I think it is doable. At least Statistics Norway have experimented with similar things. — Jeblad 13:41, 16 September 2012 (UTC)
Aren't neural networks used for that purpose to flag suspicious credit card transactions? Leucosticte (talk) 10:31, 17 September 2012 (UTC)
I think they use a whole bunch of different methods, but I have heard about neural networks and also warnings triggered by visit to unusual countries. In Wikidata I think it will be more effective to flag warnings about unusual values than block such changes. — Jeblad 22:03, 19 September 2012 (UTC)

Automatic comparison of data to its source[edit]

It seems possible to do a simple grep-like operation on referenced data to see if the same values are contained in a quotation, and check if the quotation in fact reflects the text on an external site. This could make it simpler to identify vandalized data. — Jeblad 13:35, 16 September 2012 (UTC)

Wikipedia's watch list[edit]

I think a very good way to check on changes is to use the wikipedia's watch list, each user watch alerady a certain number of article, the data changes should appear on Wikipedias user's watch list. A large part of incorect changes would be detected. A large number of watchers would be recruted almost automatically. Cqui (talk) 12:28, 19 September 2012 (UTC)

One could say the same about every other WMF wiki. See the bugs listed at mw:Interwiki integration. Leucosticte (talk) 07:40, 21 September 2012 (UTC)

Restrict input to edit-elements[edit]

The idea is to restrict what can be entered in label and description elements in items. Those restrictions are:

  1. Don´t allow web adresses in those fields
  2. Don´t allow html tags to be added to those fields
  3. Limit the length of text that can be added to those fields

There is allready an item where criteria 1 and 2 apply at http://wikidata-test-repo.wikimedia.de/wiki/Data:Q8635 .--Snaevar (talk) 01:34, 26 September 2012 (UTC)

Who is the boss?[edit]

This whole discussion seems to presume a one-way flow of information, from Wikidata to the various Wikipedias. It also presumes that the editors of these wikipedias will voluntarily give over the responsibility for providing information to those who are editing Wikidata. Both of these seem unrealistic. The second is unrealistic because most editors think they know best. The first is unrealistic because without an information flow from the individual wikis to wikidata, the whole thing will never get off the ground.

The main source of information must be what individual editors on different wikipedias put in. The primary task of wikidata is then to resolve the conflicts. I'm not suggesting this is easy. But it is the essence of the task. Chris55 (talk) 23:11, 1 October 2012 (UTC)

I agree. Wikidata editors must be the group of editors in Wikipedias that use the data. And this community must modified the policies, and elect admins to solve conflicts.
Eloy (talk) 22:27, 22 October 2012 (UTC)

Transliteration of labels and descriptions and measuring similarity[edit]

It is possible to transliterate strings in one language into a base script and compare this to the same string in other languages. The compare method would be a similarity measure, for example a count of similar trigrams. A simple explanation of it is that it works like a fuzzy comparison on strings. This similarity measure works remarkably well for copyvios and should work even across languages. Within an item we can then do this comparison with all other strings and compare the best match against a minimum value. If the match is lower than this value we tag (mw:Manual:Tags) the edit as suspicious. This could work on labels, but note that there will be a tight limit there. It could work better on descriptions, especially if we try to translate between strings. — Jeblad 02:33, 13 November 2012 (UTC)

Descriptions as tag-phrases[edit]

If we use the descriptions as a kind of context or tag-phrase (phrases used as w:en:Tag (metadata) in blogs) it would be possible to compare the phrases across items. If a specific phrase doesn't match up with any other phrase the edit could be tagged (mw:Manual:Tags) as suspicious. The existing term cache should be sufficient to make this work. — Jeblad 02:39, 13 November 2012 (UTC)

About vandalism[edit]

Edits in general contains some information that is correct and some information that is wrong. Some of the edits will be new and some will change existing content. Vandalism can be viewed as an edit that consist mainly as incorrect edits. Such incorrect edits have a probability of detection (POD) and the lower value the longer the vandalism lives before it is removed, that is in the mean. The less time used by each user looking at the incorrect information the less likely it is that they will figure out that it is vandalism. Less time can be compensated with more people, or by making troublesome edits more visible like marking last edits. For example will posting unrelated material on Special:Recentchanges give less time for an edit to be exposed and then decrease the POD. By enforcing patrolling of changes more people will check the edit and POD will increase.

When incorrect information is posted on the site it will have some impact. Usually this will end up with a loss of creditability (LOC), which is bad enough but not really critical. That makes it possible to make trade offs between LOC and POD, that is how much LOC is acceptable and what does that mean in POD. In some cases LOC will not be decreased even if POD is increased, for example if readers reports their findings. In such cases the LOC will still be there for the reader involved, but will be slightly improved for later readers as they would not see the vandalism. — Jeblad 16:04, 3 December 2012 (UTC)

Wikimedians, not Wikidatans[edit]

If we decide to go with "semi-protecting" data on Wikidata, I think anyone accepted on any project should be enough to edit Wikidata. A general problem here at Wikimedia is that there isn't enough of a connection between the different communities. If you become a major participant at Wikipedia, it counts for nothing at Wiktionary, and so on. The same thing is true about different language versions of each project. So I think we should establish standards, so anyone who's known on one project to be helpful should be trusted on other projects.

One way of doing this could be make flat guidelines, for example "Any WP user with >100 edits and who hasn't been blocked for three months is trusted." Or we could make it wiki-dependant, so the English and German -pedias for example are more trusted than Volapük etc. Essentially, reputation should carry over.

Metrics of content stability[edit]

I suggest to create metrics of content stability to support automatism.

Stability of the input data is crucial for programming, even more than its correctness. Edit wars, that are so common in Wikipedia, would render a Wikidata resource inoperable for a whole class of automatic processes. Even the well-meaning correction of a value that was kept wrong for a long time could break many applications that depend on the wrong value.

The checks, balances and correction mechanisms developed at Wikipedia are good for human readers assessing an article, but are not enough to support automation. I see the intent of this project as providing an open corpus of data for automatic processing. For that, it's imperative that the automated clients can assess the stability of the entry data and know how much they can trust it.

I envision bots and scripts that decide what version of a resource to use depending on its overall stability. Some visualization tools could simply show the most recent version, as current Wikipedia articles currently do; while other tools that use automated reasoning could tune themselves to a trusted version and discard all later edits until a human manually reviews their accuracy.

Programs using Wikidata should be able to create their own procedures to assess the reliability of data from a few simple metrics of its evolution. I suggest publishing for each data item (Statement, Property, Value or Reference) a few indicators related to the stability of its values, such as these:

  • Time from the latest change
  • Longest time without changes
  • Frequency of changes (to detect edit wars)
  • ...

These and other summaries of the item's history could be used by programmers to develop methods of trust. The quality and utility of these metrics would be evaluated during the life of the project to suggest improved metrics that better support automation. Diego Moya (talk) 18:59, 14 December 2012 (UTC)