Research:Identification of Unsourced Statements/Citation Needed Reason Analysis

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

When adding the [citation needed] flags, editor can specify the 'reason' why they chose this template for a given sentence. Here we take a large corpus of statements tagged with [citation needed] and understand the main reasons why editors apply the flag using natural language processing.


Besnik Fetahu extracted paragraphs which have citation needed markers (and all their corresponding meta data from the citation marker).

  • There are around 263672 statements in this dataset.
  • About 17K of them have the 'reason' field non empty.
  • After removing duplicates, we are left around 8K distinct 'reason' sentences for the 'citation needed' flags.


Reason sentences' vectors are clustered with k-means.

  • I transformed each sentence into a numeric feature vector using word2vec trained with Google News corpus.
  • I computed K-Means using scikit-learn, varying k from 2 to 29.
  • Using the elbow method, I identified k=10 as the best number of clusters for this problem (see pic)
Variance as function of number of clusters


Below the 10 resulting clusters, together with the closest sentences to each centroid's cluster; I named each cluster according to the nature of the sentences in it.

1) Dead links:

* eb article cited does not include this name change; the link at wp:lake alajuela that could support it has gone dead as well.
*  the article appears valid, but nowhere in the link does the article mention that this type of animated stand up is the first of its kind
*  it is not apparent from the linked wikipedia article how tree decompositions are used there. further neither this nor the linked wikipedia article references any scientific paper on the subject. it is impossible for someone not knowledgeable in that domain to verify this.
*  citation is a dead link. though this claim is made on several baseball websites and forums, the original baseball abstract issues do not seem to contain any references to speed score. more discussion is included on the talk page
*  while this is stated in numerous articles there are never any details available and calandra's own linkedin page does not include any reference to any such experience. this makes the statement 
unverifiable", the cited article does not list separation minimums; such information is found in ac 90-23g published by the faa

2) Facts needing citations:

* there is no citation for this complete paragraph, the ouster from dmk especially needs a citation
*  from the last citation lots of facts so a citation is needed
*  the citation used is totally unrelated. please do not roll back without adding a proper citation
*  not in dispute, just need a citation
*  no citation for claim that crimereports is largest\u2026something
*  not in citation given

2) Reliable source neeed:

* reliable source needed for this paragraph. previous source website went dark.
*  reliable source needed for this fact
*  reliable source needed for the tazobactam susceptibility, as source n\xb04 states otherwise
*  reliable source needed for this name
*  reliable source needed for the entire article.  source cited contains no actual information about article subject.
*  this claim needs a reliable source or sources

4) No evidence (especially historical) for the claim:

* no evidence for this claim
*  unsubstantiated claim and no example
*  no historical evidence for this claim
*  no period historical evidence for this unsupported claim (frequently/triton/cafe racer)
*  why? statement of a claim, reference needed.  claim not substantiated elsewhere
*  claim not substantiated in reference

5) too general -needs a reference to specify the subject of the sentence:

* what accounts?
*  what list?
*  what reviews?
*  what purists?
*  which cases?  what questions?  what orders?
*  what year?
*  what value?
*  what historians?
*  what labour party, rationale?
*  what illness, exactly?
*  what critics? what criticism?
*  what processor?
*  what long-year argument?
*  what research(ers)?
*  what evidence?
*  what theorems?
*  what data?

6) original research on reference:

* the table below gives many "examples", but none of them include any diacritics. pending examination by a yoruba scholar, a citation to one who at least said the writing system includes diacritics would be satisfactory. update: pldx1 added the new phrase "yoruba alphabet" to the first line after i added this cn template. searching this phrase on wikipedia links to the "pan-nigerian alphabet" article. i suspect this claim may no longer need a citation per wp:blue, but the relevance to this article, and why this is a problem, and why no diacritics appear in the list of examples immediately below, are still in question.
*  does any scholarly source states or explains that this is in "agreement"? people travelled extensively along the coasts at the time at it would not have been a problem to emerge from jutland, even though the ocean was crossed in the south. i smell original research here.
*  the second half of the first sentence, followed by the claim made in the second sentence, conflicts with the logic given in the qualifying clause of the third sentence: "in proportion to its extent" - trunks of trees and boulders only raise the level of the river above them in proportion to their own size combined with the size, shape etc of the river/stream/brook - accumulations of gravel may be due to natural sediment transport, and pose no real problem - even the us army corps of engineers may reject this suggestion as neither simple nor efficient, and they have - this paragraph seems over-simplified and fraught with the writer\'s personal opinion
*  the "z380 microprocessor product specification" lists instruction "execute time" values as small as 2.  however, this author did not find in that document (in a brief search) any explicit specification of the units for the listed execute times, and the document has other errors, so if someone knows of a source to the contrary, that source could be correct.  (this author did not have time to research this subject at the time of this edit.)  on the other hand, the z80 has a minimum instruction time of 4 clocks, so this bullet point in the article might be a mis-citation of information about the z80 instead of the z380.
*  saying that yo la tengo plays an anonymous band that is somewhat reminiscent of the group needs to be supported by reliable sources for two reasons: (1) the need to show that the band was actually credited to being in the movie; and (2) the need to show who described as somewhat reminiscent of the velvet underground. since none of this is covered at any point later in the article, it needs to be sourced here. wikipedia editors should not add their own interpretations per wikiepdia's policy on original research; wikipedia editors may report on the interpretations of others, but need to provide reliable sources in support per wikipedia's policy on verifiability.", it seems pretty obvious that (in the show at least) lyanna wasn\'t "kidnapped". this is the view held by robert, and presumably a lot of viewers who paid attention to the one time this event was obliquely referred to, at the time of the character\'s death in season 1. but it is wrong. so it should be attributed to a reliable secondary source apart from the fictional show.'

7) Possibly wrong statement:

* thailand bible society reports completion of translation of ot into thai in 1883, which is 10 years after bradley's death.  if he was involved in ot translation, he didn't finish it.", 
*  the website cannot be used as a reference. references should be made to one or both of the original sources: geert hofstede, gert jan hofstede, michael minkov, cultures and organizations: software of the mind. revised and expanded 3rd edition. new york: mcgraw-hill usa, 2010 and/or geert hofstede, culture\u2019s consequences: comparing values, behaviors, institutions, and organizations across nations.  second edition, thousand oaks ca: sage publications, 2001 </ref>", 
*  the australian article cited has no direct quotes from the hansard of the legislative assembly of norfolk island from 03 nov 2010. david buffett doesn't actually say that norfolk island would surrender self-government in return for a bailout, rather norfolk island is willing to pay into the australian system in return for services and funds.", 
* maid of honour to the regent: the regent margaret of austria referred to anne as my little boleyn but anne, if about 7, would have stayed at margaret of york's palace across the street of her niece 
*  margaret of austria's palace at which anne would have stayed if already about 12, then about the age of charles the later vth, holy roman emperor. mechelen does not have a record about anne boleyn, but does have contemporary records relating ages to homes of other young ladies at the court.", the french wikipedia shows mr hipp as the mayor but no record can be found on any official website that this is so.
*  unlike @ which gets used often these days, i have never seen \xe6 in my entire life in any spanish speaking country, and it doesn't even make sense as ae is nowhere near a/o"]

8) Current source is not reliable:

* there should be some link to documentation about this supposed monopoly, such as a verifiable source stating it, preferably one with research to back up their claim.
*  needs reliable, third-party sources to allow for verification - not from kt, the town, other partners or press releases - in order to keep this claim in the article.
*  needs reliable, published third-party sources to allow for verification - not from zanker, tla, press releases, etc. - in order to keep this claim in the article.
*  previous link no longer accessible. this link: may provide a suitable substitute, but can't currently verify that proposed source would be supportive of all citations in article for which salem source is currently used.", needs reliable, third-party sources to allow for verification - not from nicastro, berkley media, other partners or press releases - in order to keep this claim in the article.
*  needs reliable, published third-party sources to allow for verification - not from zanker, his companies, press releases, etc. - in order to keep this claim in the article.']

9) date only:

* date \u2013 april 2010
*  3 july 2010
*  3 november 2009
*  9 june 2010
*  date april 2007
*  date september 2009

10) flags with low confidence:

* just seeing a citation implies it is true, the paper on smalltalk/strongtalk predates dart so does not say for sure. maybe someone know it is based on strongtalk mixins. strongtalk wasn't first to introduce mixins. no expert on mixins \u2013 are there different variants and for sure strongtalk the first to use this variant? even then would someone have to say that it is based on that language?", i understand this reading, but this interpretation seems incorrect to me. it is entirely based on a single sentence, which is not really presented as a conclusion, but rather as an introduction to a rhetorical question. i don\'t think it can be excluded that "we want the maximum good per person" is assuming a constant population, in particular considering the preceding discussion of "optimum population".
*  unattested in reliable sources i have checked so far, and really this is just a variant of reverse pool, an article that does not exist yet, but should and someday will. any pool game can be played in this manner, which derives from russian billiards, though it is sometimes called "chinese pool" after "chinese firedrill", etc.
*  "capitalism: a love story" and various installments of "the young turks" -- admittedly not the best sources, but still better than no sources -- imply that "dead peasants insurance" is a flagrant and somewhat insulting name used by the practitioners of said, and if this is so it is misleading to imply, as this does, that the phrase is used by its opponents. discussing the names somewhere in the body would of course be nice.
*  one short website currently makes this claim with no source. may be in aken's autobiography, but the part that might support is not available online, so could not check.", this is the only place on the internet that ever refers to this show as anything other than "finders keepers." as of now, i declare there is no record of it ever being called "the finder" and until such evidence is provided it will not be known as such.