From the labeling pilot, we extract the first annotated sentences. More specifically, we have:
- A total of 460 sentences as needing/not needing a citation from 3 languages (avg 25% false (not needing citation))
- A total of 242 sentences for which users have specified a reason why a citation needs to be added or not
To analyse and group the reasons why editors tend to add citations or not, we used a procedure similar to the citation needed field analysis
- I transformed each sentence into a numeric feature vector using fasttext vectors trained with Wikipedia in the 3 languages (english, italian, french) and aligned using the multilingual alignment matrixes
- I split sentences into 'True' (reasons for adding a citation) and 'False' (reasons for not adding a citation)
- I computed K-Means using scikit-learn, varying k from 2 to 29.
- Using the elbow method, I identified k=14 as the best number of clusters for grouping 'True' reasons, and k=6 as the best k for clustering 'False' reasons.
Below the 14 resulting clusters of reasons for adding a citations, and the 6 for not adding a citation. Each cluster is identified by the closest sentences to each centroid's cluster; I named each cluster according to the nature of the sentences in it.
Reasons for adding a citation
1) Unusual terms / Opinions needing citations:
*'True:"discount" is unusual enough to be challenged, especially with the change of heart as well', 'True:Verification is needed to support the statement about why the majority grew', 'True:descrive aspetti molto importanti del soggetto, sarebbe utile avere una fonte a supporto', "True:l'information n'est pas notoire et m\xe9riterait de pouvoir étre vérifiée par une personne qui en douterait", 'True:un rapport de causalité qui peut en fait \xeatre une supposition', "True:aucune source dans le paragraphe entier. Cette phrase devrait \xeatre car elle exprime une position d'une personne"
2) Little explanation:
*'True:citation', 'True:citations', 'True:citation', 'True:citation', 'True:citation'
3) Not common knowledge:
*'True:body fact without current source and not common knowledge', 'True:Anything with this much certainty (using works like every and more often than any other) should have a citation', 'True:This paragraph needs a citation overall, but not quite sure that this exact statement requires one', 'True:body fact without current source and not common knowledge', 'True:body fact without current source and not common knowledge', 'True:body fact without current source and not common knowledge', 'True:body fact without current source and not common knowledge', 'True:body fact without current source and not common knowledge'
4) Factual statements:
*'True:This statement asserts that McClintock discovered something and therefore should include a citation. Actually all three statements should have a reference. Sentence 2 reference to her theories and sentence 3, the comment about the role of skepticism and her decision to stop publishing could be controversial.', 'True:It contains a number of factual statements. They could each of have independent citations or I guess they could have all come from one source and thus the citation could be applied to the whole paragraph. ', 'True:This seems to include a kind of citation to the mention in the Times, but there should be an actual citation to the quoted Times information.', 'True:describes the intent and choices of USCG: that should be reflected in the source, or ddescribed by another source', "True:Nonostante la voce abbia un'ampia bibliografia non si sa se l'informazione sia vera (anche perché é piuttosto particolare, forse andrebbe segnato tutto il paragrafo), servono comunque dei riferimenti bibliografici sottoforma di nota per poter risalire alla singola informazione,", "True:innanzitutto, si tratta di un paragrafo che dovrebbe essere riscritto in modo tale da poter essere pi\xf9 neutrale e comunque si parla di un episodio importante della vita del soggetto, quindi c'\xe8 bisogno di verificare i giudizi contenuti"
5) Historical sentences statements:
*'True:donnees (historiques) a sourcer etc.'
6) Bug: 2 different sentences:
*"True:caveat: two sentences, the second one doesn't need a source", 'True:citation (should be verbatim and verifiable though a source) ; beggining of the part badly detected (should take the whole citation, not just the last part)', 'True:sources are here but not explicitely linked', 'True:"officialy named" so a source is available and should be provided', 'True:important fact and place of mariage should be easy to source (+ caveat, there is two different sentences).'
7) Subjective and private statements:
*'True:vie privee + romance + supputation', 'True:données (historiques) etc... + affirmation subjective'
8) Original Research:
*'True:Servono fonti sulle presunte voci', 'True:données (historiques) a sourcer comme les autres', 'True:données (sportives) a sourcer avec les autres', 'True:données (historiques) a sourcer comme les autres + citation inclue', 'True:Assertion asourcer et phrase a récrire.', 'True:donnees factuelles (sportives) a sourcer comme les autres'
*"True:It's a quote", 'True:sounds like a quote', 'True:direct quote ', 'True:This feels like speculation', 'True:Feels like speculation ', "True:It's a quote"
11) Private life of subject:
12) General explanation:
*'True:Assertion a sourcer.'
13) Complex Sentence:
*'True:affirmation complexe', 'True:affirmation complexe'
14) Detailed information on subject needing citation:
*'True:References to provide verifications to the various honors and their dates would be appreciated. ', 'True:contiene dettagli della trama del film che andrebbero supportati da fonte', 'True:Date e dettagli che necessitano di supporto esterno', 'True:Opinioni estremamente personali e soggettive del personaggio che andrebbero confermate da qualcosa di ufficiale', 'True:données (historiques) a sourcer comme les autres + il est indiqué ensuite que les sources sont rares !', 'True:fait précis et important (les sources ne doivent pas manquer)'
Reasons for NOT adding a citation
1) Summary statement
*"False:The thesis statement doesn't need a citation, but the subsequent statements do. Summary statements should be able to be written without a citation. ", "False:The thesis statement doesn't need a citation, but the subsequent statements do. Summary statements should be able to be written without a citation. ", "False:The thesis statement doesn't need a citation, but the subsequent statements do. Summary statements should be able to be written without a citation. ", "False:The thesis statement doesn't need a citation, but the subsequent statements do. Summary statements should be able to be written without a citation. ", 'False:it would be good for the overall paragraph to carry a reference rather than this specific sentence', 'False:THis seems like a straightforward statement, but verfication of the date would be appreciated'
*'False:Bug ?', 'False:Bug ?', 'False:quoi ?', 'False:Bug ?', 'False:Bug ?', 'False:Bug ?' *'False:Guillemet'
4/5) Lead Section / Intro
*'False:lead', 'False:lead', 'False:lead', 'False:lead' *'False:Intro'
6) Citation not needed if contextual sentences have one
*'False:effectively already cites the Domesday Book which would be the only possible citation', "False:a citation is always useful, but Brod's initiative after Kafka's death is well known and covered in the Franz Kafka article", 'False:no specific need for a citation in the context of the wider narrative', 'False:content of the book is sourced to the book itself and needs no cite', 'False:This feels like a string of biographical facts. However, some of the statements could/should be linked to citations (e.g. part of the Fab Five)', "False:le paragraphe entier manque de sources ... La phrase s\electionnee n'est pas forcement celle qui necessite le plus une source (une source generale sur l'atmosphere politique a cette epoque permettrait de sourcer en bloc une partie du paragraphe)"