Research talk:Identification of Unsourced Statements

Assumption on direction of mistakes[edit]

The text seems to assume that if anything we need more "citation needed" templates, but in reality we might as well have too many. The same is true for the warnings on top of the articles. A common reason is that people adding sources, citations and footnotes, or even removing questionable text, don't remove the corresponding warnings. The test being proposed here is very complicated, while it would be rather simple to test existing templates and find how many would need to be removed. You can then probably find some reliable correlations (just like an article marked as stub is unlikely to be really a stub if it increased tenfold in size since being marked so). Nemo 19:00, 18 December 2017 (UTC)[reply]

I believe the assumption is that we need a more scalable and dynamic way to be able to identify sentences that probably need citations. The potential applications of the model (which are beyond the scope of the current project) could include a tool that lets people go around adding citation needed tags all over the place, but that isn't the point of any of the applications that have been sketched out so far. Jmorgan (WMF) (talk) 22:47, 20 December 2017 (UTC)[reply]

Thanks for this comment! This is a great observation. And possibly, the tool could be used to help reduce the backlog of sentences flagged with a 'citation needed' tag. If successful, the tool would consist in a classifier that, given a sentence, can output 1) A positive (needing citation)/ negative (not needing citation) 'citation_needed' label 2) A condifence score reflecting how likely it is that the statement actually needs a citation. When running the classifier on sentences already flagged as 'citation needed', we could then a) recommend tag removal when the classifier does not the detect the sentence as needing citation b) rank the 'positive' sentences according to the confidence score, thus surfacing the sentences that definitely need to be sourced. --Miriam (WMF) (talk) 11:26, 21 December 2017 (UTC)[reply]

English Wikipedia[edit]

Is this about the English Wikipedia only? I read "Wikipedia" but I only see English Wikipedia links. Nemo 19:00, 18 December 2017 (UTC)[reply]

See Research:Identification_of_Unsourced_Statements#Proposed_Solution. Model is intended to work across languages. Research:Identification_of_Unsourced_Statements/Labeling_Pilot_planning has more information. Jmorgan (WMF) (talk) 22:42, 20 December 2017 (UTC)[reply]

Please also see Research:Identification_of_Unsourced_Statements/Labeling_Pilot_planning for more details on how to annotate multilingual data for this task. --Miriam (WMF) (talk) 11:29, 21 December 2017 (UTC)[reply]

On the blog post[edit]

At Can machine learning uncover Wikipedia’s missing “citation needed” tags? I read "As Wikipedia’s verifiability policy mandates, …" Nope, this is not "Wikipedia", it is "English Wikipedia". I have said several times (Wikimedia-l: Core content policy) we need centralized policies, but nothing happen. — Jeblad 21:26, 28 April 2019 (UTC)[reply]

Found the statement "The resulting model can correctly classify sentences in need of citation with an accuracy of up to 90%." I would claim (totally unsourced) that this number is flawed. Sampling sentences will give sentences closer to the mean, while the [manually] marked sentences [in Wikipedia] are more difficult than the mean, so you tend to solve an easier problem. The result is probably quite good anyhow. — Jeblad 21:43, 28 April 2019 (UTC)[reply]

Jeblad You're right about the blog post; I didn't specify which Wikipedia (we're more careful with our definitions in the paper). Miriam (WMF) is better equipped to address your second point. Thanks for the feedback! Cheers, Jmorgan (WMF) (talk) 20:51, 1 May 2019 (UTC)[reply]

Jeblad I somehow missed this comment, sorry about that! Thank you for your feedback. While in the blog post we talk about a few main experiments only, in the full paper, and now in the updated version of this page, we report more extensive results on less "easy\" datasets, including the "citation needed" dataset, which considers as positive instances only those sentences with a [citation needed] tag. Miriam (WMF) (talk) 16:49, 18 June 2019 (UTC)[reply]

Miriam (WMF) All those papers I really should have dig into in all nerdy details… — Jeblad 21:02, 18 June 2019 (UTC)[reply]

Miriam (WMF) Some points (Nice paper!)

You should probably spell out what each of eq 1-4 is at page 6.
You should probably repeat what FA, LQN, and RND means at Tale 3. Probably just me, I read the first part and then picked it up again much later.
Not quite sure why point-biserial correlation is used, there are nothing continuous here? That is not really an argument against using the method.
When you optimize for accuracy you will get a restrictive classifier. (Optimizing for accuracy will maximize the larger one of true positive or true negative.) The consensus process has the same effect. You may say the process erodes the finer points,only captured by experts, leading to less use of “citation needed”.

It isn't clear from the paper if you used transfer learning from existing citations, but you can use transfer learning from sentences with references, and just tune the dense layer with your training data for citation needed and citation reason. Because users tend to change the preceding sentences slightly when adding references, they will not be quite the same. They should still be similar enough that transfer learning works properly. I would guess it will generalize better due to larger training set.

A RNN is nothing more than a fancy feature extractor with memory, which makes it able to infer a single statement. It will make a global statement over the memorized current state. The additional layers makes the global statement over several local statements. It is like reasoning over a single premise vs reasoning over several premises. Still, you are reasoning over a single sentence, and it is limited how much additional information you can reuse from other sentences.

I'm not quite sure, but I wonder if you could get somewhat better result if you use a two-layer RNN. To do that you probably need more training data, that is use transfer learning and train on existing citations. — Jeblad 23:50, 29 June 2019 (UTC)[reply]

Jeblad Thank you SO MUCH for your feedback! A few quick replies:

Thanks for the comments - we will revise the manuscript!
We use point-biserial as one variable is binary (citation need) and the other (word fequency is continueous)
We did use transfer learning exactly in this way :) Still the data is small, and that is why results on citation reason prediction are not so satisfying :(
The fact that we reason over a single sentence is definitely a limitation - we are not quite able to capture typical article patterns such as references cited elsewhere in the article. In the paper, we decided to go for the simplified task, as already a lot of work has been done. In the interface/API we are thinking of building around this models (see this page, we are thinking of returning not only the "citation need" score, but also the most similar sentences in the article, so that, in case citations from similar sentences can be transferred, or the citation needed flag can be discarded! Miriam (WMF) (talk)