Grants talk:IEG/StrepHit: Wikidata Statements Validation via References/Final
Thank you for submitting this exceptional Final Report. Your contribution to the Wikimedia movement with this project is significant, and we're so glad you're interested in continuing. Congratulations on several major achievements of this report: the corpus, the relations set, the code, and the datasets. It's a sign of your dedication that in addition to these major outputs, you still made time for bonus work, like a prototype dataset for Wiki Loves Monuments in Italy. What you have achieved in the scale of this project is really amazing. Thank you for all of your hard work!
I'm capturing below some of the discussion we covered in our meeting with Leila Zia, in case it might be of interest to others reading your report. Of course, you should fee free to edit it.
- Why do these datasets need to go through the primary source tool. What stops us from using them directly in Wikidata?
- There are different datasets from StrepHit and I believe there is one that could be used as it is. Some come from automatic classification approaches and are not 100% correct, so they definitely need to undergo a human validation step.
- Given that we have bots on Wikipedia that don't have 100% accuracy rate, why does yours have to undergo human validation? Is the accuracy/performance not high enough? What error rate we can tolerate for a bot to go?
- It really depends on a lot of different things. First, from a technical/machine learning perspective, there is an evaluation based on classical standard evaluation measures. The performance of supervised classification depends on the ambiguity of lexical units that must be classified in free text. So this is one part, but I don't think it’s the most important, in the end. Yes, we may have wrong classification, but it's quite variable depending on what we want to extract. You can see the classification output section of the final report. There are best lexical units that work very well already, and we may isolate them and they could go a separate way, maybe directly into Wikidata through a bot. This is the strict machine learning part. The most difficult challenge is finding a suitable way to map this kind of semantic - lexical knowledge, coming from verbs in the corpus to Wikidata properties. I’ve been using the frame approach and I’ve found that it’s hard to find preexisting properties in Wikidata that would be perfect fits for the frames and frame elements that I extract from free text. This is the main reason for the need for human validation. There is some discussion about this in the renewal talkpage. There is some discussion about this kind of mapping, which sound/look like wrong statements, but it’s actually not wrong from an automatic point of view. Rather, it's incorrect from a knowledge representation point of view. So this is the key point. Of course we need to improve the classification performance in general I would say I just need more training data. As i said, the most critical point was what I called the lexical database. This is a language resource which was built from scratch and i think there is a lot of work to be done on it to improve the final production of the dataset.
- You sampled 48 claims and manually assessed whether the supervised and rules-based models do well on them or not. Out of how many?
- Out of 1 million from the supervised and 900,000 from rule-based
- Why did you sample 48?
- I actually sampled 50, and had to remove 2. I don’t remember why. There is no specific meaning in that 48.
- You chose to sample 50 because of the manual work?
- Yes. This is a final claim evaluation related only to classification output. The data is not yet translated into Wikidata language. It's just free text that gets classified into something more structured. There is a different evaluation that was also carried out--two completely different evaluations. There are no standard ways to evaluate the correctness other than trying to split all the different modules responsible for the final output.
- Number of articles added or improved: 2.6 million claims. These are claims that can potentially be added or improved, right? These are the things that need to be fed to primary source tool and go to Wikidata, right?
- This is the final out put of StrepHit, and it may contain either claims that don't yet exist in Wikidata or claims that already exist in Wikidata. It is a good signal that It got a good statement if it’s already there.
- References have not been added, correct?
- They are in the primary sources tool, too, but not added to Wikidata.
- So 2.6 million is the upward bound of what we can add, if everything goes through?
- We generally report the number of claims with missing references. Have you compared to see how much the gap can close?
- There is a specific service that counts these statistics. To compute this kind of thing, it would make sense to do it on a per property basis. Each property has a number of unsourced claims, then you just sum them. I didn’t do that. I did it the opposite way: at beginning of project, I looked at unsourced properties and i tried to use the ones that were the most unsourced. This is reported in the monthly reports--there is a big table where i discuss which could be the most useful properties. I have produced references for a set of properties that had a high percentage of unsourced claims.
- If you were to repeat the past six months, what support would you want to see from WMF research team?
- The most helpful support would be community-oriented. I have struggled to try to teach people to use the primary sources tool. There is high learning curve for this tool, i would say. This is the big thing. It's an issue of human computer interaction or user interfaces in general. Basically StrepHit won't be useful if it doesn’t get into the target knowledge base. I have great datasets, but if they don’t get adopted into Wikidata, they can't be used. That’s why i would like my renewal request to shift a lot of the focus on the primary sources tool. This is one of the main rationales behind the renewal request.
It's been a pleasure to work with you from start to finish, Hjfocs. I am approving this report now and will be in touch with you about next steps with the renewal. Our grants administrator will follow up about closing out this grant.