Research:Discovering content inconsistencies between Wikidata and Wikipedia/First explorations

From Meta, a Wikimedia project coordination wiki

There is not much research about the consistency between Wikidata and Wikipedia pages. Moreover, there is no solution to compare such Wikipedia content with Wikidata information in the past. Our goals is to create a mapping between Wikidata and Wikipedia content, and try to measure the consistency between both of them. In this page, we summarize the initial results about trying to utilize the Wikipedia content and their corresponding Wikidata information in Wikipedia to discover the consistency.

Data&Methods[edit]

Data[edit]

There are many different topics of Wikipedia contents, and for this research, we collected the Wikipedia and Wikidata information about biography. For the Wikipedia, we have all the Wikipedia page contents, and for the Wikidata, we have their corresponding properties and values. Since we don’t have the labeling data about the consistency between Wikidata and Wikipedia content and we want to utilize supervised learning methods on this task, we would need to do some data preprocessing first.

Data Preprocessing[edit]

  • First, due to numerous properties on the Wikidata, we only choose the top5 most frequent meaningful properties: occupation, educated at, place of birth, member of sports team and country of citizenship to be our data. We sampled 2000 Wikidata claims for each property to be our datasets.
  • Second, since the Wikipedia content information is too large to compare with one Wikidata statement, we utilize the sentence-based Wikipedia. We focus on whether one Wikidata claim and one Wikipedia sentence in the same page is consistent or not.
  • Third, for the labeling, we have some settings. This is a two class classification problem (consistent or not). One example is presented in Table1, and the statistics of datasets is presented in Table2.

Label=0: Consistent. If all the Wikidata values or the aliases words of Wikidata are mentioned in the Wikipedia sentence, we consider they are consistent.

Label=1: Inconsistent. We utilize label=0 data to sample other values of the same property to generate 2 times inconsistent data.


Table1: Examples of the consistent/inconsistent data
Label Examples
Consistent:0 Sentence: ' Biography: He was born in Pavlovsky Posad near Moscow.'
Wikidata claim: 'place of birth Pavlovsky Posad'
Inconsistent:1 Sentence: 'At the age of thirteen, he entered Dulwich College.
Wikidata claim: 'educated at Shippensburg University of Pennsylvania'
Table2: Statistics of each property datasets
Place of birth Occupation Country of citizenship Educated at Member of sports team
Label=0 871 315 1676 2524 1654
Label=1 1742 630 3352 5048 3308
Total 2613 945 5028 7572 4962

Take a look at the Table2, we can clearly find that the property “occupation” is mentioned less in the Wikipedia content, and the property “educated at”, “member of sports team”, and “country of citizenship” are mentioned more time in their corresponding Wikipedia content.

Methods[edit]

Because our task is to discover the consistency between Wikidata and Wikipedia, We develop a Co-attention+NSMN deep learning model to detect the consistency. NSMN is a submitted paper utilizing the fever datasets. This model consists of four components. The first is Wikidata and sentence encoding: generate the representations of words in the Wikidata and sentence. The second is co-attention mechanisms: capturing the correlation between Wikidata claim and sentence. The third is Neural Semantic Matching Netwrok: another way to capture the correlation between Wikidata claim and sentence. The last is making prediction: generating the detection output by concatenating the co-attention and neural semantic matching network learning representations. The code of model is provided on Github.https://github.com/l852888/Wikipedia-Wikidata-Alignment

Model Framework
Model Framework

Wikidata and Sentence Encoding[edit]

The given Wikidata statement and sentence are represented by a word-level encoder. We utilize the pre-trained word embedding GloVe to obtain the initial vector representations of words. Let , be the input vector of wikidata claim and sentence, in which , are the pre-trained word embeddings of the m-th and n-th word, respectively. Then, we create a LSTM layer to learn the correlation between each word, and get the new word embeddings. We denote the Wikidata and sentence representations as and ,where d is the dimensionality of word embeddings.

Co-attention Mechanism[edit]

We think the consistency can be unveiled through investigating which parts of the Wikidata statement are concerned by which parts of the sentence, and can through this deep learning method to learn the correlation between Wikidata and sentence. Therefore, we develop a co-attention mechanism to model the mutual influence between Wikidata claim (i.e., ) and sentence (i.e., ).

We first compute a proximity matrix as: , where is a matrix of learnable parameters. By treating the proximity matrix as a feature, we can learn to predict Wikidata and sentence attention maps, given by

Where , are matrices of learnable parameters. The proximity F can be thought to transforming Wikidata claim to sentence attention space. Then we can generate the attention weights of Wikidata claim and sentence through the softmax function:

Where , are vectors of attention probabilities for each word in Wikidata statement and each word in sentence, respectively. Eventually we can generate the attention vectors of Wikidata and sentence through weighted sum using the derived attention weights, given by

Where , are the learned co-attention feature vectors.

Neural Semantic Matching Network[edit]

We consider another deep learning method, NSMN, provided to deal with fever datasets before. This mechanism can perform semantic matching between two textual sequences. Therefore, we utilize this mechanism to learn another representation between Wikidata claim and sentence.

First, we compute an alignment matrix as . Then the model compute the relevant semantic component from the other sequence using the weighted sum:

Where softmax are the column-wise softmax function. The aligned representations are combined:

f is one affine layer and indicates element-wise multiplication. After that, via a recurrent network as

The two matching sequences are projected onto two compressed vectors by maxPooling. The vectors are mapped to the final output m by a function f.

Make Prediction[edit]

We aim at predicting the consistency between Wikidata statement and sentence using the co-attention feature vectors , and NSMN feature vector m. Let f=[ ,] and m which are then fed into a multi-layer feedforward neural network, respectively. After that, concatenating all the outputs to finally predicts the label. We generate the binary prediction vector ,where indicate the predicted probabilities of labeling being 0 and 1, respectively. It can be derived through

Where is the matrix of learnable parameters, and is the bias term.

Preliminary Experiments[edit]

We conduct experiments on different properties to calculate their performances.

Metrics & Settings[edit]

The evaluation metrics include Accuracy, Precision, Recall, and F1. In this research, we are more focused on the performance of the inconsistent data, the Recall can reflect the results. Besides, it’s also essential to identify the highest probability inconsistent data. Hence, by sorting our testing data by predicted probability to see whether the higher predicted probability Wikidata and sentence pairs tend to be inconsistent. We consider it’s a ranking problem. Precision@50 is employed. We randomly choose 60% data for training, 20% for validation, and 20% for testing.

Main Results[edit]

The main results are shown in Table3 to Table7. We can clearly find that the Co-attention+NSMN model can classify better than other methods across five properties. The results also imply five insights:

  • The Co-attention+NSMN model performs better than other models in each property, and can get more balance prediction. For the two different ways of matching, NSMN tends to utilize different formula to calculate the relationship between Wikidata and sentence, and Co-attention tends to utilize more learning weights to automatically learn the influence between the pairs. This exhibits that although the NSMN model and co-attention model can get not bad performance and have their own advantages, utilizing two different matching methods together can get more complete information and can significantly improve the performance.
  • The performance of the Precision@50 can get about over 70% in Co-attention+NSMN model which is better than other models. This exhibits that when the predicted probability is really high, the Wikidata and sentence pair has high probability to be predicted to inconsistent correctly.
  • The performance of the Recall can get about over 95% in co-attention model, but the accuracy only can get about 60%. This exhibits that the co-attention model is really good at classifying the inconsistent pairs, but it has some difficulties in classifying the consistent pairs. However, NSMN is more balance in both of Recall and Precision, this exhibits that NSMN is better at classifying the consistent pairs than co-attention model. To sum up, combining the information learning by these two methods can perform the best in the Precision@50 and accuracy.
  • The property “Country of citizenship” and “Occupation” improve the most in precision@50 and accuracy when using the Co-attention+NSMN model. Moreover, all the property can improve significantly.
  • This current method solve the problems that only rely on the string matching. This method can capture some Wikidata and sentence pairs which are not complete matching but the meanings are consistent.
Table3: Main results of “place of birth”
Model F1 Rec Pre Acc P@50
Co-attention+NSMN 76.33 79.00 73.84 67.87 68
NSMN 70.20 63.55 78.41 64.62 65
Co-attention 76.79 95.04 64.44 62.33 60
Self-attention 75.33 94.21 63.23 61.48 58
GRU attention 76.01 94.78 63.12 61.44 59
Bag of Words+random forest 71.33 80.17 64.25 57.74 58
Table4: Main results of “Occupation”.
Model F1 Rec Pre Acc P@50
Co-attention+NSMN 81.31 87.87 75.32 71.42 78
NSMN 75.71 80.30 71.62 65.02 76
Co-attention 81.26 96.26 69.94 68.78 70
Self-attention 79.35 93.18 69.10 66.13 66
GRU attention 80.23 95.09 67.33 67.23 67
Bag of Words+random forest 72.22 74.09 70.43 64.60 54
Table5: Main results of “Country of citizenship”
Model F1 Rec Pre Acc P@50
Co-attention+NSMN 84.43 85.27 83.62 79.62 88
NSMN 81.73 83.74 79.82 75.74 78
Co-attention 78.16 98.00 65.00 64.51 66
Self-attention 72.41 83.12 64.14 58.94 66
GRU attention 73.78 84.63 62.93 58.86 65
Bag of Words+random forest 72.58 70.36 74.92 64.03 64
Table6: Main results of “Educated at”
Model F1 Rec Pre Acc P@50
Co-attention+NSMN 77.65 79.52 75.88 69.76 72
NSMN 76.20 79.52 73.16 67.19 64
Co-attention 79.53 99.60 66.31 66.13 72
Self-attention 77.84 95.03 65.79 64.15 66
GRU attention 78.09 97.70 65.99 65.28 68
Bag of Words+random forest 68.44 70.51 66.47 60.75 64
Table7: Main results of “Member of sports team”
Model F1 Rec Pre Acc P@50
Co-attention+NSMN 76.62 78.86 74.51 69.38 82
NSMN 75.86 86.75 67.40 64.75 72
Co-attention 77.34 98.26 63.76 63.24 78
Self-attention 76.85 96.37 63.91 62.94 75
GRU attention 80.62 97.79 63.50 62.27 75
Bag of Words+random forest 71.14 77.60 65.68 59.81 72

For here, we also show some predicted examples to more clearly understand our prediction. First, we present some pairs which are truly the inconsistent pairs and also be predicted to inconsistent:

E1:

  • Sentence: Early life and career Lin was born in Houguan (modern Fuzhou, Fujian Province) towards the end of the Qianlong Emperor's reign.
  • Wikidata: place of birth Atlanta

E2:

  • Sentence: In 2004, Dohnányi returned to Hamburg, Germany where he maintained a residence for many years, to become chief conductor of the NDR Symphony Orchestra.'
  • Wikidata: occupation filibuster

And then we present some pairs which are truly the consistent pairs and also be predicted to consistent:

E1:

  • Sentence: Vaz Tê has previously played for English clubs Bolton Wanderers, Hull City, Barnsley and West Ham United, Greek club Panionios,Scottish Premier League club Hibernian and Turkish side Akhisar Belediyespor
  • Wikidata: member of sports team West Ham United F.C.

E2:

  • Sentence: She studied contemporary dance at Simon Fraser University and a earned Master's degree in Fine Arts specializing in Creative Writing from the University of British Columbia.
  • Wikidata: educated at University of British Columbia

Some Difficulties[edit]

  • First, although we provide an initial idea about labeling the consistency between Wikidata statement and sentence without manual labeling, and most of the pairs can be labeled correctly, this labeling method still is not a high quality labeling datasets. It exists some problems that some sentences are mentioned the values of property, but the sentences are not talking about that property. Therefore, improving the way of labeling the consistency is one difficulty.
  • Second, Even though the Co-attention+NSMN model truly can better detect the inconsistent and consistent pairs, the performance still can be improved by utilizing other technical methods or consider other information.

Summary[edit]

  • In this research, we provide a method to label the consistency between Wikidata and Wikipedia contents, and also propose a model to be able to predict whether WIkidata claim and sentence is consistent.
  • Our method utilize two different matching models, NSMN and Co-attention, together to discover the consistency between Wikidata and Wikipage, and can get better performance and balance prediction than other past models.
  • Because we also utilize the aliases words information to generate the datasets, our method can capture some Wikidata and sentence pairs which are not complete matching but the meanings are consistent.
  • Even though our proposed method can better detect the consistency, the performance still can be improved. Moreover, the labeling way we provide still can be improved to become the higher-level quality datasets.