Research:Wikipedia Source Controversiality Metrics/work
Data Collection
[edit]...
Revision Selection
[edit]Subsequent revisions by the same user are considered as an individual revision (independently of the time between them)
Metrics
[edit]URL-wise metrics
[edit]Metrics evaluated on each combination of page & source found in the dataset. Each metric is evaluated on events of addition and removal of a source. We are considering "removal" an edit that removes the source, with the source not being added again before one day. If the source is removed and added again within one day, the inbetween revisions are ignored
Age/Lifetime
[edit]lifetime
[edit]A measure of time (days) from a source addition to its removal.
lifetime_nrev
[edit]A measure of number of revisions from a source addition to its removal
age
[edit]days between the first time that url was added on that page, and collection time
current
[edit]A value in ${0,1}$ indicating wether the source is present (1) or not (0) in the page, at the time of the data collection
age_nrev
[edit]number of edits received by the page, between the time when the url was first added up to collection time
Permanence
[edit]permanence
[edit]The measure of how much time (days) a source has been on a page. It is the sum of lifetimes
norm_permanence
[edit]The permanence (days) of a url on a page, divided by the lifetime (days) of the page itself. It is a measure ranging in $[0,1]$ of how much time a source has been on a page, proportionally to the life of the page. Considering the collection time, the times when the page was created and the url was added, whether the url was present or not on page at time
selfnorm_permanence
[edit]The permanence (days) of a url on a page, divided by the age of that url on that page. It is a measure ranging in $[0,1]$ of how much time a source has been on a page, proportionally to the life of the url. Considering the collection time, the times when the url was added, whether the url was present or not on page at time
permanence_nrev
[edit]The measure of how many revisions a source has been on a page. It is the sum of lifetimes_nrev
norm_permanence_nrev
[edit]The permanence_nrev (number of revisions) of a url on a page, divided by the lifetime_nrev (number of revisions) of the page itself. It is a measure ranging in $[0,1]$ of how much time a source has been on a page, proportionally to the life of the page
selfnorm_permanence_nrev
[edit]The permanence (number of revsions) of a url on a page, divided by the number of revisions since the url was first added. It is a measure ranging in $[0,1]$ of how many revisions a source was involved in, since the first addition on it
Involvement/Edits
[edit]inv_count
[edit]Involvement score. It is a measure relative to a page and url, of how many times a sentence using that url as a source has been edited. If a URL is present in our dataset, but is not involved in such revisions, its inv_count will be 0 (might change in the future)
inv_count_normalized
[edit]Involvement score "locally normalized". For each revision, the number of times a source has been edited is divided by the number of sources that have been involved in that revision
inv_score_globnorm
[edit]Involvement score of a url on a page, divided by its permanence (days)
inv_score_globnorm_nrev
[edit]Involvement score of a url on a page, divided by its permanence_nrev (number of edits)
edit count
[edit]inv_count + number of lifes of a url on a page. This measure account for both the edits received by [sentences containing] a source, and the number of times it was added
edit_score_globnorm
[edit]edit count, divided by the number of days the source has been on the page
edit_score_globnorm_nrev
[edit]edit count, divided by the age_nrev (number of revisions) of the url on that page
Users/Editors_add/rem/start/end
[edit]Editors_start/end
[edit]number editors who start/end a lifetime
Editors_add/rem
[edit]number editors who add/remove that url
Registerededitors_add/rem
[edit]number registered users who add/remove that url
Registerededitors_start/end
[edit]number registered users who start/end a lifetime
norm_Editors_start/end
[edit]measure the variety of editors who start/end the lifetime of a url:
norm_Editors_add/rem
[edit]measure the variety of editors who add/remove that url:
norm_Registerededitors_start/end
[edit]probability that the lifetime of that url is started/ended by a registered user:
norm_Registerededitors_add/rem
[edit]probability that url is added/removed a registered user:
Domain-wise metrics
[edit]{current}_{url-wise metric}_{mean/median/sum}_urlwise
[edit]For each url-wise metric there is a domain aggregated version of it, obtained by aggregatig each page&url metric over domain and evaluating the mean/median/sum. "Current" version measure the metrics on only the urls that are used at collection time
{current}_{url-wise metric}_{mean/median/sum}_pagewise
[edit]For each url-wise metric there is a domain-wise version of it, obtained in the same way but considering the pair of page&domain during computation (using the domain of a url instead of the url itself). The values are then aggregated for each domain as the avg/median/sum over the pages. "Current" version measure the metrics on only the pages that are used at collection time
n_life
[edit]Number of lifetimes of urls from a domain on all the pages
{current}_n_page
[edit]number of pages where the domain is used as a source. Current version count the number of page where it was used at collection time
{current}_n_pageurls
[edit]number of unique page&url combinations, for urls from that domain. Current version count the number of unique page&urls combination where it was used at collection time
normalized extensive features
[edit]
Modelling approaches
[edit]We are testing (18/10/23) different models, using different approaches with regards to data representation, data selection, preprocessing and learning algorithms
Target
[edit]We aim to predict reliability/controversiality of url/domains using the metrics described above. To do so, we use a discriminative approach using a ground truth provided by perennial status and/or mbfc status
Perennial status
[edit]Perennial sources are a classification of several domains in Gen. Reliable, Gen. Unreliable, Mixed, No consensus, Blacklisted and Deprecated. For a clean target we use the Gen. Reliable class as a positive class, and Gen. Unreliable class as a negative class
MBFC Status
[edit]We are not using (18/10/23) mbfc statuses as a target, but we can train on perennial status as target and then validate (qualitatively) on mbfc status
Data Representation
[edit]We train over the domains which are classified as Generally reliable (as a positive target), or Generally unreliable (as a negative target) in the Perennial Sources
Domain Dataset
[edit]In the domain, the rows of the dataset represent information we gathered about each domain
URL+Domain Dataset
[edit]In this dataset, each row represent information about the page&url combination AND information about its domain (same of Domain Dataset); we thus train over the information we gathered about the usage of a url in a given page, in addition to the information of its domain. This makes so that part of the values in the rows (the ones of Domain information) are repeated across different urls with the same domain.
Data selection
[edit]Subsampling
[edit]In the URL+Domain Dataset, we perform subsampling of urls from highly used domains. This subsampling is performed applying a maximum number of url from same domain cutoff, and with this strategy we obtain a more balanced url+domain dataset. It is required to avoid the dataset/fold (during crossvalidation) to be dominated by few popular domains.
Balancing
[edit]In order to have a balanced target for the training, we subsample the url/domain obtaining the same amount of positive (Generally Reliable) and negative (Generally Unreliable) entries.
Preprocessing
[edit]Normalization
[edit]If required (ex. for logistic regression model) before using the features in the model, each feature is normalized indipendently by ranking. Each feature is thus represented in a uniform interval from 0,1
VIF
[edit]If required, we check for VIF values across all the features and iteratively remove the highest VIF ones, up to the point where each feature remaining has a VIF value lower than a threshold (generally VIF<5)
Discriminative modelling Modelling
[edit]LogReg
[edit]Does require Normalization and (VIF based) feature selection
XGBoost
[edit]Does not require Normalization or feature selection - both performed by the algorithm itself
Performance Evaluation Approaches
[edit]Iterations
[edit]We tested several strategies, changing:
- balanced/not balanced dataset
- Domain data/URL+Domain data
- data from project Climate change/ COVID-19
- data from english pages/other languages alltogether/some specific language
- using scores features/not using them
- weighted/not weighted target
Evaluation Strategies
[edit]Early strategies (december 2023)
[edit]In-dataset validation
[edit]To assess the quality of model prediction over unobserved domains, we cross-validated the model within each dataset. To do so, we train on 4/5 of the domains and validate on the remaining 1/5, repeating the process for 100 times. Since there are only few domains, we used this strategy to avoid fold-dependent noise in performance metrics. Using the 100 folds results we measure F1, precision, recall averages and standard deviations, on either positive entries (Gen. Reliable) and negative entries (Gen. Unreliable)
Cross-dataset validation
[edit]To check whether the results are dataset dependent, we trained and validated the model across different datasets (different project and language)
Current strategy
[edit]Tools
[edit]- F1 Macro: Since we are interested in recognizing either negative and positive domains (consensus-reliable or consensus-unreliable) we are measuring model performances using the unweighted averages of F1 score for both the classes
- Leave-one-out: When training and testing on the same dataset, we are using leave-one-out validation to measure F1 scores. This also happends when training on a dataset that includes the test domains (es training on all domains from all languages of Climate Change pages, and testing on Climate Change Spanish pages)
- Bootstrapping: To obtain a confidence interval on each dataset measure, we take the list of results and resample them 100 times with replacement; we then compute performance metrics once for each iteration. The result will be a distribution of 100 values representing the model performance on a given dataset
- Mann-Whitney: To compare performances from two different models on the same dataset, we use Mann-Whitney P values corrected with Bonferroni Approach. This means that for each pair of models we have a p value telling us the probability of them being comparable (p>0.05) or statistically significant different (p<0.05)
Model Performances
[edit]- Random Model For each dataset, meaning a combination of set of pages and language, we perform a measure for a random model predicting reliability of the domains appearing in that page. This is done by assigning random binary values to each domain and bootstrapping the results
Results
[edit]Dataset statistics
[edit]URL+Domain
[edit]deprecated
Domain
[edit]Iteration of datasets (by project and language): support sizes for positive entries (generally reliable) and negative ones (generally unreliable).
project | lang | pos size | neg size |
---|---|---|---|
COVID-19 | de | 96 | 52 |
COVID-19 | en | 136 | 125 |
COVID-19 | other | 133 | 134 |
COVID-19 | ru | 93 | 34 |
Climate change | de | 100 | 52 |
Climate change | en | 130 | 152 |
Climate change | other | 138 | 145 |
Climate change | ru | 87 | 43 |
Model performances
[edit]Model selection
[edit]Observations
[edit]- Overfitting is more present with Logistic regressor (deprecated) even though it can be prevented by VIF Selection. Testing on Climate Change and english shows that overfitting is still present ( 0.9 on training vs 0.8 on validation) although the performances are still good on validation set. Using Leave-one-out validation moving forward
- Weighted model provide best performances with no further assumption, and without having to remove domains from the datasets
- Scores provide no additional information to the model, while being difficult to compute on languages different than english. They are dropped out of the features moving forward
- URLS info: if training is performed at the url-level, F1 scores and the model is strongly biased by the size of some domains (up to thousands urls). Moving forward we will use domain-wise information only
- XGBoost: provide better F1 performances than Logistic regressor, without assumptions to make on features selection
- Other languages: aggregation of all other languages together bias the resulting dataset in favor of more active wikis (es, de, fr,... ), so aggregation could be not the best strategy for different languages. Also, we expect the model to perform worse on other languages due to
- Less importance to Perennial
- Fewer revisions
- Noise from low resource languages
Early results (december 2023): URL+Domain dataset
[edit]Performance metrics are obtained following strategy described in "In-dataset validation" (above).
Over the URL+Domain dataset, we tested the XGBoost model changing:
- Climate change/Covid
- english/other languages
- weighting/not weighting,
- using/not using the scores based features
iteration_name | project | language | use_scores | weighted | train_f1_avg | train_f1_std | valid_f1_avg | valid_f1_std |
---|---|---|---|---|---|---|---|---|
covid-19_english | COVID-19 | english | False | False | 0.995 | 0.001 | 0.951 | 0.042 |
covid-19_english_weighted | COVID-19 | english | False | True | 0.995 | 0.002 | 0.951 | 0.038 |
covid-19_english_+scores | COVID-19 | english | True | False | 0.997 | 0.001 | 0.947 | 0.037 |
covid-19_english_+scores_weighted | COVID-19 | english | True | True | 0.995 | 0.002 | 0.945 | 0.036 |
covid-19_other_languages | COVID-19 | other_languages | False | False | 0.980 | 0.007 | 0.859 | 0.058 |
covid-19_other_languages_weighted | COVID-19 | other_languages | False | True | 0.980 | 0.004 | 0.844 | 0.065 |
climate_change_english | Climate change | english | False | False | 0.993 | 0.002 | 0.936 | 0.051 |
climate_change_english_weighted | Climate change | english | False | True | 0.988 | 0.002 | 0.939 | 0.049 |
climate_change_english_+scores | Climate change | english | True | False | 0.992 | 0.002 | 0.941 | 0.039 |
climate_change_english_+scores_weighted | Climate change | english | True | True | 0.986 | 0.003 | 0.936 | 0.035 |
climate_change_other_languages | Climate change | other_languages | False | False | 0.968 | 0.007 | 0.876 | 0.056 |
climate_change_other_languages_weighted | Climate change | other_languages | False | True | 0.966 | 0.007 | 0.842 | 0.062 |
Early results (december 2023): Domain dataset
[edit]Performance metrics are obtained following strategy described in "In-dataset validation" (above). Over the Domain dataset, we tested the XGBoost model changing:
- Climate change/Covid
- english/all other languages/de/ru
- weighting/not weighting/balancing dataset,
- using/not using the scores based features
iteration_name | project | lang | use scores | balanced dataset | weighted | pos f1 | neg f1 | pos precision | neg precision | pos recall | neg recall |
---|---|---|---|---|---|---|---|---|---|---|---|
covid-19_other_languages_de | COVID-19 | de | False | False | False | 0.811 | 0.583 | 0.774 | 0.684 | 0.862 | 0.538 |
covid-19_other_languages_de_weighted | COVID-19 | de | False | False | True | 0.786 | 0.630 | 0.812 | 0.614 | 0.771 | 0.672 |
covid-19_other_languages_balanced_de | COVID-19 | de | False | True | False | 0.668 | 0.668 | 0.672 | 0.692 | 0.690 | 0.673 |
covid-19_english | COVID-19 | en | False | False | False | 0.847 | 0.827 | 0.848 | 0.832 | 0.851 | 0.827 |
covid-19_english_weighted | COVID-19 | en | False | False | True | 0.847 | 0.828 | 0.852 | 0.829 | 0.847 | 0.833 |
covid-19_english_balanced | COVID-19 | en | False | True | False | 0.849 | 0.842 | 0.848 | 0.848 | 0.855 | 0.842 |
covid-19_english_+scores | COVID-19 | en | True | False | False | 0.858 | 0.836 | 0.854 | 0.845 | 0.868 | 0.833 |
covid-19_english_+scores_weighted | COVID-19 | en | True | False | True | 0.860 | 0.838 | 0.858 | 0.844 | 0.866 | 0.838 |
covid-19_english_+scores_balanced | COVID-19 | en | True | True | False | 0.838 | 0.837 | 0.843 | 0.839 | 0.840 | 0.840 |
covid-19_other_languages | COVID-19 | other | False | False | False | 0.774 | 0.779 | 0.792 | 0.769 | 0.764 | 0.796 |
covid-19_other_languages_weighted | COVID-19 | other | False | False | True | 0.778 | 0.781 | 0.794 | 0.772 | 0.768 | 0.795 |
covid-19_other_languages_balanced | COVID-19 | other | False | True | False | 0.776 | 0.772 | 0.793 | 0.763 | 0.766 | 0.789 |
covid-19_other_languages_ru | COVID-19 | ru | False | False | False | 0.826 | 0.364 | 0.777 | 0.527 | 0.890 | 0.311 |
covid-19_other_languages_ru_weighted | COVID-19 | ru | False | False | True | 0.799 | 0.522 | 0.839 | 0.498 | 0.774 | 0.596 |
covid-19_other_languages_balanced_ru | COVID-19 | ru | False | True | False | 0.691 | 0.671 | 0.689 | 0.726 | 0.734 | 0.663 |
climate_change_other_languages_de | Climate change | de | False | False | False | 0.786 | 0.548 | 0.764 | 0.610 | 0.820 | 0.530 |
climate_change_other_languages_de_weighted | Climate change | de | False | False | True | 0.752 | 0.580 | 0.790 | 0.555 | 0.728 | 0.640 |
climate_change_other_languages_balanced_de | Climate change | de | False | True | False | 0.637 | 0.654 | 0.649 | 0.667 | 0.649 | 0.663 |
climate_change_english | Climate change | en | False | False | False | 0.781 | 0.818 | 0.792 | 0.814 | 0.778 | 0.828 |
climate_change_english_weighted | Climate change | en | False | False | True | 0.783 | 0.814 | 0.781 | 0.821 | 0.792 | 0.813 |
climate_change_english_balanced | Climate change | en | False | True | False | 0.784 | 0.778 | 0.792 | 0.781 | 0.786 | 0.786 |
climate_change_english_+scores | Climate change | en | True | False | False | 0.788 | 0.816 | 0.800 | 0.812 | 0.783 | 0.826 |
climate_change_english_+scores_weighted | Climate change | en | True | False | True | 0.790 | 0.810 | 0.785 | 0.821 | 0.801 | 0.804 |
climate_change_english_+scores_balanced | Climate change | en | True | True | False | 0.779 | 0.781 | 0.777 | 0.792 | 0.790 | 0.779 |
climate_change_other_languages | Climate change | other | False | False | False | 0.699 | 0.729 | 0.722 | 0.717 | 0.685 | 0.749 |
climate_change_other_languages_weighted | Climate change | other | False | False | True | 0.700 | 0.723 | 0.714 | 0.719 | 0.695 | 0.733 |
climate_change_other_languages_balanced | Climate change | other | False | True | False | 0.704 | 0.714 | 0.731 | 0.702 | 0.691 | 0.740 |
climate_change_other_languages_ru | Climate change | ru | False | False | False | 0.777 | 0.397 | 0.730 | 0.513 | 0.840 | 0.350 |
climate_change_other_languages_ru_weighted | Climate change | ru | False | False | True | 0.760 | 0.519 | 0.777 | 0.516 | 0.754 | 0.550 |
climate_change_other_languages_balanced_ru | Climate change | ru | False | True | False | 0.664 | 0.644 | 0.677 | 0.671 | 0.684 | 0.660 |