Jump to content

Machine learning models/Production/Serbian Wikipedia article topic

From Meta, a Wikimedia project coordination wiki


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Serbian Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

[edit]

How can we predict what general topic an article is in? Answering this question is useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends) and filtering articles.

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. filter articles only to those in the music category.
Don't use this model for
  • definitively establishing what topic an article pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/srwiki/1234/articletopic

Ethical considerations, caveats, and recommendations

[edit]
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

[edit]

Performance

[edit]

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 13381 11913 1468 713 43322
Culture.Biography.Women 3703 2961 742 284 53429
Culture.Food and drink 1377 824 553 164 55875
Culture.Internet culture 2959 2514 445 228 54229
Culture.Linguistics 1601 1150 451 129 55686
Culture.Literature 5115 3833 1282 421 51880
Culture.Media.Books 1560 1238 322 97 55759
Culture.Media.Entertainment 2246 1192 1054 218 54952
Culture.Media.Films 2674 2333 341 106 54636
Culture.Media.Media* 12684 11238 1446 1123 43609
Culture.Media.Music 2694 2253 441 165 54557
Culture.Media.Radio 284 200 84 26 57106
Culture.Media.Software 2267 1929 338 250 54899
Culture.Media.Television 2400 2048 352 103 54913
Culture.Media.Video games 1405 1327 78 37 55974
Culture.Performing arts 1449 807 642 120 55847
Culture.Philosophy and religion 4226 2410 1816 453 52737
Culture.Sports 5082 4504 578 140 52194
Culture.Visual arts.Architecture 1985 1246 739 204 55227
Culture.Visual arts.Comics and Anime 1324 1122 202 72 56020
Culture.Visual arts.Fashion 646 341 305 50 56720
Culture.Visual arts.Visual arts* 4640 2945 1695 387 52389
Geography.Geographical 5060 3705 1355 796 51560
Geography.Regions.Africa.Africa* 3847 2602 1245 291 53278
Geography.Regions.Africa.Central Africa 537 274 263 58 56821
Geography.Regions.Africa.Eastern Africa 449 282 167 44 56923
Geography.Regions.Africa.Northern Africa 1505 1015 490 125 55786
Geography.Regions.Africa.Southern Africa 625 380 245 45 56746
Geography.Regions.Africa.Western Africa 150 102 48 24 57242
Geography.Regions.Americas.Central America 1314 602 712 86 56016
Geography.Regions.Americas.North America 6102 4272 1830 738 50576
Geography.Regions.Americas.South America 1547 1020 527 101 55768
Geography.Regions.Asia.Asia* 10008 7945 2063 777 46631
Geography.Regions.Asia.Central Asia 1120 756 364 70 56226
Geography.Regions.Asia.East Asia 2638 2062 576 103 54675
Geography.Regions.Asia.North Asia 1984 1456 528 274 55158
Geography.Regions.Asia.South Asia 1784 1277 507 64 55568
Geography.Regions.Asia.Southeast Asia 1588 1078 510 102 55726
Geography.Regions.Asia.West Asia 2753 1950 803 221 54442
Geography.Regions.Europe.Eastern Europe 4176 3084 1092 369 52871
Geography.Regions.Europe.Europe* 16664 13521 3143 2059 38693
Geography.Regions.Europe.Northern Europe 3404 2231 1173 244 53768
Geography.Regions.Europe.Southern Europe 6030 4566 1464 668 50718
Geography.Regions.Europe.Western Europe 4302 3277 1025 320 52794
Geography.Regions.Oceania 1836 1260 576 88 55492
History and Society.Business and economics 3143 1856 1287 280 53993
History and Society.Education 1669 857 812 113 55634
History and Society.History 6814 4435 2379 904 49698
History and Society.Military and warfare 5875 4303 1572 638 50903
History and Society.Politics and government 4679 2720 1959 462 52275
History and Society.Society 6974 3425 3549 613 49829
History and Society.Transportation 3236 2923 313 95 54085
STEM.Biology 3787 3111 676 159 53470
STEM.Chemistry 2144 1639 505 342 54930
STEM.Computing 2657 2252 405 255 54504
STEM.Earth and environment 1681 1018 663 155 55580
STEM.Engineering 2820 2183 637 147 54449
STEM.Libraries & Information 516 382 134 26 56874
STEM.Mathematics 573 351 222 53 56790
STEM.Medicine & Health 2450 1738 712 258 54708
STEM.Physics 1431 1039 392 160 55825
STEM.STEM* 18323 16602 1721 950 38143
STEM.Space 2151 2029 122 25 55240
STEM.Technology 4652 3536 1116 522 52242

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.233 0.123
Culture.Biography.Women 0.064 0.015
Culture.Food and drink 0.024 0.002
Culture.Internet culture 0.052 0.003
Culture.Linguistics 0.028 0.007
Culture.Literature 0.089 0.015
Culture.Media.Books 0.027 0.004
Culture.Media.Entertainment 0.039 0.004
Culture.Media.Films 0.047 0.011
Culture.Media.Media* 0.221 0.058
Culture.Media.Music 0.047 0.024
Culture.Media.Radio 0.005 0.002
Culture.Media.Software 0.039 0.001
Culture.Media.Television 0.042 0.009
Culture.Media.Video games 0.024 0.003
Culture.Performing arts 0.025 0.003
Culture.Philosophy and religion 0.074 0.011
Culture.Sports 0.089 0.071
Culture.Visual arts.Architecture 0.035 0.011
Culture.Visual arts.Comics and Anime 0.023 0.002
Culture.Visual arts.Fashion 0.011 0.001
Culture.Visual arts.Visual arts* 0.081 0.018
Geography.Geographical 0.088 0.024
Geography.Regions.Africa.Africa* 0.067 0.008
Geography.Regions.Africa.Central Africa 0.009 0.001
Geography.Regions.Africa.Eastern Africa 0.008 0
Geography.Regions.Africa.Northern Africa 0.026 0.001
Geography.Regions.Africa.Southern Africa 0.011 0.001
Geography.Regions.Africa.Western Africa 0.003 0.001
Geography.Regions.Americas.Central America 0.023 0.003
Geography.Regions.Americas.North America 0.106 0.064
Geography.Regions.Americas.South America 0.027 0.006
Geography.Regions.Asia.Asia* 0.174 0.045
Geography.Regions.Asia.Central Asia 0.02 0.001
Geography.Regions.Asia.East Asia 0.046 0.011
Geography.Regions.Asia.North Asia 0.035 0.001
Geography.Regions.Asia.South Asia 0.031 0.015
Geography.Regions.Asia.Southeast Asia 0.028 0.006
Geography.Regions.Asia.West Asia 0.048 0.011
Geography.Regions.Europe.Eastern Europe 0.073 0.013
Geography.Regions.Europe.Europe* 0.29 0.076
Geography.Regions.Europe.Northern Europe 0.059 0.031
Geography.Regions.Europe.Southern Europe 0.105 0.013
Geography.Regions.Europe.Western Europe 0.075 0.019
Geography.Regions.Oceania 0.032 0.015
History and Society.Business and economics 0.055 0.01
History and Society.Education 0.029 0.007
History and Society.History 0.119 0.011
History and Society.Military and warfare 0.102 0.014
History and Society.Politics and government 0.081 0.028
History and Society.Society 0.121 0.013
History and Society.Transportation 0.056 0.015
STEM.Biology 0.066 0.034
STEM.Chemistry 0.037 0.002
STEM.Computing 0.046 0.003
STEM.Earth and environment 0.029 0.005
STEM.Engineering 0.049 0.005
STEM.Libraries & Information 0.009 0.001
STEM.Mathematics 0.01 0
STEM.Medicine & Health 0.043 0.006
STEM.Physics 0.025 0.001
STEM.STEM* 0.319 0.069
STEM.Space 0.037 0.006
STEM.Technology 0.081 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.124 0.876 0.89 0.885 0.888 0.972 0.981 0.94
Culture.Biography.Women 0.017 0.983 0.8 0.691 0.741 0.992 0.984 0.762
Culture.Food and drink 0.004 0.996 0.598 0.335 0.43 0.996 0.973 0.419
Culture.Internet culture 0.007 0.993 0.85 0.416 0.559 0.995 0.986 0.673
Culture.Linguistics 0.008 0.992 0.718 0.696 0.707 0.996 0.976 0.672
Culture.Literature 0.02 0.98 0.749 0.594 0.663 0.988 0.979 0.723
Culture.Media.Books 0.005 0.995 0.794 0.649 0.714 0.997 0.985 0.718
Culture.Media.Entertainment 0.006 0.994 0.531 0.326 0.404 0.994 0.972 0.331
Culture.Media.Films 0.011 0.989 0.872 0.827 0.849 0.997 0.986 0.887
Culture.Media.Media* 0.075 0.925 0.886 0.687 0.774 0.97 0.981 0.872
Culture.Media.Music 0.023 0.977 0.836 0.872 0.854 0.993 0.984 0.888
Culture.Media.Radio 0.002 0.998 0.704 0.77 0.735 0.999 0.945 0.563
Culture.Media.Software 0.006 0.994 0.851 0.2 0.324 0.995 0.987 0.382
Culture.Media.Television 0.009 0.991 0.853 0.802 0.827 0.997 0.985 0.851
Culture.Media.Video games 0.003 0.997 0.944 0.789 0.86 0.999 0.988 0.914
Culture.Performing arts 0.004 0.996 0.557 0.429 0.485 0.997 0.97 0.38
Culture.Philosophy and religion 0.014 0.986 0.57 0.419 0.483 0.987 0.955 0.464
Culture.Sports 0.066 0.934 0.886 0.962 0.923 0.989 0.981 0.953
Culture.Visual arts.Architecture 0.01 0.99 0.628 0.646 0.636 0.992 0.978 0.626
Culture.Visual arts.Comics and Anime 0.003 0.997 0.847 0.592 0.697 0.998 0.988 0.767
Culture.Visual arts.Fashion 0.001 0.999 0.528 0.327 0.404 0.999 0.966 0.243
Culture.Visual arts.Visual arts* 0.019 0.981 0.635 0.617 0.626 0.986 0.967 0.651
Geography.Geographical 0.032 0.968 0.732 0.538 0.62 0.979 0.973 0.637
Geography.Regions.Africa.Africa* 0.011 0.989 0.676 0.495 0.571 0.992 0.973 0.564
Geography.Regions.Africa.Central Africa 0.001 0.999 0.51 0.24 0.327 0.999 0.972 0.171
Geography.Regions.Africa.Eastern Africa 0.001 0.999 0.628 0.27 0.378 0.999 0.971 0.167
Geography.Regions.Africa.Northern Africa 0.003 0.997 0.674 0.27 0.386 0.997 0.976 0.337
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.608 0.474 0.533 0.999 0.969 0.42
Geography.Regions.Africa.Western Africa 0.001 0.999 0.68 0.526 0.593 0.999 0.908 0.414
Geography.Regions.Americas.Central America 0.003 0.997 0.458 0.497 0.477 0.997 0.963 0.385
Geography.Regions.Americas.North America 0.058 0.942 0.7 0.77 0.733 0.967 0.97 0.803
Geography.Regions.Americas.South America 0.006 0.994 0.659 0.698 0.678 0.996 0.977 0.671
Geography.Regions.Asia.Asia* 0.052 0.948 0.794 0.698 0.743 0.975 0.974 0.827
Geography.Regions.Asia.Central Asia 0.002 0.998 0.675 0.32 0.434 0.998 0.979 0.333
Geography.Regions.Asia.East Asia 0.011 0.989 0.782 0.828 0.804 0.996 0.984 0.853
Geography.Regions.Asia.North Asia 0.006 0.994 0.734 0.121 0.207 0.995 0.985 0.174
Geography.Regions.Asia.South Asia 0.012 0.988 0.716 0.906 0.8 0.995 0.979 0.84
Geography.Regions.Asia.Southeast Asia 0.006 0.994 0.679 0.692 0.685 0.996 0.975 0.659
Geography.Regions.Asia.West Asia 0.012 0.988 0.708 0.659 0.683 0.993 0.974 0.679
Geography.Regions.Europe.Eastern Europe 0.016 0.984 0.739 0.581 0.65 0.99 0.979 0.697
Geography.Regions.Europe.Europe* 0.108 0.892 0.811 0.57 0.669 0.939 0.959 0.771
Geography.Regions.Europe.Northern Europe 0.024 0.976 0.655 0.821 0.729 0.985 0.973 0.787
Geography.Regions.Europe.Southern Europe 0.023 0.977 0.757 0.435 0.552 0.984 0.974 0.621
Geography.Regions.Europe.Western Europe 0.021 0.979 0.762 0.712 0.736 0.99 0.979 0.804
Geography.Regions.Oceania 0.012 0.988 0.686 0.869 0.767 0.994 0.975 0.814
History and Society.Business and economics 0.011 0.989 0.591 0.538 0.563 0.991 0.961 0.517
History and Society.Education 0.006 0.994 0.513 0.652 0.575 0.994 0.963 0.551
History and Society.History 0.025 0.975 0.651 0.285 0.397 0.979 0.96 0.439
History and Society.Military and warfare 0.023 0.977 0.732 0.458 0.563 0.984 0.972 0.629
History and Society.Politics and government 0.025 0.975 0.581 0.658 0.617 0.98 0.957 0.657
History and Society.Society 0.018 0.982 0.491 0.34 0.402 0.982 0.929 0.365
History and Society.Transportation 0.015 0.985 0.903 0.887 0.895 0.997 0.987 0.935
STEM.Biology 0.03 0.97 0.821 0.906 0.862 0.991 0.98 0.9
STEM.Chemistry 0.007 0.993 0.764 0.162 0.267 0.993 0.987 0.244
STEM.Computing 0.007 0.993 0.848 0.329 0.474 0.995 0.988 0.577
STEM.Earth and environment 0.006 0.994 0.606 0.498 0.547 0.995 0.971 0.526
STEM.Engineering 0.007 0.993 0.774 0.602 0.677 0.996 0.981 0.723
STEM.Libraries & Information 0.001 0.999 0.74 0.502 0.598 0.999 0.967 0.466
STEM.Mathematics 0.001 0.999 0.613 0.215 0.318 0.999 0.967 0.148
STEM.Medicine & Health 0.009 0.991 0.709 0.493 0.582 0.993 0.98 0.657
STEM.Physics 0.003 0.997 0.726 0.178 0.285 0.997 0.983 0.293
STEM.STEM* 0.085 0.915 0.906 0.735 0.811 0.971 0.98 0.914
STEM.Space 0.006 0.994 0.943 0.927 0.935 0.999 0.99 0.957
STEM.Technology 0.014 0.986 0.76 0.284 0.413 0.989 0.978 0.467

Implementation

[edit]
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "scale": false,
        "center": false,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "multilabel": true,
        "population_rates": null,
        "ccp_alpha": 0.0,
        "criterion": "friedman_mse",
        "init": null,
        "learning_rate": 0.1,
        "loss": "deviance",
        "max_depth": 5,
        "max_features": "log2",
        "max_leaf_nodes": null,
        "min_impurity_decrease": 0.0,
        "min_impurity_split": null,
        "min_samples_leaf": 1,
        "min_samples_split": 2,
        "min_weight_fraction_leaf": 0.0,
        "n_estimators": 150,
        "n_iter_no_change": null,
        "presort": "deprecated",
        "random_state": null,
        "subsample": 1.0,
        "tol": 0.0001,
        "validation_fraction": 0.1,
        "verbose": 0,
        "warm_start": false,
        "label_weights": {}
    }
}
Output schema
Output schema
{
    "title": "Scikit learn-based classifier score with probability",
    "type": "object",
    "properties": {
        "prediction": {
            "description": "The most likely labels predicted by the estimator",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "probability": {
            "description": "A mapping of probabilities onto each of the potential output labels",
            "type": "object",
            "properties": {
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                }
            }
        }
    }
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/srwiki/1234/articletopic

Output:

Example output
{
    "srwiki": {
        "models": {
            "articletopic": {
                "version": "1.4.0"
            }
        },
        "scores": {
            "1234": {
                "articletopic": {
                    "score": {
                        "prediction": [
                            "Geography.Regions.Europe.Europe*"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.11040104750780567,
                            "Culture.Biography.Women": 0.13470931374358366,
                            "Culture.Food and drink": 0.001061695465685582,
                            "Culture.Internet culture": 0.0011464255449539539,
                            "Culture.Linguistics": 0.0015135201814895013,
                            "Culture.Literature": 0.2372278579844666,
                            "Culture.Media.Books": 0.03278657586800802,
                            "Culture.Media.Entertainment": 0.013682928395192407,
                            "Culture.Media.Films": 0.0019281923818669927,
                            "Culture.Media.Media*": 0.21500396074659422,
                            "Culture.Media.Music": 0.006457968301838956,
                            "Culture.Media.Radio": 5.324926349730714e-05,
                            "Culture.Media.Software": 0.0016168087230442878,
                            "Culture.Media.Television": 0.0008242222680444264,
                            "Culture.Media.Video games": 0.00015369555726823047,
                            "Culture.Performing arts": 0.019747694393684182,
                            "Culture.Philosophy and religion": 0.022521115478092234,
                            "Culture.Sports": 0.0006649569081514491,
                            "Culture.Visual arts.Architecture": 0.0014359201662917907,
                            "Culture.Visual arts.Comics and Anime": 0.0008204187925402861,
                            "Culture.Visual arts.Fashion": 0.003749338967508995,
                            "Culture.Visual arts.Visual arts*": 0.01239033143459835,
                            "Geography.Geographical": 0.0020590554403184342,
                            "Geography.Regions.Africa.Africa*": 0.009497069772931837,
                            "Geography.Regions.Africa.Central Africa": 0.00019369006146811776,
                            "Geography.Regions.Africa.Eastern Africa": 0.0015356466692226847,
                            "Geography.Regions.Africa.Northern Africa": 0.005336316892577839,
                            "Geography.Regions.Africa.Southern Africa": 0.00033412592938185316,
                            "Geography.Regions.Africa.Western Africa": 8.839117892978597e-06,
                            "Geography.Regions.Americas.Central America": 0.005021089209446801,
                            "Geography.Regions.Americas.North America": 0.005561597081477266,
                            "Geography.Regions.Americas.South America": 0.011508840556108442,
                            "Geography.Regions.Asia.Asia*": 0.04194057214043255,
                            "Geography.Regions.Asia.Central Asia": 0.004133622146411763,
                            "Geography.Regions.Asia.East Asia": 0.0010214193696603039,
                            "Geography.Regions.Asia.North Asia": 0.006348550312943161,
                            "Geography.Regions.Asia.South Asia": 0.0021167821977180414,
                            "Geography.Regions.Asia.Southeast Asia": 0.002635200732815336,
                            "Geography.Regions.Asia.West Asia": 0.005865910117755206,
                            "Geography.Regions.Europe.Eastern Europe": 0.24693488970957647,
                            "Geography.Regions.Europe.Europe*": 0.5617995769783409,
                            "Geography.Regions.Europe.Northern Europe": 0.007852167426943286,
                            "Geography.Regions.Europe.Southern Europe": 0.14698569063311887,
                            "Geography.Regions.Europe.Western Europe": 0.030190015010303623,
                            "Geography.Regions.Oceania": 0.006497928041576333,
                            "History and Society.Business and economics": 0.006484597136562253,
                            "History and Society.Education": 0.005663187639018815,
                            "History and Society.History": 0.4909693515351464,
                            "History and Society.Military and warfare": 0.020405610349156465,
                            "History and Society.Politics and government": 0.030881580068143677,
                            "History and Society.Society": 0.1450441646414019,
                            "History and Society.Transportation": 0.0004588823814014062,
                            "STEM.Biology": 0.002932193824382193,
                            "STEM.Chemistry": 0.00022682623322946885,
                            "STEM.Computing": 0.0009886969673890645,
                            "STEM.Earth and environment": 0.0005888236822300564,
                            "STEM.Engineering": 0.0009121741948055364,
                            "STEM.Libraries & Information": 0.00021451702119693527,
                            "STEM.Mathematics": 0.0006612514517849881,
                            "STEM.Medicine & Health": 0.00045552016112214335,
                            "STEM.Physics": 0.0003670794434625032,
                            "STEM.STEM*": 0.06676165100619627,
                            "STEM.Space": 0.00010645677663643897,
                            "STEM.Technology": 0.0016211512153451234
                        }
                    }
                }
            }
        }
    }
}

Data

[edit]
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

[edit]

Citation

[edit]

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Serbian_Wikipedia_article_topic,
  title={ Serbian Wikipedia article topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Serbian_Wikipedia_article_topic }
}