Jump to content

Machine learning models/Production/Arabic Wikipedia article topic

From Meta, a Wikimedia project coordination wiki


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Arabic Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

[edit]

How can we predict what general topic an article is in? Answering this question is useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends) and filtering articles.

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. filter articles only to those in the music category.
Don't use this model for
  • definitively establishing what topic an article pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/arwiki/1234/articletopic

Ethical considerations, caveats, and recommendations

[edit]
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

[edit]

Performance

[edit]

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 27484 26431 1053 865 35015
Culture.Biography.Women 6106 4346 1760 917 56341
Culture.Food and drink 1375 870 505 102 61887
Culture.Internet culture 3354 2676 678 226 59784
Culture.Linguistics 1435 911 524 91 61838
Culture.Literature 5701 4106 1595 596 57067
Culture.Media.Books 1581 1176 405 129 61654
Culture.Media.Entertainment 2112 980 1132 223 61029
Culture.Media.Films 2053 1574 479 115 61196
Culture.Media.Media* 14001 11521 2480 1662 47701
Culture.Media.Music 2678 1991 687 329 60357
Culture.Media.Radio 1123 463 660 96 62145
Culture.Media.Software 1989 1402 587 319 61056
Culture.Media.Television 2317 1324 993 259 60788
Culture.Media.Video games 2203 1973 230 62 61099
Culture.Performing arts 1539 735 804 155 61670
Culture.Philosophy and religion 3317 1609 1708 357 59690
Culture.Sports 8914 8162 752 385 54065
Culture.Visual arts.Architecture 1813 1141 672 186 61365
Culture.Visual arts.Comics and Anime 1876 1456 420 101 61387
Culture.Visual arts.Fashion 1360 1005 355 109 61895
Culture.Visual arts.Visual arts* 5891 4148 1743 443 57030
Geography.Geographical 3687 2614 1073 421 59256
Geography.Regions.Africa.Africa* 6437 5143 1294 431 56496
Geography.Regions.Africa.Central Africa 1126 837 289 50 62188
Geography.Regions.Africa.Eastern Africa 968 660 308 60 62336
Geography.Regions.Africa.Northern Africa 2008 1465 543 213 61143
Geography.Regions.Africa.Southern Africa 1196 912 284 49 62119
Geography.Regions.Africa.Western Africa 721 495 226 44 62599
Geography.Regions.Americas.Central America 1356 837 519 75 61933
Geography.Regions.Americas.North America 8321 5547 2774 2239 52804
Geography.Regions.Americas.South America 1643 1230 413 103 61618
Geography.Regions.Asia.Asia* 12468 10325 2143 902 49994
Geography.Regions.Asia.Central Asia 1166 808 358 69 62129
Geography.Regions.Asia.East Asia 2811 2022 789 268 60285
Geography.Regions.Asia.North Asia 2444 1725 719 217 60703
Geography.Regions.Asia.South Asia 1839 1402 437 60 61465
Geography.Regions.Asia.Southeast Asia 1594 1142 452 65 61705
Geography.Regions.Asia.West Asia 3729 2967 762 296 59339
Geography.Regions.Europe.Eastern Europe 3523 2569 954 241 59600
Geography.Regions.Europe.Europe* 13759 10696 3063 2025 47580
Geography.Regions.Europe.Northern Europe 4058 2640 1418 635 58671
Geography.Regions.Europe.Southern Europe 2814 2020 794 294 60256
Geography.Regions.Europe.Western Europe 3997 2921 1076 584 58783
Geography.Regions.Oceania 2313 1864 449 81 60970
History and Society.Business and economics 3556 1924 1632 473 59335
History and Society.Education 1988 854 1134 188 61188
History and Society.History 4321 1808 2513 642 58401
History and Society.Military and warfare 3987 2540 1447 412 58965
History and Society.Politics and government 5373 3181 2192 663 57328
History and Society.Society 4393 1070 3323 363 58608
History and Society.Transportation 3249 2777 472 153 59962
STEM.Biology 3007 2282 725 212 60145
STEM.Chemistry 1472 1003 469 159 61733
STEM.Computing 2390 1775 615 390 60584
STEM.Earth and environment 1693 962 731 178 61493
STEM.Engineering 2891 2054 837 240 60233
STEM.Libraries & Information 1184 748 436 60 62120
STEM.Mathematics 1144 739 405 63 62157
STEM.Medicine & Health 2210 1493 717 226 60928
STEM.Physics 1446 885 561 206 61712
STEM.STEM* 17887 15502 2385 1108 44369
STEM.Space 1720 1499 221 56 61588
STEM.Technology 4341 2892 1449 607 58416

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.434 0.12
Culture.Biography.Women 0.096 0.015
Culture.Food and drink 0.022 0.003
Culture.Internet culture 0.053 0.004
Culture.Linguistics 0.023 0.008
Culture.Literature 0.09 0.015
Culture.Media.Books 0.025 0.004
Culture.Media.Entertainment 0.033 0.004
Culture.Media.Films 0.032 0.012
Culture.Media.Media* 0.221 0.055
Culture.Media.Music 0.042 0.021
Culture.Media.Radio 0.018 0.002
Culture.Media.Software 0.031 0.001
Culture.Media.Television 0.037 0.009
Culture.Media.Video games 0.035 0.003
Culture.Performing arts 0.024 0.003
Culture.Philosophy and religion 0.052 0.01
Culture.Sports 0.141 0.06
Culture.Visual arts.Architecture 0.029 0.011
Culture.Visual arts.Comics and Anime 0.03 0.002
Culture.Visual arts.Fashion 0.021 0.001
Culture.Visual arts.Visual arts* 0.093 0.018
Geography.Geographical 0.058 0.021
Geography.Regions.Africa.Africa* 0.102 0.008
Geography.Regions.Africa.Central Africa 0.018 0.001
Geography.Regions.Africa.Eastern Africa 0.015 0.001
Geography.Regions.Africa.Northern Africa 0.032 0.001
Geography.Regions.Africa.Southern Africa 0.019 0.001
Geography.Regions.Africa.Western Africa 0.011 0.001
Geography.Regions.Americas.Central America 0.021 0.003
Geography.Regions.Americas.North America 0.131 0.063
Geography.Regions.Americas.South America 0.026 0.007
Geography.Regions.Asia.Asia* 0.197 0.052
Geography.Regions.Asia.Central Asia 0.018 0.001
Geography.Regions.Asia.East Asia 0.044 0.012
Geography.Regions.Asia.North Asia 0.039 0.006
Geography.Regions.Asia.South Asia 0.029 0.016
Geography.Regions.Asia.Southeast Asia 0.025 0.006
Geography.Regions.Asia.West Asia 0.059 0.012
Geography.Regions.Europe.Eastern Europe 0.056 0.018
Geography.Regions.Europe.Europe* 0.217 0.081
Geography.Regions.Europe.Northern Europe 0.064 0.029
Geography.Regions.Europe.Southern Europe 0.044 0.014
Geography.Regions.Europe.Western Europe 0.063 0.02
Geography.Regions.Oceania 0.037 0.016
History and Society.Business and economics 0.056 0.01
History and Society.Education 0.031 0.008
History and Society.History 0.068 0.011
History and Society.Military and warfare 0.063 0.015
History and Society.Politics and government 0.085 0.028
History and Society.Society 0.069 0.008
History and Society.Transportation 0.051 0.016
STEM.Biology 0.047 0.034
STEM.Chemistry 0.023 0.002
STEM.Computing 0.038 0.003
STEM.Earth and environment 0.027 0.005
STEM.Engineering 0.046 0.006
STEM.Libraries & Information 0.019 0.001
STEM.Mathematics 0.018 0
STEM.Medicine & Health 0.035 0.006
STEM.Physics 0.023 0.001
STEM.STEM* 0.282 0.065
STEM.Space 0.027 0.004
STEM.Technology 0.069 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.137 0.863 0.962 0.845 0.899 0.974 0.986 0.951
Culture.Biography.Women 0.026 0.974 0.712 0.403 0.514 0.98 0.978 0.492
Culture.Food and drink 0.003 0.997 0.633 0.495 0.556 0.997 0.975 0.542
Culture.Internet culture 0.007 0.993 0.798 0.442 0.569 0.995 0.983 0.663
Culture.Linguistics 0.007 0.993 0.635 0.777 0.699 0.996 0.97 0.712
Culture.Literature 0.021 0.979 0.72 0.512 0.598 0.986 0.973 0.685
Culture.Media.Books 0.005 0.995 0.744 0.607 0.668 0.997 0.98 0.651
Culture.Media.Entertainment 0.005 0.995 0.464 0.335 0.389 0.994 0.964 0.369
Culture.Media.Films 0.011 0.989 0.767 0.826 0.795 0.995 0.981 0.821
Culture.Media.Media* 0.077 0.923 0.823 0.586 0.685 0.958 0.971 0.787
Culture.Media.Music 0.021 0.979 0.743 0.746 0.745 0.989 0.979 0.767
Culture.Media.Radio 0.002 0.998 0.412 0.383 0.397 0.997 0.966 0.32
Culture.Media.Software 0.006 0.994 0.705 0.151 0.248 0.994 0.981 0.222
Culture.Media.Television 0.009 0.991 0.571 0.546 0.559 0.992 0.974 0.593
Culture.Media.Video games 0.004 0.996 0.896 0.72 0.798 0.999 0.992 0.881
Culture.Performing arts 0.004 0.996 0.478 0.368 0.416 0.996 0.967 0.335
Culture.Philosophy and religion 0.011 0.989 0.485 0.46 0.472 0.989 0.942 0.435
Culture.Sports 0.062 0.938 0.916 0.892 0.904 0.988 0.981 0.922
Culture.Visual arts.Architecture 0.01 0.99 0.629 0.694 0.66 0.993 0.976 0.662
Culture.Visual arts.Comics and Anime 0.003 0.997 0.776 0.531 0.63 0.998 0.984 0.651
Culture.Visual arts.Fashion 0.002 0.998 0.739 0.273 0.399 0.998 0.98 0.381
Culture.Visual arts.Visual arts* 0.02 0.98 0.704 0.627 0.663 0.987 0.968 0.699
Geography.Geographical 0.022 0.978 0.709 0.684 0.696 0.987 0.975 0.746
Geography.Regions.Africa.Africa* 0.014 0.986 0.799 0.474 0.595 0.991 0.977 0.607
Geography.Regions.Africa.Central Africa 0.001 0.999 0.743 0.394 0.515 0.999 0.979 0.462
Geography.Regions.Africa.Eastern Africa 0.001 0.999 0.682 0.262 0.379 0.999 0.977 0.313
Geography.Regions.Africa.Northern Africa 0.004 0.996 0.73 0.221 0.339 0.996 0.98 0.271
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.763 0.558 0.645 0.999 0.977 0.514
Geography.Regions.Africa.Western Africa 0.001 0.999 0.687 0.421 0.522 0.999 0.979 0.388
Geography.Regions.Americas.Central America 0.003 0.997 0.617 0.638 0.628 0.997 0.973 0.594
Geography.Regions.Americas.North America 0.08 0.92 0.667 0.524 0.587 0.941 0.953 0.609
Geography.Regions.Americas.South America 0.007 0.993 0.749 0.755 0.752 0.997 0.978 0.728
Geography.Regions.Asia.Asia* 0.06 0.94 0.828 0.721 0.771 0.974 0.971 0.841
Geography.Regions.Asia.Central Asia 0.002 0.998 0.693 0.332 0.449 0.999 0.977 0.354
Geography.Regions.Asia.East Asia 0.013 0.987 0.719 0.666 0.692 0.992 0.977 0.715
Geography.Regions.Asia.North Asia 0.007 0.993 0.706 0.527 0.604 0.995 0.975 0.654
Geography.Regions.Asia.South Asia 0.014 0.986 0.762 0.929 0.837 0.995 0.983 0.878
Geography.Regions.Asia.Southeast Asia 0.006 0.994 0.716 0.81 0.76 0.997 0.977 0.778
Geography.Regions.Asia.West Asia 0.014 0.986 0.796 0.654 0.718 0.993 0.981 0.743
Geography.Regions.Europe.Eastern Europe 0.017 0.983 0.729 0.771 0.75 0.991 0.975 0.796
Geography.Regions.Europe.Europe* 0.1 0.9 0.777 0.626 0.693 0.945 0.958 0.774
Geography.Regions.Europe.Northern Europe 0.029 0.971 0.651 0.644 0.647 0.979 0.966 0.696
Geography.Regions.Europe.Southern Europe 0.015 0.985 0.718 0.674 0.695 0.991 0.977 0.742
Geography.Regions.Europe.Western Europe 0.025 0.975 0.731 0.608 0.664 0.985 0.975 0.686
Geography.Regions.Oceania 0.015 0.985 0.806 0.91 0.855 0.996 0.98 0.872
History and Society.Business and economics 0.013 0.987 0.541 0.401 0.461 0.988 0.957 0.416
History and Society.Education 0.006 0.994 0.43 0.528 0.474 0.992 0.956 0.471
History and Society.History 0.015 0.985 0.418 0.297 0.348 0.983 0.936 0.273
History and Society.Military and warfare 0.017 0.983 0.637 0.587 0.611 0.988 0.967 0.632
History and Society.Politics and government 0.028 0.972 0.592 0.598 0.595 0.977 0.951 0.624
History and Society.Society 0.008 0.992 0.244 0.248 0.246 0.988 0.893 0.162
History and Society.Transportation 0.016 0.984 0.855 0.847 0.851 0.995 0.982 0.897
STEM.Biology 0.03 0.97 0.759 0.885 0.817 0.988 0.974 0.86
STEM.Chemistry 0.004 0.996 0.681 0.31 0.426 0.997 0.984 0.418
STEM.Computing 0.008 0.992 0.743 0.247 0.371 0.993 0.984 0.38
STEM.Earth and environment 0.006 0.994 0.568 0.483 0.522 0.995 0.969 0.48
STEM.Engineering 0.008 0.992 0.71 0.508 0.592 0.994 0.976 0.657
STEM.Libraries & Information 0.001 0.999 0.632 0.309 0.415 0.999 0.972 0.323
STEM.Mathematics 0.001 0.999 0.646 0.227 0.336 0.999 0.978 0.375
STEM.Medicine & Health 0.008 0.992 0.676 0.542 0.601 0.994 0.973 0.659
STEM.Physics 0.004 0.996 0.612 0.147 0.237 0.996 0.982 0.18
STEM.STEM* 0.079 0.921 0.867 0.712 0.782 0.969 0.974 0.883
STEM.Space 0.005 0.995 0.872 0.804 0.836 0.999 0.991 0.877
STEM.Technology 0.014 0.986 0.666 0.251 0.364 0.988 0.97 0.351

Implementation

[edit]
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "scale": false,
        "center": false,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "multilabel": true,
        "population_rates": null,
        "ccp_alpha": 0.0,
        "criterion": "friedman_mse",
        "init": null,
        "learning_rate": 0.1,
        "loss": "deviance",
        "max_depth": 5,
        "max_features": "log2",
        "max_leaf_nodes": null,
        "min_impurity_decrease": 0.0,
        "min_impurity_split": null,
        "min_samples_leaf": 1,
        "min_samples_split": 2,
        "min_weight_fraction_leaf": 0.0,
        "n_estimators": 150,
        "n_iter_no_change": null,
        "presort": "deprecated",
        "random_state": null,
        "subsample": 1.0,
        "tol": 0.0001,
        "validation_fraction": 0.1,
        "verbose": 0,
        "warm_start": false,
        "label_weights": {}
    }
}
Output schema
Output schema
{
    "title": "Scikit learn-based classifier score with probability",
    "type": "object",
    "properties": {
        "prediction": {
            "description": "The most likely labels predicted by the estimator",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "probability": {
            "description": "A mapping of probabilities onto each of the potential output labels",
            "type": "object",
            "properties": {
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                }
            }
        }
    }
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/arwiki/1234/articletopic

Output:

Example output
{
    "arwiki": {
        "models": {
            "articletopic": {
                "version": "1.3.0"
            }
        },
        "scores": {
            "1234": {
                "articletopic": {
                    "score": {
                        "prediction": [
                            "Culture.Philosophy and religion"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.0018775771968361505,
                            "Culture.Biography.Women": 0.0022559171278080403,
                            "Culture.Food and drink": 0.02911020063825202,
                            "Culture.Internet culture": 0.00013396466832571897,
                            "Culture.Linguistics": 0.0002016623350064864,
                            "Culture.Literature": 0.01630206474244034,
                            "Culture.Media.Books": 0.0019739903099149016,
                            "Culture.Media.Entertainment": 0.005776333050595081,
                            "Culture.Media.Films": 0.0005357906321037945,
                            "Culture.Media.Media*": 0.01873311881808571,
                            "Culture.Media.Music": 0.0006598789501761831,
                            "Culture.Media.Radio": 0.0008711772320624666,
                            "Culture.Media.Software": 0.00045306840360182074,
                            "Culture.Media.Television": 0.004563663906527891,
                            "Culture.Media.Video games": 0.00014238944031148178,
                            "Culture.Performing arts": 0.00031857030709468275,
                            "Culture.Philosophy and religion": 0.9915133947766572,
                            "Culture.Sports": 0.0032536645394025034,
                            "Culture.Visual arts.Architecture": 0.003171696122120992,
                            "Culture.Visual arts.Comics and Anime": 0.0030041266177882604,
                            "Culture.Visual arts.Fashion": 0.008690660472179179,
                            "Culture.Visual arts.Visual arts*": 0.015999157238911713,
                            "Geography.Geographical": 0.0059701879622301715,
                            "Geography.Regions.Africa.Africa*": 0.20140713747028768,
                            "Geography.Regions.Africa.Central Africa": 0.0005659981357134575,
                            "Geography.Regions.Africa.Eastern Africa": 0.004211961235703654,
                            "Geography.Regions.Africa.Northern Africa": 0.012708751878181768,
                            "Geography.Regions.Africa.Southern Africa": 0.0006662932204457024,
                            "Geography.Regions.Africa.Western Africa": 0.0005140083680583202,
                            "Geography.Regions.Americas.Central America": 0.0016675740255396078,
                            "Geography.Regions.Americas.North America": 0.0062834206734540595,
                            "Geography.Regions.Americas.South America": 0.0006909355854571832,
                            "Geography.Regions.Asia.Asia*": 0.01339558309316005,
                            "Geography.Regions.Asia.Central Asia": 0.0006353555918342474,
                            "Geography.Regions.Asia.East Asia": 0.001418334017189674,
                            "Geography.Regions.Asia.North Asia": 0.000575145968948881,
                            "Geography.Regions.Asia.South Asia": 0.003160871781527632,
                            "Geography.Regions.Asia.Southeast Asia": 0.002178204210680102,
                            "Geography.Regions.Asia.West Asia": 0.014274257947124423,
                            "Geography.Regions.Europe.Eastern Europe": 0.0013954670948905467,
                            "Geography.Regions.Europe.Europe*": 0.011614035833480698,
                            "Geography.Regions.Europe.Northern Europe": 0.0018661121742356134,
                            "Geography.Regions.Europe.Southern Europe": 0.007193423419921128,
                            "Geography.Regions.Europe.Western Europe": 0.0034747000735290425,
                            "Geography.Regions.Oceania": 0.0014825584617553515,
                            "History and Society.Business and economics": 0.009040864250628715,
                            "History and Society.Education": 0.0034818655626133728,
                            "History and Society.History": 0.015335759410860367,
                            "History and Society.Military and warfare": 0.0030978850596784535,
                            "History and Society.Politics and government": 0.029165387410026597,
                            "History and Society.Society": 0.031041994032148346,
                            "History and Society.Transportation": 0.00022066416987785007,
                            "STEM.Biology": 0.005349680305325293,
                            "STEM.Chemistry": 0.0027381500340171645,
                            "STEM.Computing": 0.0006550726961037265,
                            "STEM.Earth and environment": 0.0008943785135649393,
                            "STEM.Engineering": 0.0013233564634350919,
                            "STEM.Libraries & Information": 7.807094663224667e-05,
                            "STEM.Mathematics": 0.002929987189174986,
                            "STEM.Medicine & Health": 0.061891530372885306,
                            "STEM.Physics": 0.0026454984695803902,
                            "STEM.STEM*": 0.098436433397069,
                            "STEM.Space": 0.001050419037453933,
                            "STEM.Technology": 0.004580541565463146
                        }
                    }
                }
            }
        }
    }
}

Data

[edit]
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

[edit]

Citation

[edit]

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Arabic_Wikipedia_article_topic,
  title={ Arabic Wikipedia article topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Arabic_Wikipedia_article_topic }
}