Jump to content

Machine learning models/Production/Korean Wikipedia article topic

From Meta, a Wikimedia project coordination wiki


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Korean Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

[edit]

How can we predict what general topic an article is in? Answering this question is useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends) and filtering articles.

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. filter articles only to those in the music category.
Don't use this model for
  • definitively establishing what topic an article pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/kowiki/1234/articletopic

Ethical considerations, caveats, and recommendations

[edit]
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

[edit]

Performance

[edit]

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 15801 14235 1566 827 44113
Culture.Biography.Women 4424 3274 1150 430 55887
Culture.Food and drink 1776 1367 409 88 58877
Culture.Internet culture 3463 2729 734 221 57057
Culture.Linguistics 1688 1152 536 100 58953
Culture.Literature 5858 4323 1535 505 54378
Culture.Media.Books 1614 1169 445 121 59006
Culture.Media.Entertainment 2343 1229 1114 239 58159
Culture.Media.Films 2950 2446 504 109 57682
Culture.Media.Media* 15181 13011 2170 1546 44014
Culture.Media.Music 3247 2609 638 275 57219
Culture.Media.Radio 702 429 273 66 59973
Culture.Media.Software 2364 1802 562 377 58000
Culture.Media.Television 2480 1803 677 162 58099
Culture.Media.Video games 2305 2021 284 85 58351
Culture.Performing arts 1527 871 656 142 59072
Culture.Philosophy and religion 3856 2050 1806 334 56551
Culture.Sports 5207 4500 707 171 55363
Culture.Visual arts.Architecture 2141 1419 722 218 58382
Culture.Visual arts.Comics and Anime 2404 2041 363 153 58184
Culture.Visual arts.Fashion 1355 948 407 76 59310
Culture.Visual arts.Visual arts* 6401 4521 1880 538 53802
Geography.Geographical 3907 2541 1366 538 56296
Geography.Regions.Africa.Africa* 4310 3234 1076 243 56188
Geography.Regions.Africa.Central Africa 847 614 233 42 59852
Geography.Regions.Africa.Eastern Africa 475 336 139 77 60189
Geography.Regions.Africa.Northern Africa 1517 1089 428 111 59113
Geography.Regions.Africa.Southern Africa 690 496 194 58 59993
Geography.Regions.Africa.Western Africa 280 195 85 34 60427
Geography.Regions.Americas.Central America 1359 860 499 68 59314
Geography.Regions.Americas.North America 6585 4572 2013 1182 52974
Geography.Regions.Americas.South America 1532 1117 415 90 59119
Geography.Regions.Asia.Asia* 15107 12516 2591 1540 44094
Geography.Regions.Asia.Central Asia 1328 938 390 86 59327
Geography.Regions.Asia.East Asia 7562 6091 1471 823 52356
Geography.Regions.Asia.North Asia 1823 1372 451 197 58721
Geography.Regions.Asia.South Asia 1867 1404 463 83 58791
Geography.Regions.Asia.Southeast Asia 1913 1355 558 91 58737
Geography.Regions.Asia.West Asia 2393 1749 644 182 58166
Geography.Regions.Europe.Eastern Europe 3409 2597 812 316 57016
Geography.Regions.Europe.Europe* 13405 10882 2523 1627 45709
Geography.Regions.Europe.Northern Europe 3936 2782 1154 405 56400
Geography.Regions.Europe.Southern Europe 3303 2386 917 284 57154
Geography.Regions.Europe.Western Europe 4084 3070 1014 427 56230
Geography.Regions.Oceania 1829 1341 488 62 58850
History and Society.Business and economics 3915 2140 1775 403 56423
History and Society.Education 1907 1131 776 127 58707
History and Society.History 5526 3175 2351 660 54555
History and Society.Military and warfare 5276 3611 1665 477 54988
History and Society.Politics and government 5159 2870 2289 534 55048
History and Society.Society 6675 3126 3549 588 53478
History and Society.Transportation 3768 3259 509 167 56806
STEM.Biology 3464 2662 802 192 57085
STEM.Chemistry 1495 1150 345 135 59111
STEM.Computing 2753 2123 630 459 57529
STEM.Earth and environment 1903 1247 656 136 58702
STEM.Engineering 2710 1879 831 184 57847
STEM.Libraries & Information 753 508 245 39 59949
STEM.Mathematics 1179 875 304 97 59465
STEM.Medicine & Health 1993 1265 728 173 58575
STEM.Physics 1467 937 530 183 59091
STEM.STEM* 17902 15664 2238 1087 41752
STEM.Space 1703 1499 204 44 58994
STEM.Technology 4764 3350 1414 669 55308

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.26 0.123
Culture.Biography.Women 0.073 0.015
Culture.Food and drink 0.029 0.002
Culture.Internet culture 0.057 0.003
Culture.Linguistics 0.028 0.007
Culture.Literature 0.096 0.015
Culture.Media.Books 0.027 0.004
Culture.Media.Entertainment 0.039 0.004
Culture.Media.Films 0.049 0.011
Culture.Media.Media* 0.25 0.058
Culture.Media.Music 0.053 0.024
Culture.Media.Radio 0.012 0.002
Culture.Media.Software 0.039 0.001
Culture.Media.Television 0.041 0.009
Culture.Media.Video games 0.038 0.003
Culture.Performing arts 0.025 0.003
Culture.Philosophy and religion 0.063 0.011
Culture.Sports 0.086 0.071
Culture.Visual arts.Architecture 0.035 0.011
Culture.Visual arts.Comics and Anime 0.04 0.002
Culture.Visual arts.Fashion 0.022 0.001
Culture.Visual arts.Visual arts* 0.105 0.018
Geography.Geographical 0.064 0.024
Geography.Regions.Africa.Africa* 0.071 0.008
Geography.Regions.Africa.Central Africa 0.014 0.001
Geography.Regions.Africa.Eastern Africa 0.008 0
Geography.Regions.Africa.Northern Africa 0.025 0.001
Geography.Regions.Africa.Southern Africa 0.011 0.001
Geography.Regions.Africa.Western Africa 0.005 0.001
Geography.Regions.Americas.Central America 0.022 0.003
Geography.Regions.Americas.North America 0.108 0.064
Geography.Regions.Americas.South America 0.025 0.006
Geography.Regions.Asia.Asia* 0.249 0.045
Geography.Regions.Asia.Central Asia 0.022 0.001
Geography.Regions.Asia.East Asia 0.124 0.011
Geography.Regions.Asia.North Asia 0.03 0.001
Geography.Regions.Asia.South Asia 0.031 0.015
Geography.Regions.Asia.Southeast Asia 0.031 0.006
Geography.Regions.Asia.West Asia 0.039 0.011
Geography.Regions.Europe.Eastern Europe 0.056 0.013
Geography.Regions.Europe.Europe* 0.221 0.076
Geography.Regions.Europe.Northern Europe 0.065 0.031
Geography.Regions.Europe.Southern Europe 0.054 0.013
Geography.Regions.Europe.Western Europe 0.067 0.019
Geography.Regions.Oceania 0.03 0.015
History and Society.Business and economics 0.064 0.01
History and Society.Education 0.031 0.007
History and Society.History 0.091 0.011
History and Society.Military and warfare 0.087 0.014
History and Society.Politics and government 0.085 0.028
History and Society.Society 0.11 0.013
History and Society.Transportation 0.062 0.015
STEM.Biology 0.057 0.034
STEM.Chemistry 0.025 0.002
STEM.Computing 0.045 0.003
STEM.Earth and environment 0.031 0.005
STEM.Engineering 0.045 0.005
STEM.Libraries & Information 0.012 0.001
STEM.Mathematics 0.019 0
STEM.Medicine & Health 0.033 0.006
STEM.Physics 0.024 0.001
STEM.STEM* 0.295 0.069
STEM.Space 0.028 0.006
STEM.Technology 0.078 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.127 0.873 0.901 0.873 0.887 0.972 0.981 0.949
Culture.Biography.Women 0.018 0.982 0.74 0.589 0.656 0.989 0.978 0.646
Culture.Food and drink 0.003 0.997 0.77 0.56 0.648 0.998 0.984 0.668
Culture.Internet culture 0.007 0.993 0.788 0.418 0.546 0.995 0.983 0.629
Culture.Linguistics 0.007 0.993 0.682 0.748 0.714 0.996 0.976 0.734
Culture.Literature 0.02 0.98 0.738 0.558 0.635 0.987 0.975 0.708
Culture.Media.Books 0.005 0.995 0.724 0.589 0.649 0.997 0.98 0.701
Culture.Media.Entertainment 0.006 0.994 0.525 0.315 0.394 0.994 0.966 0.37
Culture.Media.Films 0.011 0.989 0.829 0.824 0.826 0.996 0.982 0.804
Culture.Media.Media* 0.082 0.918 0.857 0.611 0.713 0.96 0.975 0.832
Culture.Media.Music 0.024 0.976 0.804 0.804 0.804 0.991 0.982 0.825
Culture.Media.Radio 0.002 0.998 0.611 0.546 0.576 0.998 0.964 0.424
Culture.Media.Software 0.007 0.993 0.762 0.136 0.23 0.993 0.984 0.295
Culture.Media.Television 0.009 0.991 0.727 0.699 0.713 0.995 0.979 0.716
Culture.Media.Video games 0.004 0.996 0.877 0.612 0.721 0.998 0.988 0.817
Culture.Performing arts 0.004 0.996 0.57 0.408 0.476 0.996 0.966 0.4
Culture.Philosophy and religion 0.011 0.989 0.532 0.494 0.512 0.989 0.946 0.481
Culture.Sports 0.064 0.936 0.864 0.956 0.908 0.987 0.978 0.942
Culture.Visual arts.Architecture 0.011 0.989 0.663 0.655 0.659 0.993 0.976 0.698
Culture.Visual arts.Comics and Anime 0.004 0.996 0.849 0.416 0.558 0.997 0.985 0.608
Culture.Visual arts.Fashion 0.002 0.998 0.7 0.307 0.426 0.998 0.979 0.372
Culture.Visual arts.Visual arts* 0.023 0.977 0.706 0.571 0.631 0.985 0.968 0.692
Geography.Geographical 0.025 0.975 0.65 0.624 0.637 0.983 0.971 0.662
Geography.Regions.Africa.Africa* 0.01 0.99 0.75 0.578 0.653 0.994 0.975 0.662
Geography.Regions.Africa.Central Africa 0.001 0.999 0.725 0.395 0.511 0.999 0.982 0.425
Geography.Regions.Africa.Eastern Africa 0.002 0.998 0.707 0.201 0.313 0.999 0.97 0.22
Geography.Regions.Africa.Northern Africa 0.003 0.997 0.718 0.32 0.443 0.998 0.978 0.327
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.719 0.467 0.566 0.999 0.963 0.472
Geography.Regions.Africa.Western Africa 0.001 0.999 0.696 0.459 0.553 0.999 0.951 0.378
Geography.Regions.Americas.Central America 0.003 0.997 0.633 0.647 0.64 0.998 0.973 0.622
Geography.Regions.Americas.North America 0.065 0.935 0.694 0.686 0.69 0.96 0.964 0.733
Geography.Regions.Americas.South America 0.006 0.994 0.729 0.753 0.741 0.997 0.978 0.759
Geography.Regions.Asia.Asia* 0.07 0.93 0.828 0.539 0.653 0.96 0.966 0.75
Geography.Regions.Asia.Central Asia 0.002 0.998 0.706 0.297 0.418 0.998 0.979 0.336
Geography.Regions.Asia.East Asia 0.024 0.976 0.805 0.375 0.512 0.982 0.977 0.568
Geography.Regions.Asia.North Asia 0.004 0.996 0.753 0.172 0.28 0.996 0.984 0.336
Geography.Regions.Asia.South Asia 0.013 0.987 0.752 0.892 0.816 0.995 0.98 0.867
Geography.Regions.Asia.Southeast Asia 0.006 0.994 0.708 0.735 0.721 0.997 0.975 0.742
Geography.Regions.Asia.West Asia 0.011 0.989 0.731 0.721 0.726 0.994 0.978 0.726
Geography.Regions.Europe.Eastern Europe 0.015 0.985 0.762 0.642 0.697 0.992 0.98 0.738
Geography.Regions.Europe.Europe* 0.094 0.906 0.812 0.661 0.728 0.954 0.966 0.799
Geography.Regions.Europe.Northern Europe 0.029 0.971 0.707 0.758 0.731 0.984 0.974 0.787
Geography.Regions.Europe.Southern Europe 0.014 0.986 0.722 0.659 0.689 0.992 0.975 0.724
Geography.Regions.Europe.Western Europe 0.022 0.978 0.752 0.661 0.703 0.988 0.979 0.737
Geography.Regions.Oceania 0.012 0.988 0.733 0.914 0.814 0.995 0.979 0.864
History and Society.Business and economics 0.013 0.987 0.547 0.44 0.487 0.988 0.96 0.446
History and Society.Education 0.007 0.993 0.593 0.671 0.629 0.995 0.972 0.635
History and Society.History 0.018 0.982 0.575 0.345 0.431 0.984 0.953 0.438
History and Society.Military and warfare 0.018 0.982 0.684 0.532 0.598 0.987 0.971 0.665
History and Society.Politics and government 0.025 0.975 0.556 0.627 0.589 0.978 0.95 0.617
History and Society.Society 0.017 0.983 0.468 0.355 0.404 0.983 0.926 0.372
History and Society.Transportation 0.016 0.984 0.865 0.819 0.841 0.995 0.984 0.876
STEM.Biology 0.029 0.971 0.768 0.889 0.824 0.989 0.977 0.881
STEM.Chemistry 0.003 0.997 0.769 0.345 0.476 0.997 0.986 0.567
STEM.Computing 0.01 0.99 0.771 0.208 0.328 0.991 0.985 0.349
STEM.Earth and environment 0.005 0.995 0.655 0.564 0.606 0.996 0.973 0.591
STEM.Engineering 0.007 0.993 0.693 0.535 0.604 0.995 0.976 0.626
STEM.Libraries & Information 0.001 0.999 0.675 0.392 0.496 0.999 0.972 0.437
STEM.Mathematics 0.002 0.998 0.742 0.16 0.263 0.998 0.981 0.301
STEM.Medicine & Health 0.007 0.993 0.635 0.581 0.607 0.995 0.971 0.569
STEM.Physics 0.004 0.996 0.639 0.15 0.242 0.997 0.98 0.222
STEM.STEM* 0.084 0.916 0.875 0.719 0.789 0.968 0.974 0.88
STEM.Space 0.006 0.994 0.88 0.877 0.879 0.999 0.988 0.906
STEM.Technology 0.015 0.985 0.703 0.233 0.35 0.987 0.971 0.398

Implementation

[edit]
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "init": null,
        "presort": "deprecated",
        "min_impurity_decrease": 0.0,
        "warm_start": false,
        "label_weights": {},
        "random_state": null,
        "min_samples_split": 2,
        "learning_rate": 0.1,
        "n_estimators": 150,
        "population_rates": null,
        "center": false,
        "criterion": "friedman_mse",
        "max_depth": 5,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "min_samples_leaf": 1,
        "validation_fraction": 0.1,
        "loss": "deviance",
        "verbose": 0,
        "max_features": "log2",
        "scale": false,
        "subsample": 1.0,
        "n_iter_no_change": null,
        "ccp_alpha": 0.0,
        "max_leaf_nodes": null,
        "min_weight_fraction_leaf": 0.0,
        "tol": 0.0001,
        "min_impurity_split": null,
        "multilabel": true
    }
}
Output schema
Output schema
{
    "title": "Scikit learn-based classifier score with probability",
    "type": "object",
    "properties": {
        "prediction": {
            "description": "The most likely labels predicted by the estimator",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "probability": {
            "description": "A mapping of probabilities onto each of the potential output labels",
            "type": "object",
            "properties": {
                "Culture.Media.Books": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                }
            }
        }
    }
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/kowiki/1234/articletopic

Output:

Example output
{
    "kowiki": {
        "models": {
            "articletopic": {
                "version": "1.3.0"
            }
        },
        "scores": {
            "1234": {
                "articletopic": {
                    "score": {
                        "prediction": [
                            "Culture.Media.Software",
                            "STEM.Computing",
                            "STEM.STEM*",
                            "STEM.Technology"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.0034228232602912688,
                            "Culture.Biography.Women": 0.0009212944875142359,
                            "Culture.Food and drink": 0.00016616038714377659,
                            "Culture.Internet culture": 0.2867922233340401,
                            "Culture.Linguistics": 0.002654510795031825,
                            "Culture.Literature": 0.0012054160435867756,
                            "Culture.Media.Books": 0.00036196456970389446,
                            "Culture.Media.Entertainment": 0.0006637877545336003,
                            "Culture.Media.Films": 0.0005868003736939404,
                            "Culture.Media.Media*": 0.45082914781367006,
                            "Culture.Media.Music": 0.001082295801151109,
                            "Culture.Media.Radio": 0.0011221650601441191,
                            "Culture.Media.Software": 0.6106288410598574,
                            "Culture.Media.Television": 0.0009180753067480871,
                            "Culture.Media.Video games": 0.0005296792414704426,
                            "Culture.Performing arts": 0.0004068675009092034,
                            "Culture.Philosophy and religion": 0.001721456315217673,
                            "Culture.Sports": 0.0014508720045588101,
                            "Culture.Visual arts.Architecture": 0.0027564300681295873,
                            "Culture.Visual arts.Comics and Anime": 0.00011604420903405569,
                            "Culture.Visual arts.Fashion": 0.0004264809451367489,
                            "Culture.Visual arts.Visual arts*": 0.04328520557676202,
                            "Geography.Geographical": 0.0014977017906365317,
                            "Geography.Regions.Africa.Africa*": 0.0010947982572026048,
                            "Geography.Regions.Africa.Central Africa": 0.0001076128906272681,
                            "Geography.Regions.Africa.Eastern Africa": 0.0001031875796218527,
                            "Geography.Regions.Africa.Northern Africa": 0.00011686970514296082,
                            "Geography.Regions.Africa.Southern Africa": 0.00024862773913410214,
                            "Geography.Regions.Africa.Western Africa": 2.762890992972462e-06,
                            "Geography.Regions.Americas.Central America": 0.0007010831066025544,
                            "Geography.Regions.Americas.North America": 0.091460199158989,
                            "Geography.Regions.Americas.South America": 0.00028786083315990665,
                            "Geography.Regions.Asia.Asia*": 0.0058671426735626385,
                            "Geography.Regions.Asia.Central Asia": 0.00015090794615980674,
                            "Geography.Regions.Asia.East Asia": 0.003846219220546423,
                            "Geography.Regions.Asia.North Asia": 0.0003808596131831929,
                            "Geography.Regions.Asia.South Asia": 0.001023733486019144,
                            "Geography.Regions.Asia.Southeast Asia": 0.0008401133079025592,
                            "Geography.Regions.Asia.West Asia": 0.0004181003521755211,
                            "Geography.Regions.Europe.Eastern Europe": 0.0007729484916529774,
                            "Geography.Regions.Europe.Europe*": 0.0076809813869878445,
                            "Geography.Regions.Europe.Northern Europe": 0.0015344738768241725,
                            "Geography.Regions.Europe.Southern Europe": 0.0008720581124561569,
                            "Geography.Regions.Europe.Western Europe": 0.0009963839426511923,
                            "Geography.Regions.Oceania": 0.0006521764424870279,
                            "History and Society.Business and economics": 0.194483737383016,
                            "History and Society.Education": 0.0024200094358826025,
                            "History and Society.History": 0.002813042961808109,
                            "History and Society.Military and warfare": 0.0010156644267182038,
                            "History and Society.Politics and government": 0.0031610030502964293,
                            "History and Society.Society": 0.008022766658664051,
                            "History and Society.Transportation": 0.0009338602686515991,
                            "STEM.Biology": 0.001981476985079679,
                            "STEM.Chemistry": 0.0006977183151301637,
                            "STEM.Computing": 0.8835030881350452,
                            "STEM.Earth and environment": 0.000649715914923719,
                            "STEM.Engineering": 0.0037725577248181254,
                            "STEM.Libraries & Information": 0.00031103568572905425,
                            "STEM.Mathematics": 0.006915553934201882,
                            "STEM.Medicine & Health": 0.0024120927108435735,
                            "STEM.Physics": 0.00022156409216132937,
                            "STEM.STEM*": 0.9830731017933537,
                            "STEM.Space": 0.00013360703642841977,
                            "STEM.Technology": 0.7863607248919928
                        }
                    }
                }
            }
        }
    }
}

Data

[edit]
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

[edit]

Citation

[edit]

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Korean_Wikipedia_article_topic,
  title={ Korean Wikipedia article topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Korean_Wikipedia_article_topic }
}