Абстрактна Вікіпедія/Оновлення/2021-02-10

Оновлення Абстрактної Вікіпедії

Метою Абстрактної Вікіпедії є створення текстів природною мовою з абстрактного представлення вмісту, який треба представити. Для цього ми будемо використовувати лексикографічні дані з Вікіданих. І хоча ми досить далекі від можливості генерувати тексти, одна річ, у якій ми хочемо заохотити всіх допомогти, це охоплення та повнота лексикографічних даних у Вікіданих.

Сьогодні ми хочемо представити прототипи двох інструментів, які можуть допомогти людям візуалізувати, проілюструвати та краще скерувати наше розуміння охоплення лексикографічних даних у Вікіданих.

Інтерфейс анотації

Перший прототип — це інтерфейс анотації, який дозволяє користувачам коментувати речення будь-якою мовою, пов’язуючи кожне слово чи вираз із лексемою з Вікіданих, включаючи вибір його форми та смислу.

Ви можете побачити приклад на наступному знімку екрана.

Кожне «слово» речення тут анотовано лексемою (ідентифікатор лексеми L31818 вказано безпосередньо під словом), за якою йде лема, мова та частина мови. Then comes, if selected, the specific Form that is being used in context — for example, on “dignity” we see the Form ID L31818#F1, which is the singular Form of the Lexeme. Lastly, comes the Sense, which is assigned Sense ID L31818#S1 and defined by a gloss.
At any time, you can remove any of the annotations, or add new annotations. Some of the options will take you directly to Wikidata. For example, if you want to add a Sense to a given Lexeme, because it has no Senses or is missing the one you need, it will take you to Wikidata and let you do that there in the normal fashion. Once added there, you can come back and select the newly added Sense.
The user interface of the prototype is a bit slow, so please give it a few seconds when you initiate an action. It should work out of the box in different languages. The Universal Language Selector is available (at the top of the page), which you can use to change the language. Note that glosses of Senses are frequently only available in the language of the Lexeme, and the UI doesn’t yet do language fallback, so if you look at English sentences with a German UI you might often find missing glosses.

Technologically, this is a prototype entirely implemented in JavaScript and CSS on top of a vanilla MediaWiki installation. This is likely not the best possible technical solution for such a system, but should help to determine if there is any user-interest in the tool, for a potential reimplementation. Also, it would be a fascinating task to agree on an API which can be implemented by other groups to provide the selection of Lexemes, Senses, and Forms for input sentences. The current baseline here is extremely simple, and would not be good enough for an automated tagging system. Having this available for many sentences in many languages could provide a great corpus for training natural language understanding systems. There is a lot that could be built upon that.

The goal of this prototype is to make more tangible the Wikidata community's progress regarding the coverage of the lexicographical data. You can take a sentence in any written language, put it into this system, and find out how complete you can get with your annotations. It's a way to showcase and create anecdotal experience of the lexicographic data in Wikidata.

The prototype annotation interface is at: annotation.wmcloud.org.
You can discuss it here: annotation.wmcloud.org/wiki/Discussion (you will need to create a new account on that wiki).

Інформаційна панель покриття корпусу

Другий прототип інструменту — це інформаційна панель, яка показує охоплення даних у порівнянні з корпусом Вікіпедії кожною з сорока мов.

Минулого року, перебуваючи на своїй попередній посаді в Google Research, я був співавтором публікації, у якій ми створювали та публікували мовні моделі з очищеного тексту близько сорока мовних видань Вікіпедії.^[1] Besides the language models, we also published the raw data: this text has been cleaned up by the pre-processing system that Google uses on Wikipedia text in order to integrate the text in several of its features. So while this dataset consists of relatively clean natural language text; certainly, compared to the raw wiki text — it still contains plenty of artefacts. If you know of better large scale encyclopaedic text corpora we can use, maybe better cleaned-up versions of Wikipedia, or ones covering more languages, please let us know.

We extracted these texts from the TensorFlow models. We provide the extracted texts for download. We split the text into tokens and count the occurrences of words, and compared how many of these tokens appear in the Forms on Lexemes of the given language in Wikidata’s lexicographic data. If this proves useful, we could move the cleaned-up text to a more permanent home.

A screenshot of the current state for English is given here.

We see how many Forms for this language are available in Wikidata, and we see how many different Forms are attested in Wikipedia (i.e., how many different words, or word types, are in the Wikipedia of the given language). The number of tokens is the total number of words in the given language corpus. Covered forms says how many of the forms in the corpus are also in Wikidata's Lexeme set, and covered tokens tells us how many of the occurrences that covers (so, if the word “time” appears 100 times in English Wikipedia, it would be counted as one covered form, but 100 covered tokens). The two pie charts visualize the coverage of forms and tokens respectively.
Finally, there is a link to the thousand most frequent forms that are not yet in Wikidata. This can help communities prioritise ramping up coverage quickly. Note though, the progress report is manual and does not automatically update. I plan to run an update from time to time for now.

Прототип інформаційної панелі покриття корпусу знаходиться за адресою: Wikidata:Lexicographical coverage
You can discuss it here:Wikidata talk:Lexicographical coverage

Потрібна допомога

Обидва прототипи інструментів є саме такими: прототипи, а не реальні продукти. Ми не зобов'язуємося підтримувати та розвивати ці прототипи надалі. At the same time, all of the code and data is of course open sourced. If anyone would like to pick up the development or maintenance of these prototypes, you would be more than welcome — please let us know (on my talk page, or via e-mail, or on the Tool ideas page).

Also, if someone likes the idea but thinks that a different implementation would be better, please move ahead with that — I am happy to support and talk with you. There is much to improve here, but we hope that these two prototypes will lead to more development of content and tools in the space of lexicographic data.

Примітки

↑ Mandy Guo, Zihang Dai, Denny Vrandečić, Rami Al-Rfou: Wiki-40B: Multilingual Language Model Dataset, LREC 2020.