OmegaWiki data design
This design is the working documentation for the datadesign of OmegaWiki.
As a convention, tablenames are capitalised. Conversion from current Wiktionary data structures to the new data design is being developed as a pywikipediabot script. See its Sourceforge CVS repository and the latest updates (search for polyglot or wiktionary) for details, but note there is no documentation yet for the wiktionary script.
- 1 Gemet, the implementation of a thesaurus in a subset of OmegaWiki
- 2 Dogfood
- 3 Table and Pronunciation
- 4 Meaning
- 5 Attestations
- 6 Etymology and references to an old wiktionary
- 7 Inflection presentation
- 8 Labels
- 9 Charactersets and fonts
- 10 Big overhaul following Wikimania
- 11 Changes after Berlin
- 12 Changes after I revisited Berlin
- 13 Sign feature states
- 14 Relations and homophones
- 15 User
- 16 Issues
- 17 External links
Gemet, the implementation of a thesaurus in a subset of OmegaWiki
In the plans for OmegaWiki, we have always included the implementation of the GEMET thesaurus. For a long time we thought we should keep things simple by creating a Wikidata implementation of the existing GEMET database. We have reconsidered this; by creating this subset of the UW, we create a model that can be extended one functionality at a time.
OmegaWiki will use its own content for its labels. This is why you find relations from Gender, Language and Wordtype to Meaning.
Table and Pronunciation
As pronunciation can differ depending on it being a Spelling, a Word or a Meaning, for pronunciation you specify on what level you want to include the data.
The table Meaning is a "technical" table. It is needed to enable a single point of reference for all kinds of functionality. When a word is given a new meaning, it will be implicitly shared by all the words that are given as a translation or a synonym.
When there are Meanings that have explicitly the same content, they are to be merged.
Attestations are examples of actual usage of a word in a particular meaning. As these things are used in the English wiktionary, I have added a "RelationText" field in the table "Relation". The existence of this field is governed by a true/false field in WordRelation called "AllowRelationField".
Etymology and references to an old wiktionary
- I have added a file that allows you to add more than one word when defining an etymology.
- I have added a table OldWiktionaryData that allows us to save the permalink of an article that was converted from one of the old wiktionaries. By saving the permalink it is possible to check if the Wiktionary has a later change.
I have added rows and colums on the Inflection table, they are to be filled in by numbers. As most inflections are currently in table like structures, it makes sense to help the presentation in this way. This is why the table Inflectionbox has been defined. You can either define a row/column for an inflection or for a text.
I found that sometimes a word or a meaning needs an additional label. These labels can be required on many levels and they do add value to the richness of the information. I needed this to allow for eponyms. Eponyms are often a noun, and they should be labeled as such. It being an eponym is extra info. I constructed this by creating two new tables, LabelType and Label. LabelType governs what table they relate to and what the name of the label is. It also defines if you are allowed to add a text to go with the label. Label is, the label itself.
Charactersets and fonts
Written languages are expressed in charactersets; these are defined in the ISO-15924. To show them on the screen, you need fonts. As many people do not have the fonts to show all the characters that are used, they need some assistance. This is why the ERD does register the charactersets and fonts that can be used. Font support from the Wikimedia Foundation is about helping users with one stop shopping.
Big overhaul following Wikimania
Wikimania was great; but it did change certain things radically. The purpose of the change is to allow for the inclusion of non written languages.
Spelling will now be Expression
As a signed language does not have one orthography, the name Spelling could not be maintained. The record now needs to have either the "Spelling" or the "MediafileID" field filled. This is depending on the language and it being a sign language, a written language or a spoken language.
Pronunciation will now be Mediafile
The Mediafile may contain movies or soundclips. As such pronunciation does not describe it well. The name "Mediafile" does more justice. As both sounds and signs can be described, I have added a seperate table for this: "Representation". "Representatontype" identifies them for instance as IPA.
When an authorised glossary or thesaurus is included, it is important to inform for what languages the authorisation is validated. When a glossary is only en fr de it and nl, all other languages can be added by the people.
I saw these cool animations on how to create the strokes for Chinese characters; I now reserve room to include these as well.
Changes after Berlin
I have been to Berlin to discuss the development of OmegaWiki this resulted in several changes
Meaning becomes DefinedMeaning
The Meaning table has been renamed to reflect that only one Word/language defines this lemma. This means that there are a few situations to deal with; alternative definitions for the same meaning and definitions that are not synonymous.
- Alternative definitions are fine to have, it is just the primary meaning that should be true for all translations and synonyms.
- When translation does not share the meaning, the flag for "Endemic Meaning" should be set to off. This means that the translation can be used for a translation one way but it needs to be done with caution.
- How the near synonymy is to be understood needs to be indicated by adding a Relation.
Improved support for Inflections
The SynTrans has been extended with the InflectionWordID field. As there can be many inflections of a word, we do want to show the MeaningText associated with the Headword. We do not want to show the inflections as being synonymous or translations because they are not. Inflections have their own translations.
Complications and timeline
Erik found that both Wikidata and OmegaWiki are more complicated than originally thought. The consequence is that we have to concentrate our efforts. The first thing is to get key parts of Wikidata committed to CVS. This will probably be to the 1.7 branch.
Changes after I revisited Berlin
I have been back to Berlin initially to speak to Alan Melby who is a member of the OSCAR committee that is responsible for the TBX and TMX standards. It proved to be exceedingly stimulating because not only did Alan prove a really relevant contact for UW, I was also invited to sit in on the conference. It resulted in probable cooperation with contacts in standards organisations, governments, businesses. The data design has been scrutinized by some heavy hitters and the result is that conceptually I am even more confident that UW can serve a pent up demand for lexicological demand.
- I have added the field Radical to the Expression table. This is necessary for indexing Oriental languges. When the value Strokes_Y/N is set to Y it should be available for input.
- I have renamed the Word table to LexicalItem
- I have renamed the Label table to Attributes
- I have added a table called AlternateRepresentation. This is to link "Plague, Bubonic" to "bubonic plague"
- I have added a table called Pattern. This table contains info on how a LexicalItem can be found in a sentence. I am not completely sure about Patterns yet. There may be multiple ways of describing these Patterns. This would result in alternate descriptions
- I have added a table called LexicalItem-Pattern. This table indicates where a particular LexicalItem can be found in a sentence
Managing translations for Wikidata and other fixed texts
This is functionality that is very much outside of OmegaWiki but it is very much associated with tables that are presently part of OmegaWiki. The consequence may be that they need to become part of Mediawiki proper.
The functionality I try to model is one whereby one version of a text is translated into multiple languages. I have included the use of Computer-assisted translation tools. To make efficient use of these tools you need to store and share translation memories. As these are asociated with the the target text, they are modelled to be in the same table.
- I have added the "Translator" field to the UW User table; this indicates that a user is willing to translate. I also added a table to indicate that a translator is willing to help on a particular project.
Localisation is not fully considered in MediaWiki ??
The typical Wikipedia articles are typically not translated, they are new articles about often a same subject. Many of these articles provide the same data provided in an "infobox". In a way it does not make sense to have the same data stored in many places. Internationalization and localization is the process that allows for the presentation based on locale information.
It is unclear to me to what extend we support different locales. Given that Unicode acknowledges Wikipedia as one of the important implementers of their standard, we might adopt the Unicode Common Locale Data Repository and help out creating locale data for the CLDR for those locales that do not exist in CLDR yet. It would serve us well because this is something that we need if we want to be a truely internationally oriented organisation.
Sign feature states
As one of the intentions of the UW is to include sign languages, there is a need to be able to retrieve the signs out of OmegaWiki. There are several ways of expressing signlanguages in a written way; they are in a way closer to Chinese characters than they are to latin script. There is also the method as used in Wolfgang Georgdorf's research where the signs are described. As Wolfgang is one of the promotors of the inclusion of sign languages in UW it is fitting to include the possibilty to describe signs in this way.
- NDH - non-dominant hand
- SOS - start of sign
- COS - course of sign
- EOS - end of sign
Relations and homophones
With homophones I learned that Relations not only exist for Meanings but also for Words. This resulted in different relations. For the first milestone with GEMET I can change the data design to make it the same. I wonder what to do. On the one hand this layout is so much clearer.
Some small improvements
In the Misspelling table, I did still refer to a Spelling table, the table ValidSpelling is now called ValidExpression. As a Word can get a new meaning, it should be possible to include the date for this meaning as well (In Dutch Tsunami got a new meaning after the recent big one).
Users will be given the option to indicate their proficiency for languages. This will show texts like the "Babel" templates.
Users will be able to toggle the languages off for languages they do NOT want to see.
User owned Collection
A Collection represents a thesaurus or a glossary. When a glossary is maintained by an authority and when it is vital that it is absolutely clear that content is authorised by this authority, then the Collection is Owned by a User. This user might be called "Vatican" if it represents the Roman Catholic church or "GEMET" when it is the database of European Community Ecological data.
There are interwiki links and interproject links. With interwiki links, links to the old style Wiktionaries are meant. The links to Commons for soundfiles are all within the Pronunciation file and, the interproject links are, with the exception to old style wiktionaries to be associated on the Meaning level.
Pictures is the first attribute that can go to Meaning it seems because it is universal.
When a word has a link to a wiktionary, the interwiki makes sense to show when the content has not been integrated into OmegaWiki. To make that possible it is important to have a switch that can be turned on or off. A change on an old style wiktionary should turn on the display for all Spellings with identical spelling.
The problem is that there is not only the pronunciation, there is also the local way of pronouncing. At this moment we cannot readily record local varietions. At this moment it is not something we actively persue. However, the way Chinese is spoken is definetly different and is thought to be more than just that.