Metalingo is a proposal for a way of representing relational data within Wikipedia which:
- looks like, and is actually a tiny subset of, English (or any other language)
- is easy to edit by imitation
- isn't too clever
- will include a subset of simple sentence/phrase forms which might actually turn up naturally in the Wikipedia, and therefore:
- could be used to data-mine common idioms from raw Wikipedia text written by users who are unaware of its existence?
- some of the knowledge is already there
- lots of stereotypical grammar in an encyclopedia: we can mine that
- writing a 'dumb' natural language parser is easy, providing you are willing only to parse a tiny fragment of the English language, and accept a certain error rate: think "blocks worlds" and chat-bots like Eliza
- the language pattern bank could even be edited in the wiki way: think about the mis-spellings page.
- the language pattern bank could even be easily translated from one language to another, if we do it right
- would start towards implementing the simple ideology of Wikitax by making the person DTD and spacetime DTD and ecoregion DTD target constructs easy to find.
- won't find information expressed in many quite normal complex sentences
- might encourage Basic English sentences: Henry VIII was the king of England. England is a place in Europe. Europe is a continent.... etc.
- will have false hits in some places
The underlying data representation is in general of the form
x R y where x and y are Wikipedia links, and R is a relation. For example, the perfectly reasonable, but very simple, English sentence
should be valid metalingo for the relation triple
which in turn can be used, given some inference rules, to generate the set of triples:
Henry VII of England relation:parent-of Henry VIII of England
Henry VII of England relation:is-a male
Note: Ordinary end users should never have to see these representations. They are only intended to be seen by computer programs and expert users who want to be able to see, or hack with, the internal representation.
Given enough facts in this form, the Wikipedia could be enhanced by automatically generated genealogies, category classifications ("Physics is a field of science"), indices, and James Burke-style chains of connections.
These could also be used to generate RDF, for analysis by third-party tools, as well as RSS feeds.
It might be possible to have a "template bank" of simple language patterns, allowing this to be ported to other languages. A bit of googling seems to show some promising candidates.
The English Wikipedia metadata would be in Basic-English-subset, the French one would be in Basic-French-subset, and so on... there would be a translation/parser file for each language, with the corresponding patterns.
We could even do very simple machine translation between the Wikipedias for these statements: remember, they should not be more complex than things like "THE BLUE CUBE IS ON TOP OF THE RED BLOCK". Indeed, SHRDLU is exactly the level of competence I'm looking for: SHRDLU worked very well, within its tiny world: this is not an attempt to solve the general machine translation or natural language understanding problems!
The level of complexity of the parser should be similar to that of a chatbot or an adventure game interpreter: simple pattern matching with variables.
Ideally, we drag in as little linguistic technology as possible: the principle should be "50% of the goodness, for 1% of the effort".
- What is the appropriate level of power? Simple string matching? Regexps? BNF? Something more exotic?
- what would be a start on a good-enough BNF grammar and/or set of regex templates?
- do we want to drag in all of first-order logic? (No! Not yet, anyway).
- or simply negation, so we can say NOT (x R y)? (Perhaps).
- should we use "becomes, remains, equals" instead of "is" as link type semantics? See E Prime.
- "X is a Y" -> X is-a Y
- "X are Y" -> X is-a Y, X is-a property:plural-thing
- "X is a field of Y" -> X subset-of Y
- "X is a county in Y" -> X subset-of Y, X is-a county
- A problem: conceptual metaphor is common in English. Truly weird things might result, e.g. "Love is a drug" -> "Love is-a drug", or even "The door is ajar" -> "Door is-a jar". These would rapidly lead to self-contradicting links and a corrupt database. It's safer to use "becomes, remains, equals" rather than "is" for this reason.