Talk:The need for XML re: wiktionary

From Meta, a Wikimedia project coordination wiki

Ray Saintonge wrote:

> The current format is adequate. Your proposal makes no mention of the possible sacrifices in terms of ease of editing, a key feature in all the wikis. How will flexibility of format be maintained?


XML is not to be edited by hand. You are absolutely right about that. However, the current format is not without its problems. At this moment an English word cannot be re-used easily in other wiktionaries. Things are free formatted at the moment. It would be a good thing if we start thinking about creating some database structures for use within wiktionary. It would rid us of these dratted templates like (English) and Template:-en-. They work, it is the best thing around but they are ugly.

What I propose at this time is to get us thinking about importing and exporting in an XML format. And considering changes to enhance the functionality within all wiktionaries and the functionality to the outside world.

One of the aims of wikimedia is to create open content. By having our data in our proprietary format, we do not achieve what can be achieved.

Thanks, GerardM 11:47, 4 Sep 2004 (UTC)


Andrew Dunbar wrote:

> I disagree. But it depends on *how much* structure you want.

> Actually If we only wanted to structure these parts it would work ok. Many other properties of words and phrases are a lot more difficult, such as part-of-speech.

> I do think a dictionary requires structure which an encyclopedia does not. A very loose structure like you have described would be a benefit for Wiktionary. The problems I see are these: 1. Once we have some structure people will push for more structure such as part-of-speech, not realizing how difficult that is to get right in a multilingual dictionary. 2. To work with the wiki software we can have a tool/ script/routine which maps from internal XML into wiki/HTML so it can be displayed. 3. People will have to input XML, or we need a friendly interface which can take input from non-expert users and turn it into correct XML.

> Number 3 would mean a *lot* of work for developers.

Andrew (hippietrail).


I don't much understand the technical details, but the intent of what is wanted seems to be very good. There is much needless duplication on Wiktionary right now and that is simply very bad database design. --Daniel Mayer

Wiktionary to XML converter[edit]

For anybody who's interested, I am currently experimenting with a parser which can read articles from the en.wikipedia SQL dump file and output XML.

So far not a lot has been done as my knowledge of XML is quite rudimentary and my knowledge of CSS and XSL and how well supported they are on various browsers is much worse.

So far I have two main tools both written in Perl:

  1. scanwiki.pl can look for various problems in an entire SQL dump. It can also create an dump all the article titles and all the article text in a readable format. Another function is to create an article index that makes it possible to quickly gather random samplings of articles.
  2. grokwiki.pl uses the index file and attempts to parse a number of randomly selected articles into XML and log certain "oddities" to various log files.

At present I only handle the en.wiktionary because each wiktionary has quite a different standard for articles, and a different degree of variation from that standard.

Please feel free to ask any questions or offer suggestions. — Hippietrail 14:53, 19 Nov 2004 (UTC)

Do you have somewhere a definition of the XML specification that you use ? GerardM 18:50, 20 Nov 2004 (UTC)
No, that's the most trivial aspect so it changes frequently. The hard part is the parsing. The eventual XML structure will reflect the wiki fmt and semantic structure. Headings of higher levels will create blocks nesting inside those representing lower the lower levels. The tags for each pronunciation will be within tags for the overall pronunciation section, etc. If you give me an example article from en.wiktionary I can post an example current parser output. — Hippietrail 15:25, 26 Nov 2004 (UTC)
This template will point from the discussion pages of all the different proposals for a single Wiktionary DataBase to the one page where all discussion on the subject of a single Wiktionary Database is conducted, to create a discussion of that purpose, rather than of each proposal separately. User:Aliter