Talk:Wikimedia Fellowships/Project Ideas/Tools for text synthesis

One model for this

Latest comment: 12 years ago2 comments2 people in discussion

I personally think that the basic idea of creating short articles containing the most important basic information would be valuable, even though I know that a lot of people seem to oppose that way of creating articles. They are against it basically because of the limited interest from volunteers in the areas where articles have not been created (the problem put forward is that if no one is interested in creating them, no one will keep an eye on them), and the risk that those articles then would be left un-updated and filled with hoax material. Do you have a possible solution in mind for this problem?

Also, I must admit that I am a bit confused about what this project really will do? As I understand the project it is about solving how to handle text synthesis, but could you expand a bit more about what exactly will be done? In layman's terms if possible. What type of code will be produced in the end? Also, what language versions will you focus on? Perhaps it would be good to start with contacting the different communities to see if they want this or not so that you can get the help you need. For example it is currently a discussion up on the Swedish language version about bot generated articles (but this discussion is however about small villages) and I am one of the few that seem to support that (the discussion was lifted without any connection to this project, but I have informed people about this project idea). Jopparn 19:31, 11 January 2012 (UTC)Reply

Its about how to automate necessary grammatical rewrites of generated text. If successful it will be possible to build advanced content templates, not only templates for infoboxes. Often there are parts of a template that can't easily be written as a fixed text, even if the content can be simple to write manually by someone fluent in the language. In some of those cases it might be possible to use methods from this project to automate text generation.

One description of the tools would be as a kind of guided grammatical rewrite of very short phrases or single words into a new form that can be used in a text where most of the text is fixed, only some parts must be adapted. For example changing gender, time, location, or from singular to plural.

Because we don't have any such tools today we often end up with bot stubs that is a one liner, like the stub at Norwegian (bokmål) Wikipedia about Arctic Warbler [1]

Lappsanger (vitenskapelig navn Phylloscopus borealis) er en fugl i sangerfamilien.

At the Swedish article there are a lot more info, for example it says "Nordsångaren är cirka 12 cm lång." which would be "Lappsangeren er cirka 12 cm lang." in Norwegian Bokmål. The problem is how do we write a generic template do do such a transformation. We could just add -en to the name but it would not always work. One example where it fails is when we talk about ducks. We can say "anden" (a duck) in Bokmål (one of the two major variations of Norwegian) but usually we use "anda". In Nynorsk (the other major form of Norwegian) we always use "anda", even if we use "Lappsongaren".

That is we could write it like these two examples in Nynorsk

Lappsongaren er omlag 12 cm lang.

Toppanda er omlag 40-45 cm lang.

Its actually two different grammatical genders, the name "lappsongar" is maskulinum and "toppand" is femininum. Encoding for such changes in each template could lead to very complex wikicode, and the necessary data will probably not be available at all. When we are writing bot code we usually try to avoid the problem by using the same grammatical form everywhere. Imagine instead being able to easily change the form as in (Sg is the numeric singular and marks a transform - an inflection)

{{{name}}}+Sg er omlag {{{length}}} cm lang.

This is just one possible form, perhaps it should instead use a parser function, and perhaps we need additional constructs to control the transformation.

Note that there is already a solution at mw:Manual:$wgGrammarForms, but that is a little to simple to use for anything but a few fixed cases.

The project will not focus on any specific language, except to demonstrate specific functions, it will focus on how to interact with external libraries like lttoolbox, with preexisting dictionaries, from a Mediawiki instance like Wikipedia. — Jeblad 01:04, 12 January 2012 (UTC)Reply

Ram-man comparison

Latest comment: 12 years ago2 comments2 people in discussion

If I understand correctly, you are talking about populating underserved Wikipedia projects in minority languages with an article base of stubs by starting with some structured set of data and using an automated tool to create a set of stub articles. This sounds like what English Wikipedia's user Ram-man did with those city articles starting with demographic information. Of course this is a good idea, but I would ask what tools you will use as a base for your work. Blue Rasberry (talk) 20:38, 13 January 2012 (UTC)Reply

Its not so much about the data itself, thats part of the Wikidata project, its more about how to generate the text. Most of it is simply wrappers for libraries (lttoolbox, xfst, foma) that will transform words and phrases from one form into another. — Jeblad 16:23, 15 January 2012 (UTC)Reply

The nlwp bot.

Latest comment: 12 years ago2 comments2 people in discussion

I think that it is natural and relevant to mention the nl-wikipedia animal species stub maker bot here; c.f. e.g. nl:Wikipedia:Wikiproject/Dieren/Botgids. It surely does not do everything that Jeblad would like this bot to do. Still, I'm impressed of what it does do, and I think that it might be a good basis for constructing better stub maker bots.

(Side remark: I do not quite understand the reference to "minority languages" supra. If it was used in its ordinary meaning, neither Swedish, Norwegian bokmål, or Dutch is a minority language in Sweden, Norway, or the Netherlands, respectively. I doubt that Jeblad was considering the majority/minority standing within a country as a criterion for where to use the bot. If the user meant e.g. "All languages except English" or "All languages with relatively few active wp editors", it would be better to write so.)

Best, JoergenB 20:39, 30 January 2012 (UTC)Reply

This is not really about a bot, it is more about extending the parser vocabulary to facilitate some simple morphing of words from one form to another, thereby making it possible to create on the fly content from the data stored in the infoboxes. A bot usually creates static wikicode from data stored in files or databases and uploads the text to Wikipedia. This is especially useful for small languages like the Northern Sami language in Norway. By developing a small set of advanced stub-templates a large set of articles can be seeded from infoboxes in Norwegian (bokmål) Wikipedia or a future Wikidata project. This is probably the most interesting part of the project, but this is not the only use of the concept. It is also possible to create smart templates that adapts to values, its like the language-dependent word conversions [2] on steroids. — Jeblad 14:47, 31 January 2012 (UTC)Reply

Link to the thread at "Bybrunnen"

Latest comment: 12 years ago1 comment1 person in discussion

There is a discussion at the Swedish Bybrunnen, Substubbar om orter med ett hundratal invånare - vill vi ha dem och hur bör de då skapas? — Jeblad 20:31, 19 February 2012 (UTC)Reply

wikidata

Latest comment: 12 years ago2 comments2 people in discussion

Are you familiar with the Wikidata project that is being started up in WMDE? Seems related. -- phoebe | talk 23:49, 11 March 2012 (UTC)Reply

Yes I know about it! :) — Jeblad 16:50, 13 March 2012 (UTC)Reply

Unified Wiktionary API

Hi!

You may have interest in taking a look at the following link, which seems a little related:

Unified Wiktionary API

Helder 19:07, 13 March 2012 (UTC)

Timing and Connection with WMDE and Wikidata

Latest comment: 12 years ago3 comments2 people in discussion

Great idea, but it seems like this project would be most likely to succeed if coordinated with Wikidata. Because the Wikidata project is just getting started, and because that work is primarily being led by WMDE, the Fellowship Program believes it could be too early to begin work on a project like this. It also may make more sense to think about applying for a grant via WMF or WMDE to complete this work, rather than as a Wikimedia Fellowship, if coordination with the WMDE Wikidata team would be of most value (I expect they would be more helpful to advise on this project than WMF staff at this point). The Fellowships Program will not be funding this fellowship idea at this time, but I do encourage you to continue to develop your ideas and reapply in the future! Siko Bouterse (WMF) (talk) 18:40, 30 March 2012 (UTC)Reply

It would not be possible for me to do any full time follow up for now, but perhaps in the future. The proposal is not depending on Wikidata, but that project makes this proposal more important. Perhaps it could be reworked when more are known about the Wikidata project, and/or if anyone with better skills in natural language processing wants to do a follow up. — Jeblad 06:39, 31 March 2012 (UTC)Reply

I'm archiving the idea for now, but anyone would be welcome to reopen it in a future open call for fellows if you're interested in developing it further. Thanks! Siko Bouterse (WMF) (talk) 05:51, 31 May 2012 (UTC)Reply