Talk:Abstract Wikipedia/Archive 3

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Archive This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

I don't know if a potential logo has been discussed, but I propose the Wikipedia Logo with Pictograms replacing the various characters from different languages. I might mock one up if I feel like it. If you have any ideas, or want to inform me there already is a logo, please ping me below. Cheers, WikiMacaroons (talk) 18:00, 26 July 2020 (UTC)

The logo discussion has not started yet. We're going to have the naming discussion first, and once the name is decided, the logo will follow up. Feel free to start a subpage for the logo, maybe here if you want, to collect ideas. --DVrandecic (WMF) (talk) 01:46, 5 August 2020 (UTC)
Thanks, DVrandecic (WMF), perhaps I shall :) WikiMacaroons (talk) 21:57, 7 August 2020 (UTC)
Note that the existing logo of wikimedia Commons nearly has the underlying concept of composition/fusion to generate a result that can be borrowed and imported to other projects (not just Wikipedia, just like the wiki of functions).
We could derive the logo of the future wiki of functions, from the logo for Wikidata (barcodes) placed at top with arrows pointing to a central point of processing, and then arrows outgoing from it to the set of Wikimedia projects (placed at bottom, represented by the tri-sector ring, reduced to an half circle at bottom. If you look beside at the logo for Meta, the top red segment of ring would be the Wikidata barcode, the two blue segments are kept, the green Earth is replaced by a central green point; the arrows are drawn inside the ring, with the same colors as the segments they are collected to. I would avoid any logo that represents a mathematical concept (such as the lambda letter, unless it is part of the chosen name, or parentheses representing classic maths functions: the functions in the new wiki will not really be mathematical functions, as they are polymorphic, and don't necessarily return a result but could return errors, or a "variant type" to represent actual values or errors returned). verdy_p (talk) 09:36, 29 October 2020 (UTC)
Note that the formal request is still not done, but a draft page allows you to prepare suggestions. This draft.Logo subpage needs to be updated before it is actually translated: I added a first proposal there. verdy_p (talk) 09:20, 17 November 2020 (UTC)

Some questions and concerns

Depending on the eventual implementation details, Abstract Wikipedia may need support from the WMF Search Platform team. So a couple of us from our team had a conversation with Denny about Abstract Wikipedia, and we had some questions and concerns. Denny suggested sharing them in a more public forum, so here we are! I've added some sub-headings to give it a little more structure for this forum. In no particular order...

Grammar/renderer development vs writing articles, and who will do the work

Each language renderer, in order to be reasonably complete, could be a significant fraction of the work required to just translate or write the "core" articles in that language, and the work can only be done by a much smaller group of people (linguists or grammarians of the language, not just any speaker). With a "medium-sized" language like Guarani (~5M speakers, but only ~30 active users on the Guarani Wikipedia in the last month), it could be hard to find interested people with the technical and linguistic skills necessary. Though, as Denny pointed out, perhaps graduate students in relevant fields would be interested in developing a grammar—though I'm still worried about how large the pool of people with the requisite skills is.

I share this concern, but I think we need to think of it as a two-edged sword. Maybe people will be more interested in contributing to both grammar and content than to either one alone. I certainly hope that this distinction will become very blurred; our goal is content and our interest in grammar is primarily to support the instantiation of content in a particular natural language (which, you know, is what people actually do all the time). We need to downplay the "technical and linguistic skills" by focusing on what works. People love fixing bad grammar or poor word choices (it's a parent–child thing), so perhaps there are two separate challenges here: how to get a rendering process that produces intelligible content versus one that produces correctly phrased content. Native speakers will certainly have a key role in identifying incorrectly phrased content; they may even be tempted to fix a problem they understand. They won't necessarily be existing editors, however; they may even have been intrigued by the slightly quirky Wikipedia idiolect they've been hearing about! Ultimately, though, their community will need the maturity to allow the more difficult linguistic problems they identify to percolate upwards or outwards to more specialist contributors, and these may be non-native grammarians, programmers, contributors of any stripe.
What about the initial "intelligible" renderers? We need to explore this as we develop language-neutral content. We will see certain very common content frameworks, which should already be apparent from Wikidata properties. So we will be asking whether there is a general (default) way to express a instance of (P31), for example, how it varies according to the predicate and (somewhat more problematically) the subject. We will also observe how certain Wikidata statements are linguistically subordinate (being implied or assumed). So, <person> is (notable for) <role in activity>, rather than instance of (P31) human (Q5)... To the extent that such observations are somewhat universal, they serve as a useful foundation for each successive renderer: how does the new language follow the rules and exceptions derived for previous languages (specifically for the language-neutral content so far developed; we never need a complete grammar of any language except, perhaps, the language-neutral synthetic language that inevitably emerges as we proceed).
Who will do the work? Anyone and everyone! Involvement from native speakers would be a pre-requisite for developing any new renderer, but the native speakers will be supported by an enthusiastic band of experienced linguistic problem-solvers, who (will by then) have already contributed to the limited success of several renderers for an increasing quantity of high-quality language-neutral content. --GrounderUK (talk) 12:59, 25 August 2020 (UTC)
@GrounderUK: "People love fixing bad grammar or poor word choices (it's a parent–child thing), so perhaps there are two separate challenges here: how to get a renderer that produces intelligible content versus one that produces correctly phrased content. Native speakers will certainly have a key role in identifying incorrectly phrased content; they may even be tempted to fix a problem they understand. They won't necessarily be existing editors, however; they may even have been intrigued by the slightly quirky Wikipedia idiolect they've been hearing about!" This is a great notion, and hopefully it will be one the project eventually benefits from. --Chris.Cooley (talk) 00:04, 29 August 2020 (UTC)
@Chris.Cooley: Thanks, Chris, one can but hope! Denny's crowdsourcing ideas (below) are a promising start. Just for the record, your quoting me prompted me to tweak the original; I've changed "a renderer" to "a rendering process" (in my original but not in your quoting of it).
@TJones (WMF): One workflow we have been brainstorming was to:
  1. crowdsource the necessary lexical knowledge, which can be done without particular expertise beyond language knowledge
  2. crowdsource how sentences for specific simple constructors would look like (from, e.g. bilingual speakers, who are shown simple sentences in another language, e.g. Spanish, and then asked to write it down in Guarani)
  3. then even non-Guarani speakers could try to actually build renderers, using the language input as test sentences
  4. now verify the renderers again with Guarani speakers for more examples, and gather feedback
It would be crucial that the end result would allow Guarani speakers to easily mark up issues (as GrounderUK points out), so that these can be addressed. But this is one possible workflow that would omit the necessity to have deep linguistic and coding language available in each language community, and can spread the workload in a way that could help with filling that gap.
This workflow would be build on top of Abstract Wikipedia and not require much changes to the core work. Does this sound reasonable or entirely crazy?
One even crazier idea, but that's well beyond what I hope for, is that we will find out that the Renderers across languages are in many areas rather uniform, and that there are a small number and that we can actually share a lot of the renderer code across languages and that languages are basically defined through a small number of parameters. There are some linguists who believe such things possible. But I don't dare bet on it. --DVrandecic (WMF) (talk) 21:50, 28 August 2020 (UTC)

The "Grammatical Framework" grammatical framework

Denny mentioned Grammatical Framework and I took a look at it. I think it is complex enough to represent most grammatical phenomena, but I don’t see very much that is actually developed. The examples and downloadable samples I see are all small toy systems. It isn’t 100% clear to me that any grammatical framework can actually capture all grammatical phenomena—and with certain kinds of edge cases and variation in dialects, it may be a lost cause—and linguists still argue over the right way to represent phenomena in major languages in major grammatical frameworks. Anyway, it looks like the Grammatical Framework still leaves a lot of grammar development to be done; assuming it’s possible (which I’m willing to assume), it doesn’t seem easy or quick, especially as we get away from major world languages.

Yes. I keep meaning to take another look but, basically, GF itself is not something I would expect many people to enjoy using for very long (I may become an exception) ...and "very long" certainly seems to be required. I'm usually pretty open-minded when it comes to possible solutions but I've not added GF to my list of things that might work. That's not to say that there can be no place for it in our solution landscape; it's just that we do need to keep focusing on the user experience, especially for the user who knows the under-served language but has no programming experience and no formal training in grammar. I can hardly remember what that feels like (ignoring the fact that my first language is hardly under-served)! Apart from committing to engaging with such stakeholders, it's not clear what more we can usefully do, at this stage, when it comes to evaluating alternative solutions. That said, I am 99.9% sure that we can find a somewhat satisfactory solution for a lot of encyclopedic content in a large number of languages; the principal constraint will always be the availability of willing and linguistically competent contributor communities.
One thing GF has going for it is that it is intentionally multilingual. As I understand it, our current plan is to begin natural-language generation with English renderers. I'm hoping we'll change our minds about that. Sooner rather than later, in any event, I would hope that renderer development would be trilingual (as a minimum). Some proportion of renderer functions may be monolingual, but I would like to see those limited to the more idiosyncratic aspects of the language. Or, perhaps, if there are good enough reasons to develop a function with only a single language in mind, we should also consider developing "equivalent provision" in our other current languages. What that means in practice is that the test inputs into our monolingual functions must also produce satisfactory results for the other languages, whether by using pre-existing functions or functions developed concurrently (more or less). --GrounderUK (talk) 13:02, 26 August 2020 (UTC)
I agree, we shouldn't start development based on a single language, and particularly not English. Starting in three languages, ideally across at least two main families, sounds like a good idea. --DVrandecic (WMF) (talk) 21:55, 28 August 2020 (UTC)
@DVrandecic (WMF): That's very good to hear! Is that an actual change of our collective minds, or just your own second thoughts? P2.7 doesn't seem to have a flexible interpretation. In my reading of P2.8 to P2.10, there also appears to be a dependency on P2.7 being fairly advanced, but maybe that's just my over-interpretation.--GrounderUK (talk) 07:52, 1 September 2020 (UTC)
It is true that a lot of development would be required for GF, but even more development would be required for any other framework. GF being open source, any development would flow back into the general public, too, so working with GF and its developers is likely the correct solution.
The GF community is interested in helping out. See , where Aarne Ranta suggests a case study for some area with adequate data, like mathematics. —The preceding unsigned comment was added by Inariksit (talk) 18:44, 28 August 2020
I am a big fan of reusing as much of GF as possible, but my reference to GF was not to mean that we necessarily have to use it as is, but rather that it shows that such a project is at all possible. We should feel free to either follow GF, or to divert from it, or to use it as an inspiration - but what it does show is that the task we are aiming for has been achieved by others in some form.
Having said that, I am very happy to see Aarne react positively by the idea! That might be the beginning of a beautiful collaboration. --DVrandecic (WMF) (talk) 21:54, 28 August 2020 (UTC)

Incompatible grammatical frameworks across language documentation

I worry that certain grammatical traditions associated with given languages may make it difficult to work compatibly across languages. A lot of Western European languages have traditionally followed the grammatical model of Latin, even when it doesn’t make sense—though there are of course many grammatical frameworks for the major languages. But it’s easy to imagine that the best grammar for a given medium-sized language was written by an old-fashioned European grammarian, based on a popular grammatical model from the 1970s. Reconciling that with whatever framework that has been developed up until that point may create a mess.

Speaking as an old-fashioned European grammarian... "a mess" is inevitable! It should be as messy as necessary but no messier (well, not much messier). I'm not sure that there's much "reconciling" involved, however. Given our framing of the problem, I don't see how "our mess" can be anything other than "interlingual" (as Grammatical Framework is). This is why I would prefer not to start with English; the first few languages will (inevitably?) colour and constrain our interlingua. So we need to be very careful, here. To set against that is our existing language-neutral content in Wikidata. Others must judge whether Wikidata is already "too European", but we must take care that we do not begin by constructing a "more English" representation of Wikidata content, or coerce it into a "more interlingual" representation, where the interlingua is linguistically more Eurocentric than global. So, first we must act to counter first-mover advantage and pre-existing bias, which means making things harder for ourselves, initially. At the same time, all language communities can be involved in refining our evolving language-neutral content, which will be multi-lingually labelized (like Wikidata). If some labelized content seems alien in some language, this can be flagged at an early stage (beginning now, for Wikidata). What this means is that all supported languages can already be reconciled, to an extent, with our foundational interlingua (Wikidata), and any extensions we come up with can also be viewed through our multi-lingual labelization. I suppose this is a primitive version of the "intelligible" content I mentioned earlier. When it comes to adding a further language (one that we currently support with labelization in Wikidata), we may hope to find that we are already somewhat reconciled, because linguistic choices have already been made in the labelization and our new target consumers can say what they dislike and what they might prefer; they do not need to consult some dusty grammar tome. In practice, they will already have given their opinions because that is how we will know that they are ready to get started (that is, theirs is a "willing and linguistically competent" community). In the end, though, we have to accept that the previous interlingual consensus will not always work and cannot always be fixed. This is when we fall back on the "interlingual fork" (sounds painful!). That just means adding an alternative language-neutral representation of the same (type of) encyclopedic content. I say "just" even though it might require rather more effort than I imagine (trust me, it will!) because it does seem temptingly straightforward. I say we must resist the temptation, but not stubbornly; a fallback is a tactical withdrawal, not a defeat; it is messy, but not too messy.--GrounderUK (talk) 12:54, 27 August 2020 (UTC)
Agree with GrounderUK here - I guess that we will have some successes and some dead ends during the implementation of certain languages, and that this is unavoidable. And that might partially stem from the state of the art in linguistic research in some languages.
On the other side, my understanding is that we won't be aiming for 7000 languages but for 400, and that these 400 languages will in general be better described and have more current research work about them than the other 6600. So I have a certain hope that for most languages that we are interested in we do have more current and modern approaches, that are more respectful of the language itself.
And finally, the grammarians lens of the 1970s Europeans will probably not matter that much in the end anyway - if we have access to a large enough number of contributors native in the given language. That should be a much more important voice in shaping the renderers of a language than dated grammar books. --DVrandecic (WMF) (talk) 22:01, 28 August 2020 (UTC)

Introducing new features into the framework needed by a newly-added language

In one of Denny's papers (PDF) on Abstract Wikipedia, he discusses (Section “5 Particular challenges”, p.8) how different languages make different distinctions that complicate translation—in particular needing to know whether an uncle is maternal or paternal (in Uzbek), and whether or not a river flows to the ocean (in French). I am concerned about some of the implications of this kind of thing.

Recoding previously known information with new features

One problem I see is that when you add a new language with an unaccounted-for feature, you may have to go back and recode large swathes of the information in the Abstract Wikipedia, both at the fact level, and possibly at some level of the common renderer infrastructure.

Suppose we didn’t know about the Uzbek uncle situation and we add Uzbek. Suddenly, we have to code maternal/paternal lineage on every instance of uncle everywhere. Finding volunteers to do the new uncle encoding seems like it could be difficult. In some cases the info will be unknown and either require research, or it is simply unknowable. In other cases, if it hasn’t been encoded yet, you could get syntactically correct but semantically ill-formed constructions.

@TJones (WMF): Yes, that is correct, but note that going back and recoding won't actually have effect on the existing text.
So assume we have an abstract corpus that uses the "uncle" construct, and that renders fine in all languages supported at that time. Now we add Uzbek, and we need to refine the "uncle" construct into either "maternal-uncle" or "paternal-uncle" in order to render appropriately in Uzbek - but both these constructs would be (basically) implemented as "use the previous uncle construct unless Uzbek". So all existing text in all supported languages would continue to be fine.
Now when we render Uzbek, though, then the corpus need to be retrofitted. But that merely blocks the parts of Uzbek renderings that are dealing with the construct "uncle". It has no impact on other languages. And since Uzbek didn't have any text so far (that's why we are discovering this issue now), it also won't reduce the amount of Uzbek generated text.
So, yes, you are completely right, but we still have a monotonously growing generated text corpus. And as we backfill the corpus, more and more of the text now becomes also available in Uzbek - but there would be no losses on the way. --DVrandecic (WMF) (talk) 22:09, 28 August 2020 (UTC)

Semantically ill-formed defaults

For example, suppose John is famous enough to be in Abstract Wikipedia. It is known that John’s mother is Mary, and that Mary’s brother is Peter, hence Peter is John’s uncle. However, the connection from John to Peter isn’t specifically encoded yet, and we assume patrilineal links by default. We could then generate a sentence like “John’s paternal uncle Peter (Mary’s brother) gave John $100.” Alternatively, if you try to compute some of these values, you are building an inference engine and a) you don’t want to do that, b) you really don’t want to accidentally do it by building ad hoc rules or bots or whatever, c) it’s a huge undertaking, and d) did I mention that you don’t want to do that?

Agreed. --DVrandecic (WMF) (talk) 22:10, 28 August 2020 (UTC)

Incompatible cultural "facts"

In the river example, I can imagine further complications because of cultural facts, which may require language-specific facts to be associated with entities. Romanian makes the same distinction as French rivière/fleuve with râu/fluviu. Suppose, for example, that River A has two tributaries, River B and River C. For historical reasons, the French think of all three as separate rivers, while the Romanians consider A and B to be the same river, with C as tributary. It’s more than just overlapping labels—which is complex enough. In French, B is a rivière because it flows into A, which is a fleuve. In Romanian, A and B are the same thing, and hence a fluviu. So, a town on the banks of River B is situated on a rivière in French, and a fluviu in Romanian, even though the languages make the same distinctions, because they carve up the entity space differently.

Yes, that is an interesting case. Similar situations happen with the usage of articles in names of countries and cities and whether you refer to a region as a country, a state, a nation, etc., which may differ from language to language. For these fun examples I could imagine that we have information in Wikidata connecting the Q ID for A and B to the respective L items. But that is a bit more complex indeed.
Fortunately, these cultural differences happen particularly on topics that are important for a given language community. Which increases the chance that the given language community has already an article about this topic in its own language, and thus will not depend on Abstract Wikipedia to provide baseline content. That's the nice thing: we are not planning to replace the existing content, but only to fill in the currently existing gaps. And these gaps will, more likely than not, cover topics that will not have these strong linguistic-cultural interplays. --DVrandecic (WMF) (talk) 22:15, 28 August 2020 (UTC)

Having to code information you don't really understand

In both of the uncle and river cases, an English speaker trying to encode information is either going to be creating holes in the information store, or they are going to be asked to specify information they may not understand or have ready access to.

A far-out case that we probably won’t actually have to deal with is still illustrative. The Australian Aboriginal language Guugu Yimithirr doesn’t have relative directions like left and right. Instead, everything is in absolute geographic terms; i.e., “there is an ant on your northwest knee”, or “she is standing to the east-northeast of you.” In addition to requiring all sorts of additional coding of information (“In this image, Fred is standing to the left/south-southwest of Jane”) depending on the implementation details of the encoding and the renderers, it may require changing how information is passed along in various low-level rendering functions. Obviously, it makes sense to make data/annotation pipelines as flexible as possible. (And again, the necessary information may not be known because it isn’t culturally relevant—e.g., in an old photo from the 1928 Olympics, does anyone know which way North is?)

@TJones (WMF): Yes, indeed, we have to be able to deal with holes. My assumption is that a contributor creates a first draft of an abstract article, their main concern will be the languages they speak. The UI may nudge them to fill in further holes, but it shouldn't require it. And the contributor can save the content now and reap benefit for all languages where the abstract content has all necessary information.
Now if there is a language that requires some information that is not available in the abstract content, where there is such a hole, then the rendering of this sentence for this language will fail.
We should have workflows that find all holes for a given language and then contributors can go through those and try to fill them, thereby increasing the amount of content that gets rendered in their languages - something that might be amenable to micro-contributions (or not, depending on the case). But all of this will be a gradual, iterative process. --DVrandecic (WMF) (talk) 22:21, 28 August 2020 (UTC)

Impact on grammar and renderers

Other examples that are more likely but less dramatic are clusivity, evidentiality, and ergativity which are more common. If these features also require agreement in verbs, adjectives, pronouns, etc., the relevant features will have to be passed along to all the right places in the sentence being generated. Some language features seem pretty crazy if you don't know those languages—like Salishan nounlessness and Guarani noun tenses—and may make it necessary to radically rethink the organization of information in statements and how information is transmitted through renderers.

Yes, that is something where I look to Grammatical Framework for inspiration, as it solved these cases pretty neatly by having an abstract and several concrete grammars and the passing through from one to the other and still allowing the different flows of agreement. --DVrandecic (WMF) (talk) 22:22, 28 August 2020 (UTC)

Grammatical concepts that are similar, but not the same

I’m also concerned about grammatical concepts that are similar across languages but differ in the details. English and French grammatical concepts often map to each other more-or-less, but the details where they disagree cause consistent errors in non-native speakers. For example, French says “Je travaille ici depuis trois ans” in the present tense, while English uses (arguably illogically, but that’s language!) the perfect progressive “I have been working here for three years”. Learners in both directions tend to do direct translations because those particular tenses and aspects usually line up.

Depending on implementation, I can see it being either very complex to represent this kind of thing or the representation needing to be language-specific (which could be a disaster—or at least a big mess). Neither a monolingual English speaker nor a monolingual French speaker will be able to tag the tense and aspect of this information in a way that allows it to be rendered correctly in both languages. Similarly, the subjunctive in Spanish and French do not cover the same use cases, and as an English speaker who has studied both I still have approximately zero chance of reliably encoding either correctly, much less both at the same time—though it’s unclear, unlike uncle, where such information should be encoded. If it’s all in the renderers, maybe it’ll be okay—but it seems that some information will have to be encoded at the statement level.

The hope would be that the exact tense and mood would indeed not be encoded in the abstract representation at all, but added to it by the individual concrete renderers. So the abstract representation of the example might be something like works_at(1stPSg, here, years(3)), and it would be up to the renderers to either render that using the present or the perfect progressive. The abstract representation would need to abstract from the language-specificities. These high-level abstract representations would probably break down first into lower level concrete representations, such as French clause(1stPSg, work, here, depuis(years(3)), present tense, positive) or English clause(1stPSg, work, here, for(years(3)), perfect progressive, positive), so that we have several layers from the abstract representation slowly working down its way to a string with text in a given language. --DVrandecic (WMF) (talk) 22:30, 28 August 2020 (UTC)

“Untranslatable” concepts and fluent rendering

Another random thought I had concerns “untranslatable” concepts, one of the most famous being Portuguese saudade. I don’t think anything is actually untranslatable, but saudade carries a lot more inherent cultural associations than, say, a fairly concrete concept like Swedish mångata. Which sense/translation of saudade is best to use in English—nostalgia, melancholy, longing, etc.—is not something a random Portuguese speaker is going to be able to encode when they try to represent saudade. On the other hand, if saudade is not in the Abstract Wikipedia lexicon, Portuguese speakers may wonder why; it’s not super common, but it’s not super rare, either—they use it on Portuguese Wikipedia in the article about Facebook, for example.

Another cultural consideration is when to translate/paraphrase a concept and when to link it (or both)—saudade, rickroll, lolcats. Dealing with that seems far off, but still complicated, since the answer is language-specific, and may also depend on whether a link target exists in the language (either in Abstract Wikipedia or in the language’s “regular” Wikipedia).

Yes! And that's exactly how we deal with it! So the sausade construct might be represented as a single word in Portuguese, but as a paraphrase in other languages. There is no requirement that each language has to have each clause built of components of the same kind. The same thing would allow us to handle some sentence qualifiers as adjectives or adverbs if a language can do that (say if there is a adjective stating that something happened yesterday evening), and use a temporal phrase otherwise. The abstract content can break apart in very different concrete renderers. --DVrandecic (WMF) (talk) 22:35, 28 August 2020 (UTC)

Discourse-level encoding?

I’m also unclear how things are going to be represented at a discourse level. Biographies seem to be more schematic than random concrete nouns or historical events, but even then there are very different types of details to represent about different biographical subjects. Or is every page going to basically be a schema of information that gets instantiated into a language? Will there be templates like BiographyIntro(John Smith, 1/1/1900, Paris, 12/31/1980, Berlin, ...) ? I don’t know whether that is a brilliant idea or a disaster in the making.

It's probably neither. So, particularly for bio-intros? That's what, IIRC, Italian Wikipedia is already doing, in order to increase the uniformity of their biographies. But let's look at other examples: I expect that for tail entities that will be the default representation, a more or less large template that takes some data from Wikidata and generates a short text. That is similar to the way LSJbot is working for a large number of articles - but we're removing the bus factor from LSJbot and we are putting it into a collaboratively controllable space.
Now, I would be rather disappointed if things stopped there: so, when we go beyond tail entities, I hope that we will have much more individually crafted abstract contents, where the simple template is replaced by diverse, more specific constructors, and only in case these are not available for rendering in a given language, we fall back to the simple template. And these would indeed need to also represent discourse-level conjunctions such as "however", "in the meantime", "in contrast to", etc. --DVrandecic (WMF) (talk) 22:40, 28 August 2020 (UTC)

Volume, Ontologies, and Learning Curves, Oh My!

Finally, I worry about the volume of information to encode (though Wikipedia has conquered that) and the difficulty of getting the ontology right (which Wikidata has done well, though on a less ambitious scale than Wikilambda requires, I think), and the learning curve for the more complex grammatical representations from editors.

I (and others) like to say that Wikipedia shouldn't work in theory, but in practice it does—so maybe these worries are overblown, or maybe worrying about them is how they get taken care of.

Trey Jones (WMF) (talk) 21:59, 24 August 2020 (UTC)

I don't think the worries you stated are overblown - these are all well-grounded worries, and for some of them I had specific answers, and for some I am fishing for hope that we will be able to overcome it when we get to it. One advantage is that I don't have to have all answers yet, but that I can rely on the ingenuity of the community to get some of these blockers out of the way, as long as we create an environment and a project that attracts enough good people.
And this also means that this learning curve you are describing here is one of my main worries. The goal is to design a system that allows contributors who want to do so to dive deep into the necessary linguistic theories and in the necessary computer science background to really dig through the depths of Wikilambda and Abstract Wikipedia - and I expect to rely on these contributors sooner than later. But at the same time it must remain possible to effectively contribute if you don't do that, or else the project will fail. We must provide contribution channels where confirming or correcting lexicographic forms works without the contributor having to fully understand all the other parts, where abstract content can be created and maintained without having to have a degree in computational linguistics. I still aim for a system where, in case of an event (say, a celebrity marries, an author publishes a new book, or a new mayor gets elected) an entirely fresh contributor can figure out in less than five minutes how to actually make the change on Abstract Wikipedia. This will be a major challenge, and not even because I think the individual user experiences for those things will be incredibly hard, but rather because it is unclear how to steer the contributor to the appropriate UX.
But yes, the potential learning curve is a major obstacle, and we need to address that effectively, or we will fall short of the potential that this project has. --DVrandecic (WMF) (talk) 22:50, 28 August 2020 (UTC)


@All: @TJones (WMF):

These are all serious issues that need to be addressed, which is why I proposed a more rigorous development part for "producing a broad spectrum of ideas and alternatives from which a program for natural language generation from abstract descriptions can be selected" at

With respect to your first three sections, I assumed Denny only referred to Grammatical Framework as a system with similar goals to Abstract Wikipedia. I also assumed that the idea was that only those grammatical phenomena needed to somewhat sensibly express the abstract content in a target language need to be captured, and that hopefully generating those phenomena will require much less depth and sophistication than constructing a grammar.

With respect to the section "Recoding previously known information with new features," the system needs to be typologically responsible from the outset. I think we should be aware of the features of all of the languages we are going to want to eventually support.

In your example, I think coding "maternal/paternal lineage on every instance of uncle everywhere" would be great, but not required. A natural language generation system will not (soon) be perfect.

With respect to the section "Semantically ill-formed defaults": in other words, what should we do when we want to make use of contextual neutralization (e.g., using paternal uncle in a neutral context where the paternal/maternal status of the uncle is unknown) but we cannot ensure the neutral context?[1] I would argue that there are certain forms that we should prefer because they appear less odd when the neutral context is supposedly violated than others (e.g., they in The man was outside, but they were not wearing a coat). There is also an alternative in circumlocution: for example, we could reduce specificity and use relative instead of parental or maternal uncle and hope that any awkward assumptions in the text of a specifically uncle-nephew/niece relationship are limited.

Unfortunately, it does seem that to get a really great system, you need a useful inference engine ...

With respect to the section "Incompatible cultural 'facts'," this is the central issue to me, but I would rely less on the notion of "cultural facts." We are going to make a set of entity distinctions in the chosen set of Abstract Wikipedia articles, but the generated natural language for these articles needs to respect how the supported languages construe conceptual space (speaking in cognitive linguistics terms for a moment).[2] I am wondering if there is a typologically responsible way of ordering these construals (perhaps as a "a lattice-like structure of hierarchically organized, typologically motivated categories"[3]) that could be helpful to this project.

With respect to the section "Having to code information you don't really understand," as above, I think there are sensible ways in which we can handle "holes in the information store." For example, in describing an old photo from the 1928 Olympics, can we set North to an arbitrary direction, and what are the consequences if there is context that implies a different North?

With respect to the section "Discourse-level encoding?," I do hope we can simulate the "art" of writing a Wikipedia article to at least some degree.

  1. Haspelmath, Martin. "Against markedness (and what to replace it with)." Journal of linguistics (2006): 39.; Croft, William. Typology and universals, 100-101. Cambridge University Press, 2002.
  2. Croft, William, and Esther J. Wood. "Construal operations in linguistics and artificial intelligence." Meaning and cognition: A multidisciplinary approach 2 (2000): 51.
  3. Van Gysel, Jens EL, Meagan Vigus, Pavlina Kalm, Sook-kyung Lee, Michael Regan, and William Croft. "Cross-lingual semantic annotation: Reconciling the language-specific and the universal." ACL 2019 (2019): 1.

--Chris.Cooley (talk) 11:45, 25 August 2020 (UTC)

Thanks Chris, and thanks for starting that page! I think this will indeed be an invaluable resource for the community as we get to these questions. And yes, as you said, I referred to GF as a similar project and as an inspiration and as a demonstration of what is possible, not necessarily as the actual implementation to use. Also, you will find that my answers above don't always fully align with what you said, but I think that they are in general compatible. --DVrandecic (WMF) (talk) 22:56, 28 August 2020 (UTC)
@DVrandecic (WMF): Thanks, and I apologize for sounding like a broken record with respect to a part PP2! With respect to the above, it does sound like we disagree on some things, but I am sure there will be time to get into them later! --Chris.Cooley (talk) 00:29, 29 August 2020 (UTC)
Some of these issues are I think avoided by the restricted language domain we are aiming for - encyclopedic content. There should be relatively limited need for constructions in the present or future tenses. No need to worry about the current physical location or orientation of people, etc. If something is unknown then a fall-back construction ("brother of a parent" rather than a specific sort of "uncle") should be fine, if possibly inelegant. We don't need to capture every possible aspect of language, just sufficient to convey the meanings intended. ArthurPSmith (talk) 18:56, 25 August 2020 (UTC)
Thanks Arthur! Yes, I very much appreciate the pragmatic approach here - and I think a combination of the pragmatic getting things done with the aspirational wanting everything to be perfect will lead to the best tension to get this project catapulted forward! --DVrandecic (WMF) (talk) 22:56, 28 August 2020 (UTC)
@ArthurPSmith: "the restricted language domain we are aiming for - encyclopedic content" I wish this was so, but I am having trouble thinking of an encyclopedic description of a historical event — for example — as significantly restricted. I could easily see the physical location or orientation of people becoming relevant in such a description. "If something is unknown then a fall-back construction ('brother of a parent' rather than a specific sort of 'uncle') should be fine, if possibly inelegant. We don't need to capture every possible aspect of language, just sufficient to convey the meanings intended." I totally agree, and I think this could be another great fallback alternative. --Chris.Cooley (talk) 00:29, 29 August 2020 (UTC)

Translatable modules


This is not exactly about Abstract Wikipedia, but it's quite closely related, so it may interest the people who care about Abstract Wikipedia.

The Wikimedia Foundation's Language team started the Translatable modules initiative. Its goal is to find a good way to localize Lua modules as conveniently as it is done for MediaWiki extensions.

This project is related to task #2 in Abstract Wikipedia/Tasks, "A cross-wiki repository to share templates and modules between the WMF projects". The relationship to Abstract Wikipedia is described in much more detail on the page mw:Translatable modules/Engineering considerations.

Everybody's feedback about this project is welcome. In particular, if you are a developer of templates or Lua (Scribunto) modules, your user experience will definitely be affected sooner or later, so make your voice heard! The current consultation stage will go on until the end of September 2020.

Thank you! --Amir E. Aharoni (WMF) (talk) 12:45, 6 September 2020 (UTC)

@Aaharoni-WMF: Thanks, that is interesting. I do wonder whether there is more overlap in Phase 1 than your link suggests. Although it is probably correct to say that the internalization of ZObjects will be centralized initially, there was some uncertainty about which multi-lingual labelization and documentation solution should be pursued. Phab:T258953 is now resolved but I don't understand it well enough to see whether it aligns with any of your solution options. As for phase 2, I simply encourage anyone with an interest in language-neutral encyclopedic content to take a look at the link you provided and the associated discussion. Thanks again.--GrounderUK (talk) 19:43, 7 September 2020 (UTC)

Response from the Grammatical Framework community

The Grammatical Framework (GF) community has been following the development of Abstract Wikipedia with great interest. This summary is based on a thread at GF mailing list and my (inariksit) personal additions.


GF has a Resource Grammar Library (RGL) for 40 or so languages, and 14 of them have large-scale lexical resources and extensions for wide-coverage parsing. The company Digital Grammars (my employer) has been using GF in commercial applications since 2014.

To quote GF's inventor Aarne Ranta on the previously linked thread:

My suggestion would have have a few items:

  • that we develop a high-level API for the purpose, as done in many other NLG projects
  • that we make a case study on an area or some areas where there is adequate data. For instance from OpenMath
  • that we propagate this as a community challenge
  • Digital Grammars can sponsor this with some tools, since we have gained experience from some larger-scale NLG projects

Morphological resources from Wiktionary inflection tables

With work of Forsberg and Hulden and Kankainen, it's possible to extract GF resources from Wiktionary inflection tables.

Quoting Kristian Kankainen's message:

Since the Wiktionary is a popular place for inflection tables, these could be used for boot-strapping GF resources for those languages. Moreover, but not related to GF nor Abstract Wikipedia, the master's thesis generates also FST code and integrates the language into the Giella platform which provides an automatically derived simple spell-checker for the language contained in the inflection tables. Coupling or "boot-strapping" the GF development using available data on Wiktionary could be seen as a nice touch and would maybe be seen as a positive inter-coupling of different Wikimedia projects.

Division of labour

Personally, I would love to spend the next couple of years reading grammar books and encoding basic morphological and syntactic structures of languages like Guarani or Greenlandic into the GF RGL. With those in place, a much wider audience can write application grammars, using the RGL via a high-level API.

Of course, for this to be a viable solution, more people than just me need to join in. I believe that if the GF people know that their grammars will be used, the motivation to write them is much higher. To kickstart the resource and API creation, we could make Abstract Wikipedia as a special theme of the next GF summer school, whenever that is organised (live or virtually).

Addressing some concerns from this talk page

Some of the concerns on the talk page are definitely valid.

  • It is going to take a lot of time. Developing the GF Resource Grammar Library has taken 20 calendar years and (at least) 20 person years. I think everyone who has a say in the choice of renderer implementation should get familiar with the field---check out other grammar formalisms, like HPSG, you'll see similar coverage and timelines to GF[1].
  • The "Uzbek uncle" situation happens often with GF grammars when adding a new language or new concepts. Since this happens often, we are prepared for it. There are constructions in the GF language and module system that make dealing with this manageable.
  • "Incompatible cultural facts" is a minefield of its own, far beyond the scope of NLG. I personally think we should start with a case study for a limited domain.

On the other hand, worrying about things like ergativity or when to use subjunctive tells me that the commenters haven't understood just how abstract an abstract syntax can be. To illustrate this, let me quote the GF Best Practices document on page 9:

Linguistic knowledge. Even the most trivial natural language grammars involve expert linguistic knowledge. In the current example, we have, for instance, word inflection and gender agreement shown in French: le bar est ouvert (“the bar is open”, masculine) vs. la gare est ouverte (“the station is open”, feminine). As Step 3 in Figure 3 shows, the change of the noun (bar to gare) causes an automatic change of the definite article (le to la) and the adjective (ouvert to ouverte). Yet there is no place in the grammar code (Figure 2) that says anything about gender or agreement, and no occurrence of the words la, le, ouverte! The reason is that such linguistic details are inherited from a library, the GF Resource Grammar Library (RGL). The RGL guarantees that application programmers can write their grammars on a high level of abstraction, and with a confidence of getting the linguistic details automatically right.

Language differences. The RGL takes care of the rendering of linguistic structures in different languages. [--] The renderings are different in different languages, so that e.g. the French definition of the constant the_Det produces a word whose form depends on the noun, whereas Finnish produces no article word at all. These variations, which are determined by the grammar of each language, are automatically created by the RGL. However, the example also shows another kind of variation: English and French use adjectives to express “open” and “closed”, whereas Finnish uses adverbs. This variation is chosen by the grammarian, by picking different RGL types and categories for the same abstract syntax concepts.

Obviously the GF RGL is far from covering all possible things people might want to say in a wikipedia article. But an incomplete tool that covers the most common use cases, or covers a single domain well, is still very useful.


Non-European and underrepresented languages

Regarding discussions such as, I'm happy to see that you are interested in underrepresented languages. The GF community has members in South Africa, Uganda and Kenya, doing or having previously done work on Bantu languages. At the moment (September 2020), there is ongoing development in Zulu, Xhosa and Northern Sotho.

This grammar work has been used in a healthcare application, and you can find a link to a paper describing the application in this message.

If any of these sounds interesting to you, we can start a direct dialogue with the people involved.

Concluding words

Whatever system Abstract Wikipedia will choose, that will follow the evolutionary path of GF (and no doubt other similar systems), so it's better to learn from that regardless of whether GF is chosen or not. We are willing to help, whether it's actual GF programming or sharing experiences---what we have tried, what has worked, what hasn't.

On behalf of the GF community,

inariksit (talk) 08:19, 7 September 2020 (UTC)


I second what inariksit says about the interest from the GF community - if GF were to be used by AW, it would give a great extra motivation for writing resource grammars, and it would also benefit the GF community by giving the opportunity to test and find remaining bugs in the grammars for smaller languages.

Skarpsill (talk) 08:34, 7 September 2020 (UTC)

Thanks for this summary! Nemo 09:10, 13 September 2020 (UTC)

@Inariksit:, @Skarpsill: - thank you for your message, and thank you for reaching out. I am very happy to see the interest and the willingness to cooperate from the Grammatical Framework community. In developing the project, I have read the GF book at least three times (I wish I was exaggerating), and have taken inspiration in how GF has solved a problem many times when I got stuck. In fact, the whole idea that Abstract Wikipedia can be built on top of a functional library being collected in the wiki of functions can be traced back to GF being built as a functional language itself.

I would love for us to find ways to cooperate. I think it would be a missed opportunity not to use learn or even directly use the RGLs.

I keep this answer short, and just want to check a few concrete points:

  • when Aarne mentioned to "develop a high-level API for the purpose, as done in many other NLG projects", what kind of API is he thinking of? An API to GF, or an abstract grammar for encyclopaedic knowledge?
  • whereas math would be a great early domain, given that you already have experience in the medical domain, and several people have raised the importance of medical knowledge for closing gaps in Wikipedia, could that be an interesting early focus domain?
  • regarding our timeline, we plan to work on the wiki of functions in 2021, and start working on the NLG functionalities in late 2021 and throughout 2022. Given that timeline, what engagement would make sense from your side?

I plan to come back to this and answer a few more points, but I have been sitting on this for too long already. Thank you for reaching out, I am very excited! --DVrandecic (WMF) (talk) 00:44, 26 September 2020 (UTC)

@DVrandecic (WMF): Thanks for your reply! I'm not quite sure what Aarne means with the high-level API: an application grammar in GF, or some kind of API outside GF that builds GF trees from some structured data. If it's the former, it would just look like any other GF application grammar, if the latter, it could look something like this toy example in my blog post (ignore that the black box says "nobody understands this code"): the non-GF users would interact with the GF trees on some higher level, like in that picture, first choosing the dish pizza and then choosing pizza toppings, generates the GF tree for "Your pizza has (the chosen toppings)".
The timeline starting in late 2021 is ideal for me personally. I think that medical domain is a good domain as well, but I don't know enough to discuss details. I hope that our South African community gets interested (initial interest is there, as confirmed by the thread on GF mailing list).
I would love to have an AW-themed GF summer school in 2022. It would be a great occasion to introduce GF people to AW, and AW people to GF, in a 2-week intensive course. If you (as in the whole AW team) think this is a good idea, then it would be appropriate to start organising it already in early 2021. If we want to target African languages, we could e.g. try to organise it in South Africa. I know this may seem premature to bring it up now, but if we want to do something like this in a larger scale, it's good to start early.
Btw, I joined the IRC channel #wikipedia-abstract, so we can talk more there if you want. --inariksit (talk) 17:42, 14 October 2020 (UTC)

Might deep learning-based NLP be more practical?

First of all, I'd like to state that Abstract Wikipedia is a very good idea. I applaud Denny and everyone else who has worked on it.

This is kind of a vague open-ended question, but I didn't see it discussed so I'll write it anyway. The current proposal for Wikilambda is heavily based on a generative-grammar type view of linguistics; you formulate a thought in some formal tree-based syntax, and then explicitly programmed transformational rules are applied to convert the output to a given natural language. I was wondering whether it would make any sense to make use of connectionist models instead of / in addition to explicitly programmed grammatical rules. Deep learning based approaches (most impressively, Transformer models like GPT-3) have been steadily improving over the course of the last few years, vindicating connectionism at least in the context of NLP. It seems like having a machine learning model generate output text would be less precise than the framework proposed here, but it would also drastically reduce the amount of human labor needed to program lots and lots of translational rules.

A good analogy here would be Apertium vs. Google Translate/DeepL. Apertium, as far as I understand it, consists of a large number of rules programmed manually by humans for translating between a given pair of languages. Google Translate and DeepL are just neural networks trained on a huge corpus of input text. Apertium requires much more human labor to maintain, and its output is not as good as its ML-based competitors. On the other hand, Apertium is much more "explainable". If you want to figure out why a translation turned out the way it did (for example, to fix it), you can find the rules that caused it and correct them. Neural networks are famously messy and Transformer models are basically impossible to explain.

Perhaps it would be possible to combine the two approaches in some way. I'm sure there's a lot more that could be said here, but I'm not an expert on NLP so I'll leave it at that. PiRSquared17 (talk) 23:46, 25 August 2020 (UTC)

@PiRSquared17: thank you for this comment, and it is a question that I get asked a lot.
First, yes, the success of ML systems in the last decade have been astonishing, and I am amazed by how much the field has developed. I had one prototype that was built around an ML system, but even more than the Abstract Wikipedia proposal it exhibited a Matthew effect - languages that were already best represented benefitted from that architecture the most, whereas the languages that needed most help would get least of it.
Another issue is, as you point out, that within a Wikimedia project I would expect the ability for contributors to go in and fix errors. This is considerably easier with the symbolic approach chosen for Abstract Wikipedia than with an ML-based approach.
Having said that, there are certain areas I will rely on ML-based solutions in order to get them working. This includes an improved UX to create content, and this includes analysis of the existing corpora as well as of the generated corpora. There is even the possibility of using an ML-based system to do the surface cleanup of the text to make it more fluent - basically, to have an ML-based system do copy on top of the symbolically generated text, which could have the potential to reduce the complexity of the renderers considerably and yet get good fluency - but all of these are ideas.
In fact, I am planning to write a page here where I outline possible ML tasks in more detail.
@DVrandecic (WMF): Professor Reiter ventured an idea or two on this topic a couple of weeks ago: " may be possible to use GPT3 as an authoring assistant (eg, for developers who write NLG rules and templates), for example suggesting alternative wordings for NLG narratives. This seems a lot more plausible to me than using GPT3 for end-to-end NLG."--GrounderUK (talk) 18:22, 31 August 2020 (UTC)
@GrounderUK: With respect to GPT-3, I am personally more interested in things like for the (cross-linguistic) purposes of Abstract Wikipedia. You might be able to imagine strengthening Wikidata and/or assisting abstract content editing. --Chris.Cooley (talk) 22:33, 31 August 2020 (UTC)
@Chris.Cooley: I can certainly imagine such a thing. However it happens, the feedback into WikidataPlusPlus is kinda crucial. I think I mentioned the "language-neutral synthetic language that inevitably emerges" somewhere (referring to Wikidata++) as being the only one we might have a complete grammar for. Translations (or other NL-type renderings) into that interlingua from many Wikipedias could certainly generate an interesting pipeline of putative data. Which brings us back to #Distillation of existing content (2nd paragraph)...--GrounderUK (talk) 23:54, 31 August 2020 (UTC)
@DVrandecic (WMF): May we go ahead and create such an ML page, if you haven't already? James Salsman (talk) 19:00, 16 September 2020 (UTC)
Now it could be that ML and AI will develop with such a speed to make Abstract Wikipedia superfluous. But to be honest (and that's just my point of view), given the development of the field in the last ten years, I don't see that moment be considerably closer than it was five years ago (but also, I know a number of teams working on a more ML-based solution to this problem, and I honestly wish them success). So personally I think there is a window of opportunity for Abstract Wikipedia to help billions of people for quite a few years, and to allow many more to contribute to the world's knowledge sooner. I think that's worth it.
Amusingly, if Abstract Wikipedia succeeds, I think we'll actually accelerate the moment where we make it's existent unnecessary. --DVrandecic (WMF) (talk) 03:55, 29 August 2020 (UTC)
I will oppose introducing any deep learning technique as it is 1. difficult to develop 2. difficult to train 3. difficult to generalize 4. difficult to maintain.--GZWDer (talk) 08:06, 29 August 2020 (UTC)
It could be helpful to use ML-based techniques for auxiliary features, such as parsing natural language into potential abstract contents for users to choose/modify, but using such techniques for rendered text might not be a good idea, even if it's just used on top of symbolically generated text. For encyclopedic content, accuracy/preciseness is much more important than naturalness/fluency. As suggested above, "brother of a parent" could be an acceptable fallback solution for the "uncle" problem, even it doesn't sound natural in a sentence. While a ML-based system will make sentences more fluent, it could potentially turn a true statement into a false one, which would be unacceptable. Actually, many concerns raised above, including the "uncle" problem, could turn out to be advantages of current rule-based approach over ML-based approach. Although those are challenging issues we need to address, it would be more challenging for ML/AI to resolve those issues. --Stevenliuyi (talk) 23:41, 12 September 2020 (UTC)

Hello everyone. Somewhat late to the party, but I recently proposed a machine learning data curation that is germane to this discussion I think. I propose using machine learning to re-index Wikipedias to paraphrases. FrameNet and WordNet have been working on the semantics problem that Abstract is trying to address. The problem they face is the daunting task of manually building out their semantic nuggets. Not to mention the specter over every decision's shoulder, opinion. With research telling us that sentences follow Zipf's Law paraphrases become the sentence equivalent to words' synonyms. It also means our graph is completely human readable, as we retain context for each sentence member of the paraphrase node. You could actually read a Wikipedia article directly from the graph. This way, we can verify the machine learning algorithms. Just like a thesaurus lists all of the words with the same meaning, our graph would illustrate all of the concepts with the same meaning, and the historic pathways used to got to related concepts. Further, by mapping Wikidata meta to the paraphrase nodes, we would have a basic knowledge representation. The graph is directed, which will also assist function logic for translating content. Extending this curation to the larger Internet brings new capabilities into play. I came at this from a decision support for work management perspective. My goal was to help identify risk and opportunities in our work. I realized that I needed to use communication information to get to predictive models like Polya's Urn with Innovation Triggering. I realized along the way that this data curation is foundational, and must be under the purview of a collective rather than some for profit enterprise. So here I am. I look forward to discussing this concept while learning more about Abstract. Please excuse my ignorance as I get up to speed on this wonderfully unique culture.--DougClark55 (talk) 21:33, 11 January 2021 (UTC)

Parsing Word2Vec models and generally

Moved to Talk:Abstract Wikipedia/Architecture#Constructors. James Salsman (talk) 20:36, 16 September 2020 (UTC)