Talk:Abstract Wikipedia

From Meta, a Wikimedia project coordination wiki
(Redirected from Talk:Wikilambda)
Jump to navigation Jump to search
This page is for discussions related to the Abstract Wikipedia page.

  Please remember to:

  Archives: 1 2

Wikimedia Community Logo.svg


Distillation of existing content[edit]

I wonder whether we have hold of the right end of the stick. From the little I have so far read of this project, the big idea seems to be to help turn data into information. The goal, in practice, seems to be a satisfactory natural language presentation of previously defined subjects and predicates (content). Or, rather, the goal is to develop the capability to present content in any natural language. That, I say, is a worthy ambition.

As a contributor who mainly edits articles written by others, I feel the need to look at this from the opposite direction. Missing content is, of course, a problem. But so are inconsistent content, misleading content and confusing content. For me, an "abstract wikipedia" distils the content of an article, identifying the subjects, their qualities and the more or less subtle inter-relationships between them. That is, I imagine we begin with natural language content and automatically analyse it into an abstract form. I believe this is one way that non-technical contributors can really help us move towards our goal. If the person editing an article can see a "gloss" of the article in its target language, it is easy to imagine how they might adjust the original text so as to nudge the automated abstract towards a more accurate result. At the same time, the editor could be providing hints for more natural "rendering" (or re-rendering) of the abstract content into the natural language. In practice, this is what we already do when we provide alternative text in a link.

In my view, this sort of dialogue between editor and machine abstraction will help both. The editor gets objective feedback about what is ambiguous or unclear. I imagine some would prefer to "explain" to the machine how it should interpret the language ("give hints") while others might prefer to change the language so that it is more easily interpreted by the machine. Either way, the editor and the machine converge, just as collaborative editors already do, over time.

The crucial point here, I suppose, is that the re-rendering of abstracted content can be more reliably assessed at the editing stage. To the editor, it is just another tool, like "Show preview" or "Show changes" (or the abstract results appear with the text when those tools are used). Giving hints becomes as natural to the editor as fixing redlinks; how natural taking hints might become, time alone can tell.

Congratulations on getting this exciting project started.--GrounderUK (talk) 01:21, 5 July 2020 (UTC)

tldr: Once the content become abstract one, it will be a technical burden for community to contribute.
That is why the content should be maintained as part of the wikipedia editing process, with the editor only (and optionally) guiding the way the content is mapped back into Wikidata or verifying that it would be rendered correctly back into the source language WikiText (which is still fully concrete and always remains so).--GrounderUK (talk) 22:37, 5 July 2020 (UTC)

I do not think automated abstract is a good thing to do - this means we introduced a machine translation system, which generated interlingua with unclear semantics and can not be reused simply. I am also very skeptic for making all articles fully abstract.--GZWDer (talk) 21:22, 5 July 2020 (UTC) [signature duplicated from new topic added below]

I would be sceptical too. In fact, I would strongly oppose any such proposal. My suggestion (in case you have misunderstood me) is for a human-readable preview of the machine's interpretation of the natural language WikiText.--GrounderUK (talk) 22:37, 5 July 2020 (UTC)
@GrounderUK: Yes! Thank you for the congratulations, and yes, I agree with the process you describe. The Figure on Page 6 sketches an UI for that: the contributor enters natural language text, the system tries to automatically guess the abstraction, and at the same time displays the result in different languages the contributor chooses. The automatic guess in the middle can be directly modified, and the result of the modification is displayed immediately.
I also very much like how you describe the process of contributors fixing up ambiguities (giving hints). I also hope for such a workflow, where failed renderings get into a queue and allow contributors to go through them and give the appropriate hints, filtered by language.
But in my view nothing of this happens fully automated, it always involves the human in the middle. We don't try to automatically ingest existing articles, but rather let the contributors go and slowly build and grow the content.
Thank you for the very inspiring description of the process, I really enjoyed reading it. --DVrandecic (WMF) (talk) 05:15, 14 July 2020 (UTC)
@DVrandecic (WMF): You are very welcome, of course! Thank you for the link to your paper. You say, "The content has to be editable in any of the supported languages. Note that this does not mean that we need a parser that can read arbitrary input in any language." I think we agree. I would go further, however, and suggest that we do need a parser for every language for which there is a renderer. I invite you to go further still and view a parser as the precise inverse function of a renderer. If that sounds silly, it may be because we think of rendering as a "lossy" conversion. But the lesson from this should be that rendering (per se) should be constrained to be "lossless", meaning neither more nor less than that its output can be the input to the inverse function (the "parser"), returning as output the exact original input to the renderer. Such an approach implies that there will be subsequent lossy conversion required to achieve the required end result, but we need to think about ways in which the "final losses" can be retained (within comments, for example) so that even the end result can be parsed reliably back into its pre-rendered form. More importantly, an editor can modify the end result with a reasonable expectation that the revised version can be parsed back into what (in the editor's opinion) the pre-rendered form should have been.
To relate this back to your proposed interaction, where you say, "The automatic guess in the middle can be directly modified", I hesitate. I would prefer to think of the editor changing the original input or the implied output, rather than the inferred (abstract) content (although I doubt this will always be possible). We can explore the user experience later, but I would certainly expect to see a clear distinction between novel claims inferred from the input and pre-existing claims with which the input is consistent. Suppose the editor said that San Francisco was in North Carolina, for example, or that it is the largest city in Northern California.
I agree about always involving the human in the middle. However... If we develop renderers and parsers as inverses of each other, we can parse masses of existing content with which to test the renderers and understand their current limitations.--GrounderUK (talk) 23:11, 19 July 2020 (UTC)
@GrounderUK: Grammatical Framework has grammars that are indeed bi-directional, and some of the other solutions have that too. To be honest, I don't buy it - it seems that all renderings are always lossy. Just go from a language such as German that uses gendered professions to a language such as Turkish or English that does not. There is a necessary loss of information in the English and Turkish translation. --DVrandecic (WMF) (talk) 00:47, 5 August 2020 (UTC)
@DVrandecic (WMF): Yes, that's why I said "we think of rendering as a "lossy" conversion. But the lesson from this should be that rendering (per se) should be constrained to be "lossless" [...which...] implies that there will be subsequent lossy conversion required to achieve the required end result, but we need to think about ways in which the "final losses" can be retained ... so that ... an editor can modify the end result with a reasonable expectation that the revised version can be parsed back..." [emphasis added]. In your example, we (or she) may prefer "actor" to "actress", but the rendering is more like "actor"<f> (with the sex marker not displayed). In the same way, the renderer might deliver something like "<[[Judi Dench|Dame >Judi<th Olivia> Dench< CH DBE FRSA]]>" to be finally rendered in text as "Judi Dench", or "<[[Judi Dench|>Dame Judi<th Olivia Dench CH DBE FRSA]]>" for "Dame Judi", or just "<[[Judi Dench|>she<]]>" for "she". (No, that's not pretty, but you get the idea. "...we need to think about ways in which the "final losses" can be retained..." ["she", "[[Judi Dench|", "]]"]?)--GrounderUK (talk) 20:48, 6 August 2020 (UTC)
@GrounderUK: Ah, OK, yes, that would work. That seems to require some UX to ensure that these hidden fields get entered when the user writes the content? --DVrandecic (WMF) (talk) 22:11, 6 August 2020 (UTC)
@DVrandecic (WMF):Not sure... I'm talking about rendered output that might be tweaked after it is realised as wikitext. If the wikitext is full of comments, a quick clean-up option would be great. But if the final lossy conversion has already happened, then it's a question of automatically getting back the "final losses" from wherever we put them and merging them back into our wikitext before we start the edit. You might even create the page with the heavily commented wikitext (which looks fine until you want to change it) and then "replace" it with the cleaned up version, so an editor can go back to the original through the page history, if the need arises.
If you're talking about creating the language-neutral content from natural-language input in the first place, that's a whole other UX. But with templates and functions and what-have-you, we could go quite a way with just the familiar wikitext, if we had to (and I have to say that I do usually end up not using the visual editor when it's available, for one reason or another).
Either way, there might be a generic template wrapping a generic function (or whatever) to cover the cases where nothing better is available. Let's call it "unlossy" with (unquoted) arguments being, say, the end-result string and (all optional) the page name, the preceding lost-string, the following lost-string and a link switch. So, {{unlossy|she|Judi Dench| | |unlinked}} or {{unlossy|Dame Judi|Judi Dench| |th Olivia Dench CH DBE FRSA|linked}}. (There are inversions and interpolations etc to consider, but first things first.)
In general, though, (and ignoring formatting) I'd expect there would be something more specific, like {{pronoun|Judi Dench|subject}} or {{pronoun|{{PAGENAME}}|subject}} or {{UKformal|{{PAGENAME}}|short}}. If {{PAGENAME}} and subject are the defaults, it could just be {{pronoun}}. Now, if humans have these templates (or function wrappers) available, why wouldn't the rendering decide to use them in the first place? Instead of rendering, say, {"type":"function call", "function":"anaphora", "wd":"Q28054", "lang":"en", "case":"nominative"} as "she", it renders (also?) as "{{pronoun|Judi Dench}}", which is implicitly {{pronoun|Judi Dench|subject}}, which (given enwiki as the context) maps to something like {"type":"function call", "function":"anaphora", "wd":"Q28054", "lang":"en", "case":"nominative"} (Surprise!). As an editor, I can now just change "pronoun" to "first name" or "UKformal"... and all is as I would expect. That's really not so painful, is it?--GrounderUK (talk) 03:24, 7 August 2020 (UTC)
Okay, so maybe it really is that painful! It's still not 100% clear to me exactly what an editor would type into Wikipedia to kick off a function. Say I have my stripped down pronoun function, which returns by default the nominative pronoun for the subject of the page ("he", "she", "it" or "they"). I guess it looks like pronoun() in the wiki of functions (with its English label)... but, then, where is it getting its default parameters ("Language":"English", "case":"nominative", "number":"singular", "gender":"feminine") from? Number and gender are a function of "Q28054", so we're somehow defaulting that function in; the case is an internal default or fallback, for the sake of argument; and the language is (magically) provided from the context of the function's invocation. Sounds okay, -ish. Is it too early to say whether I'm close? We don't need to think of this as NLG-specific. What would age("Judi Dench") or today() look like?--GrounderUK (talk) 16:46, 1 September 2020 (UTC)


I don't know if a potential logo has been discussed, but I propose the Wikipedia Logo with Pictograms replacing the various characters from different languages. I might mock one up if I feel like it. If you have any ideas, or want to inform me there already is a logo, please ping me below. Cheers, WikiMacaroons (talk) 18:00, 26 July 2020 (UTC)

The logo discussion has not started yet. We're going to have the naming discussion first, and once the name is decided, the logo will follow up. Feel free to start a subpage for the logo, maybe here if you want, to collect ideas. --DVrandecic (WMF) (talk) 01:46, 5 August 2020 (UTC)
Thanks, DVrandecic (WMF), perhaps I shall :) WikiMacaroons (talk) 21:57, 7 August 2020 (UTC)

Technical discussion and documentation[edit]

  • We need a page for documenting how development is going on. I have drafted Abstract Wikipedia/ZObject, but this is obviously not enough.
  • We need a dedicated page to discuss issues related to development. For example:
    • phab:T258894 does not said how non-string data is handled. How to store number? (It is not a good way to store floating-point number as string). What about associative array (aka dict/map/object)?
    • We need a reference type to differ the object Z123 with string "Z123", especially when we have functions that accepts arbitrary type.

--GZWDer (talk) 14:21, 29 July 2020 (UTC)

@GZWDer: I am not sure what you mean about the page. We have the function model that describes a lot. I will update the ZObject page you have created. We also have task list and the phases, the last one I am still working on and trying to connect to the respective entries in Phabricator. Let me know what is uncovered.
Yes, we should have pages (or Phabricator tickets) where we discuss issues related to development, like "How to store number?". That's a great topic. Why not discuss that here?
Yes. The suggestion I have for that is to continue to use the reference and the string types, as suggested in the function model and implemented in AbstractText. Does that resolve the issue? --DVrandecic (WMF) (talk) 21:20, 29 July 2020 (UTC)

Some questions and ideas[edit]

(more will be added)

@GZWDer: Thank you for these! They are awesome! Sorry to getting slowly to them, but there's a lot of substance here! --DVrandecic (WMF) (talk) 02:13, 5 August 2020 (UTC)
And if a function is paired with its inverse, you can automatically check that the final output is the same as the initial input. As under#Distillation of existing content, this is an important consideration for a rendering sub-function, whose inverse is a parser sub-function.--GrounderUK (talk) 23:26, 29 July 2020 (UTC)
I really like the idea of using the inverse to auto-generate tests! That should be possible, agreed. Will you create a phab-ticket? --DVrandecic (WMF) (talk) 02:11, 5 August 2020 (UTC)
Eventually, yes. phab:T261460--GrounderUK (talk) 21:34, 27 August 2020 (UTC)
As a second thought we can fold Z20 to Z7 and eliminate Z20 (the Z7 will require return a boolean true value to pass).--GZWDer (talk) 07:47, 30 July 2020 (UTC)
Maybe. My thought is that by separating the code that creates the result and the code that checks the result, wouldn't it be easier to be sure that we are testing the right thing? --DVrandecic (WMF) (talk) 02:11, 5 August 2020 (UTC)
@DVrandecic (WMF): I don't think there will be a clear distinction between call and check error.--GZWDer (talk) 11:14, 13 August 2020 (UTC)
  • I propose a new key to Z1 (Z1K2 "quoted"). A "quoted" object (and all subobjects) is never evaluated and will be left as is unless unquoted using the Zxxx/unquote function (exception: an argument reference should be replaced if necessary even if in quoted object, unless the reference is quoted itself). (There will also be a Zyyy/quote function.) Z3 will have a argument Z3K5/quoted (default: false) to specify the behavior of constructor (for example, the Z4K1 should have Z3K5=true). Similarly we have a Z17K3. In addition we add a field K6/feature for all functions which may take a list of ZObjects including Zxxx/quote_all.--GZWDer (talk) 08:13, 30 July 2020 (UTC)
    Yes, that's a great idea! I was thinking along similar ways. My current working idea was to introduce a new type, "Z99/Quoted object", and additionally on the Z3/Key have a marker, just as you say, to state that this is "auto-quoted", and then have an unquote function. Would that be sufficient for all use cases, or do I miss something? All the keys marked as identity would be autoquoted. But yes, I really like this idea, thanks for capturing it. We will need that. Let's create a ticket! --DVrandecic (WMF) (talk) 02:11, 5 August 2020 (UTC)
  • I propose to introduce a Z2K4Z1K3/refid field that will hold the ZID of a persistent object. Similar to Z2K1 the value is always a string, but it is held in the value, like

The ID will be removed when a new ZObject is created on the fly. This will support a ZID() function which returns the ZID of a specific object (e.g. ZID(one)="Z382" and ZID(type(one))="Z70"). Any object created on the fly have an empty string as ZID. (Note this is not the same as Z4K1/(identity or call), as Z1K3 may only be a string and may only be found in value of all kinds of Z2s (not ad-hoc created objects) regardless of type.--GZWDer (talk) 15:50, 30 July 2020 (UTC))--GZWDer (talk) 08:23, 30 July 2020 (UTC)

I'm not sure what your "one" represents. Are you saying ZID(Z382) returns the string "Z382"? So, the same as ZID(Z2K4(Z382))?--GrounderUK (talk) 10:52, 30 July 2020 (UTC)
Yes, ZID(Z382) will return the string "Z382". For the second question: a key is not intended to be used as a function, so we need to define a builtin function such as getvalue(Z382, "Z2K4"), which will return an ad-hoc string "Z382" that does not have a ZID.--GZWDer (talk) 11:03, 30 July 2020 (UTC)
Ok, thanks. So, ZID() just returns the string value of whatever is functioning as the key, whether it's a Z2K1 (like "Z382") or a Z1K1 (like "Z70")...? But "Z70" looks like the Z1K1/type of the Z2K2/value rather than the Z1K1/type of Z382 (which is "Z2", as I interpret it). --GrounderUK (talk) 12:34, 30 July 2020 (UTC)
"A Z9/Reference is a reference to the Z2K2/value of the ZObject with the given ID, and means that this Z2K2/value should be inserted here." and the parameter of ZID is implicitly a reference (so only Z2K2 is passed to the function). The ZID function can not see other part of a Z2 like label. Second thought, the key should be part of Z1 instead of Z2.--GZWDer (talk) 15:14, 30 July 2020 (UTC)
Thank you for bearing with me; I missed the bit about Z9/reference! To be clear, the argument/parameter to the proposed function is or resolves to "just a string", which might be Z9/reference. If it's not a Z9/reference, the function returns an empty string ("")? When it is a Z9/reference, it is implicitly the Z2K2/value of the referenced object. The function then returns the Z1K3/refid of the referenced object (as a string) or, if there is no such Z1K3/refid (it is a non-existent object or a transient object), it returns an empty string. I'm not sure, now, whether the intent is for the Z1K3/refid to remain present in a Z2/persistent object (and equal to the Z2K1/id if the object is well-formed). Your new note above (underlined) does not say that the Z1K3 must be a string and must be present in every Z2/persistent object and must not be present in a transient object (although a transient object does not support Z9/reference, as I see it).--GrounderUK (talk) 08:47, 31 July 2020 (UTC)
Ah, I missed this discussion before I made this this edit today. Would that solve the same use cases? I am a bit worried that the Z1K3 will cause issues such as "sometimes when I get a number 2 it has a pointer to the persistent object with its labels, and sometimes it doesn't", which crept up repeatedly. --DVrandecic (WMF) (talk) 02:11, 5 August 2020 (UTC)
  • Ambiguity of Local Keys: In this document it is unclear how local keys are resolved.
The de facto practice is if Z1K1 of an ZObject is Z7, then local keys refers to its Z7K1; otherwise it refers to its Z1K1.--GZWDer (talk) 14:58, 30 July 2020 (UTC)
Sorry, maybe I am missing something, but in the given example it seems clear for each K1 and K2 what they mean? Or is your point that over the whole document, the two two different K1s and K2s mean different things? The latter is true, but not problematic, right? Within each level of a ZObject, it is always unambiguous what K1 and K2 mean, I hope (if not, that would indeed be problematic). There's an even simpler case where K1 and K2 have different meanings, e.g. every time you have embedded function calls with positional arguments, e.g. add(K1=multiply(K1=1, K2=1), K2=0). That seems OK? (Again, maybe I am just being dense and missing something). --DVrandecic (WMF) (talk) 02:22, 5 August 2020 (UTC)
  • I propose a new type of ZObject Zxxx/module. A module have a key ZxxxK1 with a list of Z2s as value. The Z2s may have same ID as global objects. We introduce a new key Z9K2/module and a new ZObject Zyyy/ThisModule, which make ZObjects in a module able to refer to ZObjects in other modules. This will make functions portable and prevent polluting global namespace. We may also introduce a modulename.zobjectname (or modulename:zobjectname) syntax for refering individule ZObject in a module in composition expression. (Note a module may use ZObjects from other module, but it would be better that we create a function that use the required module as parameter, so that a module will not rely on global ZObjects other than builtin.)--GZWDer (talk) 19:13, 30 July 2020 (UTC)
    I understand your usecase and your solution. I think having a single flat namespace is conceptually much easier. Given the purely functional model, I don't see much advantage to this form of information hiding - besides for modules being more portable between installations (a topic that I intentionally dropped for now). But couldn't portability be solved by a namespacing model similar to the way RDF and XML does it? --DVrandecic (WMF) (talk) 02:25, 5 August 2020 (UTC)
  • Placeholder type: We introduce a "placeholder" type, to provide a (globally or locally) unique localized identifier. Every placeholder object is different. See below for the usecase.--GZWDer (talk) 03:56, 31 July 2020 (UTC)
  • Class member: I propose to add
    • a new type Zxxx/member; a instance of member have two key: ZxxxK1/key and ZxxxK2/value, both arbitrary ZObjects. If key does not need to be localized, it may be simply string. If it needs to be localized, it is recommended to use a placeholder object for it.
    • a new key Z4K4/members, value is a list of members. Note all members are static and not related to a specific instance.
    • new builtin function member_property and member_function: member_property(Zaaa, Zbbb) will return the ZxxxK2/value of a member of the type of Zaaa (i.e. Z1K1 of Zaaa, not Zaaa itself) with ZxxxK1/key equals to Zbbb. member_function is similar to member_property, but only work if member_property is a function, and the result is a new function that the first parameter of member_property bound to Zaaa. Therefore it returns a function related to a specific instance.

--GZWDer (talk) 03:56, 31 July 2020 (UTC)

  • Inherit type: I propose to add a new key Z4K5/inherit, the value is a list of ZObjects.
    • An inherited type will have all members of the parent type(s), and also all keys. (The parent type should be persistent, so that it will be possible to create a instance with specific keys - the keys may consist of those defined in child type and those defined in parent type. This may be overcomed - I have two schemes to define a relative key, but both have some weakpoints. One solution is discussed in "Temporary ZID" below--GZWDer (talk) 18:11, 4 August 2020 (UTC)) Members defined in child type will override any member defined in parent type(s).
    • We introduce a builtin function isderivedfrom() to query whether a type is a child type of another.
    • This will make it possible to build functions for arbitrary type derived from a specific interface (which are type itself with no keys), such as Serializable, Iterator.
      • An iterator (type derived from the Iteratible type) is simply any type with a next function, which generates a "next state" from current state. Some example is a generator of all (infinite) prime numbers, or a cursor of database query results.
      • We would be able to create a general filter function for arbitrary iterator, which will itself return a iterator.

--GZWDer (talk) 03:56, 31 July 2020 (UTC) A generalized filter for Iteratible will generate a Iteratible

  • @GZWDer: I raised inheritance with Denny a while back; I agree there needs to be a mechanism for it, and it's already implicit in the way Z1 keys work. But I wonder if it needs to be declared... If a ZObject has keys that belong to another, doesn't that implicitly suggest it inherits the meaning of those keys? Or there could be subdomains of ZObjects under some of which typing is stricter than for others (i.e. inherit from "typed object" vs inherit from plain "ZObject")? Typing and inheritance can get pretty complicated though and perhaps is only necessary for some purposes. ArthurPSmith (talk) 17:27, 31 July 2020 (UTC)
    @GZWDer: Agreed, that would be a way to implement OO mechanisms in the suggested model. And I won't stop anyone from doing it. My understanding is that this would all work *on top* of the existing model. I hope to be able to avoid putting inheritance and subtyping into the core system, as it makes it a much simpler system. But it should be powerful enough to implement it on top. Fortunately, this would not require a change in the current plans, if I see this correctly. Sounds right? --DVrandecic (WMF) (talk) 02:33, 5 August 2020 (UTC)
  • I propose a new key Z2K4/Serialization version to mitigate breaking change of serialization format. For example "X123abcK1" is not a valid key but I propose to use such key below.--GZWDer (talk) 18:11, 4 August 2020 (UTC)
    This might be needed. The good thing is that we can assume this to be already here, have a default value of "version 1", and introduce that key and the next value whenever we need it. So, yes, will probably be needed at some point. (I hope to push this point as far to the future as possible :) ) --DVrandecic (WMF) (talk) 02:36, 5 August 2020 (UTC)
  • Temporary ZID: we introduce a new key Z1K4/Temporary ZID. A temporary ZID may have a format Xabc123 where abc123 is random series of hexadecimal digits (alternatively we can use only decimal digits). For a transient object without a Z1K4 specified, a random Temporary ZID will be generated (which is not stable). Usecases:
    • As the key of a basic unit of ZObject used by evaluators; i.e. When evaluating, we use a pool of ZObjects to reduce redundant evaluation.
    • One of solutions of "relative keys" (see above) - the XID may easily use to form a key like X123abcK1.
    • A new serialized format to reduce duplication: A ZObject may have a large number of identical embedded ZObjects.
    Some other notes:
    • For easier evaluation, the temporary ZID should be globally unique. However it is not easy to guarantee this especially if the temporary ZID is editable.
    • When a ZObject is changed it should have a new temporary ZID. But similary it is not easy to guarantee.
    • We introduce a new function is() to check whether two object have the same temporary ZID (ZObjects created on the fly have a random temporary ZID, so is not equivalent with other ZObjects)
    • This may it possible to have ZObjects with subobjects relying on each other (such as a pair that the first element points to the pair itself). We should discuss whether it should be allowed. Such objects do not have a finite (traditional) serialization and may break other functions such as equals_to. If such object is not allowed, an alternative is to use "hash" to refer to a specific ZObject.
      • Note how will equal_to work for two custom objects is itself an epic.

--GZWDer (talk) 18:04, 4 August 2020 (UTC)

I agree with the use cases for Z1K4. In AbstractText I solved these use cases by either taking a hash of the object, or a string serialization - i.e. the object representation is its own identity. Sometimes, the evaluator internally added such a temporary ID and used that, IIRC. But that all seems to be something that is only interesting within the confines of an evaluation engine, right? And an evaluation engine should be free to do these modifications (and much more) as it wishes. And there such solutions will be very much needed - but why would we add those to the function model and to Z1 in general? We wouldn't store those in the wiki, they would just be used in the internal implementation of the evaluator - or am I missing something? So yes, you are right, this will be required - but if I understand it correctly, it is internal, right? --DVrandecic (WMF) (talk) 02:42, 5 August 2020 (UTC)

Some questions and concerns[edit]

Depending on the eventual implementation details, Abstract Wikipedia may need support from the WMF Search Platform team. So a couple of us from our team had a conversation with Denny about Abstract Wikipedia, and we had some questions and concerns. Denny suggested sharing them in a more public forum, so here we are! I've added some sub-headings to give it a little more structure for this forum. In no particular order...

Grammar/renderer development vs writing articles, and who will do the work[edit]

Each language renderer, in order to be reasonably complete, could be a significant fraction of the work required to just translate or write the "core" articles in that language, and the work can only be done by a much smaller group of people (linguists or grammarians of the language, not just any speaker). With a "medium-sized" language like Guarani (~5M speakers, but only ~30 active users on the Guarani Wikipedia in the last month), it could be hard to find interested people with the technical and linguistic skills necessary. Though, as Denny pointed out, perhaps graduate students in relevant fields would be interested in developing a grammar—though I'm still worried about how large the pool of people with the requisite skills is.

I share this concern, but I think we need to think of it as a two-edged sword. Maybe people will be more interested in contributing to both grammar and content than to either one alone. I certainly hope that this distinction will become very blurred; our goal is content and our interest in grammar is primarily to support the instantiation of content in a particular natural language (which, you know, is what people actually do all the time). We need to downplay the "technical and linguistic skills" by focusing on what works. People love fixing bad grammar or poor word choices (it's a parent–child thing), so perhaps there are two separate challenges here: how to get a rendering process that produces intelligible content versus one that produces correctly phrased content. Native speakers will certainly have a key role in identifying incorrectly phrased content; they may even be tempted to fix a problem they understand. They won't necessarily be existing editors, however; they may even have been intrigued by the slightly quirky Wikipedia idiolect they've been hearing about! Ultimately, though, their community will need the maturity to allow the more difficult linguistic problems they identify to percolate upwards or outwards to more specialist contributors, and these may be non-native grammarians, programmers, contributors of any stripe.
What about the initial "intelligible" renderers? We need to explore this as we develop language-neutral content. We will see certain very common content frameworks, which should already be apparent from Wikidata properties. So we will be asking whether there is a general (default) way to express a instance of (P31), for example, how it varies according to the predicate and (somewhat more problematically) the subject. We will also observe how certain Wikidata statements are linguistically subordinate (being implied or assumed). So, <person> is (notable for) <role in activity>, rather than instance of (P31) human (Q5)... To the extent that such observations are somewhat universal, they serve as a useful foundation for each successive renderer: how does the new language follow the rules and exceptions derived for previous languages (specifically for the language-neutral content so far developed; we never need a complete grammar of any language except, perhaps, the language-neutral synthetic language that inevitably emerges as we proceed).
Who will do the work? Anyone and everyone! Involvement from native speakers would be a pre-requisite for developing any new renderer, but the native speakers will be supported by an enthusiastic band of experienced linguistic problem-solvers, who (will by then) have already contributed to the limited success of several renderers for an increasing quantity of high-quality language-neutral content. --GrounderUK (talk) 12:59, 25 August 2020 (UTC)
@GrounderUK: "People love fixing bad grammar or poor word choices (it's a parent–child thing), so perhaps there are two separate challenges here: how to get a renderer that produces intelligible content versus one that produces correctly phrased content. Native speakers will certainly have a key role in identifying incorrectly phrased content; they may even be tempted to fix a problem they understand. They won't necessarily be existing editors, however; they may even have been intrigued by the slightly quirky Wikipedia idiolect they've been hearing about!" This is a great notion, and hopefully it will be one the project eventually benefits from. --Chris.Cooley (talk) 00:04, 29 August 2020 (UTC)
@Chris.Cooley: Thanks, Chris, one can but hope! Denny's crowdsourcing ideas (below) are a promising start. Just for the record, your quoting me prompted me to tweak the original; I've changed "a renderer" to "a rendering process" (in my original but not in your quoting of it).
@TJones (WMF): One workflow we have been brainstorming was to:
  1. crowdsource the necessary lexical knowledge, which can be done without particular expertise beyond language knowledge
  2. crowdsource how sentences for specific simple constructors would look like (from, e.g. bilingual speakers, who are shown simple sentences in another language, e.g. Spanish, and then asked to write it down in Guarani)
  3. then even non-Guarani speakers could try to actually build renderers, using the language input as test sentences
  4. now verify the renderers again with Guarani speakers for more examples, and gather feedback
It would be crucial that the end result would allow Guarani speakers to easily mark up issues (as GrounderUK points out), so that these can be addressed. But this is one possible workflow that would omit the necessity to have deep linguistic and coding language available in each language community, and can spread the workload in a way that could help with filling that gap.
This workflow would be build on top of Abstract Wikipedia and not require much changes to the core work. Does this sound reasonable or entirely crazy?
One even crazier idea, but that's well beyond what I hope for, is that we will find out that the Renderers across languages are in many areas rather uniform, and that there are a small number and that we can actually share a lot of the renderer code across languages and that languages are basically defined through a small number of parameters. There are some linguists who believe such things possible. But I don't dare bet on it. --DVrandecic (WMF) (talk) 21:50, 28 August 2020 (UTC)

The "Grammatical Framework" grammatical framework[edit]

Denny mentioned Grammatical Framework and I took a look at it. I think it is complex enough to represent most grammatical phenomena, but I don’t see very much that is actually developed. The examples and downloadable samples I see are all small toy systems. It isn’t 100% clear to me that any grammatical framework can actually capture all grammatical phenomena—and with certain kinds of edge cases and variation in dialects, it may be a lost cause—and linguists still argue over the right way to represent phenomena in major languages in major grammatical frameworks. Anyway, it looks like the Grammatical Framework still leaves a lot of grammar development to be done; assuming it’s possible (which I’m willing to assume), it doesn’t seem easy or quick, especially as we get away from major world languages.

Yes. I keep meaning to take another look but, basically, GF itself is not something I would expect many people to enjoy using for very long (I may become an exception) ...and "very long" certainly seems to be required. I'm usually pretty open-minded when it comes to possible solutions but I've not added GF to my list of things that might work. That's not to say that there can be no place for it in our solution landscape; it's just that we do need to keep focusing on the user experience, especially for the user who knows the under-served language but has no programming experience and no formal training in grammar. I can hardly remember what that feels like (ignoring the fact that my first language is hardly under-served)! Apart from committing to engaging with such stakeholders, it's not clear what more we can usefully do, at this stage, when it comes to evaluating alternative solutions. That said, I am 99.9% sure that we can find a somewhat satisfactory solution for a lot of encyclopedic content in a large number of languages; the principal constraint will always be the availability of willing and linguistically competent contributor communities.
One thing GF has going for it is that it is intentionally multilingual. As I understand it, our current plan is to begin natural-language generation with English renderers. I'm hoping we'll change our minds about that. Sooner rather than later, in any event, I would hope that renderer development would be trilingual (as a minimum). Some proportion of renderer functions may be monolingual, but I would like to see those limited to the more idiosyncratic aspects of the language. Or, perhaps, if there are good enough reasons to develop a function with only a single language in mind, we should also consider developing "equivalent provision" in our other current languages. What that means in practice is that the test inputs into our monolingual functions must also produce satisfactory results for the other languages, whether by using pre-existing functions or functions developed concurrently (more or less). --GrounderUK (talk) 13:02, 26 August 2020 (UTC)
I agree, we shouldn't start development based on a single language, and particularly not English. Starting in three languages, ideally across at least two main families, sounds like a good idea. --DVrandecic (WMF) (talk) 21:55, 28 August 2020 (UTC)
@DVrandecic (WMF): That's very good to hear! Is that an actual change of our collective minds, or just your own second thoughts? P2.7 doesn't seem to have a flexible interpretation. In my reading of P2.8 to P2.10, there also appears to be a dependency on P2.7 being fairly advanced, but maybe that's just my over-interpretation.--GrounderUK (talk) 07:52, 1 September 2020 (UTC)
It is true that a lot of development would be required for GF, but even more development would be required for any other framework. GF being open source, any development would flow back into the general public, too, so working with GF and its developers is likely the correct solution.
The GF community is interested in helping out. See , where Aarne Ranta suggests a case study for some area with adequate data, like mathematics. —The preceding unsigned comment was added by Inariksit (talk) 18:44, 28 August 2020
I am a big fan of reusing as much of GF as possible, but my reference to GF was not to mean that we necessarily have to use it as is, but rather that it shows that such a project is at all possible. We should feel free to either follow GF, or to divert from it, or to use it as an inspiration - but what it does show is that the task we are aiming for has been achieved by others in some form.
Having said that, I am very happy to see Aarne react positively by the idea! That might be the beginning of a beautiful collaboration. --DVrandecic (WMF) (talk) 21:54, 28 August 2020 (UTC)

Incompatible grammatical frameworks across language documentation[edit]

I worry that certain grammatical traditions associated with given languages may make it difficult to work compatibly across languages. A lot of Western European languages have traditionally followed the grammatical model of Latin, even when it doesn’t make sense—though there are of course many grammatical frameworks for the major languages. But it’s easy to imagine that the best grammar for a given medium-sized language was written by an old-fashioned European grammarian, based on a popular grammatical model from the 1970s. Reconciling that with whatever framework that has been developed up until that point may create a mess.

Speaking as an old-fashioned European grammarian... "a mess" is inevitable! It should be as messy as necessary but no messier (well, not much messier). I'm not sure that there's much "reconciling" involved, however. Given our framing of the problem, I don't see how "our mess" can be anything other than "interlingual" (as Grammatical Framework is). This is why I would prefer not to start with English; the first few languages will (inevitably?) colour and constrain our interlingua. So we need to be very careful, here. To set against that is our existing language-neutral content in Wikidata. Others must judge whether Wikidata is already "too European", but we must take care that we do not begin by constructing a "more English" representation of Wikidata content, or coerce it into a "more interlingual" representation, where the interlingua is linguistically more Eurocentric than global. So, first we must act to counter first-mover advantage and pre-existing bias, which means making things harder for ourselves, initially. At the same time, all language communities can be involved in refining our evolving language-neutral content, which will be multi-lingually labelized (like Wikidata). If some labelized content seems alien in some language, this can be flagged at an early stage (beginning now, for Wikidata). What this means is that all supported languages can already be reconciled, to an extent, with our foundational interlingua (Wikidata), and any extensions we come up with can also be viewed through our multi-lingual labelization. I suppose this is a primitive version of the "intelligible" content I mentioned earlier. When it comes to adding a further language (one that we currently support with labelization in Wikidata), we may hope to find that we are already somewhat reconciled, because linguistic choices have already been made in the labelization and our new target consumers can say what they dislike and what they might prefer; they do not need to consult some dusty grammar tome. In practice, they will already have given their opinions because that is how we will know that they are ready to get started (that is, theirs is a "willing and linguistically competent" community). In the end, though, we have to accept that the previous interlingual consensus will not always work and cannot always be fixed. This is when we fall back on the "interlingual fork" (sounds painful!). That just means adding an alternative language-neutral representation of the same (type of) encyclopedic content. I say "just" even though it might require rather more effort than I imagine (trust me, it will!) because it does seem temptingly straightforward. I say we must resist the temptation, but not stubbornly; a fallback is a tactical withdrawal, not a defeat; it is messy, but not too messy.--GrounderUK (talk) 12:54, 27 August 2020 (UTC)
Agree with GrounderUK here - I guess that we will have some successes and some dead ends during the implementation of certain languages, and that this is unavoidable. And that might partially stem from the state of the art in linguistic research in some languages.
On the other side, my understanding is that we won't be aiming for 7000 languages but for 400, and that these 400 languages will in general be better described and have more current research work about them than the other 6600. So I have a certain hope that for most languages that we are interested in we do have more current and modern approaches, that are more respectful of the language itself.
And finally, the grammarians lens of the 1970s Europeans will probably not matter that much in the end anyway - if we have access to a large enough number of contributors native in the given language. That should be a much more important voice in shaping the renderers of a language than dated grammar books. --DVrandecic (WMF) (talk) 22:01, 28 August 2020 (UTC)

Introducing new features into the framework needed by a newly-added language[edit]

In one of Denny's papers (PDF) on Abstract Wikipedia, he discusses (Section “5 Particular challenges”, p.8) how different languages make different distinctions that complicate translation—in particular needing to know whether an uncle is maternal or paternal (in Uzbek), and whether or not a river flows to the ocean (in French). I am concerned about some of the implications of this kind of thing.

Recoding previously known information with new features[edit]

One problem I see is that when you add a new language with an unaccounted-for feature, you may have to go back and recode large swathes of the information in the Abstract Wikipedia, both at the fact level, and possibly at some level of the common renderer infrastructure.

Suppose we didn’t know about the Uzbek uncle situation and we add Uzbek. Suddenly, we have to code maternal/paternal lineage on every instance of uncle everywhere. Finding volunteers to do the new uncle encoding seems like it could be difficult. In some cases the info will be unknown and either require research, or it is simply unknowable. In other cases, if it hasn’t been encoded yet, you could get syntactically correct but semantically ill-formed constructions.

@TJones (WMF): Yes, that is correct, but note that going back and recoding won't actually have effect on the existing text.
So assume we have an abstract corpus that uses the "uncle" construct, and that renders fine in all languages supported at that time. Now we add Uzbek, and we need to refine the "uncle" construct into either "maternal-uncle" or "paternal-uncle" in order to render appropriately in Uzbek - but both these constructs would be (basically) implemented as "use the previous uncle construct unless Uzbek". So all existing text in all supported languages would continue to be fine.
Now when we render Uzbek, though, then the corpus need to be retrofitted. But that merely blocks the parts of Uzbek renderings that are dealing with the construct "uncle". It has no impact on other languages. And since Uzbek didn't have any text so far (that's why we are discovering this issue now), it also won't reduce the amount of Uzbek generated text.
So, yes, you are completely right, but we still have a monotonously growing generated text corpus. And as we backfill the corpus, more and more of the text now becomes also available in Uzbek - but there would be no losses on the way. --DVrandecic (WMF) (talk) 22:09, 28 August 2020 (UTC)

Semantically ill-formed defaults[edit]

For example, suppose John is famous enough to be in Abstract Wikipedia. It is known that John’s mother is Mary, and that Mary’s brother is Peter, hence Peter is John’s uncle. However, the connection from John to Peter isn’t specifically encoded yet, and we assume patrilineal links by default. We could then generate a sentence like “John’s paternal uncle Peter (Mary’s brother) gave John $100.” Alternatively, if you try to compute some of these values, you are building an inference engine and a) you don’t want to do that, b) you really don’t want to accidentally do it by building ad hoc rules or bots or whatever, c) it’s a huge undertaking, and d) did I mention that you don’t want to do that?

Agreed. --DVrandecic (WMF) (talk) 22:10, 28 August 2020 (UTC)

Incompatible cultural "facts"[edit]

In the river example, I can imagine further complications because of cultural facts, which may require language-specific facts to be associated with entities. Romanian makes the same distinction as French rivière/fleuve with râu/fluviu. Suppose, for example, that River A has two tributaries, River B and River C. For historical reasons, the French think of all three as separate rivers, while the Romanians consider A and B to be the same river, with C as tributary. It’s more than just overlapping labels—which is complex enough. In French, B is a rivière because it flows into A, which is a fleuve. In Romanian, A and B are the same thing, and hence a fluviu. So, a town on the banks of River B is situated on a rivière in French, and a fluviu in Romanian, even though the languages make the same distinctions, because they carve up the entity space differently.

Yes, that is an interesting case. Similar situations happen with the usage of articles in names of countries and cities and whether you refer to a region as a country, a state, a nation, etc., which may differ from language to language. For these fun examples I could imagine that we have information in Wikidata connecting the Q ID for A and B to the respective L items. But that is a bit more complex indeed.
Fortunately, these cultural differences happen particularly on topics that are important for a given language community. Which increases the chance that the given language community has already an article about this topic in its own language, and thus will not depend on Abstract Wikipedia to provide baseline content. That's the nice thing: we are not planning to replace the existing content, but only to fill in the currently existing gaps. And these gaps will, more likely than not, cover topics that will not have these strong linguistic-cultural interplays. --DVrandecic (WMF) (talk) 22:15, 28 August 2020 (UTC)

Having to code information you don't really understand[edit]

In both of the uncle and river cases, an English speaker trying to encode information is either going to be creating holes in the information store, or they are going to be asked to specify information they may not understand or have ready access to.

A far-out case that we probably won’t actually have to deal with is still illustrative. The Australian Aboriginal language Guugu Yimithirr doesn’t have relative directions like left and right. Instead, everything is in absolute geographic terms; i.e., “there is an ant on your northwest knee”, or “she is standing to the east-northeast of you.” In addition to requiring all sorts of additional coding of information (“In this image, Fred is standing to the left/south-southwest of Jane”) depending on the implementation details of the encoding and the renderers, it may require changing how information is passed along in various low-level rendering functions. Obviously, it makes sense to make data/annotation pipelines as flexible as possible. (And again, the necessary information may not be known because it isn’t culturally relevant—e.g., in an old photo from the 1928 Olympics, does anyone know which way North is?)

@TJones (WMF): Yes, indeed, we have to be able to deal with holes. My assumption is that a contributor creates a first draft of an abstract article, their main concern will be the languages they speak. The UI may nudge them to fill in further holes, but it shouldn't require it. And the contributor can save the content now and reap benefit for all languages where the abstract content has all necessary information.
Now if there is a language that requires some information that is not available in the abstract content, where there is such a hole, then the rendering of this sentence for this language will fail.
We should have workflows that find all holes for a given language and then contributors can go through those and try to fill them, thereby increasing the amount of content that gets rendered in their languages - something that might be amenable to micro-contributions (or not, depending on the case). But all of this will be a gradual, iterative process. --DVrandecic (WMF) (talk) 22:21, 28 August 2020 (UTC)

Impact on grammar and renderers[edit]

Other examples that are more likely but less dramatic are clusivity, evidentiality, and ergativity which are more common. If these features also require agreement in verbs, adjectives, pronouns, etc., the relevant features will have to be passed along to all the right places in the sentence being generated. Some language features seem pretty crazy if you don't know those languages—like Salishan nounlessness and Guarani noun tenses—and may make it necessary to radically rethink the organization of information in statements and how information is transmitted through renderers.

Yes, that is something where I look to Grammatical Framework for inspiration, as it solved these cases pretty neatly by having an abstract and several concrete grammars and the passing through from one to the other and still allowing the different flows of agreement. --DVrandecic (WMF) (talk) 22:22, 28 August 2020 (UTC)

Grammatical concepts that are similar, but not the same[edit]

I’m also concerned about grammatical concepts that are similar across languages but differ in the details. English and French grammatical concepts often map to each other more-or-less, but the details where they disagree cause consistent errors in non-native speakers. For example, French says “Je travaille ici depuis trois ans” in the present tense, while English uses (arguably illogically, but that’s language!) the perfect progressive “I have been working here for three years”. Learners in both directions tend to do direct translations because those particular tenses and aspects usually line up.

Depending on implementation, I can see it being either very complex to represent this kind of thing or the representation needing to be language-specific (which could be a disaster—or at least a big mess). Neither a monolingual English speaker nor a monolingual French speaker will be able to tag the tense and aspect of this information in a way that allows it to be rendered correctly in both languages. Similarly, the subjunctive in Spanish and French do not cover the same use cases, and as an English speaker who has studied both I still have approximately zero chance of reliably encoding either correctly, much less both at the same time—though it’s unclear, unlike uncle, where such information should be encoded. If it’s all in the renderers, maybe it’ll be okay—but it seems that some information will have to be encoded at the statement level.

The hope would be that the exact tense and mood would indeed not be encoded in the abstract representation at all, but added to it by the individual concrete renderers. So the abstract representation of the example might be something like works_at(1stPSg, here, years(3)), and it would be up to the renderers to either render that using the present or the perfect progressive. The abstract representation would need to abstract from the language-specificities. These high-level abstract representations would probably break down first into lower level concrete representations, such as French clause(1stPSg, work, here, depuis(years(3)), present tense, positive) or English clause(1stPSg, work, here, for(years(3)), perfect progressive, positive), so that we have several layers from the abstract representation slowly working down its way to a string with text in a given language. --DVrandecic (WMF) (talk) 22:30, 28 August 2020 (UTC)

“Untranslatable” concepts and fluent rendering[edit]

Another random thought I had concerns “untranslatable” concepts, one of the most famous being Portuguese saudade. I don’t think anything is actually untranslatable, but saudade carries a lot more inherent cultural associations than, say, a fairly concrete concept like Swedish mångata. Which sense/translation of saudade is best to use in English—nostalgia, melancholy, longing, etc.—is not something a random Portuguese speaker is going to be able to encode when they try to represent saudade. On the other hand, if saudade is not in the Abstract Wikipedia lexicon, Portuguese speakers may wonder why; it’s not super common, but it’s not super rare, either—they use it on Portuguese Wikipedia in the article about Facebook, for example.

Another cultural consideration is when to translate/paraphrase a concept and when to link it (or both)—saudade, rickroll, lolcats. Dealing with that seems far off, but still complicated, since the answer is language-specific, and may also depend on whether a link target exists in the language (either in Abstract Wikipedia or in the language’s “regular” Wikipedia).

Yes! And that's exactly how we deal with it! So the sausade construct might be represented as a single word in Portuguese, but as a paraphrase in other languages. There is no requirement that each language has to have each clause built of components of the same kind. The same thing would allow us to handle some sentence qualifiers as adjectives or adverbs if a language can do that (say if there is a adjective stating that something happened yesterday evening), and use a temporal phrase otherwise. The abstract content can break apart in very different concrete renderers. --DVrandecic (WMF) (talk) 22:35, 28 August 2020 (UTC)

Discourse-level encoding?[edit]

I’m also unclear how things are going to be represented at a discourse level. Biographies seem to be more schematic than random concrete nouns or historical events, but even then there are very different types of details to represent about different biographical subjects. Or is every page going to basically be a schema of information that gets instantiated into a language? Will there be templates like BiographyIntro(John Smith, 1/1/1900, Paris, 12/31/1980, Berlin, ...) ? I don’t know whether that is a brilliant idea or a disaster in the making.

It's probably neither. So, particularly for bio-intros? That's what, IIRC, Italian Wikipedia is already doing, in order to increase the uniformity of their biographies. But let's look at other examples: I expect that for tail entities that will be the default representation, a more or less large template that takes some data from Wikidata and generates a short text. That is similar to the way LSJbot is working for a large number of articles - but we're removing the bus factor from LSJbot and we are putting it into a collaboratively controllable space.
Now, I would be rather disappointed if things stopped there: so, when we go beyond tail entities, I hope that we will have much more individually crafted abstract contents, where the simple template is replaced by diverse, more specific constructors, and only in case these are not available for rendering in a given language, we fall back to the simple template. And these would indeed need to also represent discourse-level conjunctions such as "however", "in the meantime", "in contrast to", etc. --DVrandecic (WMF) (talk) 22:40, 28 August 2020 (UTC)

Volume, Ontologies, and Learning Curves, Oh My![edit]

Finally, I worry about the volume of information to encode (though Wikipedia has conquered that) and the difficulty of getting the ontology right (which Wikidata has done well, though on a less ambitious scale than Wikilambda requires, I think), and the learning curve for the more complex grammatical representations from editors.

I (and others) like to say that Wikipedia shouldn't work in theory, but in practice it does—so maybe these worries are overblown, or maybe worrying about them is how they get taken care of.

Trey Jones (WMF) (talk) 21:59, 24 August 2020 (UTC)

I don't think the worries you stated are overblown - these are all well-grounded worries, and for some of them I had specific answers, and for some I am fishing for hope that we will be able to overcome it when we get to it. One advantage is that I don't have to have all answers yet, but that I can rely on the ingenuity of the community to get some of these blockers out of the way, as long as we create an environment and a project that attracts enough good people.
And this also means that this learning curve you are describing here is one of my main worries. The goal is to design a system that allows contributors who want to do so to dive deep into the necessary linguistic theories and in the necessary computer science background to really dig through the depths of Wikilambda and Abstract Wikipedia - and I expect to rely on these contributors sooner than later. But at the same time it must remain possible to effectively contribute if you don't do that, or else the project will fail. We must provide contribution channels where confirming or correcting lexicographic forms works without the contributor having to fully understand all the other parts, where abstract content can be created and maintained without having to have a degree in computational linguistics. I still aim for a system where, in case of an event (say, a celebrity marries, an author publishes a new book, or a new mayor gets elected) an entirely fresh contributor can figure out in less than five minutes how to actually make the change on Abstract Wikipedia. This will be a major challenge, and not even because I think the individual user experiences for those things will be incredibly hard, but rather because it is unclear how to steer the contributor to the appropriate UX.
But yes, the potential learning curve is a major obstacle, and we need to address that effectively, or we will fall short of the potential that this project has. --DVrandecic (WMF) (talk) 22:50, 28 August 2020 (UTC)


@All: @TJones (WMF):

These are all serious issues that need to be addressed, which is why I proposed a more rigorous development part for "producing a broad spectrum of ideas and alternatives from which a program for natural language generation from abstract descriptions can be selected" at

With respect to your first three sections, I assumed Denny only referred to Grammatical Framework as a system with similar goals to Abstract Wikipedia. I also assumed that the idea was that only those grammatical phenomena needed to somewhat sensibly express the abstract content in a target language need to be captured, and that hopefully generating those phenomena will require much less depth and sophistication than constructing a grammar.

With respect to the section "Recoding previously known information with new features," the system needs to be typologically responsible from the outset. I think we should be aware of the features of all of the languages we are going to want to eventually support.

In your example, I think coding "maternal/paternal lineage on every instance of uncle everywhere" would be great, but not required. A natural language generation system will not (soon) be perfect.

With respect to the section "Semantically ill-formed defaults": in other words, what should we do when we want to make use of contextual neutralization (e.g., using paternal uncle in a neutral context where the paternal/maternal status of the uncle is unknown) but we cannot ensure the neutral context?[1] I would argue that there are certain forms that we should prefer because they appear less odd when the neutral context is supposedly violated than others (e.g., they in The man was outside, but they were not wearing a coat). There is also an alternative in circumlocution: for example, we could reduce specificity and use relative instead of parental or maternal uncle and hope that any awkward assumptions in the text of a specifically uncle-nephew/niece relationship are limited.

Unfortunately, it does seem that to get a really great system, you need a useful inference engine ...

With respect to the section "Incompatible cultural 'facts'," this is the central issue to me, but I would rely less on the notion of "cultural facts." We are going to make a set of entity distinctions in the chosen set of Abstract Wikipedia articles, but the generated natural language for these articles needs to respect how the supported languages construe conceptual space (speaking in cognitive linguistics terms for a moment).[2] I am wondering if there is a typologically responsible way of ordering these construals (perhaps as a "a lattice-like structure of hierarchically organized, typologically motivated categories"[3]) that could be helpful to this project.

With respect to the section "Having to code information you don't really understand," as above, I think there are sensible ways in which we can handle "holes in the information store." For example, in describing an old photo from the 1928 Olympics, can we set North to an arbitrary direction, and what are the consequences if there is context that implies a different North?

With respect to the section "Discourse-level encoding?," I do hope we can simulate the "art" of writing a Wikipedia article to at least some degree.

  1. Haspelmath, Martin. "Against markedness (and what to replace it with)." Journal of linguistics (2006): 39.; Croft, William. Typology and universals, 100-101. Cambridge University Press, 2002.
  2. Croft, William, and Esther J. Wood. "Construal operations in linguistics and artificial intelligence." Meaning and cognition: A multidisciplinary approach 2 (2000): 51.
  3. Van Gysel, Jens EL, Meagan Vigus, Pavlina Kalm, Sook-kyung Lee, Michael Regan, and William Croft. "Cross-lingual semantic annotation: Reconciling the language-specific and the universal." ACL 2019 (2019): 1.

--Chris.Cooley (talk) 11:45, 25 August 2020 (UTC)

Thanks Chris, and thanks for starting that page! I think this will indeed be an invaluable resource for the community as we get to these questions. And yes, as you said, I referred to GF as a similar project and as an inspiration and as a demonstration of what is possible, not necessarily as the actual implementation to use. Also, you will find that my answers above don't always fully align with what you said, but I think that they are in general compatible. --DVrandecic (WMF) (talk) 22:56, 28 August 2020 (UTC)
@DVrandecic (WMF): Thanks, and I apologize for sounding like a broken record with respect to a part PP2! With respect to the above, it does sound like we disagree on some things, but I am sure there will be time to get into them later! --Chris.Cooley (talk) 00:29, 29 August 2020 (UTC)
Some of these issues are I think avoided by the restricted language domain we are aiming for - encyclopedic content. There should be relatively limited need for constructions in the present or future tenses. No need to worry about the current physical location or orientation of people, etc. If something is unknown then a fall-back construction ("brother of a parent" rather than a specific sort of "uncle") should be fine, if possibly inelegant. We don't need to capture every possible aspect of language, just sufficient to convey the meanings intended. ArthurPSmith (talk) 18:56, 25 August 2020 (UTC)
Thanks Arthur! Yes, I very much appreciate the pragmatic approach here - and I think a combination of the pragmatic getting things done with the aspirational wanting everything to be perfect will lead to the best tension to get this project catapulted forward! --DVrandecic (WMF) (talk) 22:56, 28 August 2020 (UTC)
@ArthurPSmith: "the restricted language domain we are aiming for - encyclopedic content" I wish this was so, but I am having trouble thinking of an encyclopedic description of a historical event — for example — as significantly restricted. I could easily see the physical location or orientation of people becoming relevant in such a description. "If something is unknown then a fall-back construction ('brother of a parent' rather than a specific sort of 'uncle') should be fine, if possibly inelegant. We don't need to capture every possible aspect of language, just sufficient to convey the meanings intended." I totally agree, and I think this could be another great fallback alternative. --Chris.Cooley (talk) 00:29, 29 August 2020 (UTC)

Might deep learning-based NLP be more practical?[edit]

First of all, I'd like to state that Abstract Wikipedia is a very good idea. I applaud Denny and everyone else who has worked on it.

This is kind of a vague open-ended question, but I didn't see it discussed so I'll write it anyway. The current proposal for Wikilambda is heavily based on a generative-grammar type view of linguistics; you formulate a thought in some formal tree-based syntax, and then explicitly programmed transformational rules are applied to convert the output to a given natural language. I was wondering whether it would make any sense to make use of connectionist models instead of / in addition to explicitly programmed grammatical rules. Deep learning based approaches (most impressively, Transformer models like GPT-3) have been steadily improving over the course of the last few years, vindicating connectionism at least in the context of NLP. It seems like having a machine learning model generate output text would be less precise than the framework proposed here, but it would also drastically reduce the amount of human labor needed to program lots and lots of translational rules.

A good analogy here would be Apertium vs. Google Translate/DeepL. Apertium, as far as I understand it, consists of a large number of rules programmed manually by humans for translating between a given pair of languages. Google Translate and DeepL are just neural networks trained on a huge corpus of input text. Apertium requires much more human labor to maintain, and its output is not as good as its ML-based competitors. On the other hand, Apertium is much more "explainable". If you want to figure out why a translation turned out the way it did (for example, to fix it), you can find the rules that caused it and correct them. Neural networks are famously messy and Transformer models are basically impossible to explain.

Perhaps it would be possible to combine the two approaches in some way. I'm sure there's a lot more that could be said here, but I'm not an expert on NLP so I'll leave it at that. PiRSquared17 (talk) 23:46, 25 August 2020 (UTC)

@PiRSquared17: thank you for this comment, and it is a question that I get asked a lot.
First, yes, the success of ML systems in the last decade have been astonishing, and I am amazed by how much the field has developed. I had one prototype that was built around an ML system, but even more than the Abstract Wikipedia proposal it exhibited a Matthew effect - languages that were already best represented benefitted from that architecture the most, whereas the languages that needed most help would get least of it.
Another issue is, as you point out, that within a Wikimedia project I would expect the ability for contributors to go in and fix errors. This is considerably easier with the symbolic approach chosen for Abstract Wikipedia than with an ML-based approach.
Having said that, there are certain areas I will rely on ML-based solutions in order to get them working. This includes an improved UX to create content, and this includes analysis of the existing corpora as well as of the generated corpora. There is even the possibility of using an ML-based system to do the surface cleanup of the text to make it more fluent - basically, to have an ML-based system do copy on top of the symbolically generated text, which could have the potential to reduce the complexity of the renderers considerably and yet get good fluency - but all of these are ideas.
In fact, I am planning to write a page here where I outline possible ML tasks in more detail.
@DVrandecic (WMF): Professor Reiter ventured an idea or two on this topic a couple of weeks ago: " may be possible to use GPT3 as an authoring assistant (eg, for developers who write NLG rules and templates), for example suggesting alternative wordings for NLG narratives. This seems a lot more plausible to me than using GPT3 for end-to-end NLG."--GrounderUK (talk) 18:22, 31 August 2020 (UTC)
@GrounderUK: With respect to GPT-3, I am personally more interested in things like for the (cross-linguistic) purposes of Abstract Wikipedia. You might be able to imagine strengthening Wikidata and/or assisting abstract content editing. --Chris.Cooley (talk) 22:33, 31 August 2020 (UTC)
@Chris.Cooley: I can certainly imagine such a thing. However it happens, the feedback into WikidataPlusPlus is kinda crucial. I think I mentioned the "language-neutral synthetic language that inevitably emerges" somewhere (referring to Wikidata++) as being the only one we might have a complete grammar for. Translations (or other NL-type renderings) into that interlingua from many Wikipedias could certainly generate an interesting pipeline of putative data. Which brings us back to #Distillation of existing content (2nd paragraph)...--GrounderUK (talk) 23:54, 31 August 2020 (UTC)
@DVrandecic (WMF): May we go ahead and create such an ML page, if you haven't already? James Salsman (talk) 19:00, 16 September 2020 (UTC)
Now it could be that ML and AI will develop with such a speed to make Abstract Wikipedia superfluous. But to be honest (and that's just my point of view), given the development of the field in the last ten years, I don't see that moment be considerably closer than it was five years ago (but also, I know a number of teams working on a more ML-based solution to this problem, and I honestly wish them success). So personally I think there is a window of opportunity for Abstract Wikipedia to help billions of people for quite a few years, and to allow many more to contribute to the world's knowledge sooner. I think that's worth it.
Amusingly, if Abstract Wikipedia succeeds, I think we'll actually accelerate the moment where we make it's existent unnecessary. --DVrandecic (WMF) (talk) 03:55, 29 August 2020 (UTC)
I will oppose introducing any deep learning technique as it is 1. difficult to develop 2. difficult to train 3. difficult to generalize 4. difficult to maintain.--GZWDer (talk) 08:06, 29 August 2020 (UTC)
It could be helpful to use ML-based techniques for auxiliary features, such as parsing natural language into potential abstract contents for users to choose/modify, but using such techniques for rendered text might not be a good idea, even if it's just used on top of symbolically generated text. For encyclopedic content, accuracy/preciseness is much more important than naturalness/fluency. As suggested above, "brother of a parent" could be an acceptable fallback solution for the "uncle" problem, even it doesn't sound natural in a sentence. While a ML-based system will make sentences more fluent, it could potentially turn a true statement into a false one, which would be unacceptable. Actually, many concerns raised above, including the "uncle" problem, could turn out to be advantages of current rule-based approach over ML-based approach. Although those are challenging issues we need to address, it would be more challenging for ML/AI to resolve those issues. --Stevenliuyi (talk) 23:41, 12 September 2020 (UTC)

Translatable modules[edit]


This is not exactly about Abstract Wikipedia, but it's quite closely related, so it may interest the people who care about Abstract Wikipedia.

The Wikimedia Foundation's Language team started the Translatable modules initiative. Its goal is to find a good way to localize Lua modules as conveniently as it is done for MediaWiki extensions.

This project is related to task #2 in Abstract Wikipedia/Tasks, "A cross-wiki repository to share templates and modules between the WMF projects". The relationship to Abstract Wikipedia is described in much more detail on the page mw:Translatable modules/Engineering considerations.

Everybody's feedback about this project is welcome. In particular, if you are a developer of templates or Lua (Scribunto) modules, your user experience will definitely be affected sooner or later, so make your voice heard! The current consultation stage will go on until the end of September 2020.

Thank you! --Amir E. Aharoni (WMF) (talk) 12:45, 6 September 2020 (UTC)

@Aaharoni-WMF: Thanks, that is interesting. I do wonder whether there is more overlap in Phase 1 than your link suggests. Although it is probably correct to say that the internalization of ZObjects will be centralized initially, there was some uncertainty about which multi-lingual labelization and documentation solution should be pursued. Phab:T258953 is now resolved but I don't understand it well enough to see whether it aligns with any of your solution options. As for phase 2, I simply encourage anyone with an interest in language-neutral encyclopedic content to take a look at the link you provided and the associated discussion. Thanks again.--GrounderUK (talk) 19:43, 7 September 2020 (UTC)

Response from the Grammatical Framework community[edit]

The Grammatical Framework (GF) community has been following the development of Abstract Wikipedia with great interest. This summary is based on a thread at GF mailing list and my (inariksit) personal additions.


GF has a Resource Grammar Library (RGL) for 40 or so languages, and 14 of them have large-scale lexical resources and extensions for wide-coverage parsing. The company Digital Grammars (my employer) has been using GF in commercial applications since 2014.

To quote GF's inventor Aarne Ranta on the previously linked thread:

My suggestion would have have a few items:

  • that we develop a high-level API for the purpose, as done in many other NLG projects
  • that we make a case study on an area or some areas where there is adequate data. For instance from OpenMath
  • that we propagate this as a community challenge
  • Digital Grammars can sponsor this with some tools, since we have gained experience from some larger-scale NLG projects

Morphological resources from Wiktionary inflection tables[edit]

With work of Forsberg and Hulden and Kankainen, it's possible to extract GF resources from Wiktionary inflection tables.

Quoting Kristian Kankainen's message:

Since the Wiktionary is a popular place for inflection tables, these could be used for boot-strapping GF resources for those languages. Moreover, but not related to GF nor Abstract Wikipedia, the master's thesis generates also FST code and integrates the language into the Giella platform which provides an automatically derived simple spell-checker for the language contained in the inflection tables. Coupling or "boot-strapping" the GF development using available data on Wiktionary could be seen as a nice touch and would maybe be seen as a positive inter-coupling of different Wikimedia projects.

Division of labour[edit]

Personally, I would love to spend the next couple of years reading grammar books and encoding basic morphological and syntactic structures of languages like Guarani or Greenlandic into the GF RGL. With those in place, a much wider audience can write application grammars, using the RGL via a high-level API.

Of course, for this to be a viable solution, more people than just me need to join in. I believe that if the GF people know that their grammars will be used, the motivation to write them is much higher. To kickstart the resource and API creation, we could make Abstract Wikipedia as a special theme of the next GF summer school, whenever that is organised (live or virtually).

Addressing some concerns from this talk page[edit]

Some of the concerns on the talk page are definitely valid.

  • It is going to take a lot of time. Developing the GF Resource Grammar Library has taken 20 calendar years and (at least) 20 person years. I think everyone who has a say in the choice of renderer implementation should get familiar with the field---check out other grammar formalisms, like HPSG, you'll see similar coverage and timelines to GF[1].
  • The "Uzbek uncle" situation happens often with GF grammars when adding a new language or new concepts. Since this happens often, we are prepared for it. There are constructions in the GF language and module system that make dealing with this manageable.
  • "Incompatible cultural facts" is a minefield of its own, far beyond the scope of NLG. I personally think we should start with a case study for a limited domain.

On the other hand, worrying about things like ergativity or when to use subjunctive tells me that the commenters haven't understood just how abstract an abstract syntax can be. To illustrate this, let me quote the GF Best Practices document on page 9:

Linguistic knowledge. Even the most trivial natural language grammars involve expert linguistic knowledge. In the current example, we have, for instance, word inflection and gender agreement shown in French: le bar est ouvert (“the bar is open”, masculine) vs. la gare est ouverte (“the station is open”, feminine). As Step 3 in Figure 3 shows, the change of the noun (bar to gare) causes an automatic change of the definite article (le to la) and the adjective (ouvert to ouverte). Yet there is no place in the grammar code (Figure 2) that says anything about gender or agreement, and no occurrence of the words la, le, ouverte! The reason is that such linguistic details are inherited from a library, the GF Resource Grammar Library (RGL). The RGL guarantees that application programmers can write their grammars on a high level of abstraction, and with a confidence of getting the linguistic details automatically right.

Language differences. The RGL takes care of the rendering of linguistic structures in different languages. [--] The renderings are different in different languages, so that e.g. the French definition of the constant the_Det produces a word whose form depends on the noun, whereas Finnish produces no article word at all. These variations, which are determined by the grammar of each language, are automatically created by the RGL. However, the example also shows another kind of variation: English and French use adjectives to express “open” and “closed”, whereas Finnish uses adverbs. This variation is chosen by the grammarian, by picking different RGL types and categories for the same abstract syntax concepts.

Obviously the GF RGL is far from covering all possible things people might want to say in a wikipedia article. But an incomplete tool that covers the most common use cases, or covers a single domain well, is still very useful.


Non-European and underrepresented languages[edit]

Regarding discussions such as, I'm happy to see that you are interested in underrepresented languages. The GF community has members in South Africa, Uganda and Kenya, doing or having previously done work on Bantu languages. At the moment (September 2020), there is ongoing development in Zulu, Xhosa and Northern Sotho.

This grammar work has been used in a healthcare application, and you can find a link to a paper describing the application in this message.

If any of these sounds interesting to you, we can start a direct dialogue with the people involved.

Concluding words[edit]

Whatever system Abstract Wikipedia will choose, that will follow the evolutionary path of GF (and no doubt other similar systems), so it's better to learn from that regardless of whether GF is chosen or not. We are willing to help, whether it's actual GF programming or sharing experiences---what we have tried, what has worked, what hasn't.

On behalf of the GF community,

inariksit (talk) 08:19, 7 September 2020 (UTC)


I second what inariksit says about the interest from the GF community - if GF were to be used by AW, it would give a great extra motivation for writing resource grammars, and it would also benefit the GF community by giving the opportunity to test and find remaining bugs in the grammars for smaller languages.

Skarpsill (talk) 08:34, 7 September 2020 (UTC)

Thanks for this summary! Nemo 09:10, 13 September 2020 (UTC)

@Inariksit:, @Skarpsill: - thank you for your message, and thank you for reaching out. I am very happy to see the interest and the willingness to cooperate from the Grammatical Framework community. In developing the project, I have read the GF book at least three times (I wish I was exaggerating), and have taken inspiration in how GF has solved a problem many times when I got stuck. In fact, the whole idea that Abstract Wikipedia can be built on top of a functional library being collected in the wiki of functions can be traced back to GF being built as a functional language itself.

I would love for us to find ways to cooperate. I think it would be a missed opportunity not to use learn or even directly use the RGLs.

I keep this answer short, and just want to check a few concrete points:

  • when Aarne mentioned to "develop a high-level API for the purpose, as done in many other NLG projects", what kind of API is he thinking of? An API to GF, or an abstract grammar for encyclopaedic knowledge?
  • whereas math would be a great early domain, given that you already have experience in the medical domain, and several people have raised the importance of medical knowledge for closing gaps in Wikipedia, could that be an interesting early focus domain?
  • regarding our timeline, we plan to work on the wiki of functions in 2021, and start working on the NLG functionalities in late 2021 and throughout 2022. Given that timeline, what engagement would make sense from your side?

I plan to come back to this and answer a few more points, but I have been sitting on this for too long already. Thank you for reaching out, I am very excited! --DVrandecic (WMF) (talk) 00:44, 26 September 2020 (UTC)

@DVrandecic (WMF): Thanks for your reply! I'm not quite sure what Aarne means with the high-level API: an application grammar in GF, or some kind of API outside GF that builds GF trees from some structured data. If it's the former, it would just look like any other GF application grammar, if the latter, it could look something like this toy example in my blog post (ignore that the black box says "nobody understands this code"): the non-GF users would interact with the GF trees on some higher level, like in that picture, first choosing the dish pizza and then choosing pizza toppings, generates the GF tree for "Your pizza has (the chosen toppings)".
The timeline starting in late 2021 is ideal for me personally. I think that medical domain is a good domain as well, but I don't know enough to discuss details. I hope that our South African community gets interested (initial interest is there, as confirmed by the thread on GF mailing list).
I would love to have an AW-themed GF summer school in 2022. It would be a great occasion to introduce GF people to AW, and AW people to GF, in a 2-week intensive course. If you (as in the whole AW team) think this is a good idea, then it would be appropriate to start organising it already in early 2021. If we want to target African languages, we could e.g. try to organise it in South Africa. I know this may seem premature to bring it up now, but if we want to do something like this in a larger scale, it's good to start early.
Btw, I joined the IRC channel #wikipedia-abstract, so we can talk more there if you want. --inariksit (talk) 17:42, 14 October 2020 (UTC)

Parsing Word2Vec models and generally[edit]

Moved to Talk:Abstract Wikipedia/Architecture#Constructors. James Salsman (talk) 20:36, 16 September 2020 (UTC)

Naming the wiki of functions[edit]

We've started the next steps of the process for selecting the name for the "wiki of functions" (currently known as Wikilambda), at Abstract Wikipedia/Wiki of functions naming contest. We have more than 130 proposals already in, which is far more than we expected. Thank you for that!

On the talk page some of you have raised the issue that this doesn’t allow for an effective voting, because most voters will not go through 130+ proposals. We've adjusted the process based on the discussion. We're trying an early voting stage, to hopefully help future voters by emphasizing the best candidates.

If you'd like to help highlight the best options, please start (manually) adding your Support votes to the specific proposals, within the "Voting" sub-sections, using: * {{support}} ~~~~

Next week we'll split the list into 2, emphasizing the top ~20+ or so, and continue wider announcements for participation, along with hopefully enabling the voting-button gadget for better accessibility. Cheers, Quiddity (WMF) (talk) 00:18, 23 September 2020 (UTC)

Merging the Wikipedia Kids project with this[edit]

I also proposed a Wikipedia Kids project, and somebody said that I should merge it with Abstract Wikipedia. Is this possible?Eshaan011 (talk) 16:50, 1 October 2020 (UTC)

Hi, @Eshaan011: Briefly: Not at this time. In more detail: That's an even bigger goal than our current epic goal, and whilst it would theoretically become feasible to have multiple-levels of reading-difficulty (albeit still very complicated, both technically and socially) once the primary goal is fully implemented and working successfully, it's not something we can commit to, so we cannot merge your proposal here. However, I have updated the list of related proposals at Childrens' Wikipedia to include your proposal and some other older proposals that were missing, so you may wish to read those, and potentially merge your proposal into one of the others. I hope that helps. Quiddity (WMF) (talk) 20:06, 1 October 2020 (UTC)