Talk:Abstract Wikipedia/Template Language for Wikifunctions

From Meta, a Wikimedia project coordination wiki

General[edit]

This is an excellent proposal, thank you. There is not much that I can disagree with, but there is some thinking that I am reluctant to accept. First, I ‘d like to avoid having templates as the only form of renderer. Secondly, I’d like to imagine that there would be the option for some renderers to be shared across “languages”. Of particular concern is the status of “Simple English”. Something else to consider is how to link appropriate sources to the rendered content, particularly where reliable sources in the target language are scarce (or not identified). GrounderUK (talk) 13:46, 11 August 2022 (UTC)[reply]

First, thanks for the compliment! Regarding your remarks:
  • Templates will not necessarily be the only form of renderers. Rather, they would be one way to specify renderers without doing any involved programming. However, as long as one would obey the API contract of the NLG system (which may need to to be defined precisely) one could write renderers in code and they would still be compatible with the overall NLG architecture.
  • Templates can be shared across languages using the language-dispatch mechanism. The idea is that one can specify a template for more or less specific language codes, including no language code at all. I have clarified this now in the proposal text. So one could define Simple English to fallback to the normal English grammar and lexemes whenever needed.
  • I'm not entirely sure I understand your concern about linking "appropriate sources". The template itself could mention a source, e.g. "According to {source} ...", or using a footnote notation. Of course, the source would have to be mentioned in the Abstract Content. In general, the source may not written in a different language than the rendered one, but we could add a mechanism to prefer native-language sources, if available.
AGutman-WMF (talk) 16:08, 19 August 2022 (UTC)[reply]
Further to the previous comments,
  • For the notion of 'languages' mentioned, Simple English could be treated as a separate language (a CNL of full English) and one may consider to add a property to lexemes or to templates or both on 'formality' to generate more or less formal sentences. Is that a feature you (also) want? In addition, we had discussed -- but not added to the template language itself and therewith not elaborated on in the proposal -- the notion of equivalent alternate templates for the same content with the aim to generate different sentences so as not to have the output look stale across articles; e.g., for the same example on the age of a person, to add the person's title in another variant, or only surname, or appending it with 'old' or 'young'. You'd like to have that option and expect it embedded in the template language?
  • I'm also not entirely sure what you mean with "appropriate sources", but if it concerns provenance of the content data fetched from Wikidata, then the Wikidata item would be the place to record that and then that can be fetched as content to be added to the template with, say, a {source} slot to fetch that content data. This also can be done with the lexicographic data, since that also resides in Wikidata and can have its provenance recorded. In both cases, a further note to the sources can be added regarding the reliability, and thus also that could be fetched with a query/function and added to the template and therewith to the text in the sentence or paragraph it would generate.
Keet10 (talk) 10:50, 1 September 2022 (UTC)[reply]

@AGutman-WMF and Keet10: I thank you both for your replies. I suspect you are right to think that there will generally be one template per “language”, with some flexibility around how we define that in practice. I think the updates clarify this (thank you), although I am not sure what the Lexemes would be if we end up with no language code at all.

Simple English is an exceptional case because it already has its own Wikipedia. It is not properly “controlled” (in the CNL sense) because more complex lexemes can be explained within the article, and subsequent use will not necessarily refer back to the first use. I recall some previous discussion of style and lexicon varying more generally according to reader preference. I’ll see if I can find a link for that. I shall also expand on the difficulties with sources, but this is a multi-faceted problem and it’s hard to remain on topic.GrounderUK (talk) 13:38, 1 September 2022 (UTC)[reply]

Talk:Abstract_Wikipedia/Archive_3#Merging_the_Wikipedia_Kids_project_with_this refers back to Talk:Abstract_Wikipedia/Archive_2#Hybrid_article which is rather long but includes my comment

.. given a particular Q, we can return a derived or pre-defined infoText (speaking hypothetically) but I doubt we'll have a separate pre-defined result for every conceivable use case. We can imagine what a "full" set might look like, and maybe a "minimal" set ("short description"), but less-than-full or more-than-minimal...? Pre-defined, explicitly, yes: just go ahead and define it as if it were an article. Derived from style guidelines and editorial policy but not specific to a particular Q, maybe: I guess our "full" set would respect some express cross-topic guidelines, which should be "adjustable" (level of language, level of subject expertise etc). Let's see what people want.--GrounderUK (talk) 17:19, 29 July 2020 (UTC)

I suppose now that “adjustable by level of subject expertise” is principally an Abstract Content challenge, although it is reasonable to expect that simpler content would avoid specialist terminology and might also use more straightforward sentences. I don’t have a clear idea of how level of language might be adjustable within a particular template or across “equivalent alternate templates”. I am inclined to view it as varying independently from level of detail, even though there is a clear tendency for additional detail to result in language that is harder to understand. Consider a real-world example:

Marie Curie ( Q7186)
English Wikipedia French Wikipedia
Marie Salomea Skłodowska–Curie (/ˈkjʊəri/ KURE-ee,[4] French pronunciation: ​[maʁi kyʁi], Polish pronunciation: [ˈmarja skwɔˈdɔfska kʲiˈri]; born Maria Salomea Skłodowska, Polish: [ˈmarja salɔˈmɛa skwɔˈdɔfska]; 7 November 1867 – 4 July 1934) was a Polish and naturalized-French physicist and chemist who… Marie Skłodowska-Curie, ou simplement Marie Curie, née Maria Salomea Skłodowska (prononcé [ˈmarja salɔˈmɛa skwɔˈdɔfska] Écouter) le 7 novembre 1867 à Varsovie (royaume de Pologne, sous domination russe) et morte le 4 juillet 1934 à Passy, dans le sanatorium de Sancellemoz (Haute-Savoie), est une physicienne et chimiste polonaise, naturalisée française par son mariage avec le physicien Pierre Curie en 1895.

There is more information in the French version, so is it reasonable to view the English version as “simpler”? To what extent should the omission of detail in the English version be considered a stylistic, editorial choice (given that the details are present later in the article)? What about omission of the French pronunciation in the French version? If we had an English version of the French information, would it seem odd that it doesn’t mention that Haute-Savoie is in France? (Why, in any event, is it not “...à Passy (Haute-Savoie), dans...”?) Naturally enough, perhaps, the Polish article doesn’t (initially) tell Polish readers that Warsaw is in Poland (or that Poland is in Europe).

I am assuming, though, that there would be a single Abstract Content and that the natural-language template would embody only to a limited extent the cultural assumptions that imply that certain details are extraneous (like French pronunciations of French names) even if that is not true for all users of the language. But if the templates can filter out details, the option should exist for the details to be rendered in full, or more fully. More generally, however, I would expect the Abstract Content itself to be adjustable by user preferences, so that a single Abstract Content can have many potential (simpler or expanded) variants. I imagine that many variants can be handled by a single template (per target language), possibly through sub-templates. That is, I expect a high degree of re-use of templates across many different occurrences of Abstract Content, some occurrences more detailed than others; the elective filtering (or expansion) of any particular occurrence would generally result in Abstract Content that can still be rendered by the same templates as the unfiltered version. GrounderUK (talk) 16:03, 2 September 2022 (UTC)[reply]

Abstract Content variants[edit]

I refer above to “style guidelines”, “editorial policy”, “cross-topic guidelines”, “adjustable level of content”, “adjustable level of language”, “level of detail”, “stylistic, editorial choice”, “cultural assumptions”, “user preferences”, “simpler or expanded variants”, “elective filtering (or expansion)”. As I said before, “I doubt we'll have a separate pre-defined result for every conceivable use case”. What I meant by “result”, there, is equivalent to an Abstract Content variant. In other words, we won’t have a separately defined occurrence of Abstract Content for each variation. Instead, I suggest, we shall have ways (functions in Wikifunctions, say) to transform a single Abstract Content into any one of many Abstract Content variants, which we can then proceed to render into any available natural language. By extension, if the required template is not available for a language, we might transform the variant again into a different variant for which there is a template available. More likely, the availability of templates would constrain the degree of variation, given any target language (or set of languages).

Using such a mechanism, we can deal with many of the difficulties that might arise with our sources. Where there are multiple sources for the same information, we would have the option of seeing all or none of them in the rendered result. We would also support an editorial policy (at Wikipedia version level) preferring sources in some specified languages, or preferring to omit information for which “appropriate” sources are missing. We might also support similar user preferences. My assumption is that none of this should have much effect on the natural-language templates, although “appropriate” is likely to prove problematic.

Consider the English Wikipedia policy “If you quote a non-English reliable source (whether in the main text or in a footnote), a translation into English should accompany the quote. Translations published by reliable sources are preferred over translations by Wikipedians, but translations by Wikipedians are preferred over machine translations.” How would we construct Abstract Content that could be rendered into English that complies with this policy? We might just add an English source, but what about for minority languages where that community wishes to minimise the presence of foreign languages in their Wikipedia?

Verifiability[edit]

Let’s go back to the basic principle of Wikipedia:Verifiability (Q79951), summarized as “Readers must be able to check that any of the information within Wikipedia articles is not just made up. This means all material must be attributable to reliable, published sources. Additionally, quotations and any material challenged or likely to be challenged must be supported by inline citations.” (en:Wikipedia:Verifiability) Adapting that for Abstract Wikipedia gives us something like: “Readers proficient in the language in which it is presented must be able to check that the information is not just made up. This means that all Abstract Content must be attributable to reliable, published sources. Additionally, quotations and any material challenged or likely to be challenged must be supported by inline citations that are reasonably intelligible to a reader of the language in which the material is presented.” To me, this suggests that every Abstract Wikipedia source has to be treated as a foreign-language source. In general, that would mean providing a verbatim quote from the source and some kind of “translation”. In some cases, full translations into some languages will be available. In such cases, we should (have the option to) prefer the translation provided. In other cases, we may be able to render an abstract form of the quotation, but that pre-supposes that we have a representation of the quotation as Abstract Content. Presumably the contributor who creates Abstract quotations will validate their rendering into some natural language(s). It may be appropriate to capture the verbatim result of an acceptable rendering, marked as a validated translation, and then to allow the contributor to enhance the translation. The result is then available to pass through to the eventual reader (as above), according to their preferences. This is a special case of the kind of workflow envisaged in Talk:Abstract Wikipedia/Archive 2#Distillation of existing content.GrounderUK (talk) 11:43, 3 September 2022 (UTC)[reply]

Abstract Content with attribution[edit]

There is much more to be said about sources within Abstract Wikipedia, but our focus here is on how they will be rendered. Although there are many textual options, as summarized in Ariel’s initial response, I focus here on “inline citations”, which are the most common in Wikipedia. It seems to me that alternatives to this would require separate rendering templates. In any event, I assume that the form of the Abstract Content itself will be broadly equivalent to inline citations (but see WikiCite/Shared Citations, a stalled proposal). In the general case, encyclopaedic information in Abstract Wikipedia articles will consist of a series of Abstract Content constructors. Any of these may contain attribution to sources, which are themselves Abstract Content constructors (“Abstract attribution Content”). A single Abstract attribution Content itself refers to or contains Abstract Content about the source artefact (referring to a Wikidata item, for the sake of argument) and more precise data about where the supporting information can be found including (in general) a verbatim quotation. Such a quotation (static text) itself implies a further instance of a different type of Abstract Content, which may contain full translations into some languages and/or a satisfactory abstract paraphrase. This “abstract paraphrase” may itself be considered to be a separate instance of a different type of Abstract Content. It should be noted that any quotation can, of course, itself include a quotation (or more than one), and such a quotation may be in a different language (or languages…). In brief, the general form of Abstract Content is a nested composition of (instances of) different types of Abstract Content.

Rendering citations[edit]

When a concrete version of the Abstract Content is required (in a specified language), we first transform the nested composition into an Abstract Content variant (set) that best aligns to the requirements expressed in or implied by the request. In particular, this will filter out extraneous translations from and determine the required form for citations (at least to the extent that this affects the selection of the rendering template). Most particularly, we need to distinguish between presenting references as natural language text or populating Wikimedia templates. The latter is the general case for inline citations. It is not clear to me how, in general, populated Wikimedia templates are emitted in the proposed architecture. Templates like Template:Citation and en:Template:Cite Q exist in multiple languages and editing contributors are expected to interact directly with the template arguments. Apart from quote =, we may expect all arguments to be sourced from Wikidata or the Abstract attribution Content. The parameter names vary by language, so substitution is required according to the version of the template to be used in the target. This could be performed before the language-specific rendering template is selected, in which case it may be appropriate to render the quote into the target language first, so that it can be treated as static text while the full content is rendered. As this approach is consistent with the case where a full translation already exists (or none is required), it is the approach that I assume we shall prefer.

So, as “luck” would have it, we may thereafter treat inline citations as static text strings that can populate a {source} slot (as envisaged by Maria), although we may want to adjust the final position of this downstream, when punctuation is finalized. This is because inline citations are typically placed after the end of the sentence, whereas the Abstract Content will be agnostic about its eventual division into sentences, not least because of the expected transformation into variants that may exclude some attributed information and (most often) the associated attribution along with it. On the other hand, a citation may need to remain attached to a particular part of the text if, for example, it supports only one of a series of values (see en:Wikipedia:Citing_sources#Text–source integrity; this is less robust than I am expecting Abstract Wikipedia’s equivalent to be, so we are losing precision through the rendering process).

Multiple use of a source[edit]

One final complication to consider is the use of the same source multiple times in a single article. In general, there would be a single full citation for a source, with other references to the same source using named references or short citations. I am not convinced that an Abstract Wikipedia article will follow this pattern precisely. Instead, each attribution within Abstract Content would simply reference the supporting Abstract attribution Content, which (ultimately) references the source, as outlined above. When the required Abstract Content variant is derived for a particular realization, it may be that not all references to the same source will be carried forward. This will not matter because the general placement of full citations will now have been determined and all other references to the same source can use the appropriate linking mechanism (as detailed in the Wikipedia guides referenced above). However, I assume that the the required content is a batch of Abstract Content variants, to be rendered in sequence, and we may need to inspect the whole batch to ensure that the full citation occurs only once. Alternatively, we could retain full citations as static text (possibly just a unique reference) throughout the rendering stage and convert to named references or short citations at roughly the same time as we remove duplicated wikilinks.

Deferred rationalization[edit]

One reason to prefer deferred handling of duplicates is that the rendered content may be destined to be an addition to an existing page. Such an addition will not merge seamlessly into the target, particularly where it uses short citations (when the text of the reference should be separated from the text of the realized Content). Named references are less problematic, but there is a risk of duplicating existing sources or, more avoidably, duplicated reference names. In either case, however, the best result is achieved by processing the wikitext for the whole article, after the addition of the freshly rendered, formerly Abstract, content. Although that is beyond the end of the NLG pipeline, a solution that sits close to the end of the pipeline is more likely to be re-usable beyond its end than is one that sits at the start of the pipeline.

Summary[edit]

I summarize my position as follows. A single Abstract Wikipedia article will contain a sequence of Abstract Content constructors. The article will not always be rendered in its entirety; it will first be transformed into a sequence of Abstract Content variants, including Abstract attribution Content that ultimately references reliable sources. Quotations in Abstract attribution Content are static text in the original language. Translations of quotations into different languages (possibly including renderings of Abstract paraphrases) will occur before the attribution Content is rendered, so they, too, are effectively static text. In general, the existing Wikimedia citation templates are populated in the language appropriate to the target. Otherwise, a natural-language rendition is provided. When producing a full article, duplicate citations will be rationalized. For additions to existing articles, rationalization might usefully be deferred but, with limitations, it can still be performed on the additional content in isolation.

Equivalent alternate templates[edit]

Returning, finally, to the question of “equivalent alternate templates”, I’d like to think there would be no alternates without a rationale. In the case where the alternates are strictly equivalent, we have no reason to prefer one over another. It is perhaps more likely that we arrive at some particular case where more than one template can render the same Abstract Content variant with similar but not necessarily identical results (with similar resource utilization). Even then, I’m not sure whether we should prefer consistency or variety. That said, I am rather fond of the way that DeepL provides alternative translations, so perhaps we could think along those lines? GrounderUK (talk) 22:27, 4 September 2022 (UTC)[reply]

Thanks, again, for your very detailed analysis and suggestions!
You are right, of course, that different language editions of Wikipedia typically put an emphasis on certain details, or conversely omit them, depending on a variety of cultural and geographical factors. Modeling such variation is quite difficult in general, as it requires a good formalization of these factors (an almost impossible task). In practice, some aspects of it could be modeled within the templates, and some would require a dedicated planner module to run between the Constructor and the templatic renderer (not currently portrayed in the architecture proposal).
Relatively simple variation, such as omitting the country name of a locality given the rendering language, could be modeled in a single function used in a template slot. For instance we could envisage a function named Location taking care of this. More complex "editorial decisions", such as deferring or emphasizing certain details (deemed important or unimportant in different languages) would require a planner module (e.g. a Wikifunctions function) which would transform a language-agnostic Constructor to a "language-appropriate" Constructor which would then be rendered using templates. Given the complexity of defining such planners (or even defining the requirements for them), we haven't included them in the current proposal. Once the project matures (i.e. we will have a system that produces some content) it would be a good time to return to this question, in my opinion.
Similarly for the question of sources, your discussion covers many of the desired outcomes of the project. Yet in this early stage, I think it would suffice to assume that the rendered article will simply mention the source(s), as given in the Abstract Content, either in a parenthetic expression or as a footnote (as is typical in many Wikipedia articles).
A side question you mentioned was whether the rendered article will be able to make use wikitext templates. A priori, since the rendering of the article can already specify all necessary markup, the use of wikitext templates is not really necessary. In fact, every wikitext template could be in principle replaceable by an NLG template as proposed here, as far as I see. However, if we do see that authors wish to use wikitext templates as part of the rendering output, it would be relatively simple to integrate them, as another "slot" specification. AGutman-WMF (talk) 13:43, 5 September 2022 (UTC)[reply]
@AGutman-WMF: Thanks for the reply. I quite understand your position. Although I would hesitate to label the missing architectural component “planner”, it certainly needs to appear in the architecture. Of course, in the initial implementation there may be very little for it to do, and (pragmatically, in the short term) we may tolerate allowing en:Content determination to occur in language-specific templates (just don’t quote me on that!). We must be open and honest about this, if we go down this route (but I really can’t see that there is any advantage in doing so).
To be clear, though, I fully agree that a “good formalization” is a fairly hopeless ambition, and quite unnecessary for our purposes. We don’t need to define “planners” or their requirements, we simply need to accept that optionality at the content level is inevitable, and plan to implement it in a language-neutral context (as and when the requirements emerge). There are two important consequences. First, we don’t need to worry about it here. Secondly, we can assume that we won’t ever need to support language-specific content determination (since we can implement any such requirement as language-neutral determination). It may well be that one particular filter or format is appropriate in only one language, but we do not need to know this (however interesting it might be to a linguist), or spend any time speculating on it; if someone builds it, anyone can use it.
In passing, I suggest we avoid using the term ‘“language appropriate” constructor’. Content determination will be according to community or individual preferences (and current capabilities, per language). The community may well be the users of and contributors to a particular monolingual Wikipedia edition, but we do not need to distinguish between the linguistic, cultural, habitual, aesthetic or other preferences of the community; we just need to understand their priorities (or what they have built).
As for sources, well... I think the requirements are pretty much non-negotiable, as far as WMF is concerned, but don’t take my word for it! Of course the current situation across all the WMF Projects leaves a lot to be desired, but in the short to medium term, I don’t believe there is any viable alternative to targeting the citation templates. And, as you suggest, this should also be “relatively simple”. (But we should only target them, not integrate their arguments directly into Abstract Content. And we can probably limit supported sources to those that exist on Wikidata, so most of the arguments will be sourced directly by the Cite Q template’s module, with quote being the most noteworthy exception.) I have no objection to simpler pure-text alternatives also being an available option (as well as omitting attribution), but then we are back to content determination or en:Document structuring, since this cannot credibly be considered a language-specific feature.GrounderUK (talk) 01:57, 6 September 2022 (UTC)[reply]
Thanks for the extensive comments. Regarding the “I’m not sure whether we should prefer consistency or variety” on the “equivalent alternate templates”: what we know so far from user studies on NLG, is that it depends on the context whether one or the other is preferred, such as, notably, the purpose of the text generation, intended reader, and the [more or less formal] setting of consumption of the text. For instance for UML class diagram verbalisation for information validation, consistency is preferred, but for, say, soccer match reports for soccer team fans, variety is much preferred. I’d assume Wikipedia articles would be more on the end user information provision level rather than validation of Wikidata content, and so then hinting at possible support for equivalent alternate templates.
Then, the various responses above mixes several components of the ‘alternate/alternative templates’, which I'll try to disentangle here in random order:
1. The ‘saying the exact same thing’ really equivalent ones with the same constructor: e.g., “each dog is an animal” and “all dogs are animals” render the same constructor (e.g., isA(dog, animal)) with the exact same level of formality, but in different ways because of the  different text strings and sg/pl. This would need a template chooser as additional step in the realiser pipeline, and perhaps also some tag/attribute to the template to indicate equivalence.
2. The ‘saying the exact same thing’ wrt the same constructor, but not fully equivalent: for instance, there may be different levels of formality in the sentence to be generated. This may need a marker on the template and on the lexeme, since it may either affect the template specification or the lexeme selection (like a more or less formal word, like a Sie vs du in German etc.), or on both. We haven’t figured that out and some examples may help to determine those requirements better; else on both just in case.
3. The `using the same constructor fully’ but one is in the simpleX and the other in the full X language. This can be indicated with the language tag of the template, since they may be deemed, or are de facto or even de jure, different languages.
4. The ‘using the same constructor but not all of it’: skipping some of the content of the constructor. That’s also simplified, but not the same way as simpleX in case 3 above (that still states all information). There’s no way to indicate that now in the template language and the proposal didn’t intend cover that at this state, as Ariel mentioned in his comment above.  There’s the design principle of graceful degradation for when the language to render doesn’t have all that’s needed to verbalise it, but that’s different from skipping parts of a constructor for other information hiding/editorial reasons (like ‘uninteresting’ for the assumed target readers, ‘entry blurb’ at top of the page, or whatever). Perhaps with an annotation field on the template.
Then one push it further and mix either of them, like cases 2+4 or 1+4 or 1+3. We’ll update the proposal draft in some way to be more specific on this. Keet10 (talk) 15:37, 6 September 2022 (UTC)[reply]
Thank you, @Keet10:, that sounds like the basis of a worth-while enhancement. I had a few too many thoughts to write down but basically, yes.
1. Actually, I don’t see the examples as equivalent. But the distinction or equivalence can be made explicit by grouping near-equivalents together and surfacing contrastive features. In a context where the contrastive features do not apply, the forms are equivalent. The general rule is that there are no synonyms, but, yeah, “on a scale of synonymy”.
2. Near synonyms with contrastive feature “formality” (on a scale of familiarity). I guess we need to distinguish between slots and (lexeme) text. Use of featured lexemes in text propagates their types into the template (in some fashion) and I guess that the resultant implicit type restricts the valid population of the slots. Conversely, the typing of the slot arguments may demand appropriate slots and, by implication, (potentially) different templates.
3. I think this is just a special case of 2 but it’s useful to consider it separately. On a scale of simplicity.
4. We could, but I don’t think we should do so here. That is, it is prior content determination operating over Abstract Content. There may be some feedback in the graceful degradation scenario. That is, if the degradation can be characterized it might behave like content determination and drive the template selection. But I am happy to assume not.
I look forward to seeing the updates but feel free to raise any further points in the mean time. GrounderUK (talk) 23:32, 6 September 2022 (UTC)[reply]

Example for Breton[edit]

Hi @AGutman-WMF:,

I didn't had time to read thoroughly everything but this seems to be a good document.

Here is a remark for the example of Breton for the age of someone. For you example, it would be "25 bloaz eo Malala Yousafzai".

But, there is some tricks: first, in Breton, words stay singulars after a number (so it's "bloaz" and not "bloazioù"). Then, the word "bloaz" (year d:L:L45068) has a mutation pattern specific to it (only for it): it is softened everywhere except for the numbers 1, 3, 4, 5, 9 and 1000. I'm guessing we need a Age_renderer_br() fonction but also a Year_br (a bit like like in Hebrew?). VIGNERON * discut. 10:34, 13 August 2022 (UTC)[reply]

Thanks!
If I understand correctly, the alternation bloaz / vloaz is purely phonological: you get the soft version following certain words (or digits/numbers, in this case). This kind of alternation doesn't actually need any special annotation in the template language itself, since it should be taken care by the phonotactics module of the NLG architecture. It does require, however, that Breton would have an implementation of the Cardinal function which annotates correctly the numbers as potentially triggering softening (or not) on the token following them. The fact that the word for "year" is moreover always singular in this context means that we don't need to link it with any dependency relation. Assuming that singular forms would anyhow be chosen by default the template for Breton would look like this, as far as I understand, without a need for a "Year" subtemplate:
Age_renderer_br(Entity, Age_in_years): "{Cardinal(Age_in_years)} {Lexeme(L45068)} eo {Person(Entity)}."
Let me know if this looks plausible, and I'll happily add the example with the explanation to the main document. AGutman-WMF (talk) 16:20, 19 August 2022 (UTC)[reply]
@AGutman-WMF:,
Yes, here the mutation is purely phonological (if I understood you right).
And "year" has a specific mutation pattern but a word being in the singular following a number is always true.
Overall, it does look good (again, if I understood you right).
Cheers, VIGNERON * discut. 08:45, 21 August 2022 (UTC)[reply]
Thanks! I've added the example to the main document. AGutman-WMF (talk) 13:05, 29 August 2022 (UTC)[reply]

Errors with TemplateText function[edit]

Hello @AGutman-WMF:. Overall I am very impressed by this specification; evidently it has been given a lot of thought and I think it is going in the right direction.

But there are some mistakes in the examples related to the TemplateText function, which I think can only be used for a string which is totally invariable. As I understand it, the template function Constructor_xx creates the function composition, which will then be executed to generate the final lemma list/tree, which can be transformed into the final rendered string by a function like Render. TemplateText will not be called until the function composition is executed. Its task is to generate a lexeme object which in general includes the information mentioned in the section "Output type of the evaluation", especially part of speech, grammatical features, etc. as well as orthography. It cannot do this for a lexeme which may need grammatical dependency enforcement.

Please look at your example below

Bonjour {det:"le"} {amod:"petit"} {root:Person(entity}! → 
amod(3, 4, det(2, 4, Template(4, [TemplateText("Bonjour")  TemplateText("le")  TemplateText("petit")  Person(entity)  TemplateText("!") ]))) 

When 'TemplateText("petit")' is executed, the grammatical information cannot be included in the lexeme. But when the 'amod(3, 4, ...)' call is executed the grammatical information will be needed. I suppose that 'TemplateText("petit")' should be replaced by 'Lexeme_fr(L10098)'. Also I think the lexeme "L10098" should be given instead of the string "petit" in the template itself. Similarly 'TemplateText("le")' is not good for the article here, perhaps this should be replaced by the lexeme call instead.

Also I think in the Zulu example, 'TemplateText("na")' needs to be replaced by a lexeme call.

Best Wishes, Strobilomyces (talk) 17:23, 21 August 2022 (UTC)[reply]

Thank you for the careful reading! Regarding the TemplateText function, that indeed just passes on the string unmodified. This, however, does not mean nothing ever can happen with it and would have to be as such in the final text. The template language resides within the context of the proposed NLG architecture for AW: after the "templatic renderer" box, there are still (optional) steps of morphology and phonology. This is also why the "na" is there in the isiZulu template rather than another function; it becomes 'ne' due to the phonological conditioning (-a + i- => -e-), which is handled by the "phonotactics" step in that architecture.
Kind regards, Keet10 (talk) 13:36, 22 August 2022 (UTC)[reply]
Thank you for answering. I don't know about the isiZulu case, and I should not have mentioned it as I do not understand it properly.
In the example above the adjective "petit" may need to be replaced by "petite" if the person in question has feminine gender, which is done by the function "amod". amod will enforce the agreement in gender and number between the adjective and the person (I think this is referred to as "morphology"). For this it needs the lexeme information in the argument, as defined in the section "Output type of the evaluation" which specifies the fields; here it would need to find the feminine singular form "petite" of the lexeme L10098. It will need to find the "unifiable grammatical feature" on the form "petite" with "Grammatical category" = gender/"Grammatical feature" = feminine and also "Grammatical category" = number/"Grammatical feature" = singular. This is very important and very much at the heart of the proposal. Remember that function "amod" has to work for all French adjectives and their antecedents. It could not work just based on the string and if it did, that would make a mockery of all the work which is being done on lexemes and forms. The function TemplateText cannot add the lexeme information with only a string as argument and the amod function cannot work without the lexeme information.
I know that some text modifications may be proposed just based on strings for phonological or some other reasons, but for grammatical agreement in languages like French or English, that does not work. The example at present is wrong and seriously misleading.
Best regards, Strobilomyces (talk) 20:48, 22 August 2022 (UTC)[reply]
I think you're correct and that it should be Lexeme(L10098), so that it can fetch the lexeme's forms recorded in Wikidata, and then with amod come to the agreement with the gender of the person that's fetched (and then select 'petite' if the recorded gender of the person is female). Keet10 (talk) 12:13, 28 August 2022 (UTC)[reply]
Thanks for your comments!
The question whether the French example provided above is correct or not depends on the (language-specific) implementation of TemplateText. There are several possibilities how it may be implemented:
  1. Do nothing (just pass the string as static text).
  2. Pass on the string as static text but enrich it with necessary phonological features, in order to ensure correct phonological logic for neighboring words. This could for instance take care of the phonotactics of na in Zulu.
  3. For a select list of words, fetch a lexeme, together with its various inflected forms, to be processed by the rest of the pipeline. This is mostly intended for specific function words, such as French article le (which would be expanded to all its possible forms depending on the grammar), but it could also possibly include a select list of frequent nouns/verbs/adjectives such as petit.
  4. For any string, look up Wikidata for a corresponding lexeme (in the given language) and fetch it (and if not found - treat it as a static text). This for instance would allow replacing an explicit Lexeme(L10098) call by just giving the string petit.
So for the example above to work, one has to assume that TemplateText is implemented either as #3 or #4. Personally I would opt for #3, which allows using TemplateText with the function words as well as possibly some very frequent content words. I'll add some clarification of this in the text. AGutman-WMF (talk) 09:20, 29 August 2022 (UTC)[reply]
@AGutman-WMF:. Thanks for your reply. I haven't answered immediately due to being away - now I have more time.
But possibilities #3 and #4 do not work (in a general case). You cannot reliably determine a lexeme from a string. To get the lexeme information, TemplateText would need more information that just the string and it would be a complicated and error-prone process. But it is completely unnecessary and goes against the logic of the template text proposal. The TemplateText function should only work according to #1 and #2 (#1 is a special case of #2). Adding lexeme information is the job of the Lexeme function and that is what is needed here. Lexemes always need to be selected (directly or indirectly) by humans, they cannot be selected by functions based on strings.
There are three ways in which lexeme fields can enter the generated text.
  1. The field can be added as part of the template. I think that that is the case in the French example "Bonjour la petite female person" etc. That is easy, but the writer of the template must provide the lexeme code such as 'Lexeme_fr(L10098)' for 'petite' here. The template author absolutely has to understand how to do that.
  2. The field may come through from the abstract text. Then the field will have been defined by its item number in Wikidata; then the lexeme will be found using the P5137 "item for this sense" link or similar (maybe done by function Name). The human who composes the abstract text has to select the Wikidata item and some Wikidata editor had to create the link from the lexeme (in the given language) to the item.
  3. The wikifunctions are allowed to "know" some lexemes internally, but then they must be constant lexemes, not words selected from outside like "petit", which might have been any adjective.
Please note that lexemes (or the corresponding Wikidata items) have to be selected by people, not automatically. Going from lexeme codes or item Q-numbers to strings is a standard "easy" computing problem. Going from strings to lexemes etc. is an AI problem and if you go down that road you will fail. This is really basic to the whole project and in fact the whole virtue of Denny's Abstract Wikipedia idea is that we only go from structured information to text, and we don't try to go from text to structured information, which would be language understanding, and which would be a "hard" AI problem. Well, I think Denny suggested at one point that some language understanding could be included as an optional add-on feature, but in my view that is a mistake as it is too complicated and it will fail. The original "Template Language for Wikifunctions" main text is completely consistent with this principle (and all I am saying here), so that is very good. But your possibilities #3 and #4 violate the principle, will not work, and they go against the basic idea of the template language proposal.
Remember that if the adjective "petit" could be included in a template, so could any other adjective. In fact "petit" corresponds to two different lexemes (one a noun); doubtless other strings often correspond to many lexemes. It makes no sense for the user to specify this word with a string; the template writer simply has to specify 'Lexeme_fr(L10098)' and all the problems are solved. Neither #3 nor #4 can work here in the general case (where any adjective is possible). And you should be giving an example which is as general as possible, and which shows the main process. A human must select the lexeme and the TemplateText function (with only a string argument) cannot sensibly do this job.
It is not clear to me just how the definite article "le" should be handled (I suppose that function det already "knows" that it is the definite article anyway). Sending the string "le" here might actually work computationally, but I think 'Lexeme_fr(L2770)' would be better as it would be more consistent with the general methodology.
Your new sentence under the definition of TemplateText(string) is a terrible idea. By contrast the definition higher up under "Construction of the composezd function" is fine. Your new sentence says "Depending on the exact implementation, this may be just add some annotation on top of the given text, or alternatively it may fetch from Wikidata (or from a given list of lexemes built-in in the function implementation) a lexeme with the given lemma (making if effectively a more readable Lexeme function)." But it cannot and should not be fetching lexemes based on a string. This cannot work in the example of "petit" because if the template writer can select string "petit", there are thousands of other adjectives which could have been selected instead. It makes no sense to do this AI-type work of translating backwards from the string to the lexeme and in general it is not possible. Only humans can choose lexemes! And they have to do that by specifying the L codes in templates (or by specifying the Q codes in the abstract text, from which L codes can be derived programmatically).
As I say, the main original text of "Template Language for Wikifunctions" is consistent and good. Part of the methodology has to be that any lexeme word defined in the template or the abstract text must be defined as a lexeme code (in the template) or an item code (in the abstract text) by a person. Please change the French example and the definition of TemplateText so that TemplateText is not used to select a lexeme based on a string. Best wishes, Strobilomyces (talk) 17:45, 3 September 2022 (UTC)[reply]
Thanks again for your keen discussion here and below!
I agree with you that in general lexemes need to be fixed within the template (insofar they are not arguments coming from the constructor), yet I slightly disagree with you about the feasibility or necessity of specifying such lexemes using strings, through some heuristics (not necessarily complex AI). The main motivation of using strings instead of L-ids is to maximize readability of the templates, so that the reader can get a glimpse of the rendered content without needing to look up the L-ids. Ideally, this could be handled through a good UI (showing L-ids as their lemma, or replacing typed words with L-ids), but assuming for the time being that this is not available I see usability for allowing at least a fixed set of lexemes to be inferred from certain pre-defined strings (my proposal #3). This wouldn't require any AI, as the list will be completely deterministic and known to the template authors through documentation. Moreover, some heuristics can allow the inflection of these lexemes according to neighboring words, omitting the need of writing them in slots at all (e.g. we could have a template such as "le {Entity(item)}" in which le, even though appearing as static text, inflects according to the following slot. I consider this most useful for frequent function words. The exact list of such strings (or its complete absence) would be determined by the volunteers developing the needed functions for each language.
My proposal #4 tried to take this idea even further (again, in the interest of readability). I agree with you that it is not always possible to determine unequivocally a lexeme from a string. Yet, if we look up the string petit in Wikidata for French, we get only two exact matches: a noun and an adjective. These can be further disambiguated by the usage of the dependency label amod which requires an adjective. That said, I agree with you that this may introduce more problems than it solves, so it is probably not worth the extra complications (especially if have a good UI which can make the authoring experience much more fluid).
In line with the above considerations, I will change the example so that petit is referred to using its L-id, while le is still present textually, and I will add some clarification.
Thinking about this led me to the idea that an invocation such as {Lexeme(L10098)} can be unequivocally contracted to be just {L10098} allowing the parser to implicitly invoke the Lexeme function. What do you think about this? AGutman-WMF (talk) 14:22, 5 September 2022 (UTC)[reply]
@AGutman-WMF:. Again thanks for your quick reply. I am glad that you only disagree with me slightly. The main reason that I proposed comments in the following section was precisely so that people could read the template without having to look up the L-ids. Inside templates I think that only one language is relevant. I agree with you that a front end could help the template authors and users, but it is very important first to define a consistent working system based on the template text, which will be consulted in case of problems.
For #3 you refer to "a fixed set of lexemes to be inferred from certain pre-defined strings", and if it just means a very short list of special lexemes like the definite article, I can understand that. But "petit" ("little") isn't admissible for one of those lexeme cases, is it? If you do it with "petit", you need to do it for all the thousands of other French adjectives (and similarly other parts of speech). The code of TemplateText_fr will be a duplicate of much of the French lexeme content of Wikidata, which makes no sense. We have a standard method of handling this (always from lexeme to rendered string, not backwards) and that is what we need to concentrate on.
If you must offer proposal #3, please don't use TemplateText_xx, but use a function with another name, such as SpecialLexeme_xx with string argument. That would already be a significant improvement - at least someone calling TemplateText_xx would know what they were getting. By the way they could still be getting some phonological postprocessing, etc., but not lexeme-style processing. The string given to SpecialLexeme_xx would be like multiple-choice selection.
I don't think I understand your comments about "some heuristics can allow the inflection of these lexemes according to neighboring words, omitting the need of writing them in slots at all". You could certainly have a template of the form "Le(noun phrase)", but what you have written is not that, it is two lemmas. What function will expand "le" in "le {Entity(item)}"? I suppose it has to be done by Constructor_fr. Functions like TemplateText and Lexeme cannot refer to neighbouring lemmas, can they? All they have is simple strings, they have no way of getting to the lemma tree. Please could you give a fuller example of how that could work? Have you really thought out what is happening here?
In your paragraph on proposal #4 you say "These can be further disambiguated by the usage of the dependency label amod", but TemplateText_fr has no access to the dependency label; that could only be done by Constructor_fr. I think you have not worked out logically what is happening.
Anyway you sum it up well when you say "this may introduce more problems than it solves, so it is probably not worth the extra complications".
I don't mind about replacing {Lexeme(L10098)} by {L10098}; that could work. Actually I am not enthusiastic about that suggestion because {Lexeme(L10098)} or {Lexeme_fr(L10098)} makes evident the corresponding function call and makes it clearer what is going on - and it is easy to lose focus on exactly what is happening. I think with this template proposal you have a good rigorous methodology which can actually work. But it is already very complicated and difficult to understand. I think at this stage you should be promoting and illustrating the basic normal methodology, and avoid anything which complicates the specification. Exception: there should be some way of adding a comment string whenever an L-id appears.
Thank you for making changes to the proposal text to make it more consistent. But now, according to the definition, Lexeme only accepts a Q-id, not an L-id, doesn't it? For lexemes introduced in a template it needs to accept an L-id. Strobilomyces (talk) 17:49, 5 September 2022 (UTC)[reply]
Hi again. As said, I agree with you that the petit example may not have been the most fortunate, but in principle, it would be up to the language-community to decide what words should act as "magic words" which summon up lexemes. Indeed, function words are the natural candidates, but if some adjectives, nouns or verbs are very frequently used, they may also be integrated in this way (verbs which come to mind are auxiliary verbs, such as is or have).
"Magic words" mixed in with text are a bad idea which is likely to lead to ambiguity problems, for instance "lé", "la" and "là" are all French words with meaninngs different from the articles "le" and "la". It is introducing a problem where none is necessary; the template author should specify what things are lexical strings and what need other handling, then all is clear and simple. Strobilomyces (talk) 11:23, 8 November 2022 (UTC)[reply]
Again, as mentioned above, it will be up to the contributors of the renderers of a specific language to decide whether and how to use this functionality. The system provides the possibility, but it is up to the contributors to decide if they wish to use it or not. AGutman-WMF (talk) 10:08, 9 November 2022 (UTC)[reply]
Regarding your other points, it seems you overlooked a crucial part in the design of the system. All pieces of text which appear outside slots are fed behind-the-scenes to the the TemplateText function. So a template of the form le {Entity(item)} is equivalent to {TemplateText("le")} {Entity(item)} (as well as to {"le"} {Entity(item)}). For this reason it is not possible to use a specialized function called SpecialLexeme.
The "Bonjour" example shows that the template writer determines the Core functions (such as "Lexeme", "Person", etc.), which are defined in a section just below. I am proposing that SpecialLexeme would be another of those core functions. It is perfectly possible. Your idea of putting everything through TemplateText is awful, and just makes difficulties. And I think the actual document is in line with my view of how it should work. It is the template writer who should specify whether something is a literal string or wants some particular form of processing.Strobilomyces (talk) 11:23, 8 November 2022 (UTC)[reply]
The use of TemplateText is a technical necessity of the system, since the subsequent stages of the pipeline operate on internally-represented lexemes, not on strings of text. As such, everything must be transformed into lexemes. By default, the TemplateText function does exactly that, but it provides also a hatch where you could add extra functionality, if desired. AGutman-WMF (talk) 10:13, 9 November 2022 (UTC)[reply]
Now, the king's way to share features between lexemes is through dependency relations. However, the phonotactics module will allow sharing features between neighboring lexemes (in the linear ordering of the text). While this is mostly intended to resolve phonotactics, one could use a similar mechanism to allow, for instance, the inflection of a determiner according to the neighboring noun, if there is no dependency relation annotation on it. Whether such a mechanism will be implemented is still open to discussion.
As for the point about using the label amod for disambiguation - a key component of the system is that the exact lexeme form to be rendered is only chosen in the pipeline once the entire dependency tree has been constructed. Thus, in principle, it should be possible in the system to populate a lexeme with "inflected" forms of different parts-of-speech (e.g. adjectival and nominal forms of petit). The actual form to use is selected by the morpho-syntax component which has access to the dependency label.
The phonotactics module is for completely different types of linguistic changes than lexemes. I totally disagree that you can sensibly use the phonotactic mechanism to make lexeme-based changes, and if there were another similar mechanism, that would need to be specified. The specification is logical, it states that first the lexeme-related processing will be done to produce a relevant lower-level structure, then phonotactic processing (dependent only on the generated text/sounds) will take place. You are proposing going backwards from less structured generated data to find the lexemes which are more structured. That is really terrible perverse design. The only sensible solution is what you call the "king's way", and that is what you should be promoting. My comment revealed a bad logical error in your example.Strobilomyces (talk) 11:23, 8 November 2022 (UTC)[reply]
The phonotactics module does operate on lexemes (see example of English module in the Scribunto prototype). You're right that conceptually it should be reserved for phonological changes only. However, a similar mechanism (of relying on neighboring words) can be used for morphological inflection, as a heuristic. As I wrote, whether it is a good idea or not is an open question, so I leave it for the contributors of the renderers of each language to decide. AGutman-WMF (talk) 10:22, 9 November 2022 (UTC)[reply]
Finally, I'm not sure why you write that Lexeme can only get Q-ids now. The Lexeme function, by its very definition, only operates on L-ids. There would be another function (e.g. Entity or Person), which would take Q-ids. Please note that the language of realization (e.g. French or Zulu) would be accessible as a global state argument to all the functions, including the Lexeme function. AGutman-WMF (talk) 13:00, 7 September 2022 (UTC)[reply]
I think the Lexeme function should absolutely accept an L-id, not a Q-id. But the definition of Lexeme(entity) under "Core functions" states that it "Fetches a Wikidata lexeme associated with a Q-id (in a given language), transforming it to the singleton lexeme." OK, I thought that that meant that its input was a Q-id. The system will be needed to handle a lexeme defined by a Q-id (coming from a slot from the abstract text) or also from an L-id (if coming from the template). In my opinion this should be made clearer in the document. There should be two versions of the function.Strobilomyces (talk) 11:23, 8 November 2022 (UTC)[reply]
You're right. There was an error in the document. Now corrected. AGutman-WMF (talk) 10:26, 9 November 2022 (UTC)[reply]

@AGutman-WMF: Hello. Sorry, I have been away for a while. I wish you all the best with your future projects.

I am vey disappointed with your answers. The document itself implies a coherent system, but you are not conforming to it and you are making it much worse. I hope it is OK to respond in detail to your points in your text above. Best wishes for the future again, Strobilomyces (talk) 11:23, 8 November 2022 (UTC)[reply]

Thanks again for your comments, though I would have appreciated if you had used milder language in some of them. In general it is important to keep in mind that the system is designed a quite flexible framework which allows for somewhat different implementations according to the wishes of the contributors for each language. I should also mention that @Keet10 and myself are working on a revision of the document which should clarify and correct some aspects of the proposal. In the meanwhile, I suggest you have a look at my Scributno-based prototype implementation of the specification, which can give a more concrete idea of how all pieces are working together. AGutman-WMF (talk) 10:35, 9 November 2022 (UTC)[reply]
Thank you for your quick answers. OK, perhaps I was too strident in some of my opinions, and if so, I apologize. I will try to look at the Scribunto-based prototype. Strobilomyces (talk) 20:10, 9 November 2022 (UTC)[reply]
Well accepted. Please note that in the meanwhile we have created a revision of the document, in which we incorporated some of your feedback (for instance, your critic of the example I used was well noted, and I have changed the example accordingly in the appropriate section which now lives in a separate sub-page). The Scribunto prototype also got its own documentation sub-page, which may be of interest. AGutman-WMF (talk) 15:46, 22 November 2022 (UTC)[reply]

Comments[edit]

Hello. I think that perhaps the template language should support comments.

In my understanding the templates will be items in Wikifunctions. I am not too sure exactly how that will work, but anyway there should be a place for an explanation of the whole template, a mini-user manual etc., so that is good But I think it would also be useful to have template author comments within the template code itself.

One simple kluge might be to add a function Comment with two string arguments which just returns its first argument, so the second argument could be used as a comment wherever a literal string might appear. But that is not very aesthetic. I propose that each slot or function invocation argument should be permitted a comment at the end just before the closing brace, parenthesis or comma. The comment could be delimited by /* ... */. So in the template text instead of

... {amod:Lexeme_fr(L10098)} ...

the template author could write

... {amod:Lexeme_fr(L10098) /* L10098 = petit */} ...

or

... {amod:Lexeme_fr(L10098 /* petit */)} ...

What do you think of this suggestion? Thank you for your attention. Strobilomyces (talk) 13:37, 4 September 2022 (UTC)[reply]

@AGutman-WMF: P.S. I am making some corrections to my examples. I have seen that the argument of Lexeme should not have quotes around it, so I will change that in my text. Also function Lexeme is language-independent and the argument should be a Q-number, coming from the abstract text. If the lexeme is introduced within the template with an L-code argument, the function name should have a suffix of the language code. If the argument is a Q-code there should be no suffix and if the argument is an L-code there should be a suffix. That is right in the French example of the template language definition, but it is wrong in the Hebrew, Zulu and Breton examples. For instance "Lexeme(L45068)" in Breton should be replaced by "Lexeme_br(L45068)". Strobilomyces (talk) 11:45, 5 September 2022 (UTC)[reply]
The templates should be stored in Wikifunctions as functions, and have some dedicated UI for them. As any other function such they would have some documentation attached to them. As for commenting the internal of a template, while I don't object to it, (and we could certainly allow the parser to do it), it would go against the idea that the template authoring (as the authoring of any other function in Wikifunctions) is multi-lingual, i.e. the names of the invocations and relations can change according to the UI language chosen by the user. I would rather prefer to have some UI hatch which would allow specifying multi-lingual comments on specific parts of the templates. Specifically for L-ids, I would hope that the UI would allow getting a visual hint of the lemma behind the id. Anyhow, we can see how that develops and add this functionality as needed.
Regarding your second comment, my apologies for not being consistent on this point. The basic idea behind not mentioning the language code is that there should be "behind the scene" a dynamic dispatch to language-specific implementation if needed (for the Lexeme function it is yet to see if it is needed, since its logic should be generalizable across languages). I will fix the inconsistency. AGutman-WMF (talk) 14:31, 5 September 2022 (UTC)[reply]
But surely template authoring is for one specific language, so these functions do not need to be multilingual? Anyway the string which defines a lexeme should be in the target language. I really think this comment feature would be useful. I see that in one example under "Construction of the composed function" you included a link to the lexeme in question, which is good, but I presume that that cannot be used here instead of an explanation. Strobilomyces (talk) 18:06, 5 September 2022 (UTC)[reply]
Maybe not necessarily “need” to be multilingual, but easily can be, and therefore permitted: the language of the specification of the template may or may not be the same language that the sentence will be generated in, as is the case of the examples in the document where all of them use English terms and function names (but, e.g., “Year_zu()” could also be called “Unyaka_zu()”.
Comment/annotation features may be useful, though rather not to accommodate the petit example, since an interface ideally would take care of displaying the lexeme (e.g., by hovering over it) so as not to have to put up with the opaque identifiers all the time. Perhaps for commenting on a design decision of the template, or what else? Keet10 (talk) 16:04, 6 September 2022 (UTC)[reply]

Details of CFG formalization[edit]

Just 6 points:

  • I suggest we use “non-terminal symbol” and “terminal symbol” throughout.
  • Do we want to say that lexeme can or cannot be a Wikidata reference (or, more generally, a unique identifier functioning as a lemma)? Can it refer directly into the Lexeme namespace (as an Lnnn-style reference), or only via a Qnnn-style reference? What about Pnnn-style references?
  • I’m not sure “shorthand notation” is quite correct. I think a symbol simply “denotes”. What lexeme formally denotes is not yet clear to me, I’m afraid.
  • Do we have inconsistent definitions of string in the third and the last row? We shall need escaped equivalents of excluded codepoints.
  • “If the template corresponds to a constructor, these are names of the the fields of that constructor” [interpolation] should be “…these are the names of the fields…” (although I’m not sure we would say that constructors have “fields”; I think they are actually named arguments in the form of key–value pairs, except that the keys appear as labels, so “…these are the (labelized) keys for the arguments to that constructor”?)
  • When does a template not correspond to a constructor?

GrounderUK (talk) 12:43, 7 September 2022 (UTC)[reply]

The idea with the rule Textlexeme | punctuation | string is the following: Text can be in fact any string (any combination of non-space characters with the notable exception of { }). However, some of these strings may receive special treatment and are thus called out here. One such group is punctuation marks, and the other group is a set of "magic words" which represent a full inflection table (a lexeme). That set is at the discretion of the developers of the renderers of each language. For instance, in English, such "magic words" may include a or is. Specifically, these are not L-ids. I agree this is somewhat confusing, and I will review it with @Keet10.
string is indeed inconsistently defined. In the first instance it excludes spaces and { } and in the second it doesn't (since it marked by quotation marks). We will reiterate.
The term "fields" is just another term, as far as I know, to elements in a structured record/map/dictionary/key-value pairs (pick your preferred terminology). "Field names" are indeed the "keys" of key-value pairs.
For your final point, while every constructor should have corresponding templates (at least one per language), there could also be templates which are not tied to any specific constructor but are rather used as sub-templates in other templates. Think of these templates as similar to "library functions" in a standard programming language. AGutman-WMF (talk) 13:23, 7 September 2022 (UTC)[reply]
Thanks, Ariel.
  • Maybe we could just use “lemma” instead of “lexeme”? That is, the symbol is the lemma and it denotes the (“inflection table” containing the forms of the internally defined) lexeme? Anyway, it is quite clear to me now, thank you.
  • I’m sure you’re right about the use of the word “field” in the world at large. Personal preferences aside, I would incline to the existing project terminology (or what we expect it to be).
  • I think of sub-templates as constructors too. Do you mean to restrict your comment to the interface with Content constructors? I was misled by “corresponding”, I’m afraid. Hmmm, yes, I see what you mean…
  • You left the s on “terminals”, by the way.
GrounderUK (talk) 00:29, 8 September 2022 (UTC)[reply]

Phonology of Breton[edit]

@VIGNERON Could you please provide more information about the phonological alternations of Breton, and how they are modeled currently in Wikidata lexemes? AGutman-WMF (talk) 07:35, 16 September 2022 (UTC)[reply]

@AGutman-WMF: for the general context, if you don't know it, you could start with en:Breton mutations (and en:Consonant mutation, these articles are not great but at least they give an idea).
For the modeling on Wikidata, I chose to enter each mutated form as a form and to put the effect as a grammatical feature. You could take a look at d:L:L2784 for a simple example.
Cheers, VIGNERON * discut. 07:46, 16 September 2022 (UTC)[reply]