Talk:Abstract Wikipedia/Architecture

Renderers

Latest comment: 3 years ago5 comments2 people in discussion

What follows was transcluded into Talk:Abstract_Wikipedia until it was archived and subst:ed.

"Solution 4 has the disadvantage that many functions can be shared between the different languages, and by moving the Renderers and functions to the local Wikipedias we forfeit that possibility. Also, by relegating the Renderers to the local Wikipedias, we miss out on the potential that an independent catalog of functions could achieve."

It seems that we are not neutral on this question! It is not obvious that we cannot have the functions in Wikilambda, the data in Wikidata and the implementation in the local Wikipedias. I don't propose that we should, but would such an architecture be "relegating" Renderers? Or is it not a form of Solution 4 at all?

I think we need a broader view of the architecture and how it supports the "Content journey". In this context, Content means the Wikipedia (or other WikiCommunity) articles and their components. Consumption of Content is primarily by people reading articles in a language of their choice, in accordance with the project's first Primary Goal. I take this to be the end of the "Content journey". I tend to presume that it is also the start of the journey: Editors enter text into articles in a language of their choice. In practice, this means that they edit articles in the Wikipedia for their chosen language. This seems to be the intent of the project's second Primary Goal but it is not clear how the architecture supports this.

"We think it is advantageous for communication and community building to introduce a new project, Wikilambda, for a new form of knowledge assets, functions, which include Renderers. This would speak for Solution 2 and 3."

Clearly, Renderers are functions and, as such, they should reside with other global functions in what we are calling Wikilambda. However, Renderers are not pure function; they are function plus "knowledge". Since some of that knowledge resides within editors of the natural language Wikipedias, whose primary focus may well be the creation and improvement of content in their chosen language, I am inclined to conclude that natural language knowledge should be acquired for the Wikilambda project from the natural language Wikipedias and their editors' contributions. As with encyclopedic content, there may well be a journey into Wikidata, with the result that the Renderers are technically fully located within Wilkilambda and Wikidata (which is not quite Solution 2).

"Because of these reasons, we favor Solution 2 and assume it for the rest of the proposal. If we switch to another, the project plan can be easily accommodated (besides for Solution 4, which would need quite some rewriting)."

I'd like to understand what re-writing Solution 4 would demand. I take for granted that foundational rendering functions are developed within Wikilambda and are aligned to content in Wikidata, but is there some technical constraint that inhibits a community-specific fork of their natural language renderer that uses some of the community's locally developed functionality?--GrounderUK (talk) 18:31, 6 July 2020 (UTC)Reply

Hmm. Your point is valid. If we merely think about overwriting some of the functions locally, then that could be a possibility. But I am afraid that would end up in difficulties in maintaining the system, and also possibly hampering external reuse. Also, it would require to add the functionality to maintain, edit, and curate functions to all existing projects. Not impossible, but much more intrusive than to add it to a single project dedicated to it. So yes, it might be workable. What I don't see though, and maybe you can help me with that - what would be the advantage of that solution? --DVrandecic (WMF) (talk) 01:22, 5 August 2020 (UTC)Reply

@DVrandecic (WMF): I'm happy to help where I can, but I'm not sure it's a fair question. I don't see this as a question of which architecture is better or worse. As I see it, it is a matter of explaining our current assumptions, as they are documented in the main page. What I find concerning is not that Solution 2 is favored, nor the implication that Solution 4 is the least favored, it is the vagueness of "quite some rewriting". It sounds bad, but I've no idea how bad. This is more about the journey than the destination, or about planning the journey... It's like you're saying "let's head for 2, for now, we can always decide later if we'd rather go to 1 or 3; so long as we're all agreed that we're not going to 4!"

Not to strain the analogy too much(?), my reply is something like, "With traffic like this we'll probably end up in 4 anyway!" The "traffic" here is a combination of geopolitics and human nature, the same forces that drove us to many Wikipedias, and a Wiktionary in every language for all the words in all the languages ...translated. Wikidata has been a great help (Thank You!) and I'm certainly hoping for better with added NLG. But NLG brings us closer to human hearts and may provoke "irrational" responses. If a Wikipedia demands control over "its own" language (and renderer functions), or a national government does so, how could WMF respond?

In any event (staying upbeat), I'm not sure that "Renderer" is best viewed as an "architectural" component. I see it as more distributed functionality. Some of the more "editorial" aspects (style, content, appropriate jargon...) are currently matters of project autonymy. How these policies and guidelines can interact with a "renderer's" navigation of language-neutral encyclopedic and linguistic content is, of course, a fascinating topic for future elaboration.--GrounderUK (talk) 13:49, 5 August 2020 (UTC)Reply

@GrounderUK: Ah, to explain what I mean with "quite some rewriting", I literally mean that the proposal would need to be rewritten in some part, because the proposal is written with Solution 2 in mind. So, no, that wouldn't be a real blocker and it wouldn't be that bad - we are talking merely about the proposal. So no, it wouldn't be that bad.

I think I understand your point, and it's a good one, and here's my reply to that point: if we speak of the renderers, it sounds like this is a monolithic thing, but in fact, they are built from many different pieces. And the content as well, is not a monolith, but a complex best with parts and sections. The whole thing is not a all-or-nothing thing: a local Wikipedia will have the opportunity either to pull in everything from the common abstract repository, or they can choose to pull in only certain parts. They can also create alternative renderers and functions in Wikilambda, and call these instead of the (standard?) renderers. In the end, the local Wikipedia decides which renderer to call with which content and with which parameters.

So there really should be no need to override individual renderers in their local Wikipedia, as they can create alternative renderers in Wikilambda and use those instead. And again, I think there is an opportunity to collaborate: if two language community have a common concern around some type of content, they can develop their alternative renderers in Wikilambda and share the work there. I hope that makes sense. --DVrandecic (WMF) (talk) 22:01, 6 August 2020 (UTC)Reply

@DVrandecic (WMF):What a terrible tool this natural-language can be! Thanks, Denny. That is a whole lot less worrying and more understandable (well, I think we basically agree about all of this except, perhaps, "standard?" would be "existing").--GrounderUK (talk) 22:29, 6 August 2020 (UTC)Reply

ML tasks

“

... there are certain areas I will rely on ML-based solutions in order to get them working. This includes an improved UX to create content, and this includes analysis of the existing corpora as well as of the generated corpora. There is even the possibility of using an ML-based system to do the surface cleanup of the text to make it more fluent - basically, to have an ML-based system do copy on top of the symbolically generated text, which could have the potential to reduce the complexity of the renderers considerably and yet get good fluency - but all of these are ideas.

In fact, I am planning to write a page here where I outline possible ML tasks in more detail.[1]

”

Constructors

Latest comment: 3 years ago6 comments2 people in discussion

Moved from Talk:Abstract Wikipedia#Parsing Word2Vec models and generally. James Salsman (talk) 20:37, 16 September 2020 (UTC)Reply

... Would you please comment on the relationship between poor sentiment models generally in NLU, and the propriety of using [2] for parsing to aggregate Word2Vec representations? For those to whom there appears to be little relationship, I suggest that the difficulty of correctly determining the hugely consequential sentiment of an expression may or may not have implications for decisions around multiple prepositional phrase attachments, antecedent reference, and other very consequential matters in parsing which may or may not impact interlingua more than purely natural language translation with current technology. Thank you in advance. James Salsman (talk) 20:40, 14 September 2020 (UTC)Reply

I'd like to re-formulate this question in light of the Characteristica universalis paragraph in the architecture paper. ("We solely focus on the part of expression, and only as far as needed to be able to render it into natural language, not to capture the knowledge in any deeper way. This gives the project a well defined frame. (We understand the temptation of Leibniz’ desire to also create the calculus, and regard Abstract Wikipedia as a necessary step towards it, on which, in further research, the calculus —i.e. a formal semantics to the representation— can be added.)") and the mention of natural semantic metalanguage ("Results in natural semantic metalanguage [94] even offer hope for a much smaller number of necessary constructors.") James Salsman (talk) 13:38, 15 September 2020 (UTC)Reply

@DVrandecic (WMF): So, after the benefit of your Wikimedia Clinic #010 talk, it seems that the values instantiating a constructor require knowledge of sentiment. For example, when a chemical compound is described as acidic, the extent of such acidity is a form of sentiment, and presumably that extent derived from natural language could be a slot in a constructor. How are the constructors instantiated from their natural language source text? By hand? Symbolic parser (e.g. link parsing or traditional parser) functions? In Lua? Keyphrase overlap parsers based on word2vec-style embeddings? Using "transformer" RNNs [3][4][5][6] or LSTM? Any function in, say, Python? Some combination? If a generalized approach is planned, what kind of parsing do you want to support? To be clear, this is an ML proposal pertinent to the Figure 4 data entry UI on page six of the architecture working paper. James Salsman (talk) 18:52, 16 September 2020 (UTC)Reply

@Chris.Cooley: I was very impressed with [7] that you cited, and find the arguments to avoid schema engineering compelling. How do you feel about word2vec embeddings and keyphrase overlap parsing of them? James Salsman (talk) 20:32, 16 September 2020 (UTC)Reply

@James Salsman: By hand, basically. We don't plan to build on top of ML models such as word2vec for generating or representing the abstract content to be used in the Wikipedias (For parsing, this might look different though, but parsing will be a 'nice to have' to make editing easier, not an integral feature). I don't think that ML models allow for the kind of editability that our contributors require. --DVrandecic (WMF) (talk) 00:48, 26 September 2020 (UTC)Reply