Talk:Abstract Wikipedia/Updates/2021-09-17

Calculation vs Querying

Latest comment: 2 years ago6 comments3 people in discussion

Do you have an overview about how many ressources are needed to generate the form of a word time by time in relation to take the form that is needed through a query from a database in this case propably the Lexemenamespace in Wikidata. I prefer for daily use the way that needs less resources. From my point of view is it helpful if there are a lot of words and their forms generated and stored in Wikidata and the functions to do that are interesting and so I am interested in understanding them. This helps also people outside the Wikimediaprojects who are interested in understanding language. Usually it should be possible to check what forms can occur in a specific generated sentence before and then make sure that these forms are available in Wikidata.--Hogü-456 (talk) 20:19, 23 September 2021 (UTC)Reply

I agree, I would also like the generated forms, if possible, be stored and materialized in Wikidata, because then they also become available through SPARQL, etc. I don't know the differences in resources, and would expect that it depends on the language. Given that I expect that we will always need to look up every Lexeme - even just to check whether it is regular or not - I would expect that the look up might have lower resources, but on the other side I think most morphological functions are so simple that the overhead of running a function from Wikifunction will easily be the more expensive part.

So, in short, I think the answer to your question depends more on the respective overheads of calling Wikifunctions or querying Wikidata than of any feature of the natural language. And in that case I hope that most of it will be cached and thus be reasonably cheap. -- DVrandecic (WMF) (talk) 23:56, 24 September 2021 (UTC)Reply

I think you are going wrong here and it is a bad idea to have all possible forms stored under each lexeme in Wikidata. This is enormously redundant, clumsy and inelegant. For instance in Spanish, only considering fully conjugated simple tenses, there are simple, past, imperfect and future tenses in indicative and subjunctive moods, each with six forms, so 4 * 2 * 6 = 48 forms, all of which can typically be predicted reliably if the last three letters of the infinitive are known. Instead of listing all these forms, rules should be given. For instance if the infinitive ends in -ar, the six person/number forms of the imperfect indicative can be found by deleting the final -ar of the infinitive and adding -aba, -abas, -aba, -ábamos, -abais and -aban respectively. The present tense, for instance, is more complicated, but normally if you know the first person singular and the infinitive there is an easy rule to derive the other five person/number forms, so you only need to give the first person singular. And there is a simple rule to derive it from the infinitive in most cases, it should only be given if irregular. In case of doubt the explicit forms could always be given.

To formulate a set of rules which would enormously reduce the number of forms which have to be listed explicitly isn't trivial, but it is not terribly difficult and it would be an important resource in itself. It would be vital to formulate rules to be used in Abstract Wikipedia 100% rigorously for each language; they would then be implemented in the wikifunctions which would render the text in that language (the rules would be part of the specification). I propose that these rules could be kept in a sub-page of the talk page of the Wikidata item for the language in question (written in the language in question).

I am reluctant to add large quantities of lexeme forms in Wikidata because I feel it is monkey-work which will result in an inappropriate data structure - a scheme like the one I propose would be elegant and would serve everyone much better. But it is urgent to agree the way of working so that work on implementation can start. Strobilomyces (talk) 20:58, 8 January 2022 (UTC)Reply

@Strobilomyces: Thanks, and I mostly agree with you. But there are a few further considerations: we might want to have other information connected with the form, e.g. a usage example, the pronunciation, where it is attested, etc., which might make it more interesting to have that form explicitly materialized in Wikidata.

In the end, I think the decision will be rather granular, per language, per lexical category, etc. I think it would be great to have an early decision on these questions, but I would be surprised if we manage to get to a decision before we actually refresh the interface for lexicographic data in Wikidata (which WMDE has now started working on), and before we have Wikifunctions launched.

Putting as much as possible into Wikifunctions functions certainly sounds like a good approach to me! But my caveat is the "as much as possible", and I am very flexible with materializing the forms for many different use cases.

The ultimate answer though is that it really is up to the community, and my voice regarding that is just one voice. The system we are building at the Foundation and at WMDE should support any of the possible decisions. --DVrandecic (WMF) (talk) 21:36, 9 February 2022 (UTC)Reply

@DVrandecic (WMF): Thank you for your reply. I agree that these rules will be very different, sometimes not needed, depending on the language (and part of speech). I do not think it is desirable to attach information at the form level as it will normally be better to describe the same information (or rules for deriving it) at the lexeme level. It is true that there is a very close relationship between these rules and the specific-language-generating functions - so perhaps the lexeme forms and the functions should be developed together. I am not criticizing the structure of the software or data models, but I am worried that perhaps it is not a good idea to add millions of lexeme forms when they may not really be wanted.

Also I think that the vision for Abstract Wikipedia needs to be more concrete even at this early stage. Although there are a few good example proposals of Abstract Wikipedia code, I think there needs to be much more of that material at a more detailed level. In fact for me what AW needs more than anything is some detailed examples of AW code together with an analysis of how the generator functions could work in the cases of specific languages, while tracing where all the requisite data would come from. I think you are too optimistic if you think that "the community", or many individual language communities, will come together and set up a workable system without a tremendous amount of central leadership. It is difficult for wikimedians to agree on anything; will the decision process be similar to last year's evaluation of proposals to improve the "Request for Admin" process? It will take ages, anyway. Although Wikifunctions is not yet there, I think there is already lots which could be done to make Abstract Wikipedia into a well-defined project (with a specification of how the work for each language can be organized). Strobilomyces (talk) 18:13, 10 February 2022 (UTC)Reply

Indeed, and that is a great point!

Whereas I do think that some of these steps would be premature right now, I sure will not stop anyone from working on this. Mahir for example is a great example of a user who is already dashing ahead and working on code and how to connect it with the lexicographic data in Wikidata in order to create phrases and sentences with his projects Udiron and Ninai. It is just that I and the development team are currently too busy by focusing on Wikifunctions to be of much help at the current stage.

But your point is valid. It might well be that assuming that the community will build all of this by itself might be overly optimistic. I do plan to provide more guidance than for most of the other Wikimedia projects. This is something we should discuss soon, thank you for reminding me, I will put it on my schedule to discuss this in one of the upcoming newsletters.

Thank you so much for your ideas! I enjoyed reading your suggestions. --DVrandecic (WMF) (talk) 00:03, 5 March 2022 (UTC)Reply