Grants talk:Project/XML Boiler for Wikipedia

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 4 years ago by VictorPorton in topic Oppose

JATS query[edit]

Would such a proposed XML markup be able to be JATS-compliant (Journal Article Tag Suite)? This would be highly valuable in brining wiki pages in line with relevant ISO standards. T.Shafee(Evo﹠Evo)talk 03:36, 22 January 2020 (UTC)Reply

I propose to use full features of XML Boiler to process different namespaces of XML. So, yes, certainly we can support JATS as well. Moreover, taking JATS as the first format to support seems viable. That is, in simple words, the answer to your question is "yes". --VictorPorton (talk) 05:33, 29 January 2020 (UTC)Reply

Several comments and questions[edit]

Hi User:VictorPorton,

Thank you for submitting this proposal. I have a few follow-up questions and comments for you:

  • The problem description states that MediaWiki markup language is problematic. You propose to switch to an XML format. Would this require changing content on some existing MediaWiki pages, and if so how would this be approached? When Tidy was replaced with RemexHtml, a lot of small bits and pieces in the software stack needed to be worked on (see the subtasks of phab:T89331), and the content on many pages needed to get adjusted so mw:Extension:Linter was written and deployed, to list such problematic pages on each wiki. Or do you "only" propose to have an additional parser backend, not deployed on Wikimedia sites? Would the latter not depend on resolving phab:T114194 first?
>> I propose to use multiple markup formats. Changing context on existing pages happens in the usual way: somebody edits it (and injects an XML instead of old markup). This is meant to be often done using a WYSIWYG GUI. I want to make clear that as a transition to XML, we first write XML parsers that transform XML to the old MediaWiki markup. So there should be no problem with adjustment. No, I want to deploy this parser backend on (at least some of) WikiMedia sites. --VictorPorton (talk) 05:33, 29 January 2020 (UTC)Reply
Hi User:VictorPorton, if you "want to deploy this parser backend on (at least some of) WikiMedia sites" then it is unclear to me which existing actual problem such a deployment would solve (and why you would go for XML, which is not explained in your proposal), as Parsoid already solves the problem of machine-readable output. --AKlapper (WMF) (talk) 19:30, 3 February 2020 (UTC)Reply
Parsoid produces HTML. HTML is not quite machine readable. It is machine readable in the sense that a machine is able to read it (just like as a machine is able to read any file), but it is not machine readable in the sense that it is very hard to process HTML properly and HTML is not a truly semantic markup. What I propose is to store information in semantic markup syntax. No (advance enough) semantic markup in Wikipedia is the problem I am trying to solve. --VictorPorton (talk) 16:12, 14 February 2020 (UTC)Reply
  • You have identified a problem you want to solve. For any problem, there are often several possible technical implementations to consider. Can you also share about what other approaches you have considered and why you ultimately chose this implementation and technology? Maybe this is a naive question, but for example, when it comes to the format, which reasons speak against implementing Markdown instead (see phab:T105068), which might be simpler for editors than entering XML tags? When it comes to XML itself, there are other pieces of software out there which I would say also allow "flexible, extensible XML processing". Which other pieces have you looked at and what were reasons against them?
>> Markdown is not appropriate for this purpose because it is not extensible. Many existing MediaWiki syntax features are not extensible in Markdown. So Markdown completely rules out for this purpose as it is impossible to translate some pages into this format. Moreover, Markdown parsing is not as standardized and supported by various kinds of software as XML. Entering XML tags should usually happen through a WYSIWIG GUI. I know no other flexible, extensible XML processing software than my software, absolutely (at least of the same "scale" (supporting e.g. pluggable addition of XML namespaces without tweaking existing ones). If you know such a software, please give me a link. AFAIK, only my software support seamless adding new namespaces into processed XML documents. I looked into XSLT, XPath, etc. But my software is an upper level wrapper over these, allowing to combine them in non-trivial ways. XSLT only without a wrapper would be not enough, as it provides no seamless plugging new namespaces into the system. --VictorPorton (talk) 05:33, 29 January 2020 (UTC)Reply
  • Currently, "How will you know if you have met your goals?" lists the same bullet points as "Project goals". Could you please instead describe criteria how to verify that your goals are met?
>> The criteria is: All or most WikiMedia markup translated into XML syntax, this syntax is fully supported and automatically convertible to the old markup, adding new namespaces by end users (not MediaWiki programmers) became easy. --VictorPorton (talk) 05:33, 29 January 2020 (UTC)Reply
  • Where to find the current code base of "XML Boiler"? I do not see any link provided to existing source code.
>> https://github.com/vporton/xml-boiler is the existing already working version (Python) and https://github.com/vporton/xml-boiler-dlang is an ongoing rewrite in D. --VictorPorton (talk) 05:33, 29 January 2020 (UTC)Reply
  • Integrating this with MediaWiki will probably require a good understanding of the parser architecture. See for example https://www.youtube.com/watch?v=lQGfuLP9MqA for current Parsoid plans and status. Have you had any conversations already with people working on the MediaWiki parser and Parsoid? If so, can you share anything about those conversations? It would be great to see people involved with the par4ser stack engaging directly on this talkpage so we can hear their thoughts about the proposal.
>> Not true, because my first implementation would just transform XML to the old markup and pass it to the old parser. This does not require all this knowledge. More direct usage of XML is to be added in the future (probably even not by me). --VictorPorton (talk) 05:33, 29 January 2020 (UTC)Reply
  • Furthermore, I have strong doubts that all of this can be done within 12 months by one person. To me it sounds like a project that will take several people several years.
>> I think I may succeed to do it in 12 months. I already have my XML Boiler. It remains to plug it into MediaWiki what is maybe hard but not very big amount of work and then just to write XSLT (for example) converter from an XML format into the old markup. The converter is not expected to be extremely complex because MediaWiki syntax is not that big. The first version may omit some MediaWiki features in the XML backend, include just the most important features. So, it is quite likely I may do it in 12 months, I think. --VictorPorton (talk) 05:33, 29 January 2020 (UTC)Reply

Thanks in advance for your feedback.

Regards, --AKlapper (WMF) (talk) 22:59, 24 January 2020 (UTC)Reply

It looks like there is feedback by the proposal author above in lines that start with ">>", between the lines that I wrote in my comment above, which is a rather uncommon (and hard-to-read) way of replying. --AKlapper (WMF) (talk) 19:26, 3 February 2020 (UTC)Reply

Oppose[edit]

This is a solution in search of a problem. I strongly recommend that this grant be rejected. Turning MW markup into a computer readable syntax is an already solved problem (parsoid HTML5 output. If you're in love with XML, there is an XML serialization of HTML5). XML syntax is not an easy language for humans to author or read. Even if you accepted the problem statement of this grant, the proposed solution of inventing an rdf based middleware layer for invoking xslt (or other) transforms, is not going to solve that problem. Finally, even ignoring all that, this is a technical direction that would be extremely unpopular with the users (Assuming that the plan eventually is more and more of the syntax will be displayed as xml when editing wikipage source). Not to mention the bit of the grant about hiring SEO. Oh, and no technical buy in on the WMF side. Bawolff (talk) 07:21, 1 February 2020 (UTC)Reply

The problem exists: Wikipedia does not support an advanced enough semantic markup. It is not only a real problem, but a big problem. As I said above HTML5 is not quite a semantic markup, that's a problem. I propose not any XML (like HTML5 in XML) but semantic XML! It was not yet done and needs to be done. I remind that I propose to add a WYSIWYG interface to edit that XML markup (so, your "XML syntax is not an easy language for humans to author or read" is not quite relevant). "That problem" you mean being uneasy for humans to read and write? Certainly it won't be popular among users to edit it in XML, because there will be eventually a WYSIWYG interface. WikiMedia can remove the SEO part from my proposal. --VictorPorton (talk) 16:12, 14 February 2020 (UTC)Reply
I think that you seriously underestimate the complexity of your proposal. You are essentially proposing to invent a new XML complaint markup language and fully integrate it into Mediawiki (which itself is a mess). And all these will be done for only $28800 in just one year? Do you really understand that it is just impossible? Ruslik (talk) 18:20, 17 February 2020 (UTC)Reply
And on top of that it is going to include a WYSIWYG editor that actually convinces users to use semantic markup (Something that is a semi-unsolved problem in UI design. Users almost never use semantic markup in WYSIWYG editors unless the semantic model is trivially simple). This proposal is an unrealistic solution to a problem statement that doesn't even exist. Bawolff (talk) 07:45, 20 February 2020 (UTC)Reply
WYSIWYG editor would serve to encourage the users to start use the new markup format, not to have very advanced features. Semantic is to be added later, maybe not due the grant period. So it is not so complex as you expect. --VictorPorton (talk) 17:58, 2 March 2020 (UTC)Reply
It is easy to invent an XML language having features of Wiki markup, because MediaWiki markup is not so complex. The hard part was to invent Automatic transformation of XML namespaces. Writing an XSLT script that transforms this XML to MediaWiki is not hard. Fully integrate? No, just to start the integration. It is planned to be transformed by XSLT into the regular wiki markup and processed by the usual MediaWiki parser, not "fully integrate". --VictorPorton (talk) 17:58, 2 March 2020 (UTC)Reply

DBpedia and wikitext parsing[edit]

It sounds like the proposer intends to parse wikitext to extract structured data from it and then express it in some structured format, which of course is going to be able to be stored in XML. We already have this: it's called DBpedia. Have you considered contributing to their efforts?

Also, what I'm going to call the 0th law of wikitext parsing is that Whenever you find yourself re-implementing the MediaWiki parser, you can be sure you're doing something wrong (already the second time in less than 24 hours that I'm using this to save someone's life). Nemo 12:49, 23 February 2020 (UTC)Reply

Comments from I JethroBT (WMF)[edit]

Hello VictorPorton, and thank you for your proposal on improving markup for MediaWiki pages, as well as for your responses to community feedback and questions. Having reviewed the feedback on the discussion page and consulted with technical staff on the feasibility of this proposal as well as the lack of a specific problem this deployment would address, it is not eligible for further review. I would recommend following up on Nemo's comment to explore and review projects that use XML in other contexts in relation to Wikimedia projects, as these are initiatives where your expertise and background would be valuable. If you are interested in connecting with the team coordinating DBPedia, please let me know, and I'd be happy to put you in touch with them. With thanks, I JethroBT (WMF) (talk) 11:18, 2 March 2020 (UTC)Reply