Wikimedia monthly activities meetings/Quarterly reviews/Parsoid/March 2014

From Meta, a Wikimedia project coordination wiki

The following are notes from the Quarterly Review meeting with the Wikimedia Foundation's Parsoid team, March 28, 12:30pm-2:00pm


  • Trevor Parscal
  • James Forrester
  • Roan Kattouw
  • Erik Möller
  • Gabriel Wicke
  • Tilman Bayer (taking minutes)
  • David Chan
  • Howie Fung
  • Terry Chay (taking minutes)

Participating remotely:

  • Arlo Breault
  • Maggie Dennis
  • Marc Ordinas i Llopis
  • Subramanya Sastry
  • C. Scott Ananian

Please keep in mind that these minutes are mostly a rough transcript of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Presentation slides


  • Our objectives
  • Progress Q3
  • Tasks Q4
  • Questions & discussion

Gabriel: welcome, agenda


Still: Deal with wikitext, so you don’t have to™

  1. Faithful bidirectional conversion without dirty wikitext diffs
  2. Enable new applications and architectural improvements for MediaWiki in general, with the HTML+RDFa interface and conversion service. e.g. display of structured data as table
  3. Research better templating, widget and diffing solutions

Progress Q3[edit]

Ongoing unglamourous work[edit]

  • Continuous tweaking of edge cases (e.g. invalid content moved between table cells, lot of that kind of headaches), lot of bugs
    • hacking away at the long tail
    • template encapsulation, foster parenting, nested links, …
  • informed by testing, bug reports from VE usage
  • Continuous: deploys. Probably the only WMF team that deploys 2x/week regularly. We are basically alpha testers for Trebuchet ;)

Testing infrastructure[edit]

infrastructure saw a lot of work

  • /_rtselser/ endpoint for quick checks, quite clever now
  • improved Parser Tests
    • Reduced false failures, improved selser testing (several automated edit modes + manual edits)
  • Round-Trip testing
    • Full-stack testing e.g. including HTTP API (spawn new server for each roundtrip)
    • Moved to hardware (from VMs) for perf. tracking
    • Cassandra storage backend in development by FB OA students, don't overload Labs any more
    • OPW intern Be Birchall improved our RT test server code: better interface - handlebars templates, new diff library (which doesn’t crash on large diffs), …


saw a lot of improvements recently, parallel to work on VE

  • Images and parameters
    • basics mostly done
    • no shortage of edge cases, crazy stuff, found lots of bugs, big mess in MW (CScott). Roan: e.g. behaviour can differ on whether it's a SVG or PNG ;) (Gabriel:)
  • initiatives to clean up image handling in the longer term
    • semantic image formats
    • square bounding boxes - CScott has put up RfC about that. Trevor, James: French WP has separate template for that
    • less confusing options, better uniform styling in skin / view. CScott: "upright" parameter is broken in that the way it works isn't the way the docs say. Wouldn't recommend using it because it is misleading.

DOM parameter editing[edit]

  • Parse transclusion parameters to DOM
    • originally planned for Q4
    • enables visual parameter editing - you don't drop to wikitext once you enter the transclusion dialog
    • parse to DOM ongoing, close to merge
    • next step: editing support. once that is completed, VE can start include nested editing

Public API[edit]

  • now exposed at public URL ,
  • already lots of internal users:
    • VisualEditor
    • Flow - using it to make sure HTML is rendered (discovered some bugs & missing features like the bad image block list that we'll have to support for normal views)
    • PDF renderer - CScott has been using it, not deployed yet (Erik: they're just provisioning the box) (CScott: The Ops issues are being resolved, and we are almost ready to turn it on for real people. From Parsoid perspective, it's a nice use of Parsoid and it's fairly robust and queued up some image-related bugs to fix now that this is coming along. PDF support looks better with more semantic options for images whereas wikitext people write is very specific. Now the PDF can ignore most of those user-hints and renders better on the page.)
    • ContentTranslation project starting up now, will use Parsoid
    • Mobile - not yet, but soon
  • Quickly found external users: mw:Parsoid/Users

API user: Kiwix used it very early on, overloaded our test servers ;)
Offline reader for desktop + mobile
Uses Parsoid for HTML dumps of all wikis, also some non-Wikipedia projects like Wikivoyage
This is a milestone. They reported lots of bugs; together with PDF renderer this represents broader use of this export infrastructure

API user: EPH edit gadget
edit protected helper gadget. Uses Parsoid to retrieve the DOM and look for template and give options.
and to respond to request; sends it back to Parsoid
Erik: that looks... hmm ;)
Gabriel: users are starting to do this on their own.
more potential for microcontributions
currently slow because it loads a separate page, but once this is inline, it will become a non-issue

API user: translation gadget on Tool Labs

  • Translation Czech <-> Slovak
    • Only learned about it when Labs had issues.
    • machine-translates Parsoid HTML & adjusts templates, pipes it back to Parsoid
    • presents translated wikitext to user for copyediting. used for like a third(?) of articles on Slovak Wikipedia
    • will be generalized by Content translation project
  • Google Knowledge Graph started using it, but slowly (still using normal API as well)


  • worked on Rashomon revision storage - store data and metadata separately
    • Part of larger move to SOA (service-oriented architecture)
    • RFC found lots of support at architecture summit
      • make new developers more effective
      • avoid duplication of effort
      • enable independent innovation
      • security, monitoring, scaling, testing
      • Will be pushed further along by service team
    • Not yet in production, but close (other stuff like HTML templating got inbetween)
      • Erik: hardware provisioning? Gabriel: not much needed, can likely reuse boxes we currently use for Cassandra testing.
        Erik: so all users will get same result from Parsoid with Rashomon in the future? Gabriel: yes, right now we're using caching but the plan is for them to get Rashomon in the future
  • Debian packaging for Parsoid
    • make it easy to install Mediawiki along with collection of services it is using (addressing a previous argument against services)

HTML templating[edit]

  • Originally a Q4 research project
  • Developed DOM-based library for UI and i18n messages (two frontends: KnockOff / TAssembly, one backend). balanced tags, uses escaping
    • secure and convenient, yet very fast
  • Prototype platform for high-performance content templating
  • Cooperation with Flow & Mobile

people threatened to use things that are less secure ;)
Erik: ...or only UI side?
Gabriel: started on UI, but meant to used on content too
Erik: so it could become part of a next-gen templating system
Gabriel: e.g. Flow, combining lots of parts on one page..., backend is relatively fast, but two frontend pieces (see above).
porting this to PHP (Matt Walker helping with that)
Erik: only concern: if we fork KnockOff, maintenance burden for us
Gabriel: KnockOff is actually complex, no intention to port it (reactivity); TAssembly is a small library (500 lines for TAssembly, 1500 lines KnockOff - we just spent a few days on it)
CScott: spacebars also lightweight library
secure and structured, not the usual HTML soup
I did it more as proof of concept
Gabriel: proved that it's possible to have secure solution without manual escaping
Trevor: what is the demand that is currently tipping balance towards UI?
Gabriel: Mobile, Flow, others too
Trevor: concerned: superceding demand for better content templating(?), funny that Parsoid team basically providing UI ;)
Gabriel: work on that too
CScott: Parsoid team better in touch with that stuff (the JavaScript community vis-a-vis PHP community)
Gabriel: bigger picture: in the long term move to HTML-only wiki, solve templating for that
UI does templating very similar to content
Erik: ultimately we are all on the same page, VE was a strong advocate of moving towards HTML templates, potentially secured server-side JS as language
Trevor: very much on board for content, apprehensive about UI
Erik: Can you be a part of that conversation? (there was a Wikitech thread..)
Trevor: I'll try to inject myself.
Gabriel: I can cc: you
CScott: That went a bit off track, might be a benefit from a more focused conversation.

Structured logging[edit]

OPW project: Maria Pacana

  • Generic logging framework
    • App (Parsoid) generates logging events
    • Event stream filtered and dispatched to interested subscribers on backend, similar to what's going on in MW core
  • Structured log format
    • will be async (avoid going down if disks fill up,...)
    • can now log performance metrics. Erik: backend for logstash too? yes
    • wikitext fixup information (GSoC ‘14 proposal)
  • Also useful for development:
    • tracing wikitext parsing actions, debugging, ...

Deferred Tasks[edit]

Support switching between HTML and Wikitext within one edit[edit]

  • hard / low priority in VE
  • for switch between wikitext and HTML - VE didn't ask very hard for that,

Erik: James, what's your take on the roadmap?
JamesF: We have proper switching after Parsoid perspective. Form the user perspective, there is some cleanup that is blocked on this.
Erik: The designs are basically one edit tab and a simple mode switch, if you are a current user you never have to think about VE again because you stay in that mode, and that is dependent on it.
Gabriel: Issues with dirty diffing and issues with the DOM.
Erik: It would be good if have conversation. Didn't Wikia do a naïve implementation?
JamesF: Wikia was planning on ignoring dirty diffs and move it back and forth. Haven't seen an in-production implementation. Ultimately they'll have non-semantic changes and say that we're normalizing wikitext.
Erik: Only happens when moving from VE to wikitext AND BACK and the dirtiness is some stuff added to the wikitext.
Gabriel: Yes. We have to deal with this information getting lost and find a way to preserve it as much as possible. Roundtripping is first concern, but longer term needs to preserve associations. If you feel it is a high priority…
Erik: Can we do that in one fell swoop (mode switch + new UI for wikitext/VE in one tab)?
James: There are two options.
Subbu: I'm not sure it is that difficult to do. If we imagine the serialized version to be a new revision, if we are willing to do the switch by reparsing the wikitext, it is possible. so, the issue is more that it's hard to do it efficiently. Is that correct?
Gabriel: stretch goal for next Q, can promote it to real goal
Erik: overlap of edit buttons part of user annoyance
James: see it as improvement rather than blocker
it's a lot of change - getting rid of experience people are used to
second edit tab less disruptive, unless the new one is really beautiful and polished
Erik: OK, so you want to do the minimal approach
Trevor: worked on lot of things we didn't actually deploy yet
In cases where VE not available, could still have similar look and feel, bring UXes closer together
Erik: Parsoid building preliminary support
Trevor: VE team hasn't worked on wikitext editor, but it's kind of in our realm
Gabriel: so leave it as stretch goal for next Q

Language variant support[edit]

  • Basic editing support deferred to Q4
  • full support questionable

(see discussion below)

Enforce proper nesting of transclusions[edit]

  • blocked on TemplateData and current template usage statistics
  • GSoC ‘14 proposal (Lintoid/LintTrap) will collect statistics
  • probably not before Q1 next fiscal

Q4 Tasks[edit]

Get ready for using Parsoid HTML for all page views[edit]

  • actually using it as stretch goal for Q4/Q1 (not strictly required now, but want to get it ready)
  • in collaboration with VE (e.g redlinks), Flow (they need this soon), Mobile, Platform
  • needs client-side content user pref implementation & metadata storage service / API. Need to know percentage of non-JS users, implementation depends on that. James; for redlinks, either need replicate or... I worry about code complexity. Gabriel: actually, client-side JS could be used for both editing and reading
    • Rashomon in production
    • some API work will be taken over by new services team (Q1), - general idea is to push for a common implementation for multiple teams to use.
  • requires automated testing to compare PHP parser view and Parsoid views. improve quality of rendering.

Erik: scenario for rollout of Parsoid in read mode? small scale use case?
Gabriel: Flow.
Erik that's for new content though.
Gabriel: first want to get all the preliminaries ready, also want to do visual comparisons between old and new renderings, e.g. CSS regressions. currently e.g. overlaying screenshots automatically. That will probably uncover more bugs.
Erik: after that, still heavily dependent on MW code for parts like templates, do callbacks for that? yes.
James: e.g. could start with
Gabriel: or provide faster logged-in browsing with caching/edge-side include as benefit for trying it out ;)
CScott: with PDF and Kiwix, Parsoid output is already getting more visibility, gradually.
James: with Mobile, both web and apps? yes

Basic language variant support[edit]

think about longer-term strategy for variant wikis at some point
goal: don't break existing content but don't yet add much support for extra stuff like language variant gadgets
CScott: just getting the markup into the editable and getting the metadata into the DOM
David: even when editing inconsistent text, not worse than existing situation
CScott: ...[?]
Gabriel: other option: not support variants any more, split wikis
Erik: has anyone looked at what UI would look like for zhwiki variant editing?
CScott, David: Hebrew WP, text region in different languages, inline annotations
CScott: on zhwiki, still need to be "bilingual" when editing a section, but info about variant is preserved
Gabriel: also keep in mind relation to content translation
CScott: first get variant/language metadata into DOM, then think about long term. Current use cases of language variants vary a lot. Sometimes very political, sometimes just minor differences like US/UK English.
Erik: had conversation around six months ago on how far down the rabbit hole we want to go. Should have conversation with experts from e.g. Language Engineering. Could turn out that e.g. current approach on zhwiki harms the project's growth, needs changed
David: political issue, e.g. there is a lot of content that zh-hans readers would not have access to otherwise.
CSott: yes, political issue (ignore Taiwan in favor or PR China?)

HTML templating[edit]

porting to PHP. side project, but important long term

  • support other teams in UI projects
  • continue integration with i18n
    • research conversion of i18n messages to HTML templates using Parsoid
  • research use for content

Content widgets[edit]

problematic uses like multi-row templates in football tables

  • research canned widget solutions for specific tasks (alternative to wikitext hacks)
    • data tables
    • infoboxes

try to do it in a automated way
James: There is a proposal for infoboxes being standard skin feature, widget pulling in data from Wikidata. But every city that writes on enwiki has a nasty table, and turning that into a standardized locally updateable client-side widget would be awesome
Erik: very early stage now? (yes)
There's a proposed change in the skin to make text area narrower, use some of added sidebar space (if available on device) for additional info, also e.g. Commons link
basically, human-authored content in one column, automatically-generated stuff in the other
e.g.with authority control ID in bios can pull data like literature list
how much should editors stil bother with e.g. positioning specific templates for things that get pulled from Wikidata anyway?
Gabriel, CScott: from Parsoid's perspective, main goal to separate layout from content
template could be seen as layout info
templates are a weird mix of code and data now

Stretch goals for Q4[edit]

  • Performance (haven't gotten to work much on this): More efficient template updates
    • lower priority than HTML storage
    • possible stretch goal for Q4
  • Research: Support switching between HTML and Wikitext within one edit, closely related to preserving information inbetween
    • hard / low priority in VE
  • Non-Wikipedia projects
    • Parsoid enabled on all public WMF projects
    • biggest issue for editing seems to be labeled section transclusion (primarily Wikisource)
    • potential to fix complexity issues in projects like Wiktionary (GSoC projects)
    • no time spent on this so far

James: VE is live on Wiktionary, Wikiversity, Wikibooks (some, which asked for it)
Gabriel: some Wiktionary editors have hopes for simplification as it became really hard to edit with very complex templating systems, Parsoid could help extract current data
Wikisource: need to look at labeled section transclusion


Erik: you mentioned need for data-driven referencing system.
James: Dario setting up meeting with community members who run DOI bot
Erik: looking at Zotero?
James: yes
Erik (explains): Zotero browser plugin, JS scraping bibliographic data
James: longterm view: much better way to do referencing
track which part of sentence ref refers to? hard in wikitext
Erik: in any case, good use case for building services
suggest to work on this with Matt F when he comes into the team
David: issue how refs will be handled in content translation
James: current cite... template has about 90 fields on enwiki, including for source text version in other langs, but on other wikis might get even more complicated
e.g. could display which refs are used more than 10 times in all lang versions of one article
CScott: ref templates are horrible about maintaining directionality, an issue in PDF renderer
Erik: Gabriel, other issues?
Gabriel: well covered in general,
need outside help integrating CSS, hopefully get it from Flow / teams using it
also need help on services side, e.g. Rashomon

Erik: thanks everyone