Wikimedia monthly activities meetings/Quarterly reviews/Parsoid/June 2014

The following are notes from the Quarterly Review meeting with the Wikimedia Foundation's Parsoid team, June 20, 2014, 12:30 PM - 2:00 PM PDT.

Present: Lila Tretikov, Erik Moeller, James Forrester, Gabriel Wicke, Max Semenik, Tilman Bayer (taking minutes), Terry Chay (taking some minutes too), Jared Zimmerman, Roan Kattouw, Rachel diCerbo, Juliusz Gonera (from 1:10pm)

Participating remotely: Subramanya Sastry, Maggie Dennis, Tomasz Finc (until 1:10pm), Arlo Breault

Please keep in mind that these minutes are mostly a rough transcript of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Agenda:

Our objectives
Progress Q4
Outlook 2014/15
Tasks Q1
Questions and discussion

Gabriel:
Welcome
We are 5 people now
Subbu will take over Parsoid, I'll move to Services
Attendee introductions
All of the Parsoid team will be remote soon

Outline[edit]

Our objectives
Q4 Progress (Gabriel)
Next Year (Subbu)

Our objectives[edit]

1. faithful bidirectional conversion between wikitext and HTML5 +RDFa (semantic markup) without dirty wikitext diffs (introduce only wikitext changes that actually make a difference to reader). This is very difficult. Lots of algorithms and heuristics to make it work. we are quite far down that road now
Ultimate goal: replace wikitext with HTML as primary content format
Lila: when will we be able to do that?
Gabriel: will also start storing this HTML, request that directly from storage without further massaging
soon, but need a lot of infrastructure
Need to build/port a lot of long tail stuff like spam filtering before we can really make it the primary format
Longterm goal: mediawiki just based on HTML
Erik: will soon reduce dependence on old parser dramatically already
2. improve performance & enable new features by moving our primary content representation to HTML5 + RDFa
3. Research better templating, widget, and diffing solutions
Widen the scope and think more generally to improve our content

Progress Q4[edit]

What we got done:

Continued to improve rendering
continued to improve RTing (roundtripping)
Improved performance
Parsoid-specific CSS
improved testing
wrapped up image editing support
parsing of transclusion parameters to DOM

…

work that was done without having been planned:

PDF rendering
work on services infrastructure took up some of Gabriel's time

Q4: Continuous iteration[edit]

Perfect round-tripping & rendering accuracy
Continuous deploys
- mw:Parsoid/Deployments

long tail of compatibility fixes
foster parenting

Lila: a lot of this looks like accommodating for eccentricities of existing content
Do we look at removing/ reformatting this?
Gabriel: we can't do this over night; first need to record all these issues
GSoC project to put all this in a database ( https://www.mediawiki.org/wiki/User:Hardik95/GSoC_2014_Application )
then work with community to clean this up in a way that works for them
Lila: we probably do not want to do that manually though?
Gabriel: indeed
address one problem at a time
will be much easier once moved to HTML, when wikitext will just be an editing format
then can do diffing in HTML, etc.
Lila: so wikitext will look different?
Gabriel: it will be more standardized according to community practices
Lila: we will want to do that pretty soon?
Trevor, James: be aware we need to normalize and render historical versions too
Gabriel: would be stretch to complete in 14/15
Erik: more important for end user: HTML storage will enable switching between VE and wikitext editing more easily
e.g. remember performance issues discussed in VE qr yesterday
Gabriel: perf gains for that particular case will come from reducing the network transfer, but won't speed up client-side VE initialization CPU time
Erik: but we agree that switch to HTML storage will be the most significant change soon?
Gabriel: async saves will probably be biggest gain
(continuing with Q4 progresses:)
invested in testing infrastructure
Parser tests: HTML Tidy integration in progress , simple edit mode for testing selser (selective serialization) in roundtrip testing

Q4: Parser test results[edit]

Better coverage
more passing tests

Lila: so what should be taken out of this table? [1]
Gabriel: errors are going down
Subbu: also wt2wt will probably never hit 100% even though we implemented normalization to eliminate false positives due to whitespace/quotes, but still will not be perfect.

Q4: Round-trip Test results =[edit]

Gabriel: parser tests are first line of defense
roundtrips are second line, more expensive
classify: syntactic difference only (resulting in same DOM) - yellow
semantic difference - red
with selser (selective serialization) and trivial edits: only 9 pages of 160,000 carry any difference (mostly just a newline)
this is what the VE folks care about

Q4 Histogram of roundtrip results[edit]

logarithmic scale
Most pages with issues have only a single issue

Q4: Get ready for HTML5 page views[edit]

Goals: ... (see slide)
Developed Parsoid-specific CSS

still need some small tweaks
Erik: trying to emulate enwiki CSS faithfully, with all the infobox etc. code?
Gabriel: we are actually directly linking that using ResourceLoader, just adding ~150lines module for Parsoid
Roan: remember the Arsenal FC template which generates football t-shirts :-o
James: with lots of hacks, and it's now used on 20-30,000 pages :/
Roan: but it now works in VE ;)
Gabriel: also, train lines template

Next steps: Visual diffing, infrastructure

Q4: Performance[edit]

Tables were a performance issue
wrote a profiling tool that helped a lot: nodegrind

Performance progress on en:Barack Obama (now parses in 10 seconds assuming cold cache, faster with expansion reuse- a long time ago it was 30), w:San Francisco, enwiki main page
probably comparable to PHP parser now

HTML retrieval times:
HTML retrieval time very short already, not likely to change much (static HTML)
not much change from previous quarter
James: recall that logged-out users already get cached version
so this will mostly benefit logged-in users
Provide API that offers rendered content
Erik: so if we were to switch to Parsoid rendered HTML ...
James: template data, image metadata (?) will probably still not be part of Parsoid payload. but for 99% yes

Q4: Images[edit]

No shortage of edge cases, big mess in MediaWiki
now basically bug-for-bug compatible with old parser
started initatives to clean up image handling in the long term
- semantic image formats
- square bounding boxes

Max: Square bounding boxes caused issues with community
Gabriel: resolution is need to separate the two use cases
Erik: community didn't want images to be all over the place in one article, understandably
Jared: community probably doesn't have a stronger opinion about it than we (Design team) have ;)
Roan: we (VE) were grateful for Parsoid tackling this
Erik: yes, we mainly need to be careful about changing default appearance
(Gabriel:)

- less confusing options, better uniform styling in skin/view

CScott did a lot of heroic work on images

Q4: DOM template parameter editing[edit]

Parse transclusion parameters to DOM. template parameters are still in wikitext, which surprises people who are not used to wikitext
- not yet in production (not in VE yet)
Performance implications:
- parse times
- html size

most of this is offline, i.e. just a capacity issue
but might become user-noticable issue with VE/wikitext switching
Erik: so I would be getting the full annotated HTML with all transclusions when I click edit?
Gabriel: yes, but transclusions would be separate request
it's a REST API, so it can be cached
I don't think it will affect performance, as metadata request can be async
save time is dominated by network transfer (no native gzip compression on POST, slower upload speeds)
size issue: escape quotes in HTML JSON attributes (data-mw)
Parsoid has clever way to minimize escaping by picking single / double quoted attributes dynamically
but there won't be a need for entity quote escaping in pure JSON, so total size will be smaller once data-mw moved out of DOM
Erik: so VE will try to implement parameters in Q1?
James: yes, once it's available from Parsoid
Erik: any parameters we can't do?
Gabriel: some assumptions about e.g. tags being closed
Scott did some grepping, IIRC this case was really rare, decided to go ahead
Erik: launch as beta feature first

Q4: Structured Logging[edit]

GELF logging using bunyan
stdout-based logging caused outage at one point when disks were full, so now will hook up logstash, making it easier to analyse issues & usage

Matt just set up a GELF sink feeding into logstash, it's working but we don't use it yet in production.

Logging important: Even in current form, has exposed errors in production that have since been fixed

Q4: Language variants[edit]

Work just started
Next: Full editing support

need to simplify interface - very baroque

Q4: Wikitext linter GSoC project: Linttrap / Lintbridge[edit]

GSOC '14 project (Hardik Juneja)

collect issues or just usage patterns from the wikitext content on the projects
collect statistics of which templates are part of such structure (e.g.: not balanced)
already collected a lot of such cases, but don't have good stats yet
cleanup for e.g. unbalanced would create dirty diffs, so would like to do this as a separate pass (to make clear why it changes)
Subbu:

http://lintbridge.wmflabs.org/
- presents results with HTML & web service interface
Future Plan

had some contact with en:WikiProject Check Wikipedia
demos http://lintbridge.wmflabs.org/_html/issues/
e.g. foster parenting, font size
example: table in https://es.wikipedia.org/wiki/El_%C3%BAltimo_vals_(canci%C3%B3n)#Trayectoria_en_las_listas
example: https://en.wikipedia.org/wiki/2011_FIFA_U-20_World_Cup
Gabriel: quite a few can be fixed by Parsoid, by selectively disabling logic that currently un-does such fix-ups
Erik: so do you guys need CL support on such standardization?
Gabriel: yes
Erik: have you worked with others at WMF on this outside the team, Subbu?
Subbu: no
Hardik did some outreach to community
Erik: need to plan on this
need a technical Wikipedian for this
scheduled for Q1?
Subbu: probably start in Q1, complete in Q2

Q4: HTML Templating[edit]

Gabriel:

implemented Knockoff compatible templates

long term use for UI messages, and content templating: https://www.mediawiki.org/wiki/HTML_content_templating
Gabriel: interested in templating because UI messaging is a big part of render time.
both secure, balancing DOM, escaping attributes
Lua...
Erik: so this is replacement for wikitext templates
Gabriel: The language for data access is different. Obviously JS is attractive because server-side client-side the same. Lua is also attractive because of data separation. We are not sure because the standard "Lua templating" library isn't that large a body of code.
Erik: Existing wikitext templates to target? What is success?
Gabriel: In the next quarter or two, nothing. Probably 2nd quarter will start. Might play on the side, but other things are higher priority. We've written an RFC with some goals (performance, functionality, …)
Erik: want to be able to express very complex logic in these?
Gabriel: yes, in code modules and data accessors, e.g. citation service...
aiming to separate data from presentation
in wikitext, this is quite embedded
Tassembly: JSON based assembly language https://github.com/gwicke/tassembly

Q4: Helped content translation team[edit]

they use Parsoid HTML
(earlier, there was actually a gadget someone wrote for Czech / Slovak)
Lila: who is the team?
(Gabriel:)
Language Engineering (Roan: also David, who is part of VE too)
it's their big project currently

Q4: Services[edit]

Gabriel transitioning into services team
- hiring starting
- goal setting: Authentication RFC, REST API front-end, Storage services
Infrastructure work
- pushing (Debian) packaging
CScott has also been working on PDF rendering (with Matt)
- leverages semantic info in Parsoid HTML for massaging conversion into LaTeX on backend
- ready to deploy

Q4: Things we didn't get done[edit]

not done yet:

language variants
visual diffing (for HTML pageview testing)
performance: more efficient template updates
research: like support switching between HTML and wikitext edits
research: content widgets

Q4: Stretch goals Q4[edit]

non-WP projects
Parsoid enabled on all public WMF projects
(this slide is actually from the last qr ;)

2014/15 plans[edit]

https://www.mediawiki.org/wiki/Parsoid/Roadmap/2014_15_Draft

Broad areas

continue iterating on RT and render quality
language variants
Parsoid HTML page views
Stable element IDs
support for HTML wikis

Subbu:
language variants; parser support first, then editing

2014/15: more of the same:[edit]

bug fixing
long tail compatibility with the (old) PHP parser & Tidy combo
performance work
maintenance, node.js upgrade to 0.10, 0.12
continuous deploys
GSoC, talks, conferences, etc.

2014/15 Q1: Language variants[edit]

CScott made progress

use a bot to fix nesting issues that are hard to work with/render

Scott worked with User:Liangent from zhwiki

finished editing support
evolve longer term strategy

2014/15: Parsoid HTML pageviews[edit]

Dependencies
- HTML storage and content API (Services team)
- user agnostic HTML + client-side customization, e.g. for redlinks (Services, platform teams)
- any team willing to experiment with this
Timeline
- Prototype by end of Q1
- Move to beta in Q2
- Work with Mobile on specific skinning

Parsoid HTML: pageviews
Erik: so only customer right now is VE (and offline exports)?
Subbu: And Flow
Erik: ETA?
Subbu: not determined yet
Max: I'm here (from Mobile team) because of this
we are open to experimenting with this, probaby starting with Mobile view API
Erik: Is the app using Parsoid already?
Gabrie: intended several times, but never happened
Erik: so Mobile view API is the first candidate, and it's not affecting Mobile web view?
Max: There is an alpha mode… or had one but we killed it.
Erik: so it will affect either that alpha mode or nothing?
Juliusz: yes. but also use the API to rerender page after an edit
Max: which we probably shouldn't do...
Erik: but you see the app as the first use case?
Max: yes, also because with web, could easily bring cluster down
Erik: OK, but need to scope it a bit more (which app when)
Subbu: we can talk
Gabriel: also, get a simple desktop demo to see how it works

(slide 2)

(Subbu:)
specific tasks for Parsoid HTML pageviews:
reduce HTML footprint
Max: how large is the difference when gzipped?
Gabriel: still significant, but it can be stripped for views and ends about 10% smaller in desktop v. desktop uncompressed
long tail of rendering diffs, for example some infobox templates which emit table attributes & content, but not the table tag
Gabriel: linter work is kind of complementary to this, because it might allow to fix some hard issues in wikitext

2014/15: Q1-Q3: Stable element IDs[edit]

(Subbu:)
When content is changed, can we assign the same ID?
Why?
lets associate metadata with content elements. Use cases:

authorship maps (content provenance)
efficient diffs
inline comments
content translation tracking
etc.
- Slim down HTML size but still support switching between HTML and wikitext

2014/15: Q2-Q4 support for HTML (-only) wikis[edit]

needs HTML content templates
content widgets for scenarios like navboxes, data tables
HTML diffs (for revisions)
abuse filters

2014/15: Q1 Tasks[edit]

HTML views: work toward prototype
- Tidy support in parser text
- visual diffs and reuse testing framework
- rendering accuracy, e.g. mixed-content templates used heavily in some infoboxes
- move data attribs to metadata storage
- ...
- PDF rendering
Language variant support - should be done soon
HTML for template parameters
Wrap up wikitext linting GSoC project
Resume work on stable ID support
Hook up logging with logstash
Other
- bugfixes, deploys
- Wikimania

Discussion[edit]

Erik: so team is Subbu, Arlo, Marc, Scott now? yes
I think you guys are making good progress for that team size
Thanks for the detailed update!