Wikimedia monthly activities meetings/Quarterly reviews/Parsoid/October 2014

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

The following are notes from the Quarterly Review meeting with the Wikimedia Foundation's Parsoid, Services and OCG (Offline content generator) teams, October 3, 2014, 12:30PM - 2:00PM PDT.

Attending

Present (in the office): Gabriel Wicke, Damon Sicore, Jared Zimmerman, Tomasz Finc, Tilman Bayer (taking minutes), James Forrester, Toby Negrin, Erik Moeller, Trevor Parscal, Terry Chay; participating remotely: Subramanya "Subbu" Sastry, Marc Ordinas i Llopis, C. Scott Ananian, Roan Kattouw, Arlo Breault, Hardik Juneja (from 1:40)

Please keep in mind that these minutes are mostly a rough transcript of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Parsoid[edit]

Presentations slides from the meeting


Subbu:
Welcome

slide 2

[slide 2]

Parsoid team: core and extended
help from others
teams that use Parsoid: also CX

slide 3

[slide 3]

Agenda

slide 4

[slide 4]

our objectives

slide 5

[slide 5]

development context
came a long way since 2011-12

slide 6

[slide 6]

why this is hard (skipped)

slide 7

[slide 7]

Progress Q1
strive for similarity with [old] PHP parser's rendering
robustness: handle pathological cases
testing: compare HTML from PHP parser and Parsoid
visual diffing [compare rendered HTML pixel by pixel]
not implemented in Q1:
language variant support - didn't have Scott

slide 8

[slide 8]

Continuous iteration
lots of edge cases

slide 9

[slide 9]

Parser tests
tests run on every commit

slide 10

[slide 10]

Round-trip test results
on 160k pages
now 0.16% - good progress
That's still 7k pages on ENWP though
Gabriel: but .. normally hides even these completely
with that, only 7 pages in that 160k have discrepancies, and only really small ones (a extraneous newline or such)

slide 11

[slide 11]

visual diffs

slide 12

[slide 12]

visual diffs http://parsoid-tests.wikimedia.org/visualdiff/
repurposed testing infrastructure for visual diffs
on subset, because it's still quite expensive
Damon: why pixel-perfect accuracy?
(team: it's for detecting larger problems[?])
Trevor: differences come from use of, say, div vs. <image> tag[?]
Parsoid specific CSS to achieve that
Jared: template rendering?
Trevor: mostly images
[slide' example: ...tournament (demos diff)]
Trevor: e.g. wrapping in paragraph vs. not
Damon: testing on different browsers? just Webkit
Gabriel: here, cross-browser differences aren't much of an issue usually
CScott: hoping I can reuse some of that for PDF rendering testing
Toby: scope of this testing?
Gabriel: currently only desktop, mobile might be next
Damon: (on relation with language support)
Damon: what would be the next thing you would do for testing if you had the resourcces to do something cool? (liked CScott's PDF idea)
Subbu: not sure
.. all langs
but visual diffing only for enwiki currently
CScott: this is most important to catch regressions
Damon: any HTML5 elements not covered by this?
Subbu: audio and video not supported yet

slide 14

[slide 14]

Preparing for HTML5 page views

slide 15

[slide 15]

still need to fix site CSS
e.g. citations [on enwiki]
can use gadget (en:User:Jackmcbarn/parsoidview.js) to check how Parsoid HTML looks like in production

slide 16

[slide 16]

remove mw-data...
also to enable asynchronous savse

slide 17

[slide 17]

make it more robust

slide 18

[slide 18]

more robust
fixes after huwiki table incident last month

slide 19

[slide 19]

performance
(shows load/memory graphs)

slide 20

[slide 20]

arwiki/enwiki/dewiki parsing times in seconds
Gabriel: other direction (HTML -> wiki conversion) is much faster, 100ms avg

slide 21

[slide 21]

(Subbu:)
Other
Gabriel: collaboration with WikiProject Check Wikipedia [to fix corner cases]
Damon: looked into fuzzing?
Gabriel: basically we use production data for fuzzing, but not randomly generated input
Subbu: no
[someone else:] we have real people fuzzing us ;)
Damon: fuzzing could help find issues like the large table issue on huwiki
CScot: last crash issue was a while ago
lot of the remaining issues are bugs from PHP parser [emulate or not]
Damon: any security issues?
JamesF: use it on some private wikis, but not an issue
Trevor: because we work on a DOM, escaping is more likely to be done right
moving away from string stuff with e.g. templates
Erik: things get scary with user-created templates
in wiki-markup, this is combination of wiki templates and Lua, both well sanitized[?]
looking into better ways for templates
Gabriel: (on sanitizing)
we tracked all PHP security issues
Damon: sounds good

slide 22

[slide 22]

Subbu: Wikitext linter GSOC project by Hardik Juneja
[examples: [1]]
(slide 23: PDF renderer - see CScott's part below)
(slide 24: Things we didn’t get done in Q1)

slide 25

[slide 25]

Broad areas for Q2

slide 26

[slide 26]

Parsoid HTML views
cite CSS (Marc works on this)
HTML Tidy vs. HTML5 differences: matter in case of broken wikitext
mixed content style templates (e.g. one template opens a tag, another closes it - doesn't match with DOM)

slide 27

[slide 27]

Q2 tasks: supporting clients
language variants supports

slide 28

[slide 28]

stable IDs (for elements, persisting across wikitext edits) - e.g. for authorship maps, inline comments, switching from VE to wikitext editor

slide 29

[slide 29]

new applications
Linttrap...

slide 30

[slide 30]

perf + maintenance
Thank you! (end of Parsoid slides)
Damon: do we have specific perf goals? like "<20millisec for ..."
Gabriel: has not been an issue so far because it's not [directly affecting users]
Damon: this looks great, I'm still learning... generally, I like to ask about: testing, defining what winning means
Erik: make Parsoid HTML view gadget into Beta Feature?
Subbu: in a few weeks
Erik: any really horrible examples left?
Subbu: some templates, infoboxes and such, would look bad
Erik: do we need cross lang visual diffing before that?
JamesF: cite extension's styling in hack
Parsoid output is munged to look like enwiki for that
frwiki doesn't have [ ] around refs [but in Parsoid HTML they would look like on enwiki)
Erik: ...
CScott: currently using PhantomJS, old, might not translate well to e.g. non-latin languages
Erik: so let's restrict Beta feature to latin languages
Roan: Timo worked on unit tests
CScott: PhantomJS might have new release [which fixes that]


Services[edit]

Services QR slides Q2

Gabriel:
started at beginning of [Q1]

slide 2

[slide 2]

had discussion in January, resolved to move to services model
scale storage - some parts like external store have been band-aid, hitting limits

slide 3

[slide 3]

(perf graphs)
PHP API, uncached, we have to ask people not to do expensive things
can tie up one instance [for a longer time]
but is powerful for editing

slide 4

[slide 4]

most of our app servers are busy processing cache misses
edit rate relatively low (~15/s avg, 50/s peak across all projects)
24 Parsoid boxes can keep up with updates (but use PHP API)
Damon: that's enwiki? no, all projects
Toby: 95% or 99%?
Gabriel: might be >95, yes
we have a long history of caching
it was a successful strategy
Trevor: but doesn't work well for people logging in
more strategy for personalization
Damon: ...
Erik: might shift some of that [personalization] to client, with async [modification]
[personalization is] hard with any caching
JS is a good mechanism for UI customization
example:...
Gabriel: also, stable IDs enable some customization
Roan: also, logged-in users always get data from Virginia instead of from caching center that's closest to them
Damon: every app server is on the same cluster? yes
Erik: Dallas is almost exact replica of Virginia, aim for failover of app server cluster
but there's interest in using it for load balancing etc. too
Damon: are we deploying with an eye to isolation and anonymization?
hearing Virginia is a bit distressing...
(discussion about hosting)
Gabriel:
distribution means some security challenges. don't trust the code; have had remote code execution & SQL injection bugs... e.g. should image scaling servers have full access to the user table?

slide 5

[slide 5]

Resourcing: Matt left in July
Hardik
hiring

slide 6

[slide 6]

Original plan for Q1

slide 7

[slide 7]

RestBase

slide 8

[slide 8]

misc backend services
Erik: do we have citoid.wikimedia.org now? no, it's internal
Roan (on chat): Citoid isn't deployed yet, Alex -1ed my patch. But it is deployed in labs and getting it deployed in prod is just a question of me splitting up my patch into 3 parts to make things less scary for Alex
(Gabriel:)
Mathoid:
JamesF: had bit written in OCaml, scary!
CScott: I actually removed it recently
Gabriel: working with MathJax people
Damon: so we use Latex how?
Gabriel: it's the source, within wikitext
Erik: has been supported in wikitext for a long time, used to generated PNG
issue: client-side latency
they (volunteers) implemented server-side version of MathJax
Gabriel: open question: who owns MediaWiki integration?
Erik: how about you ;)
Gabriel: Need front-end expertise.
Damon: we should support the math folks
--break for hydrant breakage--

slide 10

[slide 10]

Gabriel:
templating: Matt and myself, make DOM-based templating with context-sensitive escaping fast
idea is to complement existing reactive client-side solution
Knockoff: fast on server side; complements KnockoutJS on client
well suited for server-side pre-rendering followed by client-side dynamic updates
did benchmarks across languages, it's fast
front-end standardization group now working on this

slide 11

[slide 11]

Q2
Restbase, hiring

slide 12

[slide 12]

Q2 RESTbase
Erik: is this the first use of Swagger? yes
did you consider this for RC stream too? yes
Erik: does Moiz know?
Gabriel: yes
Wikia uses it too
build generic & basic monitoring for backend services (latency, errors)
new entry points, not systematically exposed in API currently
variant production on edits (produce offline and store)
need reliable queueing / events
for purging, etc.
Damon: is there also unreliable ...
Gabriel:
we have a homegrown job queue system, tied closely to PHP
forces you to write your own job runner, doesn't let you do reliable pub/sub
was in bash, recently rewritten in PHP
Analytics team has a lot of experience with Kafka, might use that

slide 13

[slide 13]

Q2: perf

Toby: this is on SSDs?
Gabriel: will deploy to SSDs, when it matters
Cassandra
eliminate cache misses
for Parsoid, also e.g. old version
VE fast saves - that's mostly in VE land
JamesF: we have some ideas ;)
Gabriel: send HTML, but don't wait for rendering to come back, instead use HTML you already have
logged in page views faster
just some customization, e.g. links underlined

slide 13

[slide 13]

security
for most read requests, don't need to go to database for auth
first phase implementation in cookies
also consumed by MediaWiki
Erik: has Wikia done any work on auth?
Gabriel: yes, recently talked with them

slide 15

[slide 15]

Tooling
takes several days per service, lots of bugs in deployed versions of salt & trebuchet
duplication of init scripts etc. between packages and different puppet setups
should see with RelEng & Ops teams if we can streamline the process for services

OCG (PDF rendering)[edit]

Presentation slides from the meeting


CScott: OCG (Offline Content Generator)

slide 3

[slide 3]

Global South

slide 4

[slide 4]

history

slide 5

[slide 5]

mwlib: not well maintained
lot of bugs, especially non-latin

slide 6

[slide 6]

2014 so far
in production this Monday

slide 7

[slide 7]

Q2 goals
bus factor 1.5
tables and infoboxes missing
Indic languages
had to turn off ePub and ZIM
but already have plaintext as beta

slide 8

[slide 8]

next-gen PDF renderer
current version of PhantomJS outdated
Print CSS?

slide 9

[slide 9]

slide 10

[slide 10]

in production now, ~110k req / day

slide 11

[slide 11]

cache hit rate still only 25%
Thursday: Indic languages turned on

slide 12

[slide 12]

gaph
cache clearing

Erik: let's sync up on roadmap, defining what we can provide and what will need to be taken up by community
suggest base commitment: reasonable rendering quality for all, extra effort for some large services like Kiwix, rely on volunteers for the rest
CScott: all my work recently was on production issues