Wikimedia monthly activities meetings/Quarterly reviews/Infrastructure and CTO, October 2015
Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material
Present (in the office): Abbey Ripstra, Leila Zia, Kevin Leduc, Terry Gilbey, Gabriel Wicke, Chris Steipp, Wes Moran, Greg Grossmeier, Tilman Bayer (taking minutes), Grace Gellerman, Daisy Chen, Guillaume Paumier, Ori Livneh, Chad Horohoe, Boryana Dineva, Geoff Brigham (from 9am), Lila Tretikov (from 9:17); participating remotely (via BlueJeans): Marcel Ruiz Forns, Mark Bergsma, Giuseppe Lavagetto, Joseph Allemandou, Dan Andreescu, Marko Obrovac, Nuria Ruiz, Eric Evans, Luis Villa, Antoine Musso, Faidon Liambotis and others (up to 28 remote attendees overall)
Several huge tasks worked on in first two months, but only concluded in September - hence higher velocity for September
Pageview API will be deployed this week
reasons for miss: one DBA left, etc.
Eventlogging uses Kafka
Terry: does this mean it's now infinitely scalable by just throwing more hardware on it?
Marcel: no, there is probably still some limit
Terry: do we know what that limit might be
(Marcel:) no we have not tested that, but our tests confirm it is more than 10k ev/sec
miss: we could not test load in our testing environment
Terry: do you have what you need to do a load test (for Kafka) next time?
Marcel: yes, decided to get test environment that allows some minimal load testing
Terry: budget needed?
Terry: is there one thing ask from org to make you more successful?
Marcel: would need to think about that question
Kevin: we get a lot of ad hoc requests for data
this is already in progress, but: transferring these to Reading (for traffic data) and Editing (for editing metrics) would help a lot
the three categories (strengthen/focus/experiment) didn't quite make sense for our team, so I added "Maintenance/interrupt"
KPI good enough for us now
errors went up, which is not good, but it means that others created buggy code, that we have to deal with
Continuous Integration wait time: time CI jobs spend in queue before running
Terry: so the red line going down is good; what about the others?
Greg: red is just busy (resources)
Restbase not migrated by eoq - but all the other things got done
did not get Gerrit migrated, but prepared the basic stuff
Ops and us working on rest
Skill matrix: distribute knowledge better within team
Wes: this idea is really cool
Ori: this is self-assessment?
I guess we could do group assessment
Terry: this is a great map. even if self-assessment is not perfect, it's a great starting point
Greg: also "what we do (that we can help you with)" table on mediawiki.org
Terry: so this is the start of a menu of services
will also help avoid SPOF
--> Terry: figure out action items for Phabricator improvements, and see to fund them
KPI for us is API usage
moderate increase, no major endpoints that have switched over yet
Terry: req = 1 transaction through the API?
Gabriel: means HTTP request
a lot of usage is also on services level, e.g. by Google
mobile team used to do a lot of content massaging (themselves)
now moved into service
both for iOS and Android
also, it's a lot quicker
proper caching prepared, not enabled yet
discovered that mobile web has actually very similar needs,
for work on supporting slow (2G) connections
scale restbase, across data centers
change / event propagation
e.g. Wikidata has to work around this, can't check (directly) where a Wikidata item is actually used
Wes: have you synced with Discovery team on this?
Gabriel: we will, not firmed up yet
I don't expect this to happen before 2nd half of coming quarter
HTML save: not used yet, but will hopefully enable a lot of innovations like microcontributions
test coverage and deploy process helped avoid deployment outages (which would be really bad, e.g. edit corruption for VE)
good reuse of node.js service
Wes: how do you know [Cassandra sizes] got too large?
(Gabriel:) monitoring now
Gabriel: elevated latency over several days
Wes: how do we communicate that (kind of issue)?
Gabriel: worked with Ops - it did not actually stop working, but higher error rate
went to discussion pages for VE etc. told people there was an issue and being fixed
Mark: KPI: basically same as last q
want to point out that all of Engineering has part in that - please help us to keep it up ;)
successful failover of eqiad cache traffic, no one noticed ;) had to do some procurement for that
so goal technically missed, but 99% there
leasing is a new process, collaboration with Finance, took longer
that made things a bit hectic in the end (but wasn't the eventual cause of missing this goal)
learn: allow time for dependent work
Terry: so did we get over the learning hump for leasing?
Mark: should have that sorted out now, it was due to initial contract review only.
Mail tech debt
TLS encryption for email requested by Legal team
redefined OTRS goal after reviewing users' workflows etc with CA team
did meet redefined goal
there are some servers where we don't actually want/need firewalls, e.g. because of perf or because they are protected by other means
Terry: do we hav pen test scheduled?
we had an internal one last q
Mark: and had an external one at end of last year
--> Lila: would be good to have a policy to do pen tests once or twice a year
Mark: (Labs reliability)
last quarter, tried to improve uptime of Labs and failed disastrously
better this quarter
NFS was unnecessary for many projects
ToolLabs is on cluster environment (Gridengine) that was not quite made for this use case
evaluated two different environments and moved two tools over for testing
Mark: see appendix, also for Victor's new server photos
--> Lila: with the [complex] engineering projects that are coming along, I see need to think more about planning further ahead
sync earlier [between Ops and other teams] on complex scenarios
Abbey: need to work on KPI, suggestions are welcome
--> Lila: I will try to help (to come up with KPI)
Abbey: collaborate with other teams
4 examples: ...
hard to provide numbers as we often depend on other teams
pragmatic personas build on existing data, e.g. from surveys, academic research
Lila: how many personas do we get out of this?
Abbey: e.g. new editors,... [as example of personas]
Terry: did we only invite 12 editors?
Abbey: no, many more
Terry: show up rate?
would need to look up, but low
Wes: could CL help with that?
Abbey: yes, working with them
participant recruiter helps us with processes like mass emailing etc., brings a lot of experience with that
Voice and Tone: lots of Phabricator tickets on how UI communicates with users
e.g. "edit" vs. "edit source" - means different things to community members and others
Lila: this is really important
this language is anchoring (expectations), just like (graphical) design elements do
Abbey: misses: switching costs are high for us when product teams reprioritize
Wes: was the annual roadmap (from offsite) a good tool?
Abbey: it was really useful for seeing what other teams are going to be doing
but not pertinent to this issue here
Terry: so message you earlier?
Abbey: yes, e.g. don't want to start recruiting, build prototype etc and then have to shelve it because of plan changes
planning work with U Washington students
Leila: I'm filling in for Dario
no KPI, open to suggestions, but team does very diverse work
Lila: is ownership moving to editing team?
Leila: need to check with Aaron, but it's productionized
--> Lila: I worry about impact of revscoring, need sit down and see how it's going to be used
Dario and Aaron probably should talk to Trevor etc.
KPIs should be same as [or related to] top level KPIs
e.g. productivity of editors
so maybe multiple KPIs
Wes: Discovery also wants to use it, started to talk to Aaron
Lila: maximum impact for editors; should be able to say e.g. "caught 95% of spam on enwiki thanks to this tool"
Article recommendation tests
CX needed bug fixes for PL WP, PL community asked to hold off on test for that reason
measuring value added
decided to defer this, because revscoring needed a lot of help
article recommendation improvements (stretch goal)
Misses: data analysis requests
--> Lila: for everyone in this room: clarify who owns what
e.g. Analytics team's job is not to provide data for everybody
Wes: Dario did good work on that project
Lila: great job on editor impact
would great to also work on reader engagement
Leila: planned for Q2
Terry: Ori has the only KPI that I like a minus in front of ;)
Ori: page save time
attained goal in mid August, but hit by regression then
Lila: do we have test runs in CI?
that stuff should be caught on nightly basis
Ori: first paint time < 900ms
this was a stretch goal
saw presentation by Paul Irish on importance of being under 1 sec
so aimed for that
Lila: this is awesome
--> Lila: blink of an eye is 200ms, we should aim for that [for first paint time]
Ori: first paint time vs. signups per second; interesting, should look closer
proud of hires
multi dc project, RfC - I think Ops enjoys working with us on that
Mark: absolutely, thanks for picking this up
Chris: security bugs are not public, so looking for better KPI
Automated Dynamic Scanning
Metrics for Security Bugs
other successes and misses:
not enough time scheduled for Analytics cluster assessment, but found several relevant things already
most of our work is core
working with Comms to better communicate post incident
--> Lila: consider #security incidents as KPIs
Lila: thanks everyone