Wikimedia monthly activities meetings/Quarterly reviews/Architecture, Operations, Release Engineering, Services, and Security, April 2016

Notes from the Quarterly Review meeting with the Wikimedia Foundation's Technology II: Architecture, Operations, Release Engineering, Services, Security teams, April 14, 10:00 - 11:30 AM PT.

Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Architecture

Slide 2 - Architecture cover slide

Rob: I am the architecture dept, running the virtual team responsible for the arch committee. We do not have any KPIs for the arch group. I think that's OK, everybody has a lot of their KPIs they are trying to achieve.

Slide 3 - Lending ArchCom authority

aka piloting Rust process

Rob: 1st goal: Pilot the RUST style (RUST is a programming lang by Mozilla. was a tool to rewrite Firefox. Gabriel has been observing RUST process and though why can't we do something like that.

I have been trying to work out what it would mean for the foundation

ArchCom said yes, we should give this a shot. We do not have sub team yet.

On the plus side, the process has the idea that each RfC has a shepherd. Using this we have a better way to talk about what the next steps and when something is blocked on only on the backburner.

This collab is super complicated.

Slide 4 Improve RFC documentation

Rob: Having better documentation, people understand what to expect, that meetings are happening and there is process.

The work will continue (it will never be done).

I had one more goal about renaming but that was not done.

Katherine: I'm intersted if there are next steps from roadmap. What is this work preparing for.

Rob: This is leading to consensus based conceptual integrity (see: <https://www.mediawiki.org/wiki/User:RobLa-WMF/CBCI>)

Conceptual integrity is making a system that makes sense. Conventional wisdom is that you have a single person responsible as visionary to make sure system fits together. To do that in a consensus oriented way is challenging but necessary.

How do we do this in a group way where no single person is laying down the law. That's the long term goal.

Immediate steps: when someone has good idea, how do they get support.

What we ar looking at is "what does it mean to be consensus based? Are we looking for an advice process. We've been talking about it in the ArchCom but still haven't decided if it is advice or authority that we are looking for.

Geoff: who is in the ArchCom?

Rob: Is mediawiki the center of the ArchCom is about, or Wikimedia software specifically. It's not clear.

The formal authority of the ArchCom is +2 access - the ability to commit code that goes live on site quickly. They also have the ability to take away someone's +2 rights. It's very rarely been done, but they have the trust to do it. (after meeting note: see <https://www.mediawiki.org/wiki/plus2>)

Geoff: who is on the committee?

Rob: Gabriel, Tim S, Daniel Kinzler from WMDE

Mostly staff, dominated by staff (answer: https://www.mediawiki.org/wiki/Architecture_committee#Members )

Operations

Slide 5

Mark: last qtr was fundraising, uptime is fairly similar to last qtr.

Slide 6 - DR testing

Mark: biggest goal: was to test Dallas datacenter. Secondary datacenter should be able to be primary. Had data backed up there for a while. Equip was there. wasn't tested and fully ready. We haven't actually served MW traffic from there, there was still a fair amount of work needed. Goal this quarter was to do that. Largely completed, but eoq timing. Planned originally for March. We chose week before Easter, a few unexpected setbacks (e.g. job queue). Some security infrastructure. Backup date used. Planned for next week. We did switch over all of the other services in a test. MW is the most complicated part. One thing we should have done: planned the exact date before planning the goal. Also, we should have planned the comms plan. Sherry Snyder jumped on it (many thanks to her)

we will work on completing the goal next week and follow up on learnings on the process.

Slide 7 Labs dashboard

Mark: Move dashboard from homegrown solution to Horizon (used by upstream). Most important parts moved to Horizon. Users can now create new systems, can access DNS, http proxies. All done by Horizon: we almost didn't make it, but Alex Monk (Krenair) stepped in and gave his front-end support for the work.

Slide 8 Migrated from Ubuntu 12.04 to Debian Jessie

Mark: we have 1400 servers running multiple versions of Linux. We have to keep getting deprecated versions out. Some systems were harder to migrate. Some need to be done one by one. We were able to meet goal to do over 60. We will continue to work on this but will make a KPI: systems running on deprecated SW.

Slide 9 - Monitoring

Mark: this is a follow up of goal from last quarter which we had missed. So we started with needed, dependend on work not completed. Monitoring doesn't scale anymore and need to be modernized.

Since this was a follow goal, we decided mid qtr to redefine and reduce scope, which we did reach (evaluate solutions)

We're not going to make it a goal for next qtr because we don't have resources, but hope to rectify with hiring.

Slide 10 Varnish/caching

Stretch goal. Upgrade our caching..we've been running on Varnish 3 for a long time. Finally upgrading to Varnish 4. Lots of custom code to support wikipedia zero and analytics infrastructure. There's a lot of tech debt to solve before upgrading. This was on roadmap for over a year. We have a new ops eng hired and did entire goal. We also got support from Luca - new hire on analytics team - to help unblock issues on analytics infrastructure. It got done in the end. We are following up on goal with more migrations. Next qtr we should have everything running no Varnish 4.

Slide 11 other successes and misses (1 of 2)

We tend to have a lot of work outside of our goals so we listed accomplishments.

Slide 12 other successes and misses (2 of 2)

More accomplishments

Slide 13 Metrics

Main KPI: availability. We need everyone's help to keep this number up. Slightly lower this quarter. Several audiences. Strategy process was about connecting to them. Reader, Contributors, Partners, Donors.

Katherine: difference between partners?

Mark: movement partners are people accessing our xml dumps (people working with our data)

eternal partners - rest of the world

It's admittedly hazy, we could make that distinction clearer

Questions?

Wes: Mark did a lot of work around annual planning this quarter while keeping the team working and 99% uptime. He did a really good job of making this work.

Mark: thank you. there's a lot to keep track of.

Geoff: you have service readers, and content contributors. how do you distinguish these groups

Mark: e.g next week, we're going to have to make the site read-only. Editing stuff is downtime for editors, but not for readers. There could be

Katherine: thank you mark. I understand work on swithcing over to Dallas is significant and it's important to call out judicious decision making and flexibility when it came to delaying the switchover test. Great planning and judicious decision making.

Release Engineering

Slide 14 Release Engineering

KPI - amount of time it takes to get stuff merged. YoY still going down.

=== Slide 15 Consolidate deploy tools Consolidate deployment tools. Scap3. we didn't complete this. we realized it would be easier to migrate non-MW services, so this quarter we implemented a few features needed for migrating thing. this should continue into next quarter.

Slide 16 retiring Gerrit

Not completed. We wanted to integrate Differential into CI.

A lot of consensus was built at the dev summit, but the RFC is still in progress.

Stretch goal: get 1 early adopter per team. We didn't reach this but got one early adopter.

Overall goals were too large for one qtr. Retiring gerrit is not a single qtr project.

Slide 17 reduce CI wait time

Nodepool migration. not done for _all_ CI jobs, but did it for the npm ones. Other ones are in progress. User:Paladox helps with a lot with this stuff

Newer version of HHVM

Slide 18 other successes and misses

(read off long list on slide)

scap3 - necessary for large binaries

Slide 19 metrics

Phab upgrades every other week. most people don't notcie downtime from that

MW Swat deployments M-Thu. Rolling responsibility - less stress, we can shift that responsibilty

Slide 20 metrics slide 2

CI - trying to make this as selfserve as possible. A lot of churn in this. MW Selenium - two releases of that this quarter test cleanup

Slide 21 SPOF tracking / skill matrix

we keep track all of the things we are responsible. Trend is very positive. there are still areas where we still have SPOFs

Questions for RelEng

Katherine: thank you for supporting your team. still learning this area. I liked the skill matrix, I really appreciated that.

Chad: we had an offsite prior to the Lyon hackathon. Pairing, making things a rolling . Greg? came up with it. just tracking

Rob: that has Greg's fingerprints all over it :-)

Katherine: are there plans to roll this over into next quarter?

Chad: scap and differential migration. lots of unknown unknowns. we have a lot of good momentum

Slide 22 Security

KPI 2.25 people. (Chris & Darian) Brian Wolff is a contractor working for us as well.

We made a little progress over last qtr on our KPI.

Security

Slide 23 2FA for CentralAuth

Chris: We had one goal: 2 factor auth on wikis.

Easy way to do this (we thought) is build off reading infrastructure's AuthManager. Auth Manger was not ready in time, so we moved ahead with integrating with CentralAuth directly.

We had more security bugs reported this qtr, so were interrupted more.

Goal: will be rolled out soon.

Slide 24 - other successes and misses

RFC on meta to up password requirements to 8 characters.

Missed one security review

Slide 25 metrics

Chris: we do a lot. we can't narrow focus because it's Security.

We're either getting worse at writing SW or better at reporting security flaws.

We're finding more issues that are of less severity. We did find one big security flaw.

Geoff: I know you all are working hard on this, but do really know if we should be worried?

Chris: yes, we don't really know for sure

We could do better with security testing tools, we could have bug bounty programs. It's hard to say how much our threat is increasing. The amount of data we are storing is exploding so the impact of compromise is increasing.

Gabriel: do you have plans to reduce the impact of compromises?

Chris: yes, working next quarter on Auth Service, to better protect sensitive data if mediawiki is compromised (T120484)

Kathering: do we have more eyeballs on code? (we're getting more bugs)

Chris: I don't know.

Rob: One thing that would be nice is to do postmortems after a security problem is found so teams learn how t write SW without security issues. Right now Chris does review at the end of the process. Incorporating learning from mistakes we'll get better.

Geoff: Do designers have security knowledge?

Rob: not necessarily. all teams would tell you that having more people involved early would be better. Should we have more people involved at the start, yes but that's a balance.

Wes: some of these things are addressed in the annual plan.

Chris: some PMs I talk with regularly, some I don't, it varies drastically.

Chris: We're only able to complete a portion or the security reviews requested each quarter. Sometimes what we have to drop is issues from community.

We work with a lot of other teams as well.

Slide 26 metrics part 2

(redacted)

Slide 27 bug counts

(redacted)

Q&A

Geoff: we seem to need more here

Chris: training, scanning, we need people in the middle

Gabriel: changing infrastructure can help

Ori: we're going to be tied to MW for the foreseeable future

Chris: there are things we can do

Ori: two suggestions: bug bounty programs? There should be strong collective voice to discourage unnecessary collection of data. we could get by with a lot less data.

Chris: bug bounties: total agree it could be a great method (redacted conversation).

Services

Slide 28 - Services

Gabriel: 4 people. Core tasks. Usually we experiment more. KPIs: total requests to REST api. Refined the metric.

Mark: you are only testing the Varnish layer at the moment

Gabriel: yes, we need to look at that

Slide 29 REST API request rates

Gabriel: Increasing traffic: Increase due to Android app rollout

Slide 30 - Goal: REST API expansion and documentation

Gabriel: Documentation index page created. Service template created, used in mobile content services. API policies

Second sub goal: building out the API, targetting high traffic endpoints.

Integrating better with caching. Created RFC for versioning

Issue we ran into repeatedly: lots of APIs request a specific size for thumbnails. We need to let the client select the size. We have an RfC on this.

Slide 31 - REST API documentation (Swagger) screenshot

Gabriel: Swagger specs drive this. Automatically generates this, creates a sandbox.

Slide 32 - Scaling storage

Gabriel: We are storing HTML for all parsoid. It takes a lot of space and cost benefit algorithm is not there. We noticed people edit one line at a time. There is a lot of repetition. We experimented with compression and found one algorith - Brotli - that has a large window and compresses by a factor of 5. It also saves CPU, but increases mem usage. We built a patch for Cassandra and working on upstreaming it. So we could store the entire history in HTML.

Second was moving latest Content API to the edge (closer to user). We introduced a new storage format and increased throughput of API.

right now when you load an older version of a page, it's rendered on demand because it is not stored. Storing this would reduce the latency to get this.

Maggie: what about being able to view the history because the community is interested in attribution?

Gabriel: this will help.

Slide 33 - 3rd objective: Reliable event production & change propagation

Gabriel: Propogating events when editing. Making everyting that needs to be orchetrated for everything that needs to change as the result of a change being made.

Job queue is not the most reliable system right now.

This qtr we finished EventBus and made it multi datacenter ready. We can also handle failover.

The second half: making event production reliable is delayed.

Slide 34 Other successes and misses

Multi-DC support

A lot was built for this already. straightforward to do the failover testing. Latency increased, but less than 100ms. There were no user issues.

Reliable deploys and testing

good test coverage and good deploy scheme. So we didn't have any outages that we're aware of.

API result format versioning

Let's us move forward without breaking things.

Slide 35 workflows & metrics

Lots of guiding and mentoring. e.g. Math work is close to ready (already an option)

Trying to help other teams get the skills they need. Citoid. Cassandra pageview API.