Wikimedia monthly activities meetings/Quarterly reviews/Architecture, Operations, Release Engineering, Services, and Security, July 2016

From Meta, a Wikimedia project coordination wiki

Notes from the Quarterly Review meeting with the Wikimedia Foundation's Technology II: Architecture, Operations, Release Engineering, Services, Security teams, July 14, 8:00 - 9:30 AM PT.

Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Present (in the office): Heather, Gabriel, Michelle, Ori, RobLa, Jaime, Madhu, Katherine, Joady ; participating remotely: Darian, Faidon, Greg, Maggie, Mark, Nuria, Chad, Chase, Wes

Backup Datacenter[edit]

  • Wes: Great rollout. We had to make adjustments, moved out a quarter do be prepared and do this well. Good learning and improvements.
  • Katherine: reiterating last session; huge accomplishment for the team, community members were very positive; confidence in the results; board appreciated the work

Architecture[edit]

Implement ArchCom Subteams[edit]

  • Rob: Goal is to scale ArchCom with sub-teams. Hoped to spin up at least one (security), but not fully done yet.

Document RFC[edit]

  • Improving process. No single big improvement to call out. Status page for RFC shepherding.

Develop Fellowships[edit]

  • Developing fellow program. Did not create any controversy, but also wasn't really discussed widely.
  • Background: Idea is to free up senior engineers a tenure-like status, with a lot of freedom.

Technical operations[edit]

  • 17 FTE, plus some opsens embedded in other teams
  • Main KPI is availability, improved slightly to 99.987%.
  • Katherine: congratulations on a job well done (fail-over). Presented at Wikimania, well received there. Choice to delay was the right choice.

HTTP/2 support[edit]

  • Needed to move to HTTP/2, as major browsers like Chrome dropped SPDY support.
  • Switched successfullly on May 4th.

Objective: Varnish 4 migration[edit]

  • Moving away from Varnish 3. Complicated transition.
  • Encountered issues, but solved them.

Objective: Tools on k8s[edit]

  • Target is tool tabs, the community project running a lot of small projects.
  • Replaced legacy system with Kubernetes. Needed to learn a lot about how users use our platform.
  • Katherine: is this aligned around bd808's work
  • Chase: yes, he's basically an integrated member of the team. Ideally transparent to the team. Explains things to users. Honorary member of the team.
  • Katherine: great to see x-team work
  • Gabriel: K8s work is providing us stuff that we might want to use in production
  • Katherine: Looks like a very busy quarter. Seems to have been a very satisfying way of coming into the quarter with everything that was in play last quarter

Other successes and misses[edit]

  • As usual, lots of large & small things that didn't make it to official goal status.

Other successes and misses[edit]

Core workflows and metrics[edit]

  • No major changes, slight improvements.
  • enwiki test didn't seem to trigger problem; may need to talk to vendor about this

Core workflows and metrics[edit]

  • Faidon: Labs more than 700 instances supporting 1000 tools.
  • Chase: Hard to get a count on the tools
  • Wes: thank you for all your efforts over the quarter Faidon and Mark. A lot of communication and coordination, at the same time, upgraded a lot of equip that needed refresh.
  • Jaime: echo that on capital side
  • Katherine: thank you for taking advantage of the opportunity we had at the end of the last fiscal

Release engineering[edit]

  • Greg presenting.
  • Team size: effectively 5
  • KPIs related to CI merge times slightly improved (-4.5%)

Time spent[edit]

  • Bulk in maintenance

Consolidate deploy tools[edit]

  • Hoped to move to Scap3
  • Finished about 50% of repos

Retire Gerrit in favor of Phabricator[edit]

  • Gerrit still in use, but preparation done.
  • Rob: Gerrit isn't gone yet, correct?
  • Greg: (explaining slide reporting process)
  • Katherine: What is the timeline?
  • Greg: another few quarters / hard to tell due to our current focus on technical debt (analysis starting in Q1, follow-ups prioritized accordingly after)

JavaScript Browser Testing[edit]

  • Wes: how has participation been from overall eng?
  • Greg: goal is to reduce burden on others. Survey: almost 30 responses. Working closely with Ed Sanders from VE team.

Other successes and misses[edit]

  • Retired gitblit. thanks to Danny_B and Paladox (community) and @mutante
  • Released MW LTS 1.27
  • Phab: thanks to Quim Gil for figuring out collab. Graph stuff really nice
  • CI server failure. large scramble to fix. thanks to ops. doing tech debt analysis

Core workflows and metrics[edit]

  • upgrade phab biweekly
  • daily swat deploys

Core workflows and metrics[edit]

  • CI config changes. metric isn't perfect, but gives you a good idea. Soon go away; making it so teams can change their own configs.
  • Selenium: 2 rels
  • Malu: pre releases
  • Browser tests: title doesn't really make sense. defining how bts are run and where they are run

Skill tracking matrix[edit]

  • we do this every quarter to make sure bus factor is healthy. obvious bus factor issues
  • Katherine: with security releases, we have obvious problem. what's the plan?
  • Greg: this started with our team offsite in May last year. RelEng made up of many parts. Started with deployments. now we can migrate to a new focus area.
  • Katherine: really impressed you are doing this skill mapping. it feels like healthy team maintenance
  • Wes: I'd like to add: you had some time away as well and people stepped in accordingly. Good coverage with a much reduced team.
  • Katherine: echoing that

Services[edit]

Slide 24[edit]

  • task distribution is very approx. Large increase in REST API traffic. cache misses only increased marginally, caching very effective. Increase mostly driven by math switching to SVG and MathML served through REST API. Other big change, Android switched to using new app. We don't have full data
  • big change was the SVG math. lower graph is cache misses

RestAPI Buildout[edit]

  • second goal was improving support for devs. made sense to integrate with k8s effort in ops. effect of android app
  • reading team is working on app, so they're in charge. this quarter, several APIs they've exposed. definition endpoint; wikimania.
  • k8s disc at wikimania. lot of support; lots of people rallying around it, even 3rd party users.

Eventbus and Change Propagation[edit]

  • Eventbus is exposing events in a stream. change prop uses this to execute actions. rules can be configured. migrated several job queue items. now working on a better Kafka binding. that blocked multi-dc, we need to upgrade to Kafka 0.9

Successes and Misses[edit]

  • math rendering: Moritz Schubotz worked on getting this work over the line. community demanded this made the default.
  • Katherine: it's fascinating to see this was a challenge
  • Wes: improving the caching
  • Gabriel: community developed service that is used by 3rd parties
  • Katherine : that's what I find is interesting was that it was driven by outside needs
  • Wes: note: as we went into annual planning, this had direct impact on operations
  • Introduced rate limiting for expensive API end points like page view stats
  • Katherine: are you seeing a notable increase in impact?
  • Gabriel: A few heavy users had to throttle their requests, but were cooperative and understanding. Enforcing limits has made the API more reliable and performant for everybody despite temporarily limited backend capacity esp. in pageview API
  • wes: api rate limits?
  • Gabriel: terms of use: we talked about this as documentation for entry point. Devs are actually happy to have concrete limits spelled out, as they can work with those. Otherwise, it's hard to guess for users what something vague like "moderate use" means in practice.

Successes and Misses[edit]

  • library: title normalization library used in RESTBase, Mobile Content Service & Parsoid. Good to see cross team sharing.
  • Migration of services to Jessie. Parsoid is almost over the line.
  • Cassandra outage during upgrade: We need to improve testing, more automation.

Core Workflows[edit]

  • workflows
  • Katherine: are you seeing more interest in this?
  • Gabriel: it's very project based. we want to help people help themselves. amount of handholding we need to do diminishing over time

Q1 Preview[edit]

  • Preview Q1
  • firewalling off authentication and sessions. looking into authentication, rollout Q2
  • eventbus and change propagation. use cases

Security[edit]

  • Darian presenting
  • Chris's departure impacted, but Brian has stepped up. Brian is fluent in community issues. very little experimentation this quarter.

CentralAuth[edit]

  • we had hoped to complete high level design. Chris and Darian completed eval. we settled on nodejs. we did not complete design. Services taking the lead, we are taking the secondary role.

Successes and Misses[edit]

  • goal from last quarter was complete 2FA. we haven't formally notified. once authmgr became ready, redeployed. already working through functionality bugs one remaining outstanding. we'll follow up with survey
  • Katherine: working with OIT for staff rollout?
  • Darian: I hadn't been thinking about that
  • Jaime: I've been talking to Wes about having a session
  • Katherine: I heard from Reading this was a goal. Congrats, especially where team was short staffed
  • Darian: missed a couple of reviews. Django app review for bd808; we didn't have appropriate security controls and what we should be looking for.

Core Workflows[edit]

  • added a row
  • Wes: Thanks for support and diligence in transition this quarter to Darian and Brian and RobLa. Next quarter: we have a number of headcount to begin pursuing
  • Katherine: thanks Darian for stepping up. Just got sent a referral today

session wrapup[edit]

  • Katherine: I appreciate seeing how these things work from quaerter to quarter
  • Jaime: a lot going on in the org, and it's often quite complicated
  • Katherine: wouldn't be here without you