Wikimedia monthly activities meetings/Quarterly reviews/Architecture, Operations, Release Engineering, Services, and Security, July 2016

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Notes from the Quarterly Review meeting with the Wikimedia Foundation's Technology II: Architecture, Operations, Release Engineering, Services, Security teams, July 14, 8:00 - 9:30 AM PT.

Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Present (in the office): Heather, Gabriel, Michelle, Ori, RobLa, Jaime, Madhu, Katherine, Joady ; participating remotely: Darian, Faidon, Greg, Maggie, Mark, Nuria, Chad, Chase, Wes

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf

Backup Datacenter[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Wes: Great rollout. We had to make adjustments, moved out a quarter do be prepared and do this well. Good learning and improvements.
  • Katherine: reiterating last session; huge accomplishment for the team, community members were very positive; confidence in the results; board appreciated the work

Architecture[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf

Implement ArchCom Subteams[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Rob: Goal is to scale ArchCom with sub-teams. Hoped to spin up at least one (security), but not fully done yet.

Document RFC[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Improving process. No single big improvement to call out. Status page for RFC shepherding.

Develop Fellowships[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Developing fellow program. Did not create any controversy, but also wasn't really discussed widely.
  • Background: Idea is to free up senior engineers a tenure-like status, with a lot of freedom.

Technical operations[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • 17 FTE, plus some opsens embedded in other teams
  • Main KPI is availability, improved slightly to 99.987%.
  • Katherine: congratulations on a job well done (fail-over). Presented at Wikimania, well received there. Choice to delay was the right choice.

HTTP/2 support[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Needed to move to HTTP/2, as major browsers like Chrome dropped SPDY support.
  • Switched successfullly on May 4th.

Objective: Varnish 4 migration[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Moving away from Varnish 3. Complicated transition.
  • Encountered issues, but solved them.

Objective: Tools on k8s[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Target is tool tabs, the community project running a lot of small projects.
  • Replaced legacy system with Kubernetes. Needed to learn a lot about how users use our platform.
  • Katherine: is this aligned around bd808's work
  • Chase: yes, he's basically an integrated member of the team. Ideally transparent to the team. Explains things to users. Honorary member of the team.
  • Katherine: great to see x-team work
  • Gabriel: K8s work is providing us stuff that we might want to use in production
  • Katherine: Looks like a very busy quarter. Seems to have been a very satisfying way of coming into the quarter with everything that was in play last quarter

Other successes and misses[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • As usual, lots of large & small things that didn't make it to official goal status.

Other successes and misses[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf

Core workflows and metrics[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • No major changes, slight improvements.
  • enwiki test didn't seem to trigger problem; may need to talk to vendor about this

Core workflows and metrics[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Faidon: Labs more than 700 instances supporting 1000 tools.
  • Chase: Hard to get a count on the tools
  • Wes: thank you for all your efforts over the quarter Faidon and Mark. A lot of communication and coordination, at the same time, upgraded a lot of equip that needed refresh.
  • Jaime: echo that on capital side
  • Katherine: thank you for taking advantage of the opportunity we had at the end of the last fiscal

Release engineering[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Greg presenting.
  • Team size: effectively 5
  • KPIs related to CI merge times slightly improved (-4.5%)

Time spent[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Bulk in maintenance

Consolidate deploy tools[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Hoped to move to Scap3
  • Finished about 50% of repos

Retire Gerrit in favor of Phabricator[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Gerrit still in use, but preparation done.
  • Rob: Gerrit isn't gone yet, correct?
  • Greg: (explaining slide reporting process)
  • Katherine: What is the timeline?
  • Greg: another few quarters / hard to tell due to our current focus on technical debt (analysis starting in Q1, follow-ups prioritized accordingly after)

JavaScript Browser Testing[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Wes: how has participation been from overall eng?
  • Greg: goal is to reduce burden on others. Survey: almost 30 responses. Working closely with Ed Sanders from VE team.

Other successes and misses[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Retired gitblit. thanks to Danny_B and Paladox (community) and @mutante
  • Released MW LTS 1.27
  • Phab: thanks to Quim Gil for figuring out collab. Graph stuff really nice
  • CI server failure. large scramble to fix. thanks to ops. doing tech debt analysis

Core workflows and metrics[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • upgrade phab biweekly
  • daily swat deploys

Core workflows and metrics[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • CI config changes. metric isn't perfect, but gives you a good idea. Soon go away; making it so teams can change their own configs.
  • Selenium: 2 rels
  • Malu: pre releases
  • Browser tests: title doesn't really make sense. defining how bts are run and where they are run

Skill tracking matrix[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • we do this every quarter to make sure bus factor is healthy. obvious bus factor issues
  • Katherine: with security releases, we have obvious problem. what's the plan?
  • Greg: this started with our team offsite in May last year. RelEng made up of many parts. Started with deployments. now we can migrate to a new focus area.
  • Katherine: really impressed you are doing this skill mapping. it feels like healthy team maintenance
  • Wes: I'd like to add: you had some time away as well and people stepped in accordingly. Good coverage with a much reduced team.
  • Katherine: echoing that

Services[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf

Slide 24[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • task distribution is very approx. Large increase in REST API traffic. cache misses only increased marginally, caching very effective. Increase mostly driven by math switching to SVG and MathML served through REST API. Other big change, Android switched to using new app. We don't have full data
  • big change was the SVG math. lower graph is cache misses

RestAPI Buildout[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • second goal was improving support for devs. made sense to integrate with k8s effort in ops. effect of android app
  • reading team is working on app, so they're in charge. this quarter, several APIs they've exposed. definition endpoint; wikimania.
  • k8s disc at wikimania. lot of support; lots of people rallying around it, even 3rd party users.

Eventbus and Change Propagation[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Eventbus is exposing events in a stream. change prop uses this to execute actions. rules can be configured. migrated several job queue items. now working on a better Kafka binding. that blocked multi-dc, we need to upgrade to Kafka 0.9

Successes and Misses[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • math rendering: Moritz Schubotz worked on getting this work over the line. community demanded this made the default.
  • Katherine: it's fascinating to see this was a challenge
  • Wes: improving the caching
  • Gabriel: community developed service that is used by 3rd parties
  • Katherine : that's what I find is interesting was that it was driven by outside needs
  • Wes: note: as we went into annual planning, this had direct impact on operations
  • Introduced rate limiting for expensive API end points like page view stats
  • Katherine: are you seeing a notable increase in impact?
  • Gabriel: A few heavy users had to throttle their requests, but were cooperative and understanding. Enforcing limits has made the API more reliable and performant for everybody despite temporarily limited backend capacity esp. in pageview API
  • wes: api rate limits?
  • Gabriel: terms of use: we talked about this as documentation for entry point. Devs are actually happy to have concrete limits spelled out, as they can work with those. Otherwise, it's hard to guess for users what something vague like "moderate use" means in practice.

Successes and Misses[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • library: title normalization library used in RESTBase, Mobile Content Service & Parsoid. Good to see cross team sharing.
  • Migration of services to Jessie. Parsoid is almost over the line.
  • Cassandra outage during upgrade: We need to improve testing, more automation.

Core Workflows[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • workflows
  • Katherine: are you seeing more interest in this?
  • Gabriel: it's very project based. we want to help people help themselves. amount of handholding we need to do diminishing over time

Q1 Preview[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Preview Q1
  • firewalling off authentication and sessions. looking into authentication, rollout Q2
  • eventbus and change propagation. use cases

Security[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • Darian presenting
  • Chris's departure impacted, but Brian has stepped up. Brian is fluent in community issues. very little experimentation this quarter.

CentralAuth[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • we had hoped to complete high level design. Chris and Darian completed eval. we settled on nodejs. we did not complete design. Services taking the lead, we are taking the secondary role.

Successes and Misses[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • goal from last quarter was complete 2FA. we haven't formally notified. once authmgr became ready, redeployed. already working through functionality bugs one remaining outstanding. we'll follow up with survey
  • Katherine: working with OIT for staff rollout?
  • Darian: I hadn't been thinking about that
  • Jaime: I've been talking to Wes about having a session
  • Katherine: I heard from Reading this was a goal. Congrats, especially where team was short staffed
  • Darian: missed a couple of reviews. Django app review for bd808; we didn't have appropriate security controls and what we should be looking for.

Core Workflows[edit]

Technology Quarterly Review - Q4 FY15-16- Architecture, Technical Operations, Release Engineering, Services, Security.pdf
  • added a row
  • Wes: Thanks for support and diligence in transition this quarter to Darian and Brian and RobLa. Next quarter: we have a number of headcount to begin pursuing
  • Katherine: thanks Darian for stepping up. Just got sent a referral today

session wrapup[edit]

  • Katherine: I appreciate seeing how these things work from quaerter to quarter
  • Jaime: a lot going on in the org, and it's often quite complicated
  • Katherine: wouldn't be here without you