Wikimedia monthly activities meetings/Quarterly reviews/Architecture, Operations, Release Engineering, Services, Security, January 2016
Notes from the Quarterly Review meeting with the Wikimedia Foundation's Technology II: Architecture, Operations, Release Engineering, Services, Security teams, January 22, 2016, 9:45 - 11:15 AM PT.
Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material
Attendees: Greg G. Gabriel, Ori, Robla Kevin L, Madu, Chris S., Abbey, Marko, Faidon, Mark, Wes, Lila
Architecture
[edit]Slide 1
[edit]Robla: Architecture is Robla, P/T Through Nov. Dec
Slide 2
[edit]RobLa: 3 goals prepare for summit. Lot's of help from many folks. We didn't plan a lot and a lot of great people stepped up and made it a great event.
Slide 3
[edit]RobLa: This one is going a lot better. Brian Wolff notes the Architecture committee going so well it might be a problem that you are going too fast. (quote is in slides) That is a good problem to have. We may be guilty of too much hand wringing but we are trusted arbiters in the messy consensus process.
Slide 4
[edit]Robla: Figure out some of the naming issues. RfC is a loaded term (means IETF RFC in a lot of contexts), but in this world RfC often means an editor community discussion. That is one thing we penciled in last quarter as we should have a discussion.
Wes: It would be helpful to have a link to some of the learnings on MWDS. We had a lot of feedback.
Lila: I heard this was the best one!
RobLa: Thank you everyone for doing your part; it was a good exercise in collaboration. Ref Slide 3: A link to the Wikidev 2016 page would do the trick on this?
Wes: Yes. What we are trying to do with these slides is include some learnings from the event on the slides, etc.
Technical Operations
[edit]Slide 5
[edit]Mark B: Fiscal Q2 17 people. We are still hiring. At the start of the year we added a new member. Our KPI is availability for readers and we had a pretty good quarter. It was up by .037% from last Q.
Lila: You were also going through fundraiser that puts more pressure.
Mark B: Yes. Some of the bigger changes have been waiting until after fundraiser.
Lila: ? based on Ori's presentations. If we do have the multi DC, this number will significantly improve because you will have a hot hot situation
Mark B: It is more for disaster recovery right now but we will see if we can get to a hot hot in the future.
Slide 6
[edit]Mark B: We did a lot of work on updating our MariaDB databases and improving encryption. We were looking at possibly encrypting Varnish but Varnish does not support this. Right now we don't have a problem because there is no traffic between Varnish and backend servers. Basically most cross-DC traffic is encrypted at this time.
We had some learnings about routing hardware. Several backdoors were found, so this proves this is not a good strategy. We will continue to use open source for encryption.
Slide 7
[edit]Mark B: We have been doing a push on security over the course of the year. This Q we did a push on User identification. This was a shared goal. We were looking for a solid process for on-boarding and off-boarding for staff. This is mostly for off-boarding. We worked with HR and this should be solid now.
We also looked at 2 Step Verification for all root users. We experimented with YUBI Keys and have implemented and experimented with them. We will continue this through the next Q. We will eventually force one time passwords. We have had help from Office IT for this.
LDAP service: was running on an older install which was set up by a former staff member and was not maintained; we migrated to a newer software, which is now maintained.
Monitor infrastructure; instant monitoring in case something goes wrong. We want to spend some additional resources on this. It is running on just one server. We want it in more data centers. It is not open to the public at the moment. We started work on this and evaluated options. We went with shinken which has some advantages for better scaling and multiple data centers. It is all under way. Part of our goal: sometimes our server gets over loaded; we want to work on better abstractions...we did not finish because we had less time than we anticipated and we were helping other teams. It is close.
Lila: Are we using any external services?
Mark B: For our external monitoring we use external services (for availability statistics). Internally we use our own.
Lila: It sounds like you have a good 2 level setup.
Mark B: Yes.
Slide 8
[edit]Mark B: Start supporting bare metal servers. We have only been able to deploy VM instances. Occasionally we get performance testing requests. In some cases we have users who need this for their work. In some cases it would be nice to offload this to servers or give access to external folks. We have multiple use cases. We never really have time to investigate. As a smaller project we investigated whether we could do this easily. We tried to make it run on OpenStack ironic. It depends heavily on a newer component, but it was not feasible. We will hold until we migrate to new servers. It is possible but too manual. In the future we hope to automate this more.
Slide 9
[edit]Major learning: Testing cluster for labs. We did not have a test environment for labs.
Lila: Did you look at containers as well.
Mark: Yes - this goal is one of 4 major projects going on in the labs team. Maybe next quarter we will follow up and possibly try for production.
Lila: if you want to get connected with OpenStack - I can get you in there. If you need help, I can help.
Mark: Okay.
Lila: That is good news for labs.
Slide 10
[edit]List of hits or misses. We have helped a lot of staff, done a lot of maintenance work.
Slide 11
[edit]Mark B: Catchpoint data on availability. We have spread this out from a year ago. Overall numbers are positive. Donor data was not complete from the fundraiser, some of the data is missing. We will figure it out.
Lila: What is the difference from last quarter Dec 1/ Dec 31
Mark B.: Just tracking.
Greg: Is there a place defining what the audiences are?
Mark B: Yes....
Fun fact: Our most stable service according to Catchpoint is Gerrit. :-) Possibly related to lack of checks rather than actual reliability.
Release Engineering
[edit]Slide 12
[edit]Greg: Team size 6. Time spent a bit different from last quarter. Last slide has more info. KPI: Time to merge in MW Core. Went from 11 Min to 6 Min
Slide 13
[edit]Greg: Nodepool - This is now fully up and working . We have a portion of jobs migrated to it. We did not get a set of JS / MPM tests migrated as of quarter break. We started as of yesterday.
Learning: Still a mostly one person job right now. Silo effect is strong. Other priorities were higher. We could not spare time for training.
Slide 14
[edit]Greg: Time to Merge
Lila: Does the team also run all the automation testing.
Greg: Yes. End to end tests are not all part of the merge process
Lila: It would be interesting to see some level of KPIs around testing
Greg: We have long been toying with whether the testing should be our KPI or other teams
Lila: Still useful to measure.
Slide 15
[edit]Greg: Not completed - goal was to migrate all the services. We did some but not all. Some before the holidays, there was movement. Dev Summit discussion was useful and hopefully this will be done by this quarter. We should be able to do this and MediaWiki deploys at same time.
Slide 16
[edit]Greg: Migrate from Gerrit to Differential. Gitblit (just for viewing) is still alive. Blockers are redirects.
Lila: So you just need buy in.
Greg: Actual migration will depend on RfC. We discussed at dev summit. Positive feedback. We need to see how this will work on a day to day then put i tout in front of users, gather feedback.
Slide 17
[edit]Greg: Media wiki 1.26 release. Happened.
Learning: Releases for 3rd parties are still somewhat subpar because it is not an internal priority .
Lila: Someone approached me at dev summit about the whole Mediawiki foundation.
RobLa: the stakeholders have meetings. There are a number of leaders. It is about social dynamics vs technical. There are a number of people and interesting convos at dev summit. Still something we need to explore
Ori: We should think it through
Quim : I think it will come from CL side
Lila - please keep me in the loop -
Slide 18
[edit]Greg: Exploring non ruby browser tests - calls from devs for adding - they have plus1 to VE team for having that. Continued positive work with Phab team, they have been responsive and done custom code with our feature requests. Miss is WMF log Errors graph - interpretation is sub optimal - issues with how we do tracking - there could be logs that aren't due to errors..
Slide 19
[edit]Greg: Changes to CI configs. Call out to volunteer that has been doing a lot of work
Greg: Mediawiki Selenium automation
Greg: Deploys are explicitly rotated throughout the team. The benefit is peoples time but it gets things in the forefront of peoples minds.
Lila: Do changes get pushed by your team?
Greg: Majority of MW changes from our team but it does comes from others. In the appendix we have graphs, skill matrix - we are keeping up with that.
Slide 20
[edit]Services
[edit]Slide 21
[edit]Marko: We were 4 this year. We spent most time on strengthening and focus goals. Main KPI - popularity of Rest API, and uptime of actual services
Faidon: Does the figure of 250 request include all of our requests from the jobrunner for example?
Marko: Yes. It includes all of them around 50/50 of our requests and external users.
Faidon: May be useful to put number fo external requests
Slide 22
[edit]Marko: 1st goal is a perpetual goal for us. We were successful in that we have a couple of new entry points. Page summary (reading uses) They should see big performance improvement. We also store renders much more quickly than before.
Learning: there is a need to expand API, make it cacheable and store results. We started discussion of layout, which features to include what not to include. At dev summit we talked about putting as much functionality as possible in Rest API so it can be cacheable.
Slide 23
[edit]Marko: EventBus operational done with analytics team. It is operational, as of a few day ago MW is producing events to this. However 2nd part of goal has not be realized yet due to delay in sequence of events that have happened.
Learning: At beginning of Q we did not talk about responsibilities between teams which caused some friction and understanding. This will be deployed this Q.
Gabriel: dev summit discussion surrounding this between many of the teams.
Lila: does it mean we will be able to role to pages more modularly
Gabriel: Eventually...right now change propagation is rule based and does not have a separate dependency graph, but something we are interested in.
Lila: This is typical for modern sites.
Ori: I'd like to see more effort into getting more teams on board. We have some solutions but it is often ad hoc. What makes or breaks this is if it individual team come up with their own system rather than using across teams.
Lila: One of the goals for the next Q may be to...
Gabriel: It is not super trivial to replace all of the job queue, but there are pieces we can gradually migrate.
Lila: I'm sure that it is extremely complex. Not saying go get it done. I'm saying hae a plan.
Andrew O: We should talk about this more in some of our team meetings.
Slide 24
[edit]Marko: Prototyped service worker front end. Declared a success. Worked with Reading team. Discussed at dev summit; people perceived this is a good way to move forward. Next steps for Reading team to move forward.
Lila: Major architectural shift we have to make.
Gabriel: Driving the uncoupling of front end from backend; data / presentation.
Lila: Good.
Slide 25
[edit]Marko: In VE you can switch from Wikitext and back.. We built storage logic t behind that feature. We helped analytics build their pageview API. (See Slide for all)
Gabriel: Ryan Lane is interested in docker based mediawiki-containers distribution solution.
Marko: There is progress in RelEng using Scap3. Content working group has a lot of support.
Slide 26
[edit]Marko: Core work flows: this Q mentored new service development - maintenance and operational mentorship. We had a lot of reactive work the last Q. We expanded Cassandra capacity. Upgraded to 2 new versions. Keeping an eye on it. Investigated issues, proposed fixes, monitored on a daily basis.
Lila: You are doing interesting and forward thinking work. You probably need more cross-functional convos.
Marko: One goal for this Q is community engagement. In general people outside our team. More cross team discussions. Even if we want to move quickly we need support.
Lila: Discussions, plans, documentation
Gabriel: Especially documentation, scale processes.
RobLa: People will want to install VE on their wikis; and this is not supported by default in the standard install. The Docker solution is one solution
Lila: This may be a driver.
Gabrel: We worked a lot with Reading but with discovery we have not had a lot of contact.
Abbey: Sherah is interested in improving the experience of installing / Using VE. She may be good to tlak too.
Security
[edit]Slide 27:
[edit]- 2.1 = csteipp, dpatrick, bawolff (very part-time contract)
- Core work is >80% of effort; of the <20% strategic work (goals that we worked on), approx.
Chris: 80% Core Work 25% Strengthening 75 % Experimental. We are understaffed.
Slide 28
[edit]Chris: Automate security review. Completed goal in that we chose veracode. It couldn't do some tracking across extensions. Hoping this platform will improve over the next year.
Lila: Do we use blackduck?
Chris: They are the library scanning tool which will do libraries for vulnerabilities. This is for stack analysis.
Learning: we could be leaders in this space if we had the time.
Slide 29
[edit]Chris: Miss - give training around security process for other teams. We developed materials but did not present. Thinking: Let's give more training curriculum to our users. We developed and presented a different topic but did not meet our goal. We still need to work towards it in the future.
Slide 30
[edit]Chris: Improve metrics. We did not meet our stretch goal.
Lila: You had a busy quarter.
Slide 31
[edit]Chris: Learnings ( See Slide for list) some personal projects put aside to assist Reading.
Slide 32
[edit]Chris: Core workflows/ Metrics (See Slide)
Slide 33
[edit]Chris Core Workflows / Metrics (See slide) Questions
Lila: How can we best help you.
Chris: Through annual planning we will ask for more staffing. Brian Wolff has been a huge help - have asked for additional money to extend. We are looking into contractors. Other teams are also so helpful. E.g. Editing found and fixed some of their own bugs. Fixing bugs which are are assigned would be great.
RobLa: If we had a retrospective/postmortem when a security vulnerability is found. If we could do that within the team which owns the development where the vulnerability is found.
Lila: Good idea. Security is everyone's responsibility. https://phabricator.wikimedia.org/T123753
Wes: Chris can you write up a summary about your asks.
Chris: We have written another document about how we would like other teams to interact with security. https://www.mediawiki.org/wiki/Wikimedia_Security_Team/SDLC
Lila: This is something we need to be constantly reminding folks of.
Lila: Don't sweat if you don't get your greens. Most important is we expect you to learn along the way that there is a better way and to change direction sometimes. Every single red I see is great learning and great progress don't get stuck on not getting a green. Really good work last quater. Thank you.