Wikimedia monthly activities meetings/Quarterly reviews/Research, Design Research, Analytics, and Performance, April 2016

Notes from the Quarterly Review meeting with the Wikimedia Foundation's Technology I: Research, Design Research, Analytics, Special Projects, Performance teams, April 14, 8:00 - 9:30 AM PT.

Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Present (in the office): Abbey, Dario, Maggie, Daisy, Stephen L, Tilman, Gabriel, Chris Steipp, Gabriel, Rob L, Kevin Leduc, Geoff Brigham, Ori, Zhou, Katherine

participating remotely: Aaron Halfaker ,aotto, Dan, Jonathan M, Joseph Allemandou, Luca Toscano, Marcel Ruiz Forns, Mark Bergsma, mviswanathan, Nuria Ruiz, Samantha, Wes

Research and Data

Slide 1

Slide 2

Dario: new addition to the team: Nathaniel (full stack engineer). Still working with 2 research fellows, but many of our volunteer collaborators left in the quarter, which affected the team's productivity.

Slide 3 - Objective: Revision scoring

Dario read through goals on slides

Edit typing. - working on edit taxonomy for this goal. will continue in Q4
Partially achieved, second half is still in progress
Papers submitted or on track for submission

Learnings - extending revision scoring to other languages requires hand coding and substantial community support which makes it harder than initially expected. We are also still working with Ops on addressing the blockers that need to be removed to productize the service.

Slide 4 - other achievements

Swagger-based documentation

Service now reports its own performance metrics, e.g. Filter rate - that measures how much work we're saving for our curation community

Slide 5 - Article recommendations

Dario: we wanted to run a campaign, didn't get there, but improved usability. it's ready now for a campaign, hoping to test this in the coming days, working on an announcement with Comms.

Slide 6 - Reader segmentation

3rd major goal - qualitative plus quantitative research. gave an early presentation to the Reading team and we expect to publish the results more broadly after Q4

Slide 7 - Other successes and misses

Organized WikiCite in Berlin

Organized a joint research workshop at 2 major conferences, to be held in Q4

Deployed referer policy in collaboration with Ops

Slide 8 - Core workflows and metrics

Hosted public showcases.

Maggie: Community Engagement really appreciates the work of the team

Wes: good progress toward actual application

Katherine: agree, great to see move towards applications; glad to see increased collaboration, excited to see readership data

Slide 9 - Appendix

Design Research

Slide 10

Slide 11 - Production Work

Abbey: making sure we build what is usable. Exploratory research with non-technical users. Daisy and sherah did workflow testing with iOS app. Findings are being used to iterate on the app. Daisy did research with new editor experience about the new editor education flow on VE.

Contextual inquiry in Mexico. Learned about doing contextual inquiries in the field and a lot from and about the people who participated in the research. See monthly metrics from March to see a high level description of our findings.

Slide 12 - Production Work

Slide 13 - Mentoring

Abbey; mentoring people. Sherah has been working on research for the Reading team. Seeing how the mentoring works, but may do it with others

Worked with May and Volker to understand the needs of people who build UI on wiki around UI libraries. (Focus groups)

Currently working on the program toolkits with Jaime, Maria, Edward and Subha on theCE team, to better understand how our toolkits work for peopel who organize and run programs. There will be iteration on the toolkits from what we learn.

Slide 14 - Personas

Personas - this will be an ongoing thing. Didn't achieve our current goal, because we didn't get the analysis we hoped to get done. done.

Slide 15 - External Collaborations

Jonathan has been leading collab with UW, 200 responses, we're going to see what we can learn with students seeking information. Generalizing experience from Mexico and UW. (In general as we do more exploratory research, we will be able to compare the various groups of people we learn from (participants).

Slide 16 - Othre successes and misses

Reboot = consultancy which has done many contextual inquiries.

On methodology of inquiry. readers in Nigeria and India. working on building a db to find commonalities and differences of research in various regions.

Research (Design Research and Research and Data) team offsite - good for figuring out how we're going to work together

Slide 17 - Core workflows

Samantha really helped recruiting in Mexico, as well as her ongoing recruiting for production work with product teams.

Monthly metrics about Mexico deep dive.

CSCW workshop "Breaking into new data-spaces"

Slide 18 - Appendix

Geoff: product design work and this are theoretically all (attached to Product). how much - what are the efficiencies that make it into product? Do you see misses?

Abbey: we do research, and Product does design. We work closely with product. We do have misses (when product goes out the door before researc. Not that evrything needs research to be done, but we do want there to be no usability issues when product goes into production. We're getting better and better at iterating and working with teams. VE was working well for a long time, but now there is no designer on the team, and we are seeing the difference.

Jonathan: often when we our work doesn't make it into product, reasons have more to do with organizational changes that affect product teams' prioritites.

E.g. at times in Collab, UX standardization team, our work not integrated or on holdbecause product changed

Geoff: let's have a side conversation

Katherine: regarding generative value of your research, how do you prioritize? do people come to you, or do you have a backlog?

Abbey: so each official product team, we have biweekly meetings. we always have topics for discussion, and prioritized list of projects. we have a hit list of things that need research work. We work with the teams on what we have bandwidth for. We also have our Phab board, and people can add projects. Weekly triage (backlog grooming). you can look at that board and see how we're working. We also had a design workshop. All service (teams that build user facing functionality but do not have design supprt) depts (e.g. Security, Legal,...) have some user facing functionality they are building, so to the degree that we can (with our bandwidth) we work with those teams to iterate their user facing functionality.

We're working with Sarah on publishing reports (e.g. on wiki) and better communicating our work more widely.

Analytics Engineering

Slide 19

Nuria: we would like to figure out how to track velocity better. Likely next quarter we will be reporting ops works separately. We have 6 full time engineers, and 1 fulltime manager

Slide 20 - Uniques

Nuria: spend quite a bit of time on this this quarter. We were able to deliver unique devices daily and monthly. Per country and per project (desktop and mobile) We now publish it the unique devices data as downloadable files. Started January 2016. We have data split of by country internally, publisjhing that externally is quite tricky due to privacy reasons.

Slide 21 (graph)

Mobile: over half of our uniques. in Indonesia, over 80%

Slide 22 - Wikistats 2.0

This is part of wikistats 2.0, our project to replace work that Erik Z. is been doing on http://stats.wikimedia.org for the better part of a decade

Browser data: https://browser-reports.wmflabs.org/#all-sites-by-os Soon to be deployed on new production domain. by OS, by browser, combination. Analytics would greatly benefit from working with a designer, UI projects take longer because we don't spend enough time on design cycles, building mocks, for example.

Slide 23- User Agent Breakdowns (graph)

One goal we missed (though 99% there), we weren't able to do this. Sanitizing data is very difficult. We got Ok from research but we need to work on this with security.

Need to work with security so that we know what we're doing works.

Slide 24 - sanitization

Chris: I didn't realize you were waiting for me. Do you need help?

Nuria: right now, it's on us. we have some research to do. we want to make sure we have something to propose, and then want Chris to review

Slide 25 - Operational Excellence

A lot of time spent working in Operations. We maintain several systems, Eventlogging, pageview API, cluster.

We need more DBA help on the team.

Mark: hope to have that sorted out within a month

Slide 26

Piwik - kinda like Google Analytics. (demo of the dashboard) Self-hosted; smart platform for small websites, and it works well for our small sites, but not with large amounts of traffic

Slide 27 - Piwik (example screenshot)

Slide 28

public by default

removed (hashed) IPs from EventLogging data to collect as little PII as possible.

We're trying to make more sanitized from the start. Group editing data. Pageview stats - our homework is to have very good data for edits, too.

Katherine: thank you for helping the whole organization. I hope we'll be able to get the design support and the security help

Personally (from Comms) thanks for your help with Piwik, especially when we melted it down ;) Nuria: glad we're able to deploy it for the right size projects

Special projects

Slide 29

Slide 30 - Begin a Community Consultation

Kevin: I led a virtual team including Michelle, Tiffany, Juliet, Johan, Edward and David S. with a goal to start a consultation on uniques

Out of our measures of success which were milestones, we

Hosted internal brownbag
we did not run a survey
and we did not start a consultation.

Analytics team had been debating using unique tokens for a years.

Unique Tokens: postponing it for at least half a year and if the new ED wants to pick up the issue. We avoided a costly community consultation both resource-wise and community good will - Feb-March would not have been a good time to consult the community about this.

Slide 31 - Other successes and misses

What did we learn?

Having unique device counts tipped the scales away from implementing a unique token. We have very valuable dataset now. It takes a long time to develop metrics, and understand and use them. Dumping a whole bunch of new metrics wasn't something we could do. There are alternatives to unique tokens and they should be used. Reading team now has a way of instrumenting their code. If we did want to do unique tokens, we would need more support in the staff, and it was clear that it didn't have the staff support. We're handing over the metrics to the Reading team.

Geoff: Is there a process to know tokens our current data collection is giving us what we need?

Kevin: we don't have a process. we don't know the gaps.

Dario: there needs to be a phase of data analysis. we need to know what we have before we iterate further.

Tilman: You said the Reading team *now* has a way to instrument, what did that refer to?

Kevin: I'm referring to session metrics. The reading team can implement that on the client side and report back xqysldkfaslk;dfj has to schedule that implement

Katherine: thank you. Social and technical we know this is complex. Thank you and the team for making the tradeoff. To be able to explore something and then step back from it is a sign of a mature decision making process.

Geoff: good learning about communicating with staff. It felt like there were feedback loops

Kevin: yes from execs to product managers

Performance

Slide 32 (team)

Ori: 5 full time employees

Slide 33 KPI: first paint time (graphs and charts)

Ori: from the time navigating to a page, and the user seeing content rendered. we've been flat this quarter....slight regression. we have a good idea why, but not something we can do anything at the moment. Box plot-whiskers are at the extremes (10th/90th percentile) - line in middle is median. On the whole our 1 year graph looks pretty good.

Slide 34 SPDY usage vs time

Investigated regression with first paint.inversely correlations to % of client connections that use SPDY

(Google rolled out protocol that improves efficiency of connections, rechristened SPDY->HTTP2. Std adopted is slightly differnt. Browsers are dropping SPDY and adding HTTP2. Clients on older browers won't get the benefit of using either.

right now only suport SPDY. HTTP2 planned for later this month

i.e. large story short: largely due to browser suport, changes

Slide 35 KPI: Page save time

Page save time

Save->edited article loading

Better news: we're getting gains. Wide array of gains. Still fairly significant gains. Fastest connections editing hardest pages are seeing biggest benifit

Slide 36 Performance inspector

Tool

Semi-hidden debugging tool picked up by editors

our perf characterics are not just function of code, but there is a high degree of variance betwen projects local admins have ability to load gadgets by default by modifying CSS and JS

I.e. differing CSS and javascript code on different wikis. we've had bad regressions on various wikis because of this. Idea was to make information available to those that can use it. Make it compelling and actionable to the right people. indication of how long it would take to load on 2g connection (for exmaple)

Have an early prototype, see short video in this slide. but ongoing development.

Caring over to next quarter

[[File:WMF Research & Design Research & Analytics & Special Projects & Performance teams quarterly review Q3 2015-16.pdf|thumb|380px|page=37|slide 37]

Slide 37 - High-availabiltiy for Mediawiki / Leaner mobile web

Two added goals

1. narrower and deeper subset. taking MediaWiki which was not written for multi-concurrency, and chasing down bugs and race conditions that prevent us from geographically distributing servers

 data centers).  April 19 switchover.  read only test that simulates some conditions.

2. Goal of mobile website leaner. Segment based on bandwidth availability

Made a lot of interesting progress, but ultimate feedback from designers was that even on high speed connections, high density solutions weren't worth it. so we disabled it.

Slide 38 - Granular Performance Dashboards

Substantially more data

Geo granularity

(demo of Grafana) NavigationTiming by geolocation

We can compare site performance on a per-country basis. Cool off the shelf tool we were able to deploy. Ongoing work on ...

= Slide 39 - Contributors

Mediawiki availability. Aaron and Timo are prolific.

Slide 40 - Problems and Prospects

Problems:

Number 9 :-)
We can't always can't always anticipate what is going to be the problem
Too much is knowledge is bundled in too few people. provide guidance, e.g. Timo is part of ArchCom, gives input to Reading team on lazy loading of images. Team members have a lot of expertise. Productize our projectize our work. still don't know the impact of performance on wider

RobLa: compared to when before there used to be a lot of lore trapped in a few experts' heads, you've done a good job of socializing this, giving people the ability to see the impact of what they do. that's huge

Dario: often work not supported by designers

Geoff: because we don't have enought designers?

Dario: no designer assigned to Research, Perf, ... so we ask designer from Product team to help out

Geoff: is the reason you want design that you want more peopel to use things, more accessible?

Abbey: yes, to enable users to find and use it findable, usable

Abbey: there's expertise.

Nuria: teams work better when teams work on their core expertise. e.g. in Analytics, we are engineers. If you have to do UI and you don't konw that, you step out of your core expertis, not efficient

Dario: it makes sense that .... we have low designer->dev for a lot of our work

Ori: at least, here there is not risk of doing actual harm when doing experiments [as opposed to end user facing work]

bigger risk: we try idea, and reject it in absense of design research. we toss ideas for "not having value"

Abbey: if you have something you already built, we can do heuristic eval, or a usability test. For a designer..

Wes: Do you use OOJUI?

Ori: yes...what you saw in the prototype was based on OOUI.

Wes: Volker is working with Editiing team, good to get solifiied

Kevin: what has been the community reaction to reducing image quality.

Ori: there hasn't been a negative reaction I'm aware of. rollout correlated with a lot of other things. wasn't advertised as loudly and compelling

Tilman: it was on Tech News, that was the only time we highlighted it publicly

Wes: might be good to follow up on

Dario: it is noticable - graph rendered as thumbs it's notable

Ori: doesn't apply to PNGs. If you just consider impact on pages, 1.8 MB to 900k on w:Barack Obama article

Wes: Design Research assessment?

Ori: not done here, but (by others). lazy loading related but not entirely

Gabriel: lazy loading infrastructure should provide an opportunity to properly support high/low speed targeting in JS

Ori: good point...may allow for reintroducing in some specific instances

Gabriel: browsers don't take bandwith into account, internal mechanism for that is kind of poor

Maggie: could you give slides?

Katherine: thank you. not sure about goal of making Performance team obsolete, but agree with the move to empower people

Design comments have been noted & related conversations with Heather and Arthur