Community Wishlist Survey 2022/Miscellaneous/Centralized Incident Management

From Meta, a Wikimedia project coordination wiki

Centralized Incident Management

  • Problem: When reports about production issues are opened on WMF projects, it is difficult for those reporting the issue to know when the production issue is purportedly resolved.
  • Proposed solution: Expand the use case for Phabricator to include incidents and problem, instead of only tasks for solutions.
  • Who would benefit: Users
  • More comments: Phabricator states are really only used to indicate that software changes have been created merged, not that they have actually been deployed. People with issues are generally trying to open "incident reports" - not "software requests" - they are telling a user story, but their story doesn't end until the situation is no longer presenting to them.
  • Phabricator tickets:
  • Proposer: — xaosflux Talk 18:22, 10 January 2022 (UTC)[reply]

Discussion

We currently use phabricator for bug tracking, feature requests, and other items. A common scenario is that contributors from a WMF project finds something malfunctioning - this may be occurring simultaneously on multiple WMF projects. What happens next: one or more phabricator tasks get opened, reporting instances - this works OK, and bug wranglers will merge duplicate reports. Next, if confirmed someone may decide to work on the problem. A common scenario is that the problem is code in need of improvement, and someone working on it may create some new code, and the new code may be released. At this point all of the tracking comes to an end - however notably this does not mean that from the point of view of the original reporters that their actual production problem is resolved. Why - because that doesn't mean that WMF servers are actually running the new code, and it certainly doesn't mean that resolution acceptance in production has occurred. What is lacking here? Tracking of the actual incident and/or problem from report to resolution. Additionally, no information about the priority (the impact and urgency) of the incident or problem is gathered (a priority field that can be misleading is tracked, but this is the priority that software developers are declaring). The primary place that incident tracking may be occurring is on disparate wiki pages across all projects. So what is lacking: A process and system to manage and track actual incident reports. This could be phabricator, however this is going to require more of a human element and mindset improvement than just technology. Should this be a call to help to the volunteer community - a call to help to ask WMF to assign some of this "service desk" functions to staff - not sure the "best" but additional ideas are welcome below! xaosflux Talk 18:22, 10 January 2022 (UTC)[reply]

I think this is often implemented by having two different states, like To Deploy/Done. This way things that are fixed in code can be moved to To Deploy once they are committed, and to Done only when the change is live. It seems a very easy fix in technical terms, but probably big for developer workflows. MarioGom (talk) 08:18, 12 January 2022 (UTC)[reply]
Even a workflow like that may help, a user story or incident isn't "done" until the situation leading to it is actually usable. — xaosflux Talk 14:36, 12 January 2022 (UTC)[reply]
I do note that we have deploy labels on tickets WITH dates of when something starts entering production. It's just that this does not match the expectations of normal users where to look. Then again, closed is special state, which has all kinds of workflow affects that are generally rather important for developers, so splitting that in two in the software is not that easy. But I agree that it is confusing and especially if sometimes some WMF teams even deviate from what happens in most of the rest of the tickets. —TheDJ (talkcontribs) 16:22, 12 January 2022 (UTC)[reply]
I also agree it's awfully confusing. Those deploy labels can be wrong, such as this week when the deployment train was halted. To those who don't know: The most exacting way to tell is to (1) go to the Phab task (2) look at the linked Gerrit patches (3) on the gerrit patch, you'll see an "INCLUDED IN" link on the right above the list of files. That tells you what branches it lives on. (4) Go to toolforge:versions and see if the branch you need is on your wiki. Easy, right? :) Making that process easier is something worthy of a proposal on its own. Maybe a Phabricator extension or something, that talks to whatever it needs to in order to tell us definitively if it's live or not on a given wiki.
@Xaosflux There are a lot of good ideas here. At the very least I ask we expand on the "Problem statement" (i.e. "I have no way of knowing the fix for T12345 has been deployed", and then maybe expand on the proposed solutions. This year proposals get marked for translation, but for obvious reasons the comments do not. So to a non-English reader, they won't be able to infer much with what you have now :) Thanks for starting this conversation, and for participating in the survey! MusikAnimal (WMF) (talk) 05:09, 14 January 2022 (UTC)[reply]
@MusikAnimal (WMF): see above - better? I mostly understand the current phab process - but I'm certain 99%+ of users of our system do not - which is why I know this is a problem :) — xaosflux Talk 11:07, 14 January 2022 (UTC)[reply]
Better, yes! I may make a few tweaks for translatability but this otherwise looks good and is a great proposal. Thanks, MusikAnimal (WMF) (talk) 03:38, 19 January 2022 (UTC)[reply]

T280 and T88136 are somewhat related. --Tgr (talk) 00:02, 30 January 2022 (UTC)[reply]

Voting