An incident is any unplanned event that causes a service disruption. Identifying and resolving incidents helps us improve: we'll make Sourcegraph a higher quality product, we'll improve the processes that lead to or around the incident, and we'll reduce friction around identifying incidents in the future.

Some examples of incidents:

Additionally, on big announcement days (funding, product launch, campaign launch, etc.), all incidents warrant more immediate attention from marketing so we can hold off on planned activities/be prepared to respond to issues. On these days, the person in marketing leading the announcement is responsible for looping #customer-support and engineering/product in ahead of time to ensure they are aware of planned activities. The person leading the announcement will work with #customer-support on the ad-hoc plan for incidents (which may involve on-call rotation).

All incidents are announced in the #announce-incidents channel automatically through incident.io, and past incidents are available in the incidents.io dashboard and our past incident postmortems.

Prioritization

With the exception of security incidents, incidents should treat the restoration of services as a higher priority than resolving underlying technical issues. That is not to say that we don’t pursue the underlying causes of an incident; rather we should bias in favor of service restoration over deep understanding or confidence in our understanding of the underlying causes of the problem.

Process

Identification

Incidents can be identified by anyone (e.g. customers, Sourcegraph teammates) via incident.io.

Even incidents that might turn out to be false positives should be reported, to ensure that they are responded to and investigated with the same rigor as any incident, and that any lessons can be learnt.

The first Sourcegraph teammate (regardless of their role) that becomes aware of an incident, or suspects there may be an incident, is responsible for taking the following actions:

  1. If the incident was reported by someone outside of Sourcegraph, acknowledge that the incident is being handled.
  2. Start a new incident with the incident.io Slack bot: /incident
  3. If you are not a member of product, engineering, or customer support, type the following into the Slack channel to page someone who can complete the rest of this list (otherwise proceed to the next step): /genie alert we have an incident, please help for customer-support