Incidents | Notion

An incident is any unplanned event that causes a service disruption. Identifying and resolving incidents helps us improve: we'll make Sourcegraph a higher quality product, we'll improve the processes that lead to or around the incident, and we'll reduce friction around identifying incidents in the future.

Some examples of incidents:

sourcegraph.com is down or a critical feature is broken (e.g. sign-in, search, code intel).
If sourcegraph.com is down for more than 5 minutes, a critical feature is down for more than 5 minutes, or we're aware of a service degradation issue that >5 users have reported. If you're unsure if the incident's impact qualifies, ask @cs in Slack for advice.
- We have an issue (per our standard SLA definition) that impacts all/many self-hosted instances, all/many managed instances, or all/many Cloud/SaaS users
There is a security issue with Sourcegraph (and if so, please also follow our security disclosure process).
A Sourcegraph team member feels like an incident might be present, but isn't certain or isn't able to confirm on their own.
We need to do critical proactive 1-to-many communication to all self-hosted customers (for example, making them aware of something they need to do in a certain upgrade like the prep needed before upgrading to 3.31) -- over time, as we do more of this, we will likely create a separate process for this

Additionally, on big announcement days (funding, product launch, campaign launch, etc.), all incidents warrant more immediate attention from marketing so we can hold off on planned activities/be prepared to respond to issues. On these days, the person in marketing leading the announcement is responsible for looping #customer-support and engineering/product in ahead of time to ensure they are aware of planned activities. The person leading the announcement will work with #customer-support on the ad-hoc plan for incidents (which may involve on-call rotation).

All incidents are announced in the #announce-incidents channel automatically through incident.io, and past incidents are available in the incidents.io dashboard and our past incident postmortems.

Prioritization

With the exception of security incidents, incidents should treat the restoration of services as a higher priority than resolving underlying technical issues. That is not to say that we don’t pursue the underlying causes of an incident; rather we should bias in favor of service restoration over deep understanding or confidence in our understanding of the underlying causes of the problem.

Process

Identification

Incidents can be identified by anyone (e.g. customers, Sourcegraph teammates) via incident.io.

Even incidents that might turn out to be false positives should be reported, to ensure that they are responded to and investigated with the same rigor as any incident, and that any lessons can be learnt.

The first Sourcegraph teammate (regardless of their role) that becomes aware of an incident, or suspects there may be an incident, is responsible for taking the following actions:

If the incident was reported by someone outside of Sourcegraph, acknowledge that the incident is being handled.
Start a new incident with the incident.io Slack bot: /incident
- set the description and severity from the modal in Slack
- this will create a new chatroom in Slack where all other communication should occur
If you are not a member of product, engineering, or customer support, type the following into the Slack channel to page someone who can complete the rest of this list (otherwise proceed to the next step): /genie alert we have an incident, please help for customer-support
Identify folks to serve in the following roles
- Incident Lead
- Messenger