On-call | Notion

We have an ops on-call rotation managed through OpsGenie. This rotation covers operation of our public site sourcegraph.com.

Responsibilities

Know when you are going to be on call. If you can't fulfill the responsibilities of on-call for any reason that week, swap with a teammate.
Acknowledge pages promptly. If you do not acknowledge within 10 minutes, someone else will get paged.
Identify the issue and collect information that might be useful for preventing the problem in the future (e.g. if a disk was full, what was it full with?).
- This frequently involves running kubectl commands against our production cluster.
- Make sure you have setup access to kubernetes and know how to perform operations like: looking at logs for a service, restarting a service, getting a command shell in a running pod (e.g. to look at what is on disk).
If about, /search, or docs is fully unreachable, page Customer Support so they can help with customer and broader internal communication. You can do this from Slack with /genie alert "______ is down" for customer-support. Support assisting with communication will let you focus on solving the issue.
Take steps to resolve the issue (e.g. if a disk was full, delete any data that is safe to delete to resolve the immediate issue) if you can.
- Don't mark pages as "resolved". Wait for the underlying alert to auto resolve.
- If you have unsuccessfully attempted to figure out how to resolve a page, the page hasn't auto resolved (many do), and the issue appears important (e.g. impacts users), then get help from someone else. Prefer to contact people who you know are already awake/working.
Add an entry to the incidents log
File issues for any followup work that needs to happen.
If alerts are too noisy and/or inactionable, take actions to fix or disable alerts.

Ops incidents log

All significant incidents that occur on Sourcegraph.com are recorded in the ops incidents log. This helps us keep track of what has happened historically, discuss follow-up work, and gives insight into what types of incidents we see.

False incidents (flaky alerts, etc.) should be tracked directly in GitHub issues and do not need a log entry.

Slack channels

Most of the work you do as the on-call engineer should be discussed in #dev-ops but check our Slack channels to discover alert and troubleshooting channels which you might find useful.

Getting help

In some cases you may not know how to resolve an incident on your own, or the incident may be due to something you are not familiar with. In such cases, it is your responsibility to pull in more people or teams as needed to resolve the incident.

If the incident is not urgent (use your best judgement) and can be handled asynchronously over the course of a few days, file a GitHub issue and pull in teams that way by cc'ing them with e.g. @sourcegraph/cloud @sourcegraph/distribution.

If the incident is urgent, start a new thread in the #dev-ops Slack channel and cc the on-call person from the relevant team. You can find this in Slack using the /genie command:

/genie whoisoncall