We have an ops on-call rotation managed through OpsGenie. This rotation covers operation of our public site sourcegraph.com.
about
, /search
, or docs
is fully unreachable, page Customer Support so they can help with customer and broader internal communication. You can do this from Slack with /genie alert "______ is down" for customer-support
. Support assisting with communication will let you focus on solving the issue.All significant incidents that occur on Sourcegraph.com are recorded in the ops incidents log. This helps us keep track of what has happened historically, discuss follow-up work, and gives insight into what types of incidents we see.
False incidents (flaky alerts, etc.) should be tracked directly in GitHub issues and do not need a log entry.
Most of the work you do as the on-call engineer should be discussed in #dev-ops but check our Slack channels to discover alert and troubleshooting channels which you might find useful.
In some cases you may not know how to resolve an incident on your own, or the incident may be due to something you are not familiar with. In such cases, it is your responsibility to pull in more people or teams as needed to resolve the incident.
If the incident is not urgent (use your best judgement) and can be handled asynchronously over the course of a few days, file a GitHub issue and pull in teams that way by cc'ing them with e.g. @sourcegraph/cloud
@sourcegraph/distribution
.
If the incident is urgent, start a new thread in the #dev-ops Slack channel and cc the on-call person from the relevant team. You can find this in Slack using the /genie
command:
/genie whoisoncall