Bug #1079
closedStatus page
90%
Description
Hi,
When something goes wrong with github, they have an externally hosted status.github.com where they point out two things:
1) Status of the system (green/red)
2) In case of red, some short comment posted by the sysadmins, i.e. some estimation about when the system will be back, what is the problem, etc.
Something like that would be wonderful to have (e.g. as status.documentfoundation.org), so that people noticing infra problems would stop coming to #tdf-infra and asking the same question again and again.
At FOSDEM2015 Alex and Christian mentioned that such an external monitoring system (or something similar) is already in place, but it's not public. Hence this bugreport. :-)
Thanks a lot,
Miklos
Related issues
Updated by Florian Effenberger over 9 years ago
- Related to Bug #519: rework monitoring added
Updated by Miklos Vajna about 9 years ago
Copying from today's #tdf-infra IRC log:
14:11 <@cloph> this is in progress re the monitoring, although no ETA as to when there's a public facing status page
Updated by Beluga Beluga over 8 years ago
I think this could be closed as this exists now: http://status.documentfoundation.org/
Updated by Miklos Vajna over 8 years ago
I beg to disagree. The status.documentfoundation.org is nice to have, but how do I check there if e.g. the TDF sysadmins are aware that dev-downloads is done (just an example, currently it's up & running)? Check e.g. status.github.com, you can see the output of automatic monitoring there + in case something is red, they post some "we're aware of it" type messages usually by the time you notice the breakage.
Updated by Beluga Beluga over 8 years ago
Maybe something for the dashboard, then.
Updated by Jean Spiteri almost 8 years ago
Is this still valid/active? Will (or is) the dashboard public facing? Has a status site been implemented?
Updated by Beluga Beluga almost 8 years ago
Jean Spiteri wrote:
Is this still valid/active? Will (or is) the dashboard public facing? Has a status site been implemented?
See these:
https://fosdem.org/2017/schedule/event/development_dashboard_deployment/
https://fosdem.org/2017/schedule/event/development_dashboard/
Updated by Florian Effenberger almost 8 years ago
IMHO you're talking about http://status.documentfoundation.org ;-)
Updated by Beluga Beluga almost 8 years ago
Florian Effenberger wrote:
IMHO you're talking about http://status.documentfoundation.org ;-)
See Miklos's comment 4.
Updated by Florian Effenberger over 7 years ago
- Assignee set to Guilhem Moulin
- Target version set to Qlater
Ok, I see, and the answer is yes and no ;-)
The idea behind status.documentfoundation.org is to have an in-time update in case something breaks, but I agree it's a bit outdated, will talk through this in today's team call
What's missing right now is the automated connection to our monitoring system
I'll add this to Guilhem's pile, but without a concrete target atm
This could also be something for infra newcomers as EasyHack maybe, to provide some automated output from our monitoring
I think the only thing missing from your idea and the status quo is the integration of the monitoring with a list of hosts - or do I miss something?
Updated by Miklos Vajna over 7 years ago
Correct, the status.tdf blog is nice, but if there is some private automatic monitoring already, then having part of that publicly visible would be great. This way if e.g. gerrit is down, developers can just go to that status page, and in case it's red there, then they know sysadmins are aware of the problem. (It adds to the confidence if sysadmins can write some one-liner status text next to a red item, but it's a smaller thing.)
Thanks. :-)
Updated by Florian Effenberger over 7 years ago
- Target version changed from Qlater to Q4/2017
Updated by Florian Effenberger over 6 years ago
- Related to Task #2210: monitoring notifications added
Updated by Florian Effenberger over 6 years ago
- Related to deleted (Bug #519: rework monitoring)
Updated by Florian Effenberger over 6 years ago
- Target version changed from Q4/2017 to Q2/2018
Updated by Christian Lohmaier over 6 years ago
unfortunately the status page is on github, and as the main infra guy doesn't have github account it is not really used for setting up maintenance info.
As for service status info, we now have prometheus blackbox monitoring, but we currently lack a way to publicly share info (we don't want to expose all metrics to the public) - while grafana (as visualization for the prometheus data) supports sharing dashboards, those are either static or would need account to restrict..
We (Brett) is working on configuring alerting for prometheus, so we can create notification and status pages automatically in case something goes wrong.
Updated by Florian Effenberger over 6 years ago
unfortunately the status page is on github, and as the main infra guy
doesn't have github account it is not really used for setting up
maintenance info.
Then we should fix that ASAP. If our sysadmin can't update the page it's
totally worthless.
Can we host it on the filoo machine maybe ourselves, or what is blocking
Guilhem to get a GitHub account otherwise? We should solve this ASAP.
Updated by Guilhem Moulin over 6 years ago
status.tdf doesn't fit the requirement anyway. We need an API (to at least retrieve the status codes, possibly also some time metrics) to update the page in real time, and somewhere where admins can write down their findings, whether they're working on it, etc. Cf. the Feb 20 infra call minutes.
Alternatives include https://hund.io/ (demo page https://status.lineageos.org/) for a third-party (hosted) solution and https://cachethq.io/ (demo page https://demo.cachethq.io/) or https://www.netlify.com/status-pages/ for a free software solutions we can host ourselves. We didn't have time to test these alternatives yet, but I'd much rather have something we can host ourselves.
Updated by Florian Effenberger over 6 years ago
All fine for me, and I think things can be scaled and/or extended later on
What is crucial to me is that we inform people in time about issues and
that they have a reliable page they can look things up if something goes
wrong - whether automatically via API or manually edited HTML page :)
Updated by Guilhem Moulin about 6 years ago
- Status changed from New to Feedback
- % Done changed from 0 to 90
Since a few weeks we have a prod CachetHQ instance hosted at https://status.documentfoundation.org/ . The info is pretty basic and not meant to replace the more comprehensive infra metrics; HTTP status code and response time are automatically added using the cachet API. Upon downtime the state of the relevant component is automatically changed to “outage”, and back to “operational” when it's back up.
Infra team members can also add scheduled outages (reboot, upgrades, etc.), and inform the community live about our progress in solving issues. The Atom feed is added to https://planet.documentfoundation.org already; individuals can also subscribe to a specific component and receive e-mail notifications on status change.
I plan to announce the service at LibOCon so I'm not closing this just yet, but I don't expect non-cosmetic changes now.
Updated by Guilhem Moulin about 6 years ago
- Status changed from Feedback to Closed