Project

General

Profile

Bug #1079

Status page

Added by Miklos Vajna over 3 years ago. Updated 3 days ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Team - Q2/2018
Start date:
Due date:
% Done:

90%

Estimated time:
Tags:
URL:

Description

Hi,

When something goes wrong with github, they have an externally hosted status.github.com where they point out two things:

1) Status of the system (green/red)

2) In case of red, some short comment posted by the sysadmins, i.e. some estimation about when the system will be back, what is the problem, etc.

Something like that would be wonderful to have (e.g. as status.documentfoundation.org), so that people noticing infra problems would stop coming to #tdf-infra and asking the same question again and again.

At FOSDEM2015 Alex and Christian mentioned that such an external monitoring system (or something similar) is already in place, but it's not public. Hence this bugreport. :-)

Thanks a lot,

Miklos


Related issues

Related to Infrastructure - Task #2210: monitoring notificationsIn Progress

History

#1 Updated by Florian Effenberger over 3 years ago

  • Related to Bug #519: rework monitoring added

#2 Updated by Miklos Vajna almost 3 years ago

Copying from today's #tdf-infra IRC log:

14:11 <@cloph> this is in progress re the monitoring, although no ETA as to when there's a public facing status page

#3 Updated by Beluga Beluga over 2 years ago

I think this could be closed as this exists now: http://status.documentfoundation.org/

#4 Updated by Miklos Vajna over 2 years ago

I beg to disagree. The status.documentfoundation.org is nice to have, but how do I check there if e.g. the TDF sysadmins are aware that dev-downloads is done (just an example, currently it's up & running)? Check e.g. status.github.com, you can see the output of automatic monitoring there + in case something is red, they post some "we're aware of it" type messages usually by the time you notice the breakage.

#5 Updated by Beluga Beluga over 2 years ago

Maybe something for the dashboard, then.

#6 Updated by Jean Spiteri over 1 year ago

Is this still valid/active? Will (or is) the dashboard public facing? Has a status site been implemented?

#7 Updated by Beluga Beluga over 1 year ago

Jean Spiteri wrote:

Is this still valid/active? Will (or is) the dashboard public facing? Has a status site been implemented?

See these:
https://fosdem.org/2017/schedule/event/development_dashboard_deployment/
https://fosdem.org/2017/schedule/event/development_dashboard/

#8 Updated by Florian Effenberger over 1 year ago

IMHO you're talking about http://status.documentfoundation.org ;-)

#9 Updated by Beluga Beluga over 1 year ago

Florian Effenberger wrote:

IMHO you're talking about http://status.documentfoundation.org ;-)

See Miklos's comment 4.

#10 Updated by Florian Effenberger over 1 year ago

  • Assignee set to Guilhem Moulin
  • Target version set to Qlater

Ok, I see, and the answer is yes and no ;-)
The idea behind status.documentfoundation.org is to have an in-time update in case something breaks, but I agree it's a bit outdated, will talk through this in today's team call

What's missing right now is the automated connection to our monitoring system
I'll add this to Guilhem's pile, but without a concrete target atm
This could also be something for infra newcomers as EasyHack maybe, to provide some automated output from our monitoring

I think the only thing missing from your idea and the status quo is the integration of the monitoring with a list of hosts - or do I miss something?

#11 Updated by Miklos Vajna over 1 year ago

Correct, the status.tdf blog is nice, but if there is some private automatic monitoring already, then having part of that publicly visible would be great. This way if e.g. gerrit is down, developers can just go to that status page, and in case it's red there, then they know sysadmins are aware of the problem. (It adds to the confidence if sysadmins can write some one-liner status text next to a red item, but it's a smaller thing.)

Thanks. :-)

#12 Updated by Florian Effenberger over 1 year ago

  • Target version changed from Qlater to Q4/2017

#13 Updated by Florian Effenberger 5 months ago

  • Related to Task #2210: monitoring notifications added

#14 Updated by Florian Effenberger 5 months ago

  • Related to deleted (Bug #519: rework monitoring)

#15 Updated by Florian Effenberger 5 months ago

  • Target version changed from Q4/2017 to Q2/2018

#16 Updated by Christian Lohmaier 4 months ago

unfortunately the status page is on github, and as the main infra guy doesn't have github account it is not really used for setting up maintenance info.

As for service status info, we now have prometheus blackbox monitoring, but we currently lack a way to publicly share info (we don't want to expose all metrics to the public) - while grafana (as visualization for the prometheus data) supports sharing dashboards, those are either static or would need account to restrict..

We (Brett) is working on configuring alerting for prometheus, so we can create notification and status pages automatically in case something goes wrong.

#17 Updated by Florian Effenberger 4 months ago

unfortunately the status page is on github, and as the main infra guy
doesn't have github account it is not really used for setting up
maintenance info.

Then we should fix that ASAP. If our sysadmin can't update the page it's
totally worthless.

Can we host it on the filoo machine maybe ourselves, or what is blocking
Guilhem to get a GitHub account otherwise? We should solve this ASAP.

#18 Updated by Guilhem Moulin 4 months ago

status.tdf doesn't fit the requirement anyway. We need an API (to at least retrieve the status codes, possibly also some time metrics) to update the page in real time, and somewhere where admins can write down their findings, whether they're working on it, etc. Cf. the Feb 20 infra call minutes.

Alternatives include https://hund.io/ (demo page https://status.lineageos.org/) for a third-party (hosted) solution and https://cachethq.io/ (demo page https://demo.cachethq.io/) or https://www.netlify.com/status-pages/ for a free software solutions we can host ourselves. We didn't have time to test these alternatives yet, but I'd much rather have something we can host ourselves.

#19 Updated by Florian Effenberger 4 months ago

All fine for me, and I think things can be scaled and/or extended later on
What is crucial to me is that we inform people in time about issues and
that they have a reliable page they can look things up if something goes
wrong - whether automatically via API or manually edited HTML page :)

#20 Updated by Guilhem Moulin about 1 month ago

  • Status changed from New to Feedback
  • % Done changed from 0 to 90

Since a few weeks we have a prod CachetHQ instance hosted at https://status.documentfoundation.org/ . The info is pretty basic and not meant to replace the more comprehensive infra metrics; HTTP status code and response time are automatically added using the cachet API. Upon downtime the state of the relevant component is automatically changed to “outage”, and back to “operational” when it's back up.

Infra team members can also add scheduled outages (reboot, upgrades, etc.), and inform the community live about our progress in solving issues. The Atom feed is added to https://planet.documentfoundation.org already; individuals can also subscribe to a specific component and receive e-mail notifications on status change.

I plan to announce the service at LibOCon so I'm not closing this just yet, but I don't expect non-cosmetic changes now.

#21 Updated by Miklos Vajna about 1 month ago

Thanks a lot! :-)

#22 Updated by Guilhem Moulin 4 days ago

  • Status changed from Feedback to Closed

#23 Updated by Florian Effenberger 3 days ago

Good news indeed! :)

Also available in: Atom PDF