Project

General

Profile

Task #2145

Frequent 504 Gateway timeouts in Bugzilla upon bug change

Added by Aron Budea 11 months ago. Updated 4 months ago.

Status:
In Progress
Priority:
Normal
Category:
-
Target version:
Team - Qlater
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:
URL:

Description

When adding a comment/changing a bug we often hit the following error: "504 Gateway Time-out" / "nginx/1.2.1"
The comment/change goes through, but the user could incorrectly think it didn't, goes back and comments again.

This is somewhat of a recurring issue, sometimes I don't see it for weeks, other times hit it multiple times.

History

#1 Updated by Aron Budea 11 months ago

It's also coupled with extreme Bugzilla slowness :/.

#2 Updated by Florian Effenberger 11 months ago

  • Assignee set to Guilhem Moulin
  • Target version set to Pool

Guilhem, can you have a look?

#3 Updated by Guilhem Moulin 11 months ago

  • Assignee deleted (Guilhem Moulin)
  • Target version deleted (Pool)

Yes I'm not it. I restarted PostgreSQL with some tweaks; it's usually ~instant, but now it seems insanely slow at starting up… bugs.tdf has been down for 20mins already :-( :-(

#4 Updated by Guilhem Moulin 11 months ago

  • Assignee set to Guilhem Moulin

Sorry, I didn't see I removed myself from the Assignee, I guess it's because I had the page loaded before I got your message.

And of course I meant "I on it", sorry for the confusion

#5 Updated by Guilhem Moulin 11 months ago

Restarting PostgreSQL and forcing VACUUM seem to have significant improvements on auto completion. I also tweaked the config (which was the reason of the restart to start with), which should improve write queries.

I leave the bug priority on High in the meantime as it's not a proper fix, though; I'll keep investigating.

#6 Updated by Florian Effenberger 10 months ago

  • Target version set to Pool

Any update? Have the problems been solved?

#7 Updated by Beluga Beluga 10 months ago

Florian Effenberger wrote:

Any update? Have the problems been solved?

More investigation is needed as the problem has reappeared several times after this was filed. It should be noted that these have always appeared with our self-hosted BZ. Not sure, if the cause has always been the same.

#8 Updated by Florian Effenberger 10 months ago

More investigation is needed as the problem has reappeared several times
after this was filed. It should be noted that these have always appeared
with our self-hosted BZ. Not sure, if the cause has always been the same.

Do you have any timestamps, so we could look into the logs?

#9 Updated by Florian Effenberger 8 months ago

Any updates, or can we close this ticket?

#10 Updated by Florian Effenberger 8 months ago

Florian Effenberger wrote:

Any updates, or can we close this ticket?

Ping?

#11 Updated by Guilhem Moulin 8 months ago

I'm still doing regular manual vacuums for now. I think it's best to keep the ticket open until we find a decent autovacuum configuration.

#12 Updated by Florian Effenberger 5 months ago

Guilhem Moulin wrote:

I'm still doing regular manual vacuums for now. I think it's best to keep the ticket open until we find a decent autovacuum configuration.

Any updates on the situation?

#13 Updated by Guilhem Moulin 4 months ago

During the past 52 days we've had "only" 40 of these Gateway Time-out, for a total of just under 1.4M requests to the fastcgi server (incl. 64k requests to the REST API). So while we could probably tune PostgreSQL better, I'm now tempted to close this, or at least downgrade the severity.

Moreover 12 of these 40 failed requests came from our own infra (the wiki querying the REST API). Looking at the timestamp they mostly come in batch and I could correlate 2 batches with the following guster heals (dates are UTC):

- 6x on 2017-07-14 from 10:10 to 10:15 [freeze+reboot of charly]
- 9x on 2017-08-03 from 15:30 to 16:00 [corruption of delta volume]

#14 Updated by Florian Effenberger 4 months ago

During the past 52 days we've had "only" 40 of these Gateway Time-out,
for a total of just under 1.4M requests to the fastcgi server (incl. 64k
requests to the REST API). So while we could probably tune PostgreSQL
better, I'm now tempted to close this, or at least downgrade the severity.

I heard no complaints either, so how about having a normal priority and
Qlater, so we can revisit later the year?

#15 Updated by Guilhem Moulin 4 months ago

  • Priority changed from High to Normal
  • Target version changed from Pool to Qlater

Florian Effenberger wrote:

I heard no complaints either, so how about having a normal priority and
Qlater, so we can revisit later the year?

Sure, done.

#16 Updated by Guilhem Moulin 4 months ago

  • Status changed from New to In Progress

Also available in: Atom PDF