Project

General

Profile

Actions

Task #2145

closed

Frequent 504 Gateway timeouts in Bugzilla upon bug change

Added by Aron Budea almost 8 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Team - Qlater
Start date:
Due date:
% Done:

0%

Tags:

Description

When adding a comment/changing a bug we often hit the following error: "504 Gateway Time-out" / "nginx/1.2.1"
The comment/change goes through, but the user could incorrectly think it didn't, goes back and comments again.

This is somewhat of a recurring issue, sometimes I don't see it for weeks, other times hit it multiple times.

Actions #1

Updated by Aron Budea almost 8 years ago

It's also coupled with extreme Bugzilla slowness :/.

Actions #2

Updated by Florian Effenberger almost 8 years ago

  • Assignee set to Guilhem Moulin
  • Target version set to Pool

Guilhem, can you have a look?

Actions #3

Updated by Guilhem Moulin almost 8 years ago

  • Assignee deleted (Guilhem Moulin)
  • Target version deleted (Pool)

Yes I'm not it. I restarted PostgreSQL with some tweaks; it's usually ~instant, but now it seems insanely slow at starting up… bugs.tdf has been down for 20mins already :-( :-(

Actions #4

Updated by Guilhem Moulin almost 8 years ago

  • Assignee set to Guilhem Moulin

Sorry, I didn't see I removed myself from the Assignee, I guess it's because I had the page loaded before I got your message.

And of course I meant "I on it", sorry for the confusion

Actions #5

Updated by Guilhem Moulin almost 8 years ago

Restarting PostgreSQL and forcing VACUUM seem to have significant improvements on auto completion. I also tweaked the config (which was the reason of the restart to start with), which should improve write queries.

I leave the bug priority on High in the meantime as it's not a proper fix, though; I'll keep investigating.

Actions #6

Updated by Florian Effenberger almost 8 years ago

  • Target version set to Pool

Any update? Have the problems been solved?

Actions #7

Updated by Beluga Beluga almost 8 years ago

Florian Effenberger wrote:

Any update? Have the problems been solved?

More investigation is needed as the problem has reappeared several times after this was filed. It should be noted that these have always appeared with our self-hosted BZ. Not sure, if the cause has always been the same.

Actions #8

Updated by Florian Effenberger almost 8 years ago

More investigation is needed as the problem has reappeared several times
after this was filed. It should be noted that these have always appeared
with our self-hosted BZ. Not sure, if the cause has always been the same.

Do you have any timestamps, so we could look into the logs?

Actions #9

Updated by Florian Effenberger over 7 years ago

Any updates, or can we close this ticket?

Actions #10

Updated by Florian Effenberger over 7 years ago

Florian Effenberger wrote:

Any updates, or can we close this ticket?

Ping?

Actions #11

Updated by Guilhem Moulin over 7 years ago

I'm still doing regular manual vacuums for now. I think it's best to keep the ticket open until we find a decent autovacuum configuration.

Actions #12

Updated by Florian Effenberger over 7 years ago

Guilhem Moulin wrote:

I'm still doing regular manual vacuums for now. I think it's best to keep the ticket open until we find a decent autovacuum configuration.

Any updates on the situation?

Actions #13

Updated by Guilhem Moulin over 7 years ago

During the past 52 days we've had "only" 40 of these Gateway Time-out, for a total of just under 1.4M requests to the fastcgi server (incl. 64k requests to the REST API). So while we could probably tune PostgreSQL better, I'm now tempted to close this, or at least downgrade the severity.

Moreover 12 of these 40 failed requests came from our own infra (the wiki querying the REST API). Looking at the timestamp they mostly come in batch and I could correlate 2 batches with the following guster heals (dates are UTC):

- 6x on 2017-07-14 from 10:10 to 10:15 [freeze+reboot of charly]
- 9x on 2017-08-03 from 15:30 to 16:00 [corruption of delta volume]
Actions #14

Updated by Florian Effenberger over 7 years ago

During the past 52 days we've had "only" 40 of these Gateway Time-out,
for a total of just under 1.4M requests to the fastcgi server (incl. 64k
requests to the REST API). So while we could probably tune PostgreSQL
better, I'm now tempted to close this, or at least downgrade the severity.

I heard no complaints either, so how about having a normal priority and
Qlater, so we can revisit later the year?

Actions #15

Updated by Guilhem Moulin over 7 years ago

  • Priority changed from High to Normal
  • Target version changed from Pool to Qlater

Florian Effenberger wrote:

I heard no complaints either, so how about having a normal priority and
Qlater, so we can revisit later the year?

Sure, done.

Actions #16

Updated by Guilhem Moulin over 7 years ago

  • Status changed from New to In Progress
Actions #17

Updated by Florian Effenberger almost 7 years ago

Any update? Can this be closed?

Actions #18

Updated by Guilhem Moulin almost 7 years ago

  • Status changed from In Progress to Closed

Closing indeed. I still see a handful of timeouts in the logs, but was about .0005% of all CGI/REST requests issued during the past 2 months. And we haven't heard any further complaint.

Actions

Also available in: Atom PDF