Task #2981


Help Online: Optionally compress static pages

Added by Guilhem Moulin about 4 years ago. Updated over 3 years ago.

Target version:
Team - Q1/2020
Start date:
Due date:
% Done:




Help online packages are rather large, with 1.8GiB per release. Xapian indices (for full text search, cf. #2555), including spelling/stemming suggestions, add another 1.9GiB per release on top of that, which raise some concerns in term of salability and sustainability from an infrastructure perspective.

6.4 has 164735 files right now; here are the extensions weighing ≥4MiB:

ext   #files  avg. size   tot size
----  ------  ---------  ---------
svg     4341    1.16kiB    4.91MiB
ods      329   20.53kiB    6.59MiB
png      724   13.88kiB    9.81MiB
js       204  276.81kiB   55.96MiB
html  159061    9.43kiB 1464.93MiB

Some of these files (html, js, svg, css) have a fairly high compression ratio. In fact all modern browsers send Accept-Encoding: gzip headers in their requests, causing the HTTPd to compress on the fly the payload, which on reception is decompressed by the client. Saving traffic, but not space. (And causing the HTTPd some overhead due to the extra processing.)

Instead, I would like to store these files gzipped on the server. Aside from saving space, this has a number of advantages:
  • compression is done once and for all on Olivier Hallot 's workstation, meaning less work to be done on the HTTPd side (hence faster processing time);
  • since compression isn't done on the fly one can safely use more aggressive options (compression level) without risk of DoS'ing ourselves; and
  • The HTTPd can safely add a Content-Length header to the response (this is not possible for pipelined compression since the server doesn't know the size of the payload by the time it writes the header part).

For the few browsers not supporting gzip or not sending Accept-Encoding: gzip in the request, the requested file, stored compressed on the server, would be decompressed on the fly by the server, and the decompressed payload is served as is (without Content-Length header). So pretty much the opposite of what's performed right now.

Concretely, what I request is a flag to optionally run

find /path/to/6.4 -type f \
\( -name "*.css" -o -name "*.html" -o -name "*.js" -o -name "*.svg" \) \
\! -size -128c \
-print0 | xargs -r0 gzip -n

After a successful build (symbolic links require some extra care: if the target is compressed, then the link name should be removed and replaced — targeting the .gz counterpart — with a .gz suffix).

I.e., compress (with gzip(1)'s default options) these files. But only when exceeding 128 bytes.

Maybe the list of extensions to compress and the compression threshold (128 bytes) could be specified by the flag.

I'll take care of the server configuration. (In fact I already have a PoC for 6.4.) That requires a new location{} block, and since we already had to add one for 6.4 (for #2555) it's best for the infra team if that flag would be added to 6.4 as well.

Actions #1

Updated by Guilhem Moulin about 4 years ago

  • Subject changed from Optionally compress static pages to Help Online: Optionally compress static pages
Actions #2

Updated by Guilhem Moulin about 4 years ago

Some compression metrics for 6.4 (considering only CSS, HTML, JavaScript and SVG files of 128 bytes or larger), testing 3 compression levels:

 ext  #files  uncompressed     --fast   default      --best
----  ------  ------------  ---------  ---------  ---------
 css       3       0.03MiB    0.01MiB    0.01MiB    0.01MiB
html  159061    1464.93MiB  489.27MiB  446.23MiB  444.87MiB
  js     207      55.96MiB    7.24MiB    5.95MiB    5.79MiB
 svg    4312       4.91MiB    2.26MiB    2.17MiB    2.17MiB
----  ------  ------------  ---------  ---------  ---------
      163583          100%     32.69%     29.78%     29.67%

I think the default compression level (-6), like suggested in the above command, is good enough. It's only marginally slower than --fast and has a better compression ratio. Higher compression levels give only marginal gain on these files.

(FWIW, I also measured gains obtained by running the JavaScript and CSS though a minifier: after compression the difference is really negligible so IMHO it's really not worth the trouble if we have gzip.)

Note that it's possible to perform the compression step as post-processing after each upload, just like sitemap generation. (In fact I might do so for old frozen releases.) However that complicates the update process and makes it racy, and I think it's much nicer if our documents match exactly the ones built and uploaded.

Actions #3

Updated by Guilhem Moulin about 4 years ago

Guilhem Moulin wrote:

For the few browsers not supporting gzip or not sending Accept-Encoding: gzip in the request

To put that in perspective, >99% of successful (status 200) GET requests with extension .css/.js/.html/.svg are currently gzipped on the fly by the server. The remaining <1% do not advertise gzip support. So if these files were stored compressed on our end, then we could serve the overwhelming majority of requests “as is“ (pre-compressed).

There are other compression algorithms out there, and I'm not especially attached to gzip. But to avoid keeping the original (uncompressed) copy of each file alongside the compressed version [0], we need the ability to decompress files in order to serve all requests. Otherwise the requests that aren't advertising support for compression in their header (eg, from older browsers) would not be supported. AFAICT gzip is the only algorithm for which there are nginx modules for 1/ serving static pre-compressed files, and 2/ on the fly decompression.

[0] The point here is more to save storage space than reduce network traffic or load on the HTTPd.

Actions #4

Updated by Florian Effenberger about 4 years ago

  • Target version set to Q4/2019

Setting target to Q4 - Olivier, let me know if this is feasible or if this takes more time

Actions #5

Updated by Olivier Hallot about 4 years ago

Feasible to YE 2019.

Actions #6

Updated by Florian Effenberger almost 4 years ago

Any update?

Actions #7

Updated by Florian Effenberger almost 4 years ago

  • Target version changed from Q4/2019 to Q1/2020

Still relevant/on the agenda?

Actions #8

Updated by Guilhem Moulin almost 4 years ago

Still relevant from an infrastructure point of view.

Actions #9

Updated by Guilhem Moulin over 3 years ago

  • Status changed from New to Closed

Closing, fixed tooling and processes with Olivier Hallot. Right now compressed uploads take about 1.9GiB per release (incl. xapian database and sitemaps), so ~50% gain.


Also available in: Atom PDF