Task #2981
closedHelp Online: Optionally compress static pages
0%
Description
Help online packages are rather large, with 1.8GiB per release. Xapian indices (for full text search, cf. #2555), including spelling/stemming suggestions, add another 1.9GiB per release on top of that, which raise some concerns in term of salability and sustainability from an infrastructure perspective.
6.4 has 164735 files right now; here are the extensions weighing ≥4MiB:
ext #files avg. size tot size ---- ------ --------- --------- svg 4341 1.16kiB 4.91MiB ods 329 20.53kiB 6.59MiB png 724 13.88kiB 9.81MiB js 204 276.81kiB 55.96MiB html 159061 9.43kiB 1464.93MiB
Some of these files (html, js, svg, css) have a fairly high compression ratio. In fact all modern browsers send Accept-Encoding: gzip
headers in their requests, causing the HTTPd to compress on the fly the payload, which on reception is decompressed by the client. Saving traffic, but not space. (And causing the HTTPd some overhead due to the extra processing.)
- compression is done once and for all on Olivier Hallot 's workstation, meaning less work to be done on the HTTPd side (hence faster processing time);
- since compression isn't done on the fly one can safely use more aggressive options (compression level) without risk of DoS'ing ourselves; and
- The HTTPd can safely add a
Content-Length
header to the response (this is not possible for pipelined compression since the server doesn't know the size of the payload by the time it writes the header part).
For the few browsers not supporting gzip or not sending Accept-Encoding: gzip
in the request, the requested file, stored compressed on the server, would be decompressed on the fly by the server, and the decompressed payload is served as is (without Content-Length
header). So pretty much the opposite of what's performed right now.
Concretely, what I request is a flag to optionally run
find /path/to/6.4 -type f \
\( -name "*.css" -o -name "*.html" -o -name "*.js" -o -name "*.svg" \) \
\! -size -128c \
-print0 | xargs -r0 gzip -n
After a successful build (symbolic links require some extra care: if the target is compressed, then the link name should be removed and replaced — targeting the .gz counterpart — with a .gz suffix).
I.e., compress (with gzip(1)
's default options) these files. But only when exceeding 128 bytes.
Maybe the list of extensions to compress and the compression threshold (128 bytes) could be specified by the flag.
I'll take care of the server configuration. (In fact I already have a PoC for 6.4.) That requires a new location{}
block, and since we already had to add one for 6.4 (for #2555) it's best for the infra team if that flag would be added to 6.4 as well.
Updated by Guilhem Moulin about 5 years ago
- Subject changed from Optionally compress static pages to Help Online: Optionally compress static pages
Updated by Guilhem Moulin about 5 years ago
Some compression metrics for 6.4 (considering only CSS, HTML, JavaScript and SVG files of 128 bytes or larger), testing 3 compression levels:
ext #files uncompressed --fast default --best ---- ------ ------------ --------- --------- --------- css 3 0.03MiB 0.01MiB 0.01MiB 0.01MiB html 159061 1464.93MiB 489.27MiB 446.23MiB 444.87MiB js 207 55.96MiB 7.24MiB 5.95MiB 5.79MiB svg 4312 4.91MiB 2.26MiB 2.17MiB 2.17MiB ---- ------ ------------ --------- --------- --------- 163583 100% 32.69% 29.78% 29.67%
I think the default compression level (-6
), like suggested in the above command, is good enough. It's only marginally slower than --fast
and has a better compression ratio. Higher compression levels give only marginal gain on these files.
(FWIW, I also measured gains obtained by running the JavaScript and CSS though a minifier: after compression the difference is really negligible so IMHO it's really not worth the trouble if we have gzip.)
Note that it's possible to perform the compression step as post-processing after each upload, just like sitemap generation. (In fact I might do so for old frozen releases.) However that complicates the update process and makes it racy, and I think it's much nicer if our documents match exactly the ones built and uploaded.
Updated by Guilhem Moulin about 5 years ago
Guilhem Moulin wrote:
For the few browsers not supporting gzip or not sending
Accept-Encoding: gzip
in the request
To put that in perspective, >99% of successful (status 200) GET requests with extension .css/.js/.html/.svg are currently gzipped on the fly by the server. The remaining <1% do not advertise gzip support. So if these files were stored compressed on our end, then we could serve the overwhelming majority of requests “as is“ (pre-compressed).
There are other compression algorithms out there, and I'm not especially attached to gzip. But to avoid keeping the original (uncompressed) copy of each file alongside the compressed version [0], we need the ability to decompress files in order to serve all requests. Otherwise the requests that aren't advertising support for compression in their header (eg, from older browsers) would not be supported. AFAICT gzip is the only algorithm for which there are nginx modules for 1/ serving static pre-compressed files, and 2/ on the fly decompression.
[0] The point here is more to save storage space than reduce network traffic or load on the HTTPd.
Updated by Florian Effenberger about 5 years ago
- Target version set to Q4/2019
Setting target to Q4 - Olivier, let me know if this is feasible or if this takes more time
Updated by Florian Effenberger almost 5 years ago
- Target version changed from Q4/2019 to Q1/2020
Still relevant/on the agenda?
Updated by Guilhem Moulin almost 5 years ago
Still relevant from an infrastructure point of view.
Updated by Guilhem Moulin over 4 years ago
- Status changed from New to Closed
Closing, fixed tooling and processes with Olivier Hallot. Right now compressed uploads take about 1.9GiB per release (incl. xapian database and sitemaps), so ~50% gain.