Project

General

Profile

Task #3236

Slow download speed on dev-www.libreoffice.org / dev-downloads.libreoffice.org bibisect.libreoffice.org/

Added by W.J. Harkink 7 months ago. Updated 6 months ago.

Status:
In Progress
Priority:
Normal
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Tags:
URL:

Description

In essence https://redmine.documentfoundation.org/issues/3018

The https://dev-downloads.libreoffice.org/bibisect/linux/bibisect-43all.tar.xz is around 250 kbit/s.. should be 5500 kbit/s

Same issue with gerrit.libreoffice.org downloading a bibisect repro is bit faster 900 - 1200 kbit/s, but still slow..

Daily builds download is for some reason faster.. still slow, but 1,6 Mb/s

Tried with VPN (different locations; Belgium/Germany/Netherlands) and without VPN.. no substantial difference.

And has been so for a year or more..


Files

Clipboard02.png (115 KB) Clipboard02.png Speedtest W.J. Harkink, 2020-06-12 19:45

History

#1

Updated by Guilhem Moulin 7 months ago

  • Priority changed from High to Normal

The https://dev-downloads.libreoffice.org/bibisect/linux/bibisect-43all.tar.xz is around 250 kbit/s..

I just downloaded the first 1GiB of that file at >4.5 MBytes/sec (>35Mbit/sec) avg from three random datacenters in Europe…

Same issue with gerrit.libreoffice.org downloading a bibisect repro is bit faster 900 - 1200 kbit/s, but still slow..

Canonical URLs for active bibisect repos is https://bibisect.libreoffice.org/$OS-$VERSION.git , please use that. Right now they host alongside gerrit but uploads are CPU-bound by git-pack-objects(1) processes — even internal clones don't have stellar speed. We'll move bibisect.libreoffice.org to a separate box (and remove support for git:// , see #1481) but need to wait for 6.4 to be frozen so we don't break existing setups.

I guess your provider has bad connectivity to our infrastructure. If you consistently observe worse performance for dev-downloads than bibisect/gerrit, this might be because v6 connectivity is better than v4. dev-downloads wasn't dual-stack until a few minutes ago.

#2

Updated by W.J. Harkink 7 months ago

-> I guess your provider has bad connectivity to our infrastructure.
The easy answer.. blame it on the ISP. However.. same problem with VPN and with a online browser like

Try: https://www.browserling.com/browse/win/8.1/chrome/57/
Go to speedtest.de first.. 100 MB/s
Next go to: https://dev-downloads.libreoffice.org/bibisect/linux/ and download something.. same slowness

Is there some direct path to 89.238.68.201 / gimli.documentfoundation.org for the bibisect/linux/ repro..

#3

Updated by Guilhem Moulin 7 months ago

The easy answer.. blame it on the ISP. However.. same problem with VPN

Well you give 2-3 slow examples and I give 3 reasonably fast ones. These are all examples.

Try: https://www.browserling.com/browse/win/8.1/chrome/57/
Go to speedtest.de first.. 100 MB/s
Next go to: https://dev-downloads.libreoffice.org/bibisect/linux/ and download something.. same slowness

Uh what are you trying to show here? You're comparing apples and oranges. speedtest.de is backed by a CDN as closed to the client as possible, as its only point is to saturate the downstream connection. Outside our mirror network we can't compete with that and never claimed to. dev-downloads.libreoffice.org is in Germany and using www.browserling.com initiates a query from North America. Of course that won't saturate your link.

Is there some direct path to 89.238.68.201 / gimli.documentfoundation.org for the bibisect/linux/ repro..

I don't understand the question. What are you trying to do?

#4

Updated by W.J. Harkink 6 months ago

Guilhem Moulin wrote:
Is there some direct path to 89.238.68.201 / gimli.documentfoundation.org for the bibisect/linux/ repro..

I don't understand the question. What are you trying to do?

Probably something stupid. Search another way to connect to server, but doesn't make much sense.

Asked Buovjaga: bibisect-43all.tar.xz gives me average speed of 1500 kB/s. Same speed I get. So there quite a difference between 1500 kBytes/s and 4.5 MBytes/sec (>35Mbit/sec). While I consider 35Mbit/sec still rather slowish for large files. I personally consider 50Mbit line being the minimum; 100 Mbit being average.

Git repro always cycles around 900-1200 kBytes/s. Is that the current achievable based on CPU bound processing?

#5

Updated by Guilhem Moulin 6 months ago

W.J. Harkink wrote:

Guilhem Moulin wrote:
Is there some direct path to 89.238.68.201 / gimli.documentfoundation.org for the bibisect/linux/ repro..

I don't understand the question. What are you trying to do?

Probably something stupid. Search another way to connect to server, but doesn't make much sense.

gimli, gerrit, and dev-download are in the same datacenter.

Asked Buovjaga: bibisect-43all.tar.xz gives me average speed of 1500 kB/s. Same speed I get.

Your report states “250 kbit/s … 900-1200 kbit/s”, so 10 to 50× slower. You even said “should be 5500 kbit/s” which well, is less than half of what you claim to have now :-)

So there quite a difference between 1500 kBytes/s and 4.5 MBytes/sec (>35Mbit/sec). While I consider 35Mbit/sec still rather slowish for large files. I personally consider 50Mbit line being the minimum; 100 Mbit being average.

So with 12 concurrent downloads we'd saturate our upstream connection. There is no way we can offer that speed without geographically distributed CDN and a much more complex architecture than what we have now. We're no content streaming site and while QA matters, are the bibisect repos enough used to justify a complete revamp of the infrastructure? For what it's worth, in the past 2 weeks there were only 3(!) GET requests to https://dev-downloads.libreoffice.org/bibisect/… with a response size of ≥100MiB, one of which mine and maybe another buovjaga. For git-upload-pack commands https://bibisect.libreoffice.org , that's 41 responses of size ≥100MiB from 24 different IPs.

Git repro always cycles around 900-1200 kBytes/s. Is that the current achievable based on CPU bound processing?

(You wrote bits earlier. 1200kB/s is close enough to 1500kB/s, which you seem to have now on the other box, to suggest that it's not resource bound on the VM.) Anywyay, that depends what's the load on the box. It's shared with gerrit which is greedy in terms of resources. Week-end and nights will likely achieve faster speeds. I seem to exceed 3MiB/s right now at 3am from my how connection in Sweden (without particularly good peering to TDF's provider). It also appears some of these repositories aren't fully repacked, which might further affect speed (although most likely not causing the bottleneck here).

The plan is to move https://bibisect.libreoffice.org repo to a dedicated host once 6.4 ceases being active.

#6

Updated by W.J. Harkink 6 months ago

Your report states “250 kbit/s … 900-1200 kbit/s”, so 10 to 50× slower. You even said “should>be 5500 kbit/s” which well, is less than half of what you claim to have now :-)

-> Sorry for the confusion.

So with 12 concurrent downloads we'd saturate our upstream connection. There is no way we can offer that speed without geographically distributed CDN and a much more complex architecture than what we have now. We're no content streaming site and while QA matters, are the bibisect repos enough used to justify a complete revamp of the infrastructure?

-> I'm do not know much about the infra. Final release are downloaded from mirror servers.
downloadarchive.documentfoundation.org is on a separate server

Gimli is hosting:
api.libreoffice.org
dev-builds.libreoffice.org
dev-www.libreoffice.org
eclipse-plugins.libreoffice.org
tinderbox.libreoffice.org

-> So with 12 concurrent downloads we'd saturate our upstream connecting
Depends how many people are around on the server at the same time.. You maybe throttle with more.. but no clue what all those services need. Don't think there are that many users at the same time.

-> The bibisect-43all.tar.xz is of course not the most relevant repro. Not often used. However still slow if you have download it by incident (I'm currently storing all bibisect repro's locally, just in case. As a quick download is a no go).
However the Master builds are located there and the symbol build including symbols. So ends up constantly waiting (for people who do not not tend to builds stuff themselves). Again not that many people (so total traffic or peak bandwidth required), so it must be doable to find a some rather easy to implement faster solution with CDN. Would be nice if some more people did bibisects at QA.
dev-builds.libreoffice.org doesn't contain that much of data either (as far I can see). If could be hosted on the downloadarchive.documentfoundation.org or the stuff could be mirrored to somewhere else (lets hypothetically say https://ftp.snt.utwente.nl/pub/software/tdf/libreoffice/ it would be fine already.

There is probably some reason why this isn't the case..

> The plan is to move https://bibisect.libreoffice.org repo to a dedicated host once 6.4 ceases being active.
Please let it be faster :
). You currently can't advise somebody do a bibisect. Even if you would be able to get them crazy enough doing that.
I'm currently intending to download the bibisect MacOS repro for a single bibisect. Waiting 3 of 4 hours is pretty long. And of course my MacBook goes the sleep sometimes.. failing again. Or I'm walking somewhere.. out of WiFi range

-> Can't those old "in-active" bibisect repro's so - everything except 7.1 (for Linux/Mac/Win) not be packed similar to bibisect-43all.tar.xz. Not seeing the advantage of git-pack-objects processes for those. They get no updates anymore. Ok, you have to extract, but looks more efficient to me; but what do I know..

#7

Updated by Guilhem Moulin 6 months ago

W.J. Harkink wrote:

Your report states “250 kbit/s … 900-1200 kbit/s”, so 10 to 50× slower. You even said “should>be 5500 kbit/s” which well, is less than half of what you claim to have now :-)

-> Sorry for the confusion.

It's still unclear to me what speed you get and which you consider acceptable. Reading the original post then message #3 it seems this issue is no longer relevant.

So with 12 concurrent downloads we'd saturate our upstream connection. There is no way we can offer that speed without geographically distributed CDN and a much more complex architecture than what we have now. We're no content streaming site and while QA matters, are the bibisect repos enough used to justify a complete revamp of the infrastructure?

-> I'm do not know much about the infra. Final release are downloaded from mirror servers.
downloadarchive.documentfoundation.org is on a separate server

Correct. The mirror network acts as a CDN. They also receive vastly more than 3 downloads requests per day on average.

Gimli is hosting:
api.libreoffice.org
dev-builds.libreoffice.org
dev-www.libreoffice.org
eclipse-plugins.libreoffice.org
tinderbox.libreoffice.org

The list is not exhaustive, but yes. No idea where you're going with that though.

-> So with 12 concurrent downloads we'd saturate our upstream connecting
Depends how many people are around on the server at the same time.. You maybe throttle with more.. but no clue what all those services need. Don't think there are that many users at the same time.

I'm talking about the upstream link for the entire infrastructure (used by gerrit, downloadarchive, gimli and the rest), not only gimli's.

-> The bibisect-43all.tar.xz is of course not the most relevant repro. Not often used. However still slow if you have download it by incident (I'm currently storing all bibisect repro's locally, just in case. As a quick download is a no go).

The ellipsis in what I wrote above is a wildcard. That's only 3 hits (of which one is mine and the other likely buovjaga's) in a two weeks periods with a ≥100MB response for any file under https://dev-downloads.libreoffice.org/bibisect/ . That's not specific to bibisect-43all.tar.xz .

However the Master builds are located there and the symbol build including symbols. So ends up constantly waiting (for people who do not not tend to builds stuff themselves). Again not that many people (so total traffic or peak bandwidth required), so it must be doable to find a some rather easy to implement faster solution with CDN. Would be nice if some more people did bibisects at QA.
dev-builds.libreoffice.org doesn't contain that much of data either (as far I can see). If could be hosted on the downloadarchive.documentfoundation.org or the stuff could be mirrored to somewhere else (lets hypothetically say https://ftp.snt.utwente.nl/pub/software/tdf/libreoffice/ it would be fine already.

There is probably some reason why this isn't the case..

Yes… downloadarchive is not backed by a CDN, and shares the same upstream link, so you wouldn't get faster speed there. Commercial CDNs are expensive and definitely not worth it for just not even 10 hits per days on average. We could deploy some of these on our volunteer-driven mirror network, however that'd need to be opt in (the release downloads are mostly fixed in size, but bibisect stuff much larger and ever increasing; that's also why downloadarchive is not backed by a CDN). Most likely few mirrors would be interested in this, not just because of the extra space needs, but also because universities and ISPs prefer to mirror stuff that their users/customers use (and they're vastly more likely to have LibreOffice users than QA team members among them).

-> Can't those old "in-active" bibisect repro's so - everything except 7.1 (for Linux/Mac/Win) not be packed similar to bibisect-43all.tar.xz. Not seeing the advantage of git-pack-objects processes for those. They get no updates anymore. Ok, you have to extract, but looks more efficient to me; but what do I know..

That's the QA team not the infra team who is building these. I personally don't care how these repositories are bundled. (That said if the repository is properly packed you shouldn't have git-pack-objects processes, only git-send-pack.) While both git clones and file-based HTTP downloads are able to handle partial downloads (hence avoid restarting from scratch in case of a broken connection) it might be easier with git shallow clones than with an HTTP client (depending on the tooling at hand).

#8

Updated by W.J. Harkink 6 months ago

Guilhem Moulin wrote:

W.J. Harkink wrote:

Your report states “250 kbit/s … 900-1200 kbit/s”, so 10 to 50× slower. You even said “should>be 5500 kbit/s” which well, is less than half of what you claim to have now :-)

-> Sorry for the confusion.

It's still unclear to me what speed you get and which you consider acceptable. Reading the original post then message #3 it seems this issue is no longer relevant.

Today between 1,8 Mb/s 14.4 Mbit and 2,5 MB/s 20 Mbit. 14 Mbit part is to slow. 20 Mbit acceptable. My target around 30 Mbit

https://dev-builds.libreoffice.org/daily/master/Win-x86_64@tb77-TDF/2020-07-12_04.53.31/LibreOfficeDev_7.1.0.0.alpha0_Win_x64.msi

So with 12 concurrent downloads we'd saturate our upstream connection. There is no way we can offer that speed without geographically distributed CDN and a much more complex architecture than what we have now. We're no content streaming site and while QA matters, are the bibisect repos enough used to justify a complete revamp of the infrastructure?

-> I'm do not know much about the infra. Final release are downloaded from mirror servers.
downloadarchive.documentfoundation.org is on a separate server

Correct. The mirror network acts as a CDN. They also receive vastly more than 3 downloads requests per day on average.

Gimli is hosting:
api.libreoffice.org
dev-builds.libreoffice.org
dev-www.libreoffice.org
eclipse-plugins.libreoffice.org
tinderbox.libreoffice.org

The list is not exhaustive, but yes. No idea where you're going with that though.

Nowhere, except me having some impression about everything running..

-> So with 12 concurrent downloads we'd saturate our upstream connecting
Depends how many people are around on the server at the same time.. You maybe throttle with more.. but no clue what all those services need. Don't think there are that many users at the same time.

I'm talking about the upstream link for the entire infrastructure (used by gerrit, downloadarchive, gimli and the rest), not only gimli's.

-> The bibisect-43all.tar.xz is of course not the most relevant repro. Not often used. However still slow if you have download it by incident (I'm currently storing all bibisect repro's locally, just in case. As a quick download is a no go).

The ellipsis in what I wrote above is a wildcard. That's only 3 hits (of which one is mine and the other likely buovjaga's) in a two weeks periods with a ≥100MB response for any file under https://dev-downloads.libreoffice.org/bibisect/ . That's not specific to bibisect-43all.tar.xz .

(A) Everything in https://dev-downloads.libreoffice.org/bibisect/ is 5 years old; so surely not an active area
(B) How many people do download a bibisect repro at all. Most regulars have them on their drive already. Point is more if you want to download it (for what every reason) is pretty slow. How many people download a old repro from the 43-70 range. Maybe 40 people a year for Linux. MacOS repro 2-3? Windows repro 40. [From QA perspective, not sure what the DEV department does. So in all not very often. So surely not optimizing for the amount of users, but for the users, but more being somewhat user friendly for the people who do. And the files are rather big..

However the Master builds are located there and the symbol build including symbols. So ends up constantly waiting (for people who do not not tend to builds stuff themselves). Again not that many people (so total traffic or peak bandwidth required), so it must be doable to find a some rather easy to implement faster solution with CDN. Would be nice if some more people did bibisects at QA.
dev-builds.libreoffice.org doesn't contain that much of data either (as far I can see). If could be hosted on the downloadarchive.documentfoundation.org or the stuff could be mirrored to somewhere else (lets hypothetically say https://ftp.snt.utwente.nl/pub/software/tdf/libreoffice/ it would be fine already.

There is probably some reason why this isn't the case..

Yes… downloadarchive is not backed by a CDN, and shares the same upstream link, so you wouldn't get faster speed there. Commercial CDNs are expensive and definitely not worth it for just not even 10 hits per days on average. We could deploy some of these on our volunteer-driven mirror network, however that'd need to be opt in (the release downloads are mostly fixed in size, but bibisect stuff much larger and ever increasing; that's also why downloadarchive is not backed by a CDN). Most likely few mirrors would be interested in this, not just because of the extra space needs, but also because universities and ISPs prefer to mirror stuff that their users/customers use (and they're vastly more likely to have LibreOffice users than QA team members among them).

The downloadarchive is fast for me .. ;-). 3,3 MB/s - 5,5 MB/s. Tried 3 files. https://dev-builds.libreoffice.org/ is surely slower at the same time

-> Can't those old "in-active" bibisect repro's so - everything except 7.1 (for Linux/Mac/Win) not be packed similar to bibisect-43all.tar.xz. Not seeing the advantage of git-pack-objects processes for those. They get no updates anymore. Ok, you have to extract, but looks more efficient to me; but what do I know..

That's the QA team not the infra team who is building these. I personally don't care how these repositories are bundled. (That said if the repository is properly packed you shouldn't have git-pack-objects processes, only git-send-pack.) While both git clones and file-based HTTP downloads are able to handle partial downloads (hence avoid restarting from scratch in case of a broken connection) it might be easier with git shallow clones than with an HTTP client (depending on the tooling at hand).

So the one who wrote this (below) at https://wiki.documentfoundation.org/QA/Bibisect/Windows is wrong? I surely don't know how to resume

"Please be aware that the original download may be several gigabytes, so please make sure you're on a fast Internet connection and are reasonably certain that you will not experience network interruption. Because of the mechanics of git repositories, if the initial clone is interrupted, I believe that you'll have to start the clone all over again :-("

I have no experience with (git) repository (or what can be optimized). I only know that the downloading a git bibisect repro being (very)slow. First is starts 'preparing' and when it starts downloading it's it's with 14 Mbit/s or less. What causes the issue no clue: git, CPU bound stuff, uplink, improper (git) packaging. The only thing I prefer is some decent download speed. Will ask Xisco about the git repro

#9

Updated by Guilhem Moulin 6 months ago

W.J. Harkink wrote:

Gimli is hosting:
api.libreoffice.org
dev-builds.libreoffice.org
dev-www.libreoffice.org
eclipse-plugins.libreoffice.org
tinderbox.libreoffice.org

The list is not exhaustive, but yes. No idea where you're going with that though.

Nowhere, except me having some impression about everything running..

Everything running on gimli? It's not overloaded. Most vhosts are just small static files that don't take any resources. It doesn't make sense to host api.lo, eclipse-plugins.lo, etc on dedicated VMs. That being said, we mostly group services per target users, and since the main target for dev-builds is QA AFAICT, I guess it would make sense to move that ( along with https://dev-downloads.libreoffice.org/bibisect/ ) on the soon to be bibisect box.

-> The bibisect-43all.tar.xz is of course not the most relevant repro. Not often used. However still slow if you have download it by incident (I'm currently storing all bibisect repro's locally, just in case. As a quick download is a no go).

The ellipsis in what I wrote above is a wildcard. That's only 3 hits (of which one is mine and the other likely buovjaga's) in a two weeks periods with a ≥100MB response for any file under https://dev-downloads.libreoffice.org/bibisect/ . That's not specific to bibisect-43all.tar.xz .

(A) Everything in https://dev-downloads.libreoffice.org/bibisect/ is 5 years old; so surely not an active area

I only picked that prefix because that's what you were complaining about in your report :-)

(B) How many people do download a bibisect repro at all.

FWIW I also gave numbers for https://bibisect.libreoffice.org, which is all I can see and what's relevant to decide whether to use a CDN-backed solution or not. (In case of a volunteer-driven CDN people mirror operators will also be interested to know these metrics.)

So the one who wrote this (below) at https://wiki.documentfoundation.org/QA/Bibisect/Windows is wrong? I surely don't know how to resume

If you can get a shallow clone at depth 1 (the command is atomic though, so if it fails you can't resume) you should then be able to make it unshallow little by little, and the process should survive interruption. Of course for that to work the connection needs not interrupt while fetching one level of depth, but that shouldn't be too many objects.

That being said, now that I spelled out the above I have take to back what I said about the git tooling making it easier to resume downloads :-) It's certainly not user-friendly, and `wget`, which part of the LibreOffice build baseline IIRC, supports it out resuming downloads out of the box. Browsers might do it natively too.

Xisco Fauli Tarazona: Rather than shipping tarballs how about git bundles, like the folks at kernel.org: https://www.kernel.org/cloning-linux-from-a-bundle.html . (IIRC this was even brought up to the #tdf-infra channel, or perhaps an infra call.) Like a tarball it's also an archive format but is perhaps more “git-ty”, has verification and reference listing builtin, etc. I wrote some wrapper around `git bundle create` in that effect. Another advantage is that it makes update without nuking one's own tree easier. Do you have any objection regarding replacing git repositories with static bundles for frozen bibisect repositories? For active repositories we could even keep both in parallel (the bundle will need to be updated each time the repo is touched though, either automatically or manually), the size of a bundle matches the one of the repo so it won't blow up too much.

#10

Updated by Guilhem Moulin 6 months ago

Guilhem Moulin wrote:

If you can get a shallow clone at depth 1 (the command is atomic though, so if it fails you can't resume) you should then be able to make it unshallow little by little, and the process should survive interruption. Of course for that to work the connection needs not interrupt while fetching one level of depth, but that shouldn't be too many objects.

FWIW I see someone added that trick to https://wiki.documentfoundation.org/QA/Bibisect/Linux#Using_Git . But it's nothing specific to Linux, and nothing specific to QA/Bibisecting either. It's a good trick to know when the need arises to download a large repo on a flaky line :-) Even if it might not be as effective on bibisect repositories with a rather short tree and not many objects but only large ones.

That being said, now that I spelled out the above I have take to back what I said about the git tooling making it easier to resume downloads :-) It's certainly not user-friendly, and `wget`, which part of the LibreOffice build baseline IIRC, supports it out resuming downloads out of the box. Browsers might do it natively too.

#11

Updated by Guilhem Moulin 6 months ago

Coordinated offline with Xisco Fauli Tarazona: the inactive repositories are now available via bundle only while active ones are available both as bundle and `git-fetch-pack`. It won't speed up things but might make resuming easier. Individual detailed instructions are available at the https://bibisect.libreoffice.org/$OS-$VERSION.git links found at https://wiki.documentfoundation.org/QA/Bibisect/{Linux,macOS,Windows} (just point your browser at the `git clone` URL). Feel free to update the wiki with some further hints :-)

#12

Updated by Florian Effenberger 6 months ago

  • Status changed from New to In Progress

Will happen when 6.4 bibisect is frozen
Will check back with Xisco on the date, and adjust the due date accordingly

Also available in: Atom PDF