Project

General

Profile

Actions

Task #598

closed

Git + Bibisect

Added by Joel Madero over 9 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
Normal
Category:
QA
Target version:
-
Start date:
Due date:
% Done:

0%

Tags:

Description

The task to be carried out is to get one repository for bibisect, available via git, that provides everything available for the specific platform, from the most recent branch point (currently 5.1) up to current master. The rationale is to have a mechanism to check the most recent regressions on a daily basis.

Then once 5.2 branch point occurs - git would follow that branch (5.2 - master) just like we have "master branch" for building

Actions #1

Updated by Bjoern Michaelsen over 9 years ago

  • Project changed from 49 to Infrastructure

moving to infrastructure project as per https://redmine.documentfoundation.org/issues/599#note-2

Actions #2

Updated by Bjoern Michaelsen over 9 years ago

  • Category set to QA
Actions #3

Updated by Christian Lohmaier over 9 years ago

a single repository covering the whole range would be rather large, so I'm personally not a big fan of those. Having separate bibisect repositories covering the time between branch points is more manageable in my opinion.

Or do you mean creating the bibisect repository from the tinderbox builds? This is done for windows already and can be extended to more tinderboxes. Drawback of those repositories is that they also grow large rather quickly, as the compression is not as effective as when done over the whole repository.

Actions #4

Updated by Florian Effenberger over 9 years ago

  • Status changed from New to Feedback
  • Assignee set to Christian Lohmaier

Assigning to Cloph, as he or Robinson are the most likely takers ;-)
Joel, can you give feedback what precisely is required?

Actions #5

Updated by Joel Madero about 9 years ago

Florian Effenberger wrote:

Assigning to Cloph, as he or Robinson are the most likely takers ;-)
Joel, can you give feedback what precisely is required?

Well it really depends on if it can be done at all.The idea is just that currently you have to manually download multiple packages for bibisect and they only go through the last commit in the individual package. It would be nice if instead we could do a git pull of some kind and be completely up to date and have it all in one bibisect (instead of the current two, which at some point will turn to 3 and 4...etc...)

So ideally:
1. Ability to git pull daily so we are up to date always on bibisect - with our current QA team I think that we could catch regressions and have them bibisected within days;
2. Ability to download a subset of the bibisect for those who do not want to have 15+ gigs, for instance download from 4.0 branch point - present, and use git pull to add to it daily;

Let me know if that helps. I'm not sure if it's feasible, Robinson likely has some thoughts on if it is but if this were done it would also save Robinson the effort of having to package bibisect every few weeks (or whatever time frame he does it).

Actions #6

Updated by Florian Effenberger almost 9 years ago

Cc'ing Robinson
Robinson, can you give some feedback with your thoughts, and also let me know if you can work on this?

Actions #7

Updated by Robinson Tryon almost 9 years ago

Florian Effenberger wrote:

Cc'ing Robinson
Robinson, can you give some feedback with your thoughts, and also let me know if you can work on this?

I've talked briefly with mjfrancis about some of our bibisect goals. He's working on some advanced packing, and calculating/testing what a more complete coverage of commits would take (i.e. builds of each commit), which would make the results of bibisect more accurate and provide a more direct answer akin to 'git blame'.

Well it really depends on if it can be done at all.The idea is just that currently you have to manually download multiple packages for bibisect and they only go through the last commit in the individual package. It would be nice if instead we could do a git pull of some kind and be completely up to date and have it all in one bibisect (instead of the current two, which at some point will turn to 3 and 4...etc...)

I think it's definitely something doable from a technical perspective, we just need to make sure we understand how much of a data diff people will have to pull down each time (whether they're 1 day or 1 week out of date)

So ideally:
1. Ability to git pull daily so we are up to date always on bibisect - with our current QA team I think that we could catch regressions and have them bibisected within days;
2. Ability to download a subset of the bibisect for those who do not want to have 15+ gigs, for instance download from 4.0 branch point - present, and use git pull to add to it daily;

Yep

Actions #8

Updated by Robinson Tryon almost 9 years ago

2. Ability to download a subset of the bibisect for those who do not want to have 15+ gigs, for instance download from 4.0 branch point - present, and use git pull to add to it daily;

So what's a reasonable chunk-size? Conceptually, it'd make sense to chunk on each release branch (4.4, 4.5, etc..), however I don't know how much this might increase the overall sum of sizes...

Actions #9

Updated by Christian Lohmaier almost 9 years ago

mjfrancis created a version of the 4.4 linux bibisect that covers every buildable commit in the range (if I understood correctly), and that is a 7.5 GB tarball. this size is manageable, and fits on a 8GB USB-stick...

And with that many commits covered, the git overhead should be rather slim, I don't expect noteworthy changes in disk-space when combining that with another bibisect repo. (as you cannot use huge comparison windows or diff-chains when gc'ing/repacking the repository, as even if you have enough RAM to process it when creating the repo, larger diff-numbers increase the time to switch between versions (and of course much larger repo also prevents having the repo-data in disk-cache locally when performing the bibisect)

I think for the individual repositories, we should aim for 8GB as the upper boundary in size. Do we have a rough feeling about what percentage of requests require switching between the bibisect repositories? I mean right now the backlog of bibisect requests/old issues has been reduced significantly, leaving only "newly introduced" regressions that should be covered by either master or by the 44 repo?

But I might be completely wrong with that assumption. My gut feeling is that the by-branch chunks are reasonable. You know when the range they cover starts and ends, and it is also easy to figure out with which to start by reading a bugreport.

Actions #10

Updated by Robinson Tryon almost 9 years ago

Christian Lohmaier wrote:

But I might be completely wrong with that assumption. My gut feeling is that the by-branch chunks are reasonable. You know when the range they cover starts and ends, and it is also easy to figure out with which to start by reading a bugreport.

So do we want to have bibisect ranges that span release branches, or just have bibisect coverage for the master/dev-branches? Either way, I guess that we'd get coverage of similar sets of commits. The benefit of building bibisect repositories off of the master branch is that we get all of the commits sooner than if we waited until they landed in a release branch. The downside is that it's possible that we might find a bug in a release branch that isn't reproducible in master, however I'd assume the probability there is pretty small...

If we do build off of master, then what we're going to get is a 4.5(.0.0.alpha0+) branch repo, then a 4.6(.0.0.alpha0+) branch repo, etc, etc..

Actions #11

Updated by Matthew Francis almost 9 years ago

Christian Lohmaier wrote:

mjfrancis created a version of the 4.4 linux bibisect that covers every buildable commit in the range (if I understood correctly), and that is a 7.5 GB tarball. this size is manageable, and fits on a 8GB USB-stick...

Every buildable commit, minus those which are clearly irrelevant for Linux-x64 (i.e. commits which only touch files in directories specific to other architectures) or are only for features that aren't built in the repo (translation, help content updates).

4.5 (5.0?) should be smaller than 7.5G, possibly quite a bit smaller, for two reasons:
- Unless there's a huge last minute burst, the number of commits is rather smaller than 4.4 (11000 in 4.4, vs 6000 at >2/3 the way through current master. I wonder why that is, incidentally)
- A bug was fixed in mid-late 4.4 which significantly reduced the entropy of each build (for want of a "sort", the generated parts of oox were built non-deterministically, adding unnecessarily to the compressed/delta size of each of the thousands of commits)

On a couple of other topics above:
"What's the right scope for a repository"
One per master period seems fine to me. I've never been particularly troubled by the process of switching between two or three repositories to find the region of a bug. In terms of extra keystrokes, it's barely more than an extra "git bisect good/bad", which would be required anyway with a larger repository. And, according to my stats (from a spreadsheet I used to decide that 4.4 was the earliest it was worthwhile to do a full build from), generally at any one time, 80-90% of newly reported bugs were introduced in the 3 most recent master periods, so it's not like there are usually going to be a huge number of repositories to work from. If anyone is that bothered, we could always keep culling a single, coarser resolution repo to help determine the region of a bug.

"Does it increase the overall size much to have separate repos for each master period"
I concur with cloph on that one, when you're already at such coarse granularity, merging more together won't save you any worthwhile amount of disk space. And, much more than 8GB in size will make it exponentially harder to get good results from a full "git gc" / "git repack" - I was already maxing out 32GB of RAM recompressing it at that size. You can tune it to use less RAM of course, but then the overall compression starts to suffer.

"Are there ever bugs that could usefully be bisected in release branches"
Rare as hens' teeth. Out of all the regressions I've ever investigated, I can think of just one or two times I've had to look in a release branch for one (or for where one was fixed, which is occasionally relevant). They're so uncommon I'd suggest just carrying on investigating them manually, rather than carrying around infrastructure for it.

"How much data would need to be pulled daily from a real time, full coverage bibisect repo"
I don't have an answer on that yet - need to make some time to model it. However, I would add that while inclusivity, i.e. the number of people able to take part, is a wonderful goal, the full coverage bibisect repo makes slicing and dicing regressions so fast that it would only take one or two people working even part time to deal with every regression that comes in. Even with a fast machine, the old way - coarse bibisect first, then source bisect if (as is often the case) there's no obvious culprit or the answer isn't 100% certain - took several hours per bug, while now, regressions in 4.4 can be dealt with in 5-15 minutes each from start to finish. With enough full coverage, when there are 70-100 reported regressions a month at 15 minutes each, that's only a couple of person-days of effort each month in total.

And apropos of nothing:
"Can bibisect builds be usefully distributed over several machines/VMs"
A mistake I made and had to spend some time clearing up is that unless you build each time in exactly the same directory location, the reproducibility of various libraries, mostly included externals, goes down the toilet. They do (to my mind, pointless) things like embedding the build location in the library itself. Git delta compression obviously sucks back most of the cost of that, but when you have a dozen libraries doing it in each commit, and thousands of commits, the marginal cost adds up to quite a lot. So in summary you can do it, but the configuration of each had better be as identical as can be.
However, it's certainly much easier to do it on one machine. I used two to accelerate the building of 4.4, but in fact just one of them, an Intel 4790k (a very cost effective solution at present, as I see one or two other developers have also found) can build at 4-5 times real time in terms of average daily commits, which is handily enough to keep up even with the occasional surge.

Actions #12

Updated by Florian Effenberger over 8 years ago

  • Tracker changed from Feature to Task
Actions #13

Updated by Florian Effenberger over 8 years ago

  • Description updated (diff)

Updated the description.

Actions #14

Updated by Bjoern Michaelsen over 8 years ago

  • Assignee changed from Christian Lohmaier to Joel Madero

Edit: Urgh, I was misreading this one -- I wasnt paying attention. Marking as resolved.
@Joel: Feel free to reopen, if you disagree.

Actions #15

Updated by Bjoern Michaelsen over 8 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Joel Madero to Christian Lohmaier
Actions #16

Updated by Florian Effenberger over 8 years ago

As a side note as this might be related, Q4 milestone planning foresees (after I had received insight during the Hamburg meeting) repo building by TDF staff again, so if that is the only thing missing, it can be closed.

Actions #17

Updated by Florian Effenberger over 8 years ago

  • Status changed from Resolved to Closed
Actions

Also available in: Atom PDF