-
-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented --cache-verify and --cache-force now overwrites #1908
Comments
Could at least the changed behavior of '--cache-force', i.e. that it overwrites exiting files be considered to be accepted? Background: We (a team of about 100 developers) use scons to build our software for over a decade and are very happy with it. We use a large shared cache that speeds up building of the various code branches and variants we have to build dramatically. However in rare cases builds are broken because some of the files in the cache are corrupted. That might be due to the reasons listed above or other unknown reasons. A "scons --cache-force" with the proposed changed behavior quickly fixes that issue whereas deleting the cache would cause days of rebuilds. |
I've played with this patch in the past, and have a branch where it's applied (it no longer rebases, so it's probably better to start over if there's agreement to proceed), and it seemed reasonable to me. What I've gathered from the discussion was there was a reluctance to put a band-aid over a problem rather than find a root cause, and some reluctance to add more options (which is a little silly), and then the people who cared enough to keep pushing it left the project. If you've been using the code successfully for that long, it seems it's worth reconsidering. |
Thanks for the quick reply and considering this enhancement. In our case we need the "band-aid" to quickly get the builds working again. That gives us time for a root cause analysis. Sometimes the root cause even turned out to be out of our control (e.g. HW issues) so there is nothing we could do to prevent it. Our cache has a size of 4 TB with 1.5 million files. 100+ developers and multiple automated build system are using it. A lot of things can happen with a system of that scale. Thanks, |
Great... thanks for scaring me half to death :-) |
@pweseloh - which versions of python and SCons are you using? We did some work on minimizing cache corruption possibilities issues in the last few years. |
@bdbaddog we are using Python 3.11.4 and SCons v4.5.2.120fd4f633e9ef3cafbc0fec35306d7555ffd1db The last cases of cache corruption, that I analyzed, had nothing to do with SCons itself. One was caused by a missing dependency in the SConscript of a new component, another was due to a misconfigured build machine (wrong version of a system header). What we need is a quick way to repair the cache in such cases and the proposed behavior change of '--cache-force' has proven to work well for us for many years. |
This issue was originally created at: 2008-02-12 08:41:52.
This issue was reported by:
belley
.Hi everyone,
I have implemented the following command-line option:
I have also modified slightly the behavior of
--cache-force
and--cache-populate
. As discussed on the mailing list, the change in behavior should not affect the current usage of these options.(Submitted against the changeset 2659 of branches/core.)
Benoit
Created an attachment (id=303)
Implementation + regression tests
Benoit - What's the use model for this? Central cache, flags used by any and/or specified user to update cache from other build tree?
Bug party triage: Jim to research use case and see if there's a base issue (e.g., asynchronous cache updates) that's the root cause of this fix, and whether the root cause might be fixable.
(Attached previous e-mail to this bug description to help tracking.)
Hi William,
A user might introduce a bug in one of its SConscript in such a way that it causes cache corruption. These options are useful to identify these bugs in your SConscripts and fix them.
Let me expand a bit on this.
SCons uses build signatures to fetch element out of the cache. But, a build signature can sometime be erroneous. I have found a few instances of these at my very own work place. For example:
Please note that caching a file under the wrong build signature is EXTREMELY dangerous. It will be re-fetched from the cache over and over again. Your build will fail mysteriously. You try cleaning (i.e.
scons --clean
) and the next build fails again. The worst part is that the build does not crash at the point where the erroneous file is fetched from the cache but at the point where the file is used later in the build. That's make it hard to debug.The option
--cache-verify
is very handy to tracked down these issues and fix the faulty SConscripts. It is very important that the bugs in the SConscript's be fixed because otherwise your cache will get corrupted quickly. You'll be erasing your cache so often that you might as well not have a cache... ;-(The
--cache-force
option is used to fix a broken cache once you have identify the bug in your SConscript. You could also just erase your cache, but selectively using--cache-force
to purge the stale file out of the cache is faster. This aspect is probably more important for central caches where erasing the cache means hours and hours of recompilation.Benoit
Greg wrote:
Hi Greg,
As explained in my previous e-mail (I have just attached it to this bug report), an error in a user-written SConscript my cause a corruption of the cache. The
--cache-verify
is therefore useful to debug SConscript and therefore IMO should be available to any user.Does it make sense ?
BTW, I am currently rebasing the patch to the changeset 2773 and verifying that it works with Python 2.5.1.
Benoit
Benoit, there's no doubt that this fix helps in the short term; the question was really if it's more productive to go after the root causes so that a short-term fix is not required. SCons already has too many confusing command-line options; if we can avoid creating any more, we're ahead. If there's a way to be sure the problem doesn't arise in the first place or if the cache becomes self-correcting, these options may not be needed. If they will only be needed in the short term, they could be documented as for debugging only and to be withdrawn when the root causes are fixed.
Another potential use case to add to the ones mentioned by Benoit would be to help track down problems caused by buggy network hardware and/or software.
I wouldn't be surprised if the problems I'm having is due to some network flakiness, but I'm not really 100% sure.
Greg wrote:
Hi Greg,
I agree that SCons already has too many command-line options. Unfortunately, I do not see how could entirely eliminate the need for the
--cache-verify
option. Let's examine the various cases/suggestions:Bugs in SCons
I just submitted a patch for a bug in SCons that could cause cache corruption (due the erronous re-use of a corrupted file). See issue SCons shoud double-check the node's signature before assumming a file is up-to-date #2014. I feel that this is not an isolated case. The signature system is complex. Bugs will be introduced. The
--cache-verify
option gives an effective way for tracking down these types of bugs (in addition to--tree
and sconsign).Bugs in SConscript
This is an error that a user makes while writting its SConscript that leads to an erronous signature.
I have already encountered a few examples of this. An action that generate different targets on different OS, machines, versions of Python, etc. A signature might be incomplete because a user-written scanner or emitter fails to list all dependencies. A FunctionAction does not list every accessed construction variables in its varlist.
I could see that the varlist one could be fixed by dynamically listing every accessed construction variables using an proxy of some sort. But for the other cases, this is not obvious at all.
Corrupted disk and or network
Not much we can do about this. At least the
--cache-verify
and--cache-force
options let you fix your cache without having to recompile the entire world to rebuild your caches.Self-correcting cache
I fail to see how this could work. For example, a build fetches a file from the cache because the build signature matches, but the file just fetched, for some reason, it actually different than the one that you have built. How could you determine that the file that you just fetched is actually different that the one that you would have built without building it ?
Benoit
The first patch was broken on Windows because
os.rename()
fails on Windows if the target already exists. This was not detected before because the cache-write-error warning is turned off by default.I have also realized that built targets could be copied multiple times back to the cache because:
push()
andpush_if_cached()
are called for targets that are actually built.visited()
method was called twice, once inmake_ready()
and a second time inexecuted()
!!!This was all fixed.
(Is it too late to check this in before 1.0 ? It does not seem too risky to me.)
Note that the newer patch was submitted against the changeset 3019.
Benoit
There may yet be an asynchronous cache issue, but if so, it's rare enough that it's hard to know if there's an actual problem or whether it's people killing builds at inopportune times. Certainly the code looks ok, for whatever that's worth.
Our error rate for cache is pretty darn low now (or people aren't telling me any more, but I presume it's the former :)
What type of network filesystem is this?
The description sounds like a CIFS share with Windows clients. If that is the case it would be interesting to known what the OS version and service pack level for the server and clients are.
You are correct, it is a samba server. The clients are mostly XP and a few Vista, mostly up-to-date. The server is FreeBSD 6.
Bug party triage. If
--cache-verify
is strictly for debugging, the name should be changed to something with "debug" in it.Benoit, if you want to do it, assign the issue to yourself and go for it.
Implementation + regression tests
The text was updated successfully, but these errors were encountered: