Implemented --cache-verify and --cache-force now overwrites #1908

bdbaddog · 2018-01-02T12:00:32Z

This issue was originally created at: 2008-02-12 08:41:52.
This issue was reported by: belley.

belley said at 2008-02-12 08:41:52

Hi everyone,

I have implemented the following command-line option:

  --cache-verify
         When  using CacheDir(), verify that already-existing, up-to-date
         derived files and files built by this invocation match the  ones
         present  in  the cache. This is done by performing a binary com-
         parison between each derived file and  its  corresponding  cache
         file. A warning message is printed if they do not match. This is
         useful to identify files that are corrupted in the cache.

         One has to be careful when interpreting the --cache-verify warn-
         ings.  Derived files can hash to same cache entry if their build
         signature is incomplete.

         For example, a scanner that forgets to list all of its  implicit
         dependencies  could  lead  to  an  action  producing a different
         derived file on different machines and yet they would  all  hash
         to  the same cache file. In that case, --cache-verify would com-
         plain about files not matching, but the problem is  not  a  cor-
         rupted cache.

I have also modified slightly the behavior of --cache-force and --cache-populate. As discussed on the mailing list, the change in behavior should not affect the current usage of these options.

  --cache-force
         When  using CacheDir(), populate a cache by copying any already-
         existing, up-to-date derived files to the cache, in addition  to
         files  built  by  this invocation.  This is useful to populate a
         new cache with all the current derived files, or to add  to  the
         cache any derived files recently built with caching disabled via
         the --cache-disable option.

         When  using  the  --cache-disable  and   --cache-force   options
         together, scons will not retrieve files from the cache, but will
         still copy derived files to the cache.

         When using the --cache-force option, derived files  will  always
         be copied to the cache, overwriting the corresponding file if it
         is already present. This behavior is useful for getting  rid  of
         corrupted files that might exist in the cache.


  --cache-populate
         When  using CacheDir(), populate a cache by copying any already-
         existing, up-to-date derived files to the cache, in addition  to
         files  built  by  this invocation.  This is useful to populate a
         new cache with all the current derived files, or to add  to  the
         cache any derived files recently built with caching disabled via
         the --cache-disable option.

         When using  the  --cache-disable  and  --cache-populate  options
         together, scons will not retrieve files from the cache, but will
         still copy derived files to the cache.

         When using the --cache-populate option, derived files  will  not
         be  copied  to  the  cache  if the corresponding file is already
         present.

(Submitted against the changeset 2659 of branches/core.)

Benoit

belley said at 2008-02-12 08:42:23

Created an attachment (id=303)
Implementation + regression tests

bdbaddog said at 2008-04-08 23:12:22

Benoit - What's the use model for this? Central cache, flags used by any and/or specified user to update cache from other build tree?

gregnoel said at 2008-04-14 13:23:17

Bug party triage: Jim to research use case and see if there's a base issue (e.g., asynchronous cache updates) that's the root cause of this fix, and whether the root cause might be fixable.

belley said at 2008-04-14 13:25:32

(Attached previous e-mail to this bug description to help tracking.)

Hi William,

A user might introduce a bug in one of its SConscript in such a way that it causes cache corruption. These options are useful to identify these bugs in your SConscripts and fix them.

Let me expand a bit on this.

SCons uses build signatures to fetch element out of the cache. But, a build signature can sometime be erroneous. I have found a few instances of these at my very own work place. For example:

A user might forget to properly specify the varlist of an action. The result is that different targets get stored under the same cache key (i.e. build signature).
A builder might produce a different target on different OSes even if its build signature is the same. The different handling of CR-LF under Windows and UNIX makes it very easy to have automatically generated headers files to have the same build signature but still be different.
There could be a bug in SCons itself. For example, issue Dont use KeyboardInterrupt to stop a build! #1907 is related to a corrupted cache after interrupting an SCons build with a Ctrl-C.

Please note that caching a file under the wrong build signature is EXTREMELY dangerous. It will be re-fetched from the cache over and over again. Your build will fail mysteriously. You try cleaning (i.e. scons --clean) and the next build fails again. The worst part is that the build does not crash at the point where the erroneous file is fetched from the cache but at the point where the file is used later in the build. That's make it hard to debug.

The option --cache-verify is very handy to tracked down these issues and fix the faulty SConscripts. It is very important that the bugs in the SConscript's be fixed because otherwise your cache will get corrupted quickly. You'll be erasing your cache so often that you might as well not have a cache... ;-(

The --cache-force option is used to fix a broken cache once you have identify the bug in your SConscript. You could also just erase your cache, but selectively using --cache-force to purge the stale file out of the cache is faster. This aspect is probably more important for central caches where erasing the cache means hours and hours of recompilation.

Benoit

belley said at 2008-04-14 13:29:44

Greg wrote:

Bug party triage: Jim to research use case and see if there's a base issue? (e.g., asynchronous cache updates) that's the root cause of this fix, and whether the root cause might be fixable.

Hi Greg,

As explained in my previous e-mail (I have just attached it to this bug report), an error in a user-written SConscript my cause a corruption of the cache. The --cache-verify is therefore useful to debug SConscript and therefore IMO should be available to any user.

Does it make sense ?

BTW, I am currently rebasing the patch to the changeset 2773 and verifying that it works with Python 2.5.1.

Benoit

gregnoel said at 2008-04-14 19:48:41

Benoit, there's no doubt that this fix helps in the short term; the question was really if it's more productive to go after the root causes so that a short-term fix is not required. SCons already has too many confusing command-line options; if we can avoid creating any more, we're ahead. If there's a way to be sure the problem doesn't arise in the first place or if the cache becomes self-correcting, these options may not be needed. If they will only be needed in the short term, they could be documented as for debugging only and to be withdrawn when the root causes are fixed.

mightyllamas said at 2008-04-14 20:22:56

Another potential use case to add to the ones mentioned by Benoit would be to help track down problems caused by buggy network hardware and/or software.

I wouldn't be surprised if the problems I'm having is due to some network flakiness, but I'm not really 100% sure.

belley said at 2008-04-15 07:55:48

Greg wrote:

Benoit, there's no doubt that this fix helps in the short term; the question was really if it's more productive to go after the root causes so that a short-term fix is not required. SCons already has too many confusing command-line options; if we can avoid creating any more, we're ahead. If there's a way to be sure the problem doesn't arise in the first place or if the cache becomes self-correcting, these options may not be needed. If they will only be needed in the short term, they could be documented as for debugging only and to be withdrawn when the root causes are fixed.

Hi Greg,

I agree that SCons already has too many command-line options. Unfortunately, I do not see how could entirely eliminate the need for the --cache-verify option. Let's examine the various cases/suggestions:

Bugs in SCons

I just submitted a patch for a bug in SCons that could cause cache corruption (due the erronous re-use of a corrupted file). See issue SCons shoud double-check the node's signature before assumming a file is up-to-date #2014. I feel that this is not an isolated case. The signature system is complex. Bugs will be introduced. The --cache-verify option gives an effective way for tracking down these types of bugs (in addition to --tree and sconsign).
Bugs in SConscript

This is an error that a user makes while writting its SConscript that leads to an erronous signature.

I have already encountered a few examples of this. An action that generate different targets on different OS, machines, versions of Python, etc. A signature might be incomplete because a user-written scanner or emitter fails to list all dependencies. A FunctionAction does not list every accessed construction variables in its varlist.

I could see that the varlist one could be fixed by dynamically listing every accessed construction variables using an proxy of some sort. But for the other cases, this is not obvious at all.
Corrupted disk and or network

Not much we can do about this. At least the --cache-verify and --cache-force options let you fix your cache without having to recompile the entire world to rebuild your caches.
Self-correcting cache

I fail to see how this could work. For example, a build fetches a file from the cache because the build signature matches, but the file just fetched, for some reason, it actually different than the one that you have built. How could you determine that the file that you just fetched is actually different that the one that you would have built without building it ?

Benoit

belley said at 2008-06-02 12:02:21

The first patch was broken on Windows because os.rename() fails on Windows if the target already exists. This was not detected before because the cache-write-error warning is turned off by default.

I have also realized that built targets could be copied multiple times back to the cache because:

Both push() and push_if_cached() are called for targets that are actually built.
For up-to-date targets, the visited() method was called twice, once in make_ready() and a second time in executed() !!!

This was all fixed.

(Is it too late to check this in before 1.0 ? It does not seem too risky to me.)

Note that the newer patch was submitted against the changeset 3019.

Benoit

mightyllamas said at 2009-03-04 07:46:55

There may yet be an asynchronous cache issue, but if so, it's rare enough that it's hard to know if there's an actual problem or whether it's people killing builds at inopportune times. Certainly the code looks ok, for whatever that's worth.

Our error rate for cache is pretty darn low now (or people aren't telling me any more, but I presume it's the former :)

azverkan said at 2009-03-17 23:12:33

What type of network filesystem is this?

The description sounds like a CIFS share with Windows clients. If that is the case it would be interesting to known what the OS version and service pack level for the server and clients are.

mightyllamas said at 2009-03-18 17:10:54

You are correct, it is a samba server. The clients are mostly XP and a few Vista, mostly up-to-date. The server is FreeBSD 6.

gregnoel said at 2009-03-19 00:50:32

Bug party triage. If --cache-verify is strictly for debugging, the name should be changed to something with "debug" in it.

Benoit, if you want to do it, assign the issue to yourself and go for it.

belley attached cache-verify.patch at 2008-02-12 08:42:23.

Implementation + regression tests

The text was updated successfully, but these errors were encountered:

pweseloh · 2025-02-20T16:08:24Z

Could at least the changed behavior of '--cache-force', i.e. that it overwrites exiting files be considered to be accepted?
I applied that specific patch to our scons installation many years ago and it has helped us a lot when our large shared cache contained corrupted files.

Background: We (a team of about 100 developers) use scons to build our software for over a decade and are very happy with it. We use a large shared cache that speeds up building of the various code branches and variants we have to build dramatically. However in rare cases builds are broken because some of the files in the cache are corrupted. That might be due to the reasons listed above or other unknown reasons. A "scons --cache-force" with the proposed changed behavior quickly fixes that issue whereas deleting the cache would cause days of rebuilds.

mwichmann · 2025-02-20T16:33:21Z

I've played with this patch in the past, and have a branch where it's applied (it no longer rebases, so it's probably better to start over if there's agreement to proceed), and it seemed reasonable to me. What I've gathered from the discussion was there was a reluctance to put a band-aid over a problem rather than find a root cause, and some reluctance to add more options (which is a little silly), and then the people who cared enough to keep pushing it left the project. If you've been using the code successfully for that long, it seems it's worth reconsidering.

pweseloh · 2025-02-21T08:35:21Z

Thanks for the quick reply and considering this enhancement.

In our case we need the "band-aid" to quickly get the builds working again. That gives us time for a root cause analysis. Sometimes the root cause even turned out to be out of our control (e.g. HW issues) so there is nothing we could do to prevent it.

Our cache has a size of 4 TB with 1.5 million files. 100+ developers and multiple automated build system are using it. A lot of things can happen with a system of that scale.

Thanks,
Peter

mwichmann · 2025-02-21T14:04:51Z

Our cache has a size of 4 TB with 1.5 million files.

Great... thanks for scaring me half to death :-)

bdbaddog · 2025-02-21T18:45:09Z

@pweseloh - which versions of python and SCons are you using? We did some work on minimizing cache corruption possibilities issues in the last few years.

pweseloh · 2025-02-24T07:08:11Z

@bdbaddog we are using Python 3.11.4 and SCons v4.5.2.120fd4f633e9ef3cafbc0fec35306d7555ffd1db

The last cases of cache corruption, that I analyzed, had nothing to do with SCons itself. One was caused by a missing dependency in the SConscript of a new component, another was due to a misconfigured build machine (wrong version of a system header). What we need is a quick way to repair the cache in such cases and the proposed behavior change of '--cache-force' has proven to work well for us for many years.

bdbaddog added this to the 2.x milestone Jan 2, 2018

bdbaddog added the P3 label Jan 2, 2018

mwichmann mentioned this issue Apr 20, 2020

Generate identical cache MD5 for Linux and Windows #2666

Open

bdbaddog removed this from the 2.x milestone Apr 21, 2020

mwichmann added the has patch Issue has attached Patch label Jan 9, 2021

mwichmann added the cachedir label Jan 13, 2022

mwichmann mentioned this issue Apr 4, 2022

scons caches corrupted files if cache directory volume has no space left #2625

Open

mwichmann removed the P3 label May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented --cache-verify and --cache-force now overwrites #1908

Implemented --cache-verify and --cache-force now overwrites #1908

bdbaddog commented Jan 2, 2018 •

edited by mwichmann

Loading

pweseloh commented Feb 20, 2025

mwichmann commented Feb 20, 2025

pweseloh commented Feb 21, 2025

mwichmann commented Feb 21, 2025

bdbaddog commented Feb 21, 2025

pweseloh commented Feb 24, 2025

Implemented --cache-verify and --cache-force now overwrites #1908

Implemented --cache-verify and --cache-force now overwrites #1908

Comments

bdbaddog commented Jan 2, 2018 • edited by mwichmann Loading

pweseloh commented Feb 20, 2025

mwichmann commented Feb 20, 2025

pweseloh commented Feb 21, 2025

mwichmann commented Feb 21, 2025

bdbaddog commented Feb 21, 2025

pweseloh commented Feb 24, 2025

bdbaddog commented Jan 2, 2018 •

edited by mwichmann

Loading