Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test flake in test_mem.sh #1569

Open
leftwo opened this issue Nov 21, 2024 · 2 comments
Open

test flake in test_mem.sh #1569

leftwo opened this issue Nov 21, 2024 · 2 comments

Comments

@leftwo
Copy link
Contributor

leftwo commented Nov 21, 2024

A few times in the past two weeks, the test_mem.sh test has not finished before hitting a timeout.

Here is an example of the last failure:
https://github.com/oxidecomputer/crucible/pull/1566/checks?check_run_id=33265882305

There were updates made to the github job in #1563 to reduce the timeout and to also gather more data in the event of a failure.

@leftwo
Copy link
Contributor Author

leftwo commented Nov 21, 2024

This whole test takes two and a half minutes on Atrium:

alan@atrium:crucible$ time ./tools/test_mem.sh 
/home/alan/ws/crucible
Memory usage test begins at November 21, 2024 at 12:55:30 PM UTC
Memory usage values in kilobytes unless specified otherwise
Region with ES:16384 EC:16 BS:4096  Size: 1 GiB  Extent Size: 64 MiB
   PID     RSS RSS/EC     VSZ VSZ/EC    HEAP HEAP/EC   TOTAL TOTAL/EC     EC
 17111  104468   6529  129876   8117   82356    5147  129876     8117     16
 17125  106384   6649  132792   8299   85248    5328  132792     8299     16
 17123  104444   6527  129840   8115   82292    5143  129840     8115     16
Region:1 GiB  Extent:64 MiB  Total downstairs (pmap -x): 383 MiB
Size on disk of all region dirs: 3.07G or 3296823584

Region with ES:16384 EC:160 BS:4096  Size: 10 GiB  Extent Size: 64 MiB
   PID     RSS RSS/EC     VSZ VSZ/EC    HEAP HEAP/EC   TOTAL TOTAL/EC     EC
  8318  113624    710  140164    876   92632     578  140164      876    160
  8324  112504    703  139044    869   91524     572  139044      869    160
  8320  113316    708  139856    874   92296     576  139856      874    160
Region:10 GiB  Extent:64 MiB  Total downstairs (pmap -x): 409 MiB
Size on disk of all region dirs: 30.7G or 32968229075

Region with ES:16384 EC:1600 BS:4096  Size: 100 GiB  Extent Size: 64 MiB
   PID     RSS RSS/EC     VSZ VSZ/EC    HEAP HEAP/EC   TOTAL TOTAL/EC     EC
 12688  178052    111  216140    135  168508     105  216140      135   1600
 12690  176852    110  214936    134  167304     104  214936      134   1600
 12689  175952    109  214036    133  166404     104  214036      133   1600
Region:100 GiB  Extent:64 MiB  Total downstairs (pmap -x): 629 MiB
Size on disk of all region dirs: 307G or 329682283958

Memory usage test finished on November 21, 2024 at 12:58:01 PM UTC

real    2m31.320s
user    3m27.535s
sys     4m35.862s

So, the current timeout of an hour should be more than enough.
A recent buildomat showed this test taking ~20 minutes:

buildomat / test-memory
succeeded 3 days ago in 20m 40s

But, I believe that time is from when the job starts and includes waiting for the dependencies to finish. This waiting for dependency time is not part of the script specific timeout of 1 hour.

@leftwo
Copy link
Contributor Author

leftwo commented Nov 21, 2024

If I look at the PS output that is gathered after the test hits the timeout, I don't see a crutest process.

When I compare a passing test run to a failed timeout run, I notice that the passing tests has results from crutest output of the third loop, where we test Region:100 GiB:

Region: sv:1 bs:4096  ts:107374182400  tb:26214400  max_io:256 or 1048576
Fill test

However, on the failing test, the crutest log shows the results from what I think is the second loop of the test, where we only test 10 GiB:

Disk: sv:1 bs:4096  ts:10737418240  tb:2621440  max_io:256 or 1048576
Fill test

From this I suspect that in the timeout case, we never loop around to start that third loop.
The last thing we see on the timeout case is this line:

	2024-11-20T14:31:54.613Z	Size on disk of all region dirs: 30.7G or 32968229075

Looking at the test_mem.sh script, we are in the mem_test function, and it ends with this:

    region_summary=$(du -sAh "$region_dir" | awk '{print $1}')
    region_size=$(du -sA "$region_dir" | awk '{print $1}')
    echo "Size on disk of all region dirs: $region_summary or $region_size"
        
    set +o errexit
    "$dsc" cmd shutdown >> "$test_mem_log" 2>&1
    wait $dsc_pid
    unset dsc_pid
    set -o errexit
    echo ""
} 

The next thing the test will do is call mem_test() function with larger sizes.
mem_test() begins with this code:

function mem_test() {
    if [[ $# -ne 2 ]]; then
        echo "Missing EC and ES for mem_test()" >&2
        exit 1
    fi
    es=$1 
    ec=$2
        
    total_size=$(echo "$es * $ec * $block_size" | bc)
    size_mib=$(echo "$total_size / 1024 / 1024" | bc)
    size_gib=$(echo "$size_mib / 1024" | bc)

    echo -n "Region with ES:$es EC:$ec BS:$block_size  "

And, we never see the Region with ... line printed. (There is more code, but we do get to an echo without the -n)
As the dsc shutdown looks to be where we are stuck, I'm starting to suspect this may be another case of #1498

The test_mem.sh does not have the same checks that other tests do where we first verify all downstairs have started. Without that check, the other way to know you have hit 1498 is by not being able to shudown dsc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant