-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
separate_asset_per_band gives some empty tiffs #329
Comments
Status: Disabling fuse mount, and using S3 directly, might cause issues with export_workspace |
…s before the executor gets closed. #329
|
Job got trough with the file move way. Executors got OOM a few times. This might be the initial reason for the incomplete output files.
Curious observation, the file permissions in the fusemount changed over time:
|
…ults to the output folder from driver, and clean up afterwards. #329
emile@emile-Precision-7680:~$ kubectl --kubeconfig ~/.kube/cdse_dev.yml -n spark-jobs-dev exec -it a-8dec3512508e410e878e3742c07ef718-driver -c spark-kubernetes-driver -- /bin/bash
bash-4.4$ cd /batch_jobs/j-241022e02b9745ce8438c66bbffbb5f9
bash-4.4$ date --iso-8601=seconds && ls -al
2024-10-22T15:05:50+00:00
total 667033
drwxr-xr-x. 2 spark spark 4096 Oct 22 15:05 .
drwxr-xr-x. 1 spark spark 48 Oct 22 14:59 ..
-rw-rw-r--. 1 spark spark 110510198 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif
-rw-rw-r--. 1 spark spark 115616258 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif
-rw-rw-r--. 1 spark spark 114191269 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif
-rw-rw-r--. 1 spark spark 115283524 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif
-rw-rw-r--. 1 spark spark 113842171 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif
-rw-rw-r--. 1 spark spark 113548381 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif
-rw-rw-r--. 1 spark spark 40783 Oct 22 15:05 job_metadata.json
-rw-rw-r--. 1 spark spark 2487 Oct 22 14:59 job_specification.json
bash-4.4$ date --iso-8601=seconds && ls -al
2024-10-22T15:05:53+00:00
total 667055
drwxr-xr-x. 2 spark spark 4096 Oct 22 15:05 .
drwxr-xr-x. 1 spark spark 48 Oct 22 14:59 ..
-rw-r--r--. 1 spark spark 635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P10.tif.json
-rw-r--r--. 1 spark spark 635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P25.tif.json
-rw-r--r--. 1 spark spark 635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P50.tif.json
-rw-rw-r--. 1 spark spark 110510198 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif
-rw-r--r--. 1 spark spark 389 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif.aux.xml
-rw-r--r--. 1 spark spark 834 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif.json
-rw-r--r--. 1 spark spark 635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P90.tif.json
-rw-rw-r--. 1 spark spark 115616258 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif
-rw-r--r--. 1 spark spark 391 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif.aux.xml
-rw-r--r--. 1 spark spark 796 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P25.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P50.tif.json
-rw-rw-r--. 1 spark spark 114191269 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif
-rw-r--r--. 1 spark spark 392 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif.aux.xml
-rw-r--r--. 1 spark spark 797 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P90.tif.json
-rw-rw-r--. 1 spark spark 115283524 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif
-rw-r--r--. 1 spark spark 390 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif.aux.xml
-rw-r--r--. 1 spark spark 795 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P25.tif.json
-rw-r--r--. 1 spark spark 603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P50.tif.json
-rw-rw-r--. 1 spark spark 113842171 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif
-rw-r--r--. 1 spark spark 389 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif.aux.xml
-rw-r--r--. 1 spark spark 794 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif.json
-rw-rw-r--. 1 spark spark 113548381 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif
-rw-r--r--. 1 spark spark 389 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif.aux.xml
-rw-r--r--. 1 spark spark 794 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif.json
-rw-r--r--. 1 spark spark 2313 Oct 22 15:05 collection.json
-rw-rw-r--. 1 spark spark 42573 Oct 22 15:05 job_metadata.json
-rw-rw-r--. 1 spark spark 2487 Oct 22 14:59 job_specification.json
bash-4.4$ command terminated with exit code 137 But Kibana shows that the driver did not found the path even after the fusemount in the same pod showed it existed. It is only 1 seconds apart, so maybe the timestamps are a bit offseted.
Will try with a wait loop now |
Retrying did work on CDSE dev:
|
TODO: dedup time_machine test code |
FYI a test in this integration tests run failed (Kibana) with:
There's also this one where apparently two executors attempted to write the same output asset (Kibana) but it ultimately went missing. |
Ok, then the write per executor and move/copy is really needed. |
Observer errors with the
This might be due to a flaky s3 connection or a racing condition between executors.
|
…e results. (This is also implicitly tested in CDSE integration tests) Open-EO/openeo-geotrellis-extensions#329
Example graph that uses
separate_asset_per_band
and has empty tiff files:j-241009a45a764383a3a3db1453b9881f
Making the batchjob write to S3 directly instead of the fuse mount avoids this issue.
Need to check if fsync also avoids the issue: https://github.com/yandex-cloud/geesefs/blob/master/README.md?plain=1#L279-L299
https://teams.microsoft.com/l/message/19:[email protected]/1728034200593?tenantId=9e2777ed-8237-4ab9-9278-2c144d6f6da3&groupId=8c9c739d-2544-4def-8cd4-b65970551b70&parentMessageId=1728034200593&teamName=Unit%20TAP&channelName=openEO-users&createdTime=1728034200593
The text was updated successfully, but these errors were encountered: