Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate_asset_per_band gives some empty tiffs #329

Closed
EmileSonneveld opened this issue Oct 8, 2024 · 9 comments · Fixed by #335 or #337
Closed

separate_asset_per_band gives some empty tiffs #329

EmileSonneveld opened this issue Oct 8, 2024 · 9 comments · Fixed by #335 or #337
Assignees

Comments

@EmileSonneveld
Copy link
Contributor

EmileSonneveld commented Oct 8, 2024

Example graph that uses separate_asset_per_band and has empty tiff files: j-241009a45a764383a3a3db1453b9881f
Making the batchjob write to S3 directly instead of the fuse mount avoids this issue.
Need to check if fsync also avoids the issue: https://github.com/yandex-cloud/geesefs/blob/master/README.md?plain=1#L279-L299

https://teams.microsoft.com/l/message/19:[email protected]/1728034200593?tenantId=9e2777ed-8237-4ab9-9278-2c144d6f6da3&groupId=8c9c739d-2544-4def-8cd4-b65970551b70&parentMessageId=1728034200593&teamName=Unit%20TAP&channelName=openEO-users&createdTime=1728034200593

@EmileSonneveld EmileSonneveld self-assigned this Oct 8, 2024
EmileSonneveld added a commit that referenced this issue Oct 8, 2024
@EmileSonneveld
Copy link
Contributor Author

Status:
Disabling fuse mount makes all the files be written completely.
CDSE staing still uses the fusemount, it could be used as a fallback.
It can be quicly enabled/disabled by changing these lines: https://git.vito.be/projects/TPT/repos/os_creodias_openeo_k8s/browse/kube_resources/applications/openeo/values_cdse-prod.yaml#4-7
And running the promote job again.

Disabling fuse mount, and using S3 directly, might cause issues with export_workspace

EmileSonneveld added a commit that referenced this issue Oct 14, 2024
@EmileSonneveld
Copy link
Contributor Author

FileChannel.open(Path.of(path)).force(...) is an example taken from this library: https://github.com/eclipse-rdf4j/rdf4j/blob/main/core/common/io/src/main/java/org/eclipse/rdf4j/common/io/NioFile.java#L164C5-L164C7

@EmileSonneveld
Copy link
Contributor Author

Job got trough with the file move way. Executors got OOM a few times. This might be the initial reason for the incomplete output files.
With a file-move, the files are first written to the pod's /tmp directory tough.

Oct 15, 2024 @ 16:05:23.716	INFO	stitchAndWriteToTiff writeGeoTiff done. filePath: /batch_jobs/j-241015d4f747427882375efb47c311db/openEO_VH_on_VV_P75.tif	package.scala
Oct 15, 2024 @ 16:05:21.903	INFO	FileAlreadyExistsException. Will overwrite file: /batch_jobs/j-241015d4f747427882375efb47c311db/openEO_VH_on_VV_P90.tif	package.scala
Oct 15, 2024 @ 16:05:18.156	INFO	FileAlreadyExistsException. Will overwrite file: /batch_jobs/j-241015d4f747427882375efb47c311db/openEO_VV_P75.tif	package.scala
Oct 15, 2024 @ 16:05:17.448	INFO	FileAlreadyExistsException. Will overwrite file: /batch_jobs/j-241015d4f747427882375efb47c311db/openEO_VV_P50.tif	package.scala

Curious observation, the file permissions in the fusemount changed over time:

kubectl *** -- /bin/bash
bash-4.4$ cd /batch_jobs/j-241015d4f747427882375efb47c311db/
drwxr-xr-x. 2 spark spark      4096 Oct 15 14:04 .
drwxr-xr-x. 1 spark spark        48 Oct 15 14:02 ..
-rw-rw-r--. 1 spark spark     13973 Oct 15 14:01 job_metadata.json
-rw-rw-r--. 1 spark spark      2234 Oct 15 14:00 job_specification.json
-rw-------. 1 spark spark 114825818 Oct 15 14:04 openEO_VH_P10.tif
-rw-r--r--. 1 spark spark 114858844 Oct 15 14:04 openEO_VH_P25.tif
-rw-r--r--. 1 spark spark 113986724 Oct 15 14:04 openEO_VH_P90.tif
-rw-r--r--. 1 spark spark 111953331 Oct 15 14:04 openEO_VH_on_VV_P10.tif
-rw-r--r--. 1 spark spark 111785039 Oct 15 14:04 openEO_VH_on_VV_P25.tif
-rw-r--r--. 1 spark spark 111524860 Oct 15 14:04 openEO_VH_on_VV_P50.tif
-rw-r--r--. 1 spark spark 114393944 Oct 15 14:04 openEO_VV_P10.tif
-rw-r--r--. 1 spark spark 114498596 Oct 15 14:04 openEO_VV_P25.tif
-rw-r--r--. 1 spark spark 114298964 Oct 15 14:04 openEO_VV_P90.tif
bash-4.4$ ls -al
total 1110032
drwxr-xr-x. 2 spark spark      4096 Oct 15 14:05 .
drwxr-xr-x. 1 spark spark        48 Oct 15 14:02 ..
-rw-r--r--. 1 spark spark      1628 Oct 15 14:05 collection.json
-rw-rw-r--. 1 spark spark     23811 Oct 15 14:05 job_metadata.json
-rw-rw-r--. 1 spark spark      2234 Oct 15 14:00 job_specification.json
-rw--w----. 1 spark spark 114825818 Oct 15 14:04 openEO_VH_P10.tif
-rw-r--r--. 1 spark spark       395 Oct 15 14:05 openEO_VH_P10.tif.aux.xml
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VH_P10.tif.json
-rw--w----. 1 spark spark 114858844 Oct 15 14:04 openEO_VH_P25.tif
-rw-r--r--. 1 spark spark       395 Oct 15 14:05 openEO_VH_P25.tif.aux.xml
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VH_P25.tif.json
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VH_P50.tif.json
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VH_P75.tif.json
-rw--w----. 1 spark spark 113986724 Oct 15 14:04 openEO_VH_P90.tif
-rw-r--r--. 1 spark spark       395 Oct 15 14:05 openEO_VH_P90.tif.aux.xml
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VH_P90.tif.json
-rw--w----. 1 spark spark 111953331 Oct 15 14:04 openEO_VH_on_VV_P10.tif
-rw-r--r--. 1 spark spark       394 Oct 15 14:05 openEO_VH_on_VV_P10.tif.aux.xml
-rw-r--r--. 1 spark spark       492 Oct 15 14:05 openEO_VH_on_VV_P10.tif.json
-rw--w----. 1 spark spark 111785039 Oct 15 14:04 openEO_VH_on_VV_P25.tif
-rw-r--r--. 1 spark spark       394 Oct 15 14:05 openEO_VH_on_VV_P25.tif.aux.xml
-rw-r--r--. 1 spark spark       492 Oct 15 14:05 openEO_VH_on_VV_P25.tif.json
-rw--w----. 1 spark spark 111524860 Oct 15 14:04 openEO_VH_on_VV_P50.tif
-rw-r--r--. 1 spark spark       394 Oct 15 14:05 openEO_VH_on_VV_P50.tif.aux.xml
-rw-r--r--. 1 spark spark       492 Oct 15 14:05 openEO_VH_on_VV_P50.tif.json
-rw-r--r--. 1 spark spark       492 Oct 15 14:05 openEO_VH_on_VV_P75.tif.json
-rw-r--r--. 1 spark spark       492 Oct 15 14:05 openEO_VH_on_VV_P90.tif.json
-rw--w----. 1 spark spark 114393944 Oct 15 14:04 openEO_VV_P10.tif
-rw-r--r--. 1 spark spark       392 Oct 15 14:05 openEO_VV_P10.tif.aux.xml
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VV_P10.tif.json
-rw--w----. 1 spark spark 114498596 Oct 15 14:04 openEO_VV_P25.tif
-rw-r--r--. 1 spark spark       390 Oct 15 14:05 openEO_VV_P25.tif.aux.xml
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VV_P25.tif.json
-rw--w----. 1 spark spark 114498158 Oct 15 14:05 openEO_VV_P50.tif
-rw-r--r--. 1 spark spark       391 Oct 15 14:05 openEO_VV_P50.tif.aux.xml
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VV_P50.tif.json
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VV_P75.tif.json
-rw--w----. 1 spark spark 114298964 Oct 15 14:04 openEO_VV_P90.tif
-rw-r--r--. 1 spark spark       392 Oct 15 14:05 openEO_VV_P90.tif.aux.xml
-rw-r--r--. 1 spark spark       468 Oct 15 14:05 openEO_VV_P90.tif.json
bash-4.4$ command terminated with exit code 137

@EmileSonneveld EmileSonneveld linked a pull request Oct 17, 2024 that will close this issue
EmileSonneveld added a commit that referenced this issue Oct 17, 2024
…ults to the output folder from driver, and clean up afterwards. #329
EmileSonneveld added a commit to Open-EO/openeo-geopyspark-driver that referenced this issue Oct 22, 2024
EmileSonneveld added a commit to Open-EO/openeo-geopyspark-driver that referenced this issue Oct 22, 2024
@EmileSonneveld
Copy link
Contributor Author

emile@emile-Precision-7680:~$ kubectl --kubeconfig ~/.kube/cdse_dev.yml -n spark-jobs-dev exec -it a-8dec3512508e410e878e3742c07ef718-driver -c spark-kubernetes-driver -- /bin/bash
bash-4.4$ cd /batch_jobs/j-241022e02b9745ce8438c66bbffbb5f9

bash-4.4$ date --iso-8601=seconds && ls -al
2024-10-22T15:05:50+00:00
total 667033
drwxr-xr-x. 2 spark spark      4096 Oct 22 15:05 .
drwxr-xr-x. 1 spark spark        48 Oct 22 14:59 ..
-rw-rw-r--. 1 spark spark 110510198 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif
-rw-rw-r--. 1 spark spark 115616258 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif
-rw-rw-r--. 1 spark spark 114191269 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif
-rw-rw-r--. 1 spark spark 115283524 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif
-rw-rw-r--. 1 spark spark 113842171 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif
-rw-rw-r--. 1 spark spark 113548381 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif
-rw-rw-r--. 1 spark spark     40783 Oct 22 15:05 job_metadata.json
-rw-rw-r--. 1 spark spark      2487 Oct 22 14:59 job_specification.json

bash-4.4$ date --iso-8601=seconds && ls -al
2024-10-22T15:05:53+00:00
total 667055
drwxr-xr-x. 2 spark spark      4096 Oct 22 15:05 .
drwxr-xr-x. 1 spark spark        48 Oct 22 14:59 ..
-rw-r--r--. 1 spark spark       635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P10.tif.json
-rw-r--r--. 1 spark spark       635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P25.tif.json
-rw-r--r--. 1 spark spark       635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P50.tif.json
-rw-rw-r--. 1 spark spark 110510198 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif
-rw-r--r--. 1 spark spark       389 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif.aux.xml
-rw-r--r--. 1 spark spark       834 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P75.tif.json
-rw-r--r--. 1 spark spark       635 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P90.tif.json
-rw-rw-r--. 1 spark spark 115616258 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif
-rw-r--r--. 1 spark spark       391 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif.aux.xml
-rw-r--r--. 1 spark spark       796 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P10.tif.json
-rw-r--r--. 1 spark spark       603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P25.tif.json
-rw-r--r--. 1 spark spark       603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P50.tif.json
-rw-rw-r--. 1 spark spark 114191269 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif
-rw-r--r--. 1 spark spark       392 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif.aux.xml
-rw-r--r--. 1 spark spark       797 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P75.tif.json
-rw-r--r--. 1 spark spark       603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VH_P90.tif.json
-rw-rw-r--. 1 spark spark 115283524 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif
-rw-r--r--. 1 spark spark       390 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif.aux.xml
-rw-r--r--. 1 spark spark       795 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P10.tif.json
-rw-r--r--. 1 spark spark       603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P25.tif.json
-rw-r--r--. 1 spark spark       603 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P50.tif.json
-rw-rw-r--. 1 spark spark 113842171 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif
-rw-r--r--. 1 spark spark       389 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif.aux.xml
-rw-r--r--. 1 spark spark       794 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P75.tif.json
-rw-rw-r--. 1 spark spark 113548381 Oct 22 15:04 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif
-rw-r--r--. 1 spark spark       389 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif.aux.xml
-rw-r--r--. 1 spark spark       794 Oct 22 15:05 LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VV_P90.tif.json
-rw-r--r--. 1 spark spark      2313 Oct 22 15:05 collection.json
-rw-rw-r--. 1 spark spark     42573 Oct 22 15:05 job_metadata.json
-rw-rw-r--. 1 spark spark      2487 Oct 22 14:59 job_specification.json
bash-4.4$ command terminated with exit code 137

But Kibana shows that the driver did not found the path even after the fusemount in the same pod showed it existed. It is only 1 seconds apart, so maybe the timestamps are a bit offseted.

Oct 22, 2024 @ 17:05:54.290	ERROR	OpenEO batch job failed: "[Errno 2] No such file or directory: '/batch_jobs/j-241022e02b9745ce8438c66bbffbb5f9/LCFM_LSF-ANNUAL-GAMMA0_V100_2020_44RNT_FEATURES.tif_VHVV-RATIO_P10.tif'"
job_id	j-241022e02b9745ce8438c66bbffbb5f9
kubernetes.pod_name	a-8dec3512508e410e878e3742c07ef718-driver

Will try with a wait loop now

@EmileSonneveld
Copy link
Contributor Author

EmileSonneveld commented Oct 22, 2024

Retrying did work on CDSE dev:

Oct 22, 2024 @ 18:44:28.795	INFO	Waiting for path to be available. Try 2/5: /batch_jobs/j-2410227f4b0d4af6876341faba4c976d/openEO_2023-06-04Z_B02.tif
Oct 22, 2024 @ 18:44:18.795	INFO	Waiting for path to be available. Try 1/5: /batch_jobs/j-2410227f4b0d4af6876341faba4c976d/openEO_2023-06-04Z_B02.tif

@EmileSonneveld
Copy link
Contributor Author

EmileSonneveld commented Oct 23, 2024

TODO: dedup time_machine test code
Open-EO/openeo-geopyspark-driver#916 (comment)

EmileSonneveld added a commit to Open-EO/openeo-geopyspark-driver that referenced this issue Oct 24, 2024
EmileSonneveld added a commit to Open-EO/openeo-geopyspark-driver that referenced this issue Oct 25, 2024
@bossie
Copy link
Collaborator

bossie commented Oct 29, 2024

FYI a test in this integration tests run failed (Kibana) with:

java.nio.file.FileSystemException: /data/projects/OpenEO/j-241028ce05c64facb4a37ce5b4241fdc/openEO_2018-01-01Z.tif: Stale file handle

There's also this one where apparently two executors attempted to write the same output asset (Kibana) but it ultimately went missing.

@EmileSonneveld
Copy link
Contributor Author

Ok, then the write per executor and move/copy is really needed.
The driver side in Scala will need a wait_till_path_available function too then

@EmileSonneveld
Copy link
Contributor Author

Observer errors with the moveOverwriteWithRetries implementation:

  • Found by Peter:
Stage error: Job aborted due to stage failure: Task 4 in stage 38.0 failed 4 times, most recent failure: Lost task 4.3 in stage 38.0 (TID 1186) (10.42.7.154 executor 2): java.io.IOException: Resource temporarily unavailable
  at java.base/sun.nio.ch.FileDispatcherImpl.force0(Native Method)
  at java.base/sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:82)
  at java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:461)
  at org.openeo.geotrellis.geotiff.package$.writeGeoTiff(package.scala:862)
  at org.openeo.geotrellis.geotiff.package$.writeTiff(package.scala:602)
  at org.openeo.geotrellis.geotiff.package$.$anonfun$saveRDDTemporalAllowAssetPerBand$4(package.scala:191)

This might be due to a flaky s3 connection or a racing condition between executors.
It might be good to put the scala FSYNC under a retry case, and check if the error occurs again

  • In test_load_collection_references_correct_batch_process_id:
    sun.nio.fs.UnixException: No such file or directory
    Task error: ExceptionFailure(java.nio.file.FileSystemException,/data/projects/OpenEO/j-2411054cbc7d4d0c868e3698acac18d3/openEO_2018-01-01Z.tif: Stale file handle,[Ljava.lang.StackTraceElement;@5292477a,java.nio.file.FileSystemException: /data/projects/OpenEO/j-2411054cbc7d4d0c868e3698acac18d3/openEO_2018-01-01Z.tif: Stale file handle

EmileSonneveld added a commit to Open-EO/openeo-geopyspark-driver that referenced this issue Nov 6, 2024
@EmileSonneveld EmileSonneveld reopened this Nov 7, 2024
EmileSonneveld added a commit to Open-EO/openeo-geopyspark-driver that referenced this issue Nov 12, 2024
…e results. (This is also implicitly tested in CDSE integration tests) Open-EO/openeo-geotrellis-extensions#329
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants