From 2e4d5b5b21c8872e347d7f86bc6c72c3484eb7fc Mon Sep 17 00:00:00 2001 From: Manu Zhang Date: Sat, 14 Sep 2024 12:56:21 +0800 Subject: [PATCH] Docs: Fix missing options for remove_orphan_files procedure (#11080) --- docs/docs/spark-procedures.md | 38 +++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/docs/docs/spark-procedures.md b/docs/docs/spark-procedures.md index 1cd14dd1888e..0953e729a77b 100644 --- a/docs/docs/spark-procedures.md +++ b/docs/docs/spark-procedures.md @@ -312,6 +312,10 @@ Used to remove files which are not referenced in any metadata files of an Iceber | `location` | | string | Directory to look for files in (defaults to the table's location) | | `dry_run` | | boolean | When true, don't actually remove files (defaults to false) | | `max_concurrent_deletes` | | int | Size of the thread pool used for delete file actions (by default, no thread pool is used) | +| `file_list_view` | | string | Dataset to look for files in (skipping the directory listing) | +| `equal_schemes` | | map | Mapping of file system schemes to be considered equal. Key is a comma-separated list of schemes and value is a scheme (defaults to `map('s3a,s3n','s3')`). | +| `equal_authorities` | | map | Mapping of file system authorities to be considered equal. Key is a comma-separated list of authorities and value is an authority. | +| `prefix_mismatch_mode` | | string | Action behavior when location prefixes (schemes/authorities) mismatch:
  • ERROR - throw an exception. (default)
  • IGNORE - no action.
  • DELETE - delete files.
| #### Output @@ -331,6 +335,40 @@ Remove any files in the `tablelocation/data` folder which are not known to the t CALL catalog_name.system.remove_orphan_files(table => 'db.sample', location => 'tablelocation/data'); ``` +Remove any files in the `files_view` view which are not known to the table `db.sample`. +```java +Dataset compareToFileList = + spark + .createDataFrame(allFiles, FilePathLastModifiedRecord.class) + .withColumnRenamed("filePath", "file_path") + .withColumnRenamed("lastModified", "last_modified"); +String fileListViewName = "files_view"; +compareToFileList.createOrReplaceTempView(fileListViewName); +``` +```sql +CALL catalog_name.system.remove_orphan_files(table => 'db.sample', file_list_view => 'files_view'); +``` + +When a file matches references in metadata files except for location prefix (scheme/authority), an error is thrown by default. +The error can be ignored and the file will be skipped by setting `prefix_mismatch_mode` to `IGNORE`. +```sql +CALL catalog_name.system.remove_orphan_files(table => 'db.sample', prefix_mismatch_mode => 'IGNORE'); +``` + +The file can still be deleted by setting `prefix_mismatch_mode` to `DELETE`. +```sql +CALL catalog_name.system.remove_orphan_files(table => 'db.sample', prefix_mismatch_mode => 'DELETE'); +``` + +The file can also be deleted by considering the mismatched prefixes equal. +```sql +CALL catalog_name.system.remove_orphan_files(table => 'db.sample', equal_schemes => map('file', 'file1')); +``` + +```sql +CALL catalog_name.system.remove_orphan_files(table => 'db.sample', equal_authorities => map('ns1', 'ns2')); +``` + ### `rewrite_data_files` Iceberg tracks each data file in a table. More data files leads to more metadata stored in manifest files, and small data files causes an unnecessary amount of metadata and less efficient queries from file open costs.