Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-6840][CH] Enable cache files for hdfs #6841

Merged
merged 5 commits into from
Aug 21, 2024

Conversation

loneylee
Copy link
Member

@loneylee loneylee commented Aug 14, 2024

What changes were proposed in this pull request?

(Fixes: #6840)

新增命令

CACHE FILES ASYNC? SELECT selectedColumns=selectedColumnNames
    FROM (path=STRING)
    (CACHEPROPERTIES cacheProps=propertyList)?

示例

# 缓存文件夹
CACHE FILES SELECT * FROM 'hdfs://127.0.0.1:8020/runtpchtest/tpch10/parquet/lineitem';

# 缓存文件夹及子文件夹
CACHE FILES SELECT * FROM 'hdfs://127.0.0.1:8020/runtpchtest/tpch10/parquet' 
     CACHEPROPERTIES (recursive=true) ;

注意

  • _.开头的文件将忽略

开启缓存后,默认所有查询都会缓存,如果希望SQL级别不缓存,使用

set spark.gluten.sql.columnar.backend.ch.runtime_settings.read_from_filesystem_cache_if_exists_otherwise_bypass_cache=true

S3缓存的配置项

spark.gluten.sql.columnar.backend.ch.runtime_config.s3.local_cache.enabled=true
spark.gluten.sql.columnar.backend.ch.runtime_config.s3.local_cache.max_size=107374182400
spark.gluten.sql.columnar.backend.ch.runtime_config.s3.local_cache.cache_path=/shuffle/s3_local_cache

统一S3和HDFS参数
新增如下配置

spark.gluten.sql.columnar.backend.ch.runtime_config.gluten_cache.local.enabled=true
spark.gluten.sql.columnar.backend.ch.runtime_config.gluten_cache.local.name=gluten_cache
spark.gluten.sql.columnar.backend.ch.runtime_config.gluten_cache.local.path=/tmp/test_cache
spark.gluten.sql.columnar.backend.ch.runtime_config.gluten_cache.local.max_size=10Gi

# 其余可选项
# max_elements
# max_file_segment_size
# cache_on_write_operations
# enable_filesystem_query_cache_limit
# cache_hits_threshold
# enable_bypass_cache_with_threshold
# bypass_cache_threshold
# boundary_alignment
# background_download_threads
# background_download_queue_size_limit
# load_metadata_threads
# cache_policy(默认LRU,推荐SLRU)
# slru_size_ratio
# write_cache_per_user_id_directory(无效)
# keep_free_space_size_ratio
# keep_free_space_elements_ratio
# keep_free_space_remove_batch

原有s3配置继续使用,默认配置了新参数后,原配置将不再生效

新增指标

Number of times the read from filesystem cache hit the cache: 查询时命中cache次数
Number of times the read from filesystem cache miss the cache: 查询时miss cache次数
Bytes read from filesystem cache: 读cache大小
Bytes read from filesystem cache source (from remote fs, etc):读原始数据大小
Time reading from filesystem cache:读cache时间
Time reading from filesystem cache source (from remote filesystem, etc):读原表时间

行为修改

  • hdfs.enable_async_io,影响prefetch
    修改前默认true,当前改为当启用缓存时,此参数强制false。

How was this patch tested?

Test by ut

Copy link

#6840

Copy link

Run Gluten Clickhouse CI

5 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

@loneylee
Copy link
Member Author

Run Gluten Clickhouse CI

@loneylee loneylee marked this pull request as ready for review August 20, 2024 06:32
@@ -85,6 +85,16 @@ DB::ContextMutablePtr QueryContextManager::currentQueryContext()
return query_map.get(id)->query_context;
}

std::shared_ptr<DB::ThreadGroup> QueryContextManager::currentThreadGroup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要这么复杂,直接返回currentThread::getGroup,不存在报错即可,这里会保证所有的线程都会attach到thread group

@loneylee
Copy link
Member Author

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

@loneylee
Copy link
Member Author

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

1 similar comment
@loneylee
Copy link
Member Author

Run Gluten Clickhouse CI

Copy link
Contributor

@liuneng1994 liuneng1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@loneylee
Copy link
Member Author

Run Gluten Clickhouse CI

@loneylee loneylee merged commit 371d448 into apache:main Aug 21, 2024
9 checks passed
sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Enable cache files for hdfs
2 participants