Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chiese dedup memory error #65

Open
hyeinhyun opened this issue Apr 29, 2023 · 1 comment
Open

Chiese dedup memory error #65

hyeinhyun opened this issue Apr 29, 2023 · 1 comment

Comments

@hyeinhyun
Copy link

there is memory error when deduplicate Chinese data.

23/04/19 19:44:17 WARN MemoryStore: Not enough space to cache rdd_7_0 in memory! (computed 176.2 MiB so far)
23/04/19 19:44:17 WARN BlockManager: Block rdd_7_0 could not be removed as it was not found on disk or in memory
23/04/19 19:44:17 WARN BlockManager: Putting block rdd_7_0 failed
23/04/19 19:44:17 WARN MemoryStore: Not enough space to cache rdd_7_2 in memory! (computed 176.2 MiB so far)
23/04/19 19:44:17 WARN BlockManager: Block rdd_7_2 could not be removed as it was not found on disk or in memory
23/04/19 19:44:17 WARN BlockManager: Putting block rdd_7_2 failed
23/04/19 19:44:17 WARN MemoryStore: Not enough space to cache rdd_7_9 in memory! (computed 176.3 MiB so far)
23/04/19 19:44:17 WARN BlockManager: Block rdd_7_9 could not be removed as it was not found on disk or in memory
23/04/19 19:44:17 WARN BlockManager: Putting block rdd_7_9 failed
23/04/19 19:44:19 WARN MemoryStore: Not enough space to cache rdd_7_7 in memory! (computed 176.6 MiB so far)
23/04/19 19:44:19 WARN BlockManager: Block rdd_7_7 could not be removed as it was not found on disk or in memory
23/04/19 19:44:19 WARN BlockManager: Putting block rdd_7_7 failed
23/04/19 19:44:41 WARN BlockManager: Block rdd_7_6 could not be removed as it was not found on disk or in memory
23/04/19 19:44:42 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 30545)
java.lang.OutOfMemoryError: Java heap space
@Taekyoon
Copy link
Collaborator

Taekyoon commented Jul 6, 2023

The mem space for each executor is smaller than the input partition data size.
You can try to increase the number of partitions.
Also, could you check spark history to see which executors have data skew and spill issues?
This deduplication logic still needs to improve for these problems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants