Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Use conf to control C2R occupied memory #5952

Merged
merged 30 commits into from
Aug 6, 2024

Conversation

XinShuoWang
Copy link
Contributor

What changes were proposed in this pull request?

In the current design, the Column2Row operation is completed in one go, which consumes a lot of memory and causes the program OOM. In this commit, I modified the C2R operation into multiple operations, which will greatly reduce the peak memory of the C2R operation. In addition, there should be some performance advantages from memory reuse.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Copy link

github-actions bot commented Jun 2, 2024

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented Jun 2, 2024

Run Gluten Clickhouse CI

Copy link
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current design, the Column2Row operation is completed in one go, which consumes a lot of memory and causes the program OOM

do you mean current implementation will convert all columnar batches into unsafe rows in one run? it's seems unbelievable, could you double check it and give a case which c2r conversion's peak memory is extremely high?

cpp/core/jni/JniWrapper.cc Outdated Show resolved Hide resolved
Comment on lines 34 to 35
virtual void
convert(std::shared_ptr<ColumnarBatch> cb = nullptr, int64_t rowId = 0, int64_t memoryThreshold = INT64_MAX) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we configure memoryThreshold for ColumnarToRowConverter when initializing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer can you fix it?

public native NativeColumnarToRowInfo nativeColumnarToRowConvert(long batchHandle, long c2rHandle)
throws RuntimeException;
public native NativeColumnarToRowInfo nativeColumnarToRowConvert(
long batchHandle, long c2rHandle, long rowId) throws RuntimeException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass memoryThreshold to jni ? It should be easy to track the config at java side.

@FelixYBW
Copy link
Contributor

@XinShuoWang Can you update?

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@FelixYBW
Copy link
Contributor

@XinShuoWang can you fix the UT?

Copy link

Run Gluten Clickhouse CI

cpp/core/jni/JniWrapper.cc Outdated Show resolved Hide resolved
@yma11
Copy link
Contributor

yma11 commented Jun 26, 2024

@XinShuoWang The UT fails as the default buffer size is 64MB and it's too large when task memory is small. You may can change the threshold value for these UTs.

Copy link

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Jul 1, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Jul 1, 2024

Run Gluten Clickhouse CI

@FelixYBW
Copy link
Contributor

FelixYBW commented Jul 2, 2024

@XinShuoWang can you fix the UT? the bug triggered many queries failed in our test.

Copy link

github-actions bot commented Jul 4, 2024

Run Gluten Clickhouse CI

@FelixYBW
Copy link
Contributor

@XinShuoWang Would you continue to fix the issue? If not we can pickup and fix it. The bug blocked our adoption

Copy link

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 1, 2024

Run Gluten Clickhouse CI

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Aug 1, 2024
Copy link

github-actions bot commented Aug 1, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 1, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 1, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 1, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 1, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 1, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 2, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 2, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 2, 2024

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member

Error still occurs at case SPARK-16995: flat mapping on Dataset containing a column created with lit/expr, should investigate.

Stack: [0x00007f9cf43da000,0x00007f9cf44db000],  sp=0x00007f9cf44d89e0,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xb04f5a]  Unsafe_GetLong+0x7a
J 7081  sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 0x00007f9f6d53728e [0x00007f9f6d5371c0+0xce]
j  org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5
j  org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66
j  org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14
j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering.compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I+27
j  org.apache.spark.sql.execution.GroupedIterator.fetchNextGroupIterator()Z+127
j  org.apache.spark.sql.execution.GroupedIterator.hasNext()Z+8
J 8209 C2 scala.collection.Iterator$$anon$11.hasNext()Z (35 bytes) @ 0x00007f9f6e6bf538 [0x00007f9f6e6bf340+0x1f8]
j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext()V+42
j  org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z+11
j  org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext()Z+4
j  org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(ZILscala/collection/Iterator;)Lscala/collection/Iterator;+207
j  org.apache.spark.sql.execution.SparkPlan$$Lambda$3173.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
j  org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(Lscala/Function1;Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+2
j  org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;+7
j  org.apache.spark.rdd.RDD$$Lambda$2870.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
j  org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
j  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24
j  org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+40
j  org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+201
j  org.apache.spark.TaskContext.runTaskWithListeners(Lorg/apache/spark/scheduler/Task;)Ljava/lang/Object;+2
j  org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;ILscala/collection/immutable/Map;Lscala/Option;)Ljava/lang/Object;+254
j  org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Lorg/apache/spark/executor/Executor$TaskRunner;Lscala/runtime/BooleanRef;)Ljava/lang/Object;+43
j  org.apache.spark.executor.Executor$TaskRunner$$Lambda$2820.apply()Ljava/lang/Object;+8
j  org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+4
j  org.apache.spark.executor.Executor$TaskRunner.run()V+457
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x69fb72]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0xe32
V  [libjvm.so+0x69d183]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263
V  [libjvm.so+0x69d787]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x57
V  [libjvm.so+0x73de8c]  thread_entry(JavaThread*, Thread*)+0x6c
V  [libjvm.so+0xad7b0b]  JavaThread::thread_main_inner()+0xdb
V  [libjvm.so+0xad7dd9]  JavaThread::run()+0x299
V  [libjvm.so+0x9651f2]  java_start(Thread*)+0xe2
C  [libpthread.so.0+0x8609]  start_thread+0xd9

Copy link

github-actions bot commented Aug 3, 2024

Run Gluten Clickhouse CI

@FelixYBW
Copy link
Contributor

FelixYBW commented Aug 3, 2024

I fixed a bit. Now if the batch is small, we needn't alloate 64M memory, but the accurate memory

@FelixYBW
Copy link
Contributor

FelixYBW commented Aug 3, 2024

Error still occurs at case SPARK-16995: flat mapping on Dataset containing a column created with lit/expr, should investigate.

Is it related? I can't see

Copy link

github-actions bot commented Aug 6, 2024

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer merged commit f7b4027 into apache:main Aug 6, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants