[VL] Use conf to control C2R occupied memory #5952

XinShuoWang · 2024-06-02T09:01:46Z

What changes were proposed in this pull request?

In the current design, the Column2Row operation is completed in one go, which consumes a lot of memory and causes the program OOM. In this commit, I modified the C2R operation into multiple operations, which will greatly reduce the peak memory of the C2R operation. In addition, there should be some performance advantages from memory reuse.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

github-actions · 2024-06-02T09:02:03Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2024-06-02T09:02:21Z

Run Gluten Clickhouse CI

Yohahaha

In the current design, the Column2Row operation is completed in one go, which consumes a lot of memory and causes the program OOM

do you mean current implementation will convert all columnar batches into unsafe rows in one run? it's seems unbelievable, could you double check it and give a case which c2r conversion's peak memory is extremely high?

cpp/core/jni/JniWrapper.cc

Yohahaha · 2024-06-02T13:20:29Z

cpp/core/operators/c2r/ColumnarToRow.h

+  virtual void
+  convert(std::shared_ptr<ColumnarBatch> cb = nullptr, int64_t rowId = 0, int64_t memoryThreshold = INT64_MAX) = 0;


could we configure memoryThreshold for ColumnarToRowConverter when initializing it?

@zhztheplayer can you fix it?

shims/common/src/main/scala/org/apache/gluten/GlutenConfig.scala

cpp/velox/operators/serializer/VeloxColumnarToRowConverter.cc

ulysses-you · 2024-06-03T01:57:00Z

gluten-data/src/main/java/org/apache/gluten/vectorized/NativeColumnarToRowJniWrapper.java

-  public native NativeColumnarToRowInfo nativeColumnarToRowConvert(long batchHandle, long c2rHandle)
-      throws RuntimeException;
+  public native NativeColumnarToRowInfo nativeColumnarToRowConvert(
+      long batchHandle, long c2rHandle, long rowId) throws RuntimeException;


Can we pass memoryThreshold to jni ? It should be easy to track the config at java side.

cpp/velox/operators/serializer/VeloxColumnarToRowConverter.cc

FelixYBW · 2024-06-17T23:26:58Z

@XinShuoWang Can you update?

github-actions · 2024-06-20T03:45:29Z

Run Gluten Clickhouse CI

github-actions · 2024-06-20T06:00:12Z

Run Gluten Clickhouse CI

github-actions · 2024-06-20T08:14:12Z

Run Gluten Clickhouse CI

github-actions · 2024-06-25T02:24:19Z

Run Gluten Clickhouse CI

cpp/velox/operators/serializer/VeloxColumnarToRowConverter.cc

FelixYBW · 2024-06-26T01:02:11Z

@XinShuoWang can you fix the UT?

github-actions · 2024-06-26T03:22:23Z

Run Gluten Clickhouse CI

cpp/core/jni/JniWrapper.cc

yma11 · 2024-06-26T03:31:24Z

@XinShuoWang The UT fails as the default buffer size is 64MB and it's too large when task memory is small. You may can change the threshold value for these UTs.

github-actions · 2024-06-26T03:40:10Z

Run Gluten Clickhouse CI

github-actions · 2024-07-01T03:20:19Z

Run Gluten Clickhouse CI

github-actions · 2024-07-01T06:02:12Z

Run Gluten Clickhouse CI

FelixYBW · 2024-07-02T19:42:51Z

@XinShuoWang can you fix the UT? the bug triggered many queries failed in our test.

github-actions · 2024-07-04T06:50:37Z

Run Gluten Clickhouse CI

FelixYBW · 2024-07-17T00:50:45Z

@XinShuoWang Would you continue to fix the issue? If not we can pickup and fix it. The bug blocked our adoption

github-actions · 2024-07-18T07:13:19Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T08:11:46Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T08:21:24Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T08:25:58Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T08:56:42Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T16:42:18Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T18:23:17Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T21:31:20Z

Run Gluten Clickhouse CI

github-actions · 2024-08-02T02:58:42Z

Run Gluten Clickhouse CI

github-actions · 2024-08-02T03:00:05Z

Run Gluten Clickhouse CI

github-actions · 2024-08-02T05:48:42Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-08-02T09:04:37Z

Error still occurs at case SPARK-16995: flat mapping on Dataset containing a column created with lit/expr, should investigate.

Stack: [0x00007f9cf43da000,0x00007f9cf44db000],  sp=0x00007f9cf44d89e0,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xb04f5a]  Unsafe_GetLong+0x7a
J 7081  sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 0x00007f9f6d53728e [0x00007f9f6d5371c0+0xce]
j  org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5
j  org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66
j  org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14
j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering.compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I+27
j  org.apache.spark.sql.execution.GroupedIterator.fetchNextGroupIterator()Z+127
j  org.apache.spark.sql.execution.GroupedIterator.hasNext()Z+8
J 8209 C2 scala.collection.Iterator$$anon$11.hasNext()Z (35 bytes) @ 0x00007f9f6e6bf538 [0x00007f9f6e6bf340+0x1f8]
j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext()V+42
j  org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z+11
j  org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext()Z+4
j  org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(ZILscala/collection/Iterator;)Lscala/collection/Iterator;+207
j  org.apache.spark.sql.execution.SparkPlan$$Lambda$3173.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
j  org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(Lscala/Function1;Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+2
j  org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;+7
j  org.apache.spark.rdd.RDD$$Lambda$2870.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
j  org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
j  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24
j  org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+40
j  org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+201
j  org.apache.spark.TaskContext.runTaskWithListeners(Lorg/apache/spark/scheduler/Task;)Ljava/lang/Object;+2
j  org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;ILscala/collection/immutable/Map;Lscala/Option;)Ljava/lang/Object;+254
j  org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Lorg/apache/spark/executor/Executor$TaskRunner;Lscala/runtime/BooleanRef;)Ljava/lang/Object;+43
j  org.apache.spark.executor.Executor$TaskRunner$$Lambda$2820.apply()Ljava/lang/Object;+8
j  org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+4
j  org.apache.spark.executor.Executor$TaskRunner.run()V+457
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x69fb72]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0xe32
V  [libjvm.so+0x69d183]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263
V  [libjvm.so+0x69d787]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x57
V  [libjvm.so+0x73de8c]  thread_entry(JavaThread*, Thread*)+0x6c
V  [libjvm.so+0xad7b0b]  JavaThread::thread_main_inner()+0xdb
V  [libjvm.so+0xad7dd9]  JavaThread::run()+0x299
V  [libjvm.so+0x9651f2]  java_start(Thread*)+0xe2
C  [libpthread.so.0+0x8609]  start_thread+0xd9

github-actions · 2024-08-03T01:39:41Z

Run Gluten Clickhouse CI

FelixYBW · 2024-08-03T01:44:04Z

I fixed a bit. Now if the batch is small, we needn't alloate 64M memory, but the accurate memory

FelixYBW · 2024-08-03T01:44:44Z

Error still occurs at case SPARK-16995: flat mapping on Dataset containing a column created with lit/expr, should investigate.

Is it related? I can't see

github-actions · 2024-08-06T00:33:35Z

Run Gluten Clickhouse CI

XinShuoWang mentioned this pull request Jun 2, 2024

[VL] Use conf to control C2R occupied memory #5799

Closed

Yohahaha reviewed Jun 2, 2024

View reviewed changes

ulysses-you reviewed Jun 3, 2024

View reviewed changes

cpp/velox/operators/serializer/VeloxColumnarToRowConverter.cc Outdated Show resolved Hide resolved

FelixYBW reviewed Jun 3, 2024

View reviewed changes

cpp/velox/operators/serializer/VeloxColumnarToRowConverter.cc Outdated Show resolved Hide resolved

FelixYBW reviewed Jun 3, 2024

View reviewed changes

cpp/velox/operators/serializer/VeloxColumnarToRowConverter.cc Outdated Show resolved Hide resolved

FelixYBW mentioned this pull request Jun 17, 2024

[VL] arrowcontext not support spill #5718

Closed

XinShuoWang force-pushed the c2r_oom branch from e0eed88 to 86aca90 Compare June 20, 2024 05:59

FelixYBW reviewed Jun 25, 2024

View reviewed changes

cpp/velox/operators/serializer/VeloxColumnarToRowConverter.cc Outdated Show resolved Hide resolved

XinShuoWang force-pushed the c2r_oom branch from a9a00a0 to d50a94a Compare June 26, 2024 03:21

yma11 reviewed Jun 26, 2024

View reviewed changes

cpp/core/jni/JniWrapper.cc Outdated Show resolved Hide resolved

XinShuoWang force-pushed the c2r_oom branch from 75bbb2c to 7c61ea5 Compare July 1, 2024 03:19

XinShuoWang force-pushed the c2r_oom branch from 7c61ea5 to 2731284 Compare July 1, 2024 06:01

set rowsize in benchmark to 64M

70fd1ec

github-actions bot added CORE works for Gluten Core VELOX labels Aug 1, 2024

Update ColumnarToRowBenchmark.cc

6802015

Update ColumnarToRowBenchmark.cc

c4f22f6

Update VeloxColumnarToRowTest.cc

d223d27

Update VeloxRowToColumnarTest.cc

d27136d

Merge branch 'main' into c2r_oom

bf1c97a

Update VeloxRowToColumnarTest.cc

3eac3b3

Update VeloxColumnarToRowTest.cc

1bcda04

fixup

345d1d9

allocate smaller memory if batch is small

d47adaa

Merge branch 'main' into c2r_oom

6bd1fe9

zhztheplayer approved these changes Aug 6, 2024

View reviewed changes

zhztheplayer merged commit f7b4027 into apache:main Aug 6, 2024
44 checks passed

		virtual void
		convert(std::shared_ptr<ColumnarBatch> cb = nullptr, int64_t rowId = 0, int64_t memoryThreshold = INT64_MAX) = 0;

[VL] Use conf to control C2R occupied memory #5952

[VL] Use conf to control C2R occupied memory #5952

Conversation

XinShuoWang commented Jun 2, 2024

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Jun 2, 2024

github-actions bot commented Jun 2, 2024

Yohahaha left a comment

Choose a reason for hiding this comment

Yohahaha Jun 2, 2024

Choose a reason for hiding this comment

FelixYBW Jul 27, 2024

Choose a reason for hiding this comment

ulysses-you Jun 3, 2024

Choose a reason for hiding this comment

FelixYBW commented Jun 17, 2024

github-actions bot commented Jun 20, 2024

github-actions bot commented Jun 20, 2024

github-actions bot commented Jun 20, 2024

github-actions bot commented Jun 25, 2024

FelixYBW commented Jun 26, 2024

github-actions bot commented Jun 26, 2024

yma11 commented Jun 26, 2024

github-actions bot commented Jun 26, 2024

github-actions bot commented Jul 1, 2024

github-actions bot commented Jul 1, 2024

FelixYBW commented Jul 2, 2024

github-actions bot commented Jul 4, 2024

FelixYBW commented Jul 17, 2024

github-actions bot commented Jul 18, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 2, 2024

github-actions bot commented Aug 2, 2024

github-actions bot commented Aug 2, 2024

zhztheplayer commented Aug 2, 2024

github-actions bot commented Aug 3, 2024

FelixYBW commented Aug 3, 2024

FelixYBW commented Aug 3, 2024

github-actions bot commented Aug 6, 2024