[VL] Add helper function ColumnarBatches.toString and InternalRow toString #6458

jinchengchenghh · 2024-07-15T09:31:37Z

For test purpose, add this helper function.
Add refactor columnarToRow and rowToColumnar functions to support used in otherwhere.

github-actions · 2024-07-15T09:31:53Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

zhztheplayer · 2024-07-16T01:31:56Z

What's the purpose of the API? Is it for testing use?

Also please fill in the PR description. Thanks.

zhztheplayer · 2024-07-16T06:45:22Z

Hi @jinchengchenghh,

If it's for testing purpose from Java side, my suggestion is not to propagate the call down to C++ code. We can add a Java API ColumnarBatches#toString which converts the input batch to Arrow batch (via #ensureLoaded), then use some regular ways to stringify it. This could be the simplest way to make the API support both Velox and Arrow data.

jinchengchenghh · 2024-07-17T07:21:52Z

ArrowWritableVector does not have print function in java side.
We can also change it to InternalRow Iterator, but we don't have the print function too.

zhztheplayer · 2024-07-18T04:53:53Z

ArrowWritableVector does not have print function in java side. We can also change it to InternalRow Iterator, but we don't have the print function too.

If we can get the rows, there must be a way to stringify them since Spark requires for this to implement df.show

jinchengchenghh · 2024-07-19T01:28:06Z

ArrowWritableVector does not have print function in java side. We can also change it to InternalRow Iterator, but we don't have the print function too.

If we can get the rows, there must be a way to stringify them since Spark requires for this to implement df.show

Yes, it uses ToPrettyString to show the result in dataframe, we can only use some of code, I have implemented our version.

github-actions · 2024-07-19T02:22:39Z

Run Gluten Clickhouse CI

github-actions · 2024-07-19T06:37:02Z

Run Gluten Clickhouse CI

github-actions · 2024-07-19T06:47:31Z

Run Gluten Clickhouse CI

github-actions · 2024-07-20T00:37:58Z

Run Gluten Clickhouse CI

github-actions · 2024-07-21T11:13:01Z

Run Gluten Clickhouse CI

github-actions · 2024-07-22T01:56:23Z

Run Gluten Clickhouse CI

jinchengchenghh · 2024-07-22T03:41:11Z

Can you help review this PR again? Thanks! @zhztheplayer

jinchengchenghh · 2024-07-24T01:35:43Z

Can you help review? Thanks! @zhztheplayer

zhztheplayer · 2024-07-24T07:04:17Z

@jinchengchenghh Will take a look as soon as possible. Thanks.

jinchengchenghh · 2024-08-01T02:11:28Z

Can I take your some time? @zhztheplayer

zhztheplayer · 2024-08-01T07:06:21Z

Reviewing now. Thank you for noting.

zhztheplayer · 2024-08-02T03:04:55Z

Hi @jinchengchenghh , I have been thinking if the code can be simplified to ease maintenance.

Would you please have a check about the following example code to pretty print a row iterator with much less code than ToStringUtil:

test("UnsafeRow to string 2") {
  val util = ToStringUtil(Option.apply(SQLConf.get.sessionLocalTimeZone))
  val row1 =
    InternalRow.apply(UTF8String.fromString("hello"), UTF8String.fromString("world"), 123)
  val rowWithNull = InternalRow.apply(null, null, 4)
  val row2 = UnsafeProjection
    .create(Array[DataType](StringType, StringType, IntegerType))
    .apply(rowWithNull)
  val it = List(row1, row2, row1, row1, row2).toIterator
  val struct = new StructType().add("a", StringType).add("b", StringType).add("c", IntegerType)

  val encoder = RowEncoder(struct).resolveAndBind() // `ExpressionEncoder(struct).resolveAndBind()` for newer version of Spark
  val deserializer = encoder.createDeserializer()
  it.map(deserializer).foreach(r => println(r.mkString("|")))
}

jinchengchenghh · 2024-08-02T05:18:13Z

For Iterator[UnsafeRow], it is OK.
Then the problem occurs again, loaded Arrow ColumnarBatch cannot invoke rowIterator because ColumnarBatchRow column is IndicatorVector @zhztheplayer

zhztheplayer · 2024-08-02T06:06:39Z

loaded Arrow ColumnarBatch cannot invoke rowIterator because ColumnarBatchRow column is IndicatorVector

Then we should fix this at first... Which sounds like a bug. I'll take another look on the issue.

github-actions · 2024-08-16T08:30:33Z

Run Gluten Clickhouse CI

github-actions · 2024-08-16T08:59:12Z

Run Gluten Clickhouse CI

jinchengchenghh · 2024-08-16T10:05:28Z

Executing SQL query from resource path /tpcds-queries/q24b.sql...
24/08/16 09:15:21 WARN Runtime: WholeStageIterator Reservation listener org.apache.gluten.memory.listener.ManagedReservationListener@4523caff still reserved non-zero bytes, which may cause memory leak, size: 2.0 MiB. 
24/08/16 09:15:21 ERROR Executor: Exception in task 0.0 in stage 477.0 (TID 1749)
org.apache.spark.SparkException: Managed memory leak detected; size = 2097152 bytes, task 0.0 in stage 477.0 (TID 1749)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:516)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1500)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f372fa0c9a6, pid=11701, tid=0x00007f372e207640
#
# JRE version: OpenJDK Runtime Environment (8.0_422-b05) (build 1.8.0_422-8u422-b05-1~22.04-b05)
# Java VM: OpenJDK 64-Bit Server VM (25.422-b05 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libvelox.so+0x18049a6]  facebook::velox::memory::ScopedMemoryPoolArbitrationCtx::~ScopedMemoryPoolArbitrationCtx()+0x6
#
# Core dump written. Default location: /__w/incubator-gluten/incubator-gluten/tools/gluten-it/core or core.11701
#
# An error report file with more information is saved as:
# /__w/incubator-gluten/incubator-gluten/tools/gluten-it/hs_err_pid11701.log
24/08/16 09:15:21 WARN TaskSetManager: Lost task 0.0 in stage 477.0 (TID 1749) (051b84fb203f executor driver): org.apache.spark.SparkException: Managed memory leak detected; size = 2097152 bytes, task 0.0 in stage 477.0 (TID 1749)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:516)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1500)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

jinchengchenghh · 2024-08-16T11:17:52Z

@zhztheplayer Thanks for your suggestion, It really make code clean and easy to maintain. Can you help to review again?

zhztheplayer · 2024-08-19T06:35:37Z

backends-velox/src/test/java/org/apache/gluten/columnarbatch/ColumnarBatchTest.java

+  private static ColumnarBatch newArrowBatch(int numRows) {
+    String schema = "a boolean, b int";
+    final ArrowWritableColumnVector[] columns =
+        ArrowWritableColumnVector.allocateColumns(numRows, StructType.fromDDL(schema));
+    ArrowWritableColumnVector col1 = columns[0];
+    ArrowWritableColumnVector col2 = columns[1];
+    for (int j = 0; j < numRows; j++) {
+      col1.putBoolean(j, j % 2 == 0);
+      col2.putInt(j, 15 - j);
+    }
+    col2.putNull(numRows - 1);
+    for (ArrowWritableColumnVector col : columns) {
+      col.setValueCount(numRows);
+    }
+    final ColumnarBatch batch = new ColumnarBatch(columns);
+    batch.setNumRows(numRows);
+    return batch;
+  }


Can we remove this method? Could fill in the vectors in test case's code.

E.g.,

final int numRows = 100; final ColumnarBatch batch = newArrowBatch("a boolean, b int", numRows); final ArrowWritableColumnVector col0 = (ArrowWritableColumnVector) batch.column(0); final ArrowWritableColumnVector col1 = (ArrowWritableColumnVector) batch.column(1); for (int j = 0; j < numRows; j++) { col0.putBoolean(j, j % 2 == 0); col1.putInt(j, 15 - j); } col1.putNull(numRows - 1);

zhztheplayer

Thanks!

…tring (apache#6458)

jinchengchenghh requested a review from zhztheplayer July 16, 2024 06:33

jinchengchenghh force-pushed the tostring branch from d8b1312 to 0163834 Compare July 18, 2024 04:36

jinchengchenghh changed the title ~~[VL] Add helper function ColumnarBatches.toString~~ [VL] Add helper function ColumnarBatches.toString and UnsafeRow toString Jul 24, 2024

jinchengchenghh force-pushed the tostring branch from a4f4560 to c89c6fa Compare August 16, 2024 08:21

github-actions bot added the VELOX label Aug 16, 2024

jinchengchenghh changed the title ~~[VL] Add helper function ColumnarBatches.toString and UnsafeRow toString~~ [VL] Add helper function ColumnarBatches.toString and InternalRow toString Aug 16, 2024

github-actions bot added the CORE works for Gluten Core label Aug 16, 2024

jinchengchenghh added 3 commits August 16, 2024 15:57

Add helper function ColumnarBatches.toString

c89c6fa

fix version shim

b04fcb9

delete file

7bafde5

zhztheplayer reviewed Aug 19, 2024

View reviewed changes

Update ColumnarBatchTest.java to wrap function

96de783

zhztheplayer approved these changes Aug 19, 2024

View reviewed changes

jinchengchenghh merged commit 6d1de90 into apache:main Aug 20, 2024
42 checks passed

sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024

[VL] Add helper function ColumnarBatches.toString and InternalRow toS…

42caff0

…tring (apache#6458)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Add helper function ColumnarBatches.toString and InternalRow toString #6458

[VL] Add helper function ColumnarBatches.toString and InternalRow toString #6458

jinchengchenghh commented Jul 15, 2024 •

edited

Loading

github-actions bot commented Jul 15, 2024

zhztheplayer commented Jul 16, 2024

zhztheplayer commented Jul 16, 2024

jinchengchenghh commented Jul 17, 2024

zhztheplayer commented Jul 18, 2024

jinchengchenghh commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 20, 2024

github-actions bot commented Jul 21, 2024

github-actions bot commented Jul 22, 2024

jinchengchenghh commented Jul 22, 2024

jinchengchenghh commented Jul 24, 2024 •

edited

Loading

zhztheplayer commented Jul 24, 2024

jinchengchenghh commented Aug 1, 2024

zhztheplayer commented Aug 1, 2024

zhztheplayer commented Aug 2, 2024 •

edited

Loading

jinchengchenghh commented Aug 2, 2024

zhztheplayer commented Aug 2, 2024

github-actions bot commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

jinchengchenghh commented Aug 16, 2024

jinchengchenghh commented Aug 16, 2024

zhztheplayer Aug 19, 2024

zhztheplayer left a comment

[VL] Add helper function ColumnarBatches.toString and InternalRow toString #6458

[VL] Add helper function ColumnarBatches.toString and InternalRow toString #6458

Conversation

jinchengchenghh commented Jul 15, 2024 • edited Loading

github-actions bot commented Jul 15, 2024

zhztheplayer commented Jul 16, 2024

zhztheplayer commented Jul 16, 2024

jinchengchenghh commented Jul 17, 2024

zhztheplayer commented Jul 18, 2024

jinchengchenghh commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 20, 2024

github-actions bot commented Jul 21, 2024

github-actions bot commented Jul 22, 2024

jinchengchenghh commented Jul 22, 2024

jinchengchenghh commented Jul 24, 2024 • edited Loading

zhztheplayer commented Jul 24, 2024

jinchengchenghh commented Aug 1, 2024

zhztheplayer commented Aug 1, 2024

zhztheplayer commented Aug 2, 2024 • edited Loading

jinchengchenghh commented Aug 2, 2024

zhztheplayer commented Aug 2, 2024

github-actions bot commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

jinchengchenghh commented Aug 16, 2024

jinchengchenghh commented Aug 16, 2024

zhztheplayer Aug 19, 2024

Choose a reason for hiding this comment

zhztheplayer left a comment

Choose a reason for hiding this comment

jinchengchenghh commented Jul 15, 2024 •

edited

Loading

jinchengchenghh commented Jul 24, 2024 •

edited

Loading

zhztheplayer commented Aug 2, 2024 •

edited

Loading