Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Add helper function ColumnarBatches.toString and InternalRow toString #6458

Merged
merged 4 commits into from
Aug 20, 2024

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented Jul 15, 2024

For test purpose, add this helper function.
Add refactor columnarToRow and rowToColumnar functions to support used in otherwhere.

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@zhztheplayer
Copy link
Member

What's the purpose of the API? Is it for testing use?

Also please fill in the PR description. Thanks.

@zhztheplayer
Copy link
Member

Hi @jinchengchenghh,

If it's for testing purpose from Java side, my suggestion is not to propagate the call down to C++ code. We can add a Java API ColumnarBatches#toString which converts the input batch to Arrow batch (via #ensureLoaded), then use some regular ways to stringify it. This could be the simplest way to make the API support both Velox and Arrow data.

@jinchengchenghh
Copy link
Contributor Author

ArrowWritableVector does not have print function in java side.
We can also change it to InternalRow Iterator, but we don't have the print function too.

@zhztheplayer
Copy link
Member

ArrowWritableVector does not have print function in java side. We can also change it to InternalRow Iterator, but we don't have the print function too.

If we can get the rows, there must be a way to stringify them since Spark requires for this to implement df.show

@jinchengchenghh
Copy link
Contributor Author

ArrowWritableVector does not have print function in java side. We can also change it to InternalRow Iterator, but we don't have the print function too.

If we can get the rows, there must be a way to stringify them since Spark requires for this to implement df.show

Yes, it uses ToPrettyString to show the result in dataframe, we can only use some of code, I have implemented our version.

Copy link

Run Gluten Clickhouse CI

5 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@jinchengchenghh
Copy link
Contributor Author

Can you help review this PR again? Thanks! @zhztheplayer

@jinchengchenghh
Copy link
Contributor Author

jinchengchenghh commented Jul 24, 2024

Can you help review? Thanks! @zhztheplayer

@jinchengchenghh jinchengchenghh changed the title [VL] Add helper function ColumnarBatches.toString [VL] Add helper function ColumnarBatches.toString and UnsafeRow toString Jul 24, 2024
@zhztheplayer
Copy link
Member

@jinchengchenghh Will take a look as soon as possible. Thanks.

@jinchengchenghh
Copy link
Contributor Author

Can I take your some time? @zhztheplayer

@zhztheplayer
Copy link
Member

Reviewing now. Thank you for noting.

@zhztheplayer
Copy link
Member

zhztheplayer commented Aug 2, 2024

Hi @jinchengchenghh , I have been thinking if the code can be simplified to ease maintenance.

Would you please have a check about the following example code to pretty print a row iterator with much less code than ToStringUtil:

test("UnsafeRow to string 2") {
  val util = ToStringUtil(Option.apply(SQLConf.get.sessionLocalTimeZone))
  val row1 =
    InternalRow.apply(UTF8String.fromString("hello"), UTF8String.fromString("world"), 123)
  val rowWithNull = InternalRow.apply(null, null, 4)
  val row2 = UnsafeProjection
    .create(Array[DataType](StringType, StringType, IntegerType))
    .apply(rowWithNull)
  val it = List(row1, row2, row1, row1, row2).toIterator
  val struct = new StructType().add("a", StringType).add("b", StringType).add("c", IntegerType)

  val encoder = RowEncoder(struct).resolveAndBind() // `ExpressionEncoder(struct).resolveAndBind()` for newer version of Spark
  val deserializer = encoder.createDeserializer()
  it.map(deserializer).foreach(r => println(r.mkString("|")))
}

@jinchengchenghh
Copy link
Contributor Author

For Iterator[UnsafeRow], it is OK.
Then the problem occurs again, loaded Arrow ColumnarBatch cannot invoke rowIterator because ColumnarBatchRow column is IndicatorVector @zhztheplayer

@zhztheplayer
Copy link
Member

loaded Arrow ColumnarBatch cannot invoke rowIterator because ColumnarBatchRow column is IndicatorVector

Then we should fix this at first... Which sounds like a bug. I'll take another look on the issue.

@github-actions github-actions bot added the VELOX label Aug 16, 2024
@jinchengchenghh jinchengchenghh changed the title [VL] Add helper function ColumnarBatches.toString and UnsafeRow toString [VL] Add helper function ColumnarBatches.toString and InternalRow toString Aug 16, 2024
@github-actions github-actions bot added the CORE works for Gluten Core label Aug 16, 2024
Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

@jinchengchenghh
Copy link
Contributor Author

Executing SQL query from resource path /tpcds-queries/q24b.sql...
24/08/16 09:15:21 WARN Runtime: WholeStageIterator Reservation listener org.apache.gluten.memory.listener.ManagedReservationListener@4523caff still reserved non-zero bytes, which may cause memory leak, size: 2.0 MiB. 
24/08/16 09:15:21 ERROR Executor: Exception in task 0.0 in stage 477.0 (TID 1749)
org.apache.spark.SparkException: Managed memory leak detected; size = 2097152 bytes, task 0.0 in stage 477.0 (TID 1749)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:516)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1500)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f372fa0c9a6, pid=11701, tid=0x00007f372e207640
#
# JRE version: OpenJDK Runtime Environment (8.0_422-b05) (build 1.8.0_422-8u422-b05-1~22.04-b05)
# Java VM: OpenJDK 64-Bit Server VM (25.422-b05 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libvelox.so+0x18049a6]  facebook::velox::memory::ScopedMemoryPoolArbitrationCtx::~ScopedMemoryPoolArbitrationCtx()+0x6
#
# Core dump written. Default location: /__w/incubator-gluten/incubator-gluten/tools/gluten-it/core or core.11701
#
# An error report file with more information is saved as:
# /__w/incubator-gluten/incubator-gluten/tools/gluten-it/hs_err_pid11701.log
24/08/16 09:15:21 WARN TaskSetManager: Lost task 0.0 in stage 477.0 (TID 1749) (051b84fb203f executor driver): org.apache.spark.SparkException: Managed memory leak detected; size = 2097152 bytes, task 0.0 in stage 477.0 (TID 1749)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:516)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1500)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

@jinchengchenghh
Copy link
Contributor Author

@zhztheplayer Thanks for your suggestion, It really make code clean and easy to maintain. Can you help to review again?

Comment on lines 155 to 172
private static ColumnarBatch newArrowBatch(int numRows) {
String schema = "a boolean, b int";
final ArrowWritableColumnVector[] columns =
ArrowWritableColumnVector.allocateColumns(numRows, StructType.fromDDL(schema));
ArrowWritableColumnVector col1 = columns[0];
ArrowWritableColumnVector col2 = columns[1];
for (int j = 0; j < numRows; j++) {
col1.putBoolean(j, j % 2 == 0);
col2.putInt(j, 15 - j);
}
col2.putNull(numRows - 1);
for (ArrowWritableColumnVector col : columns) {
col.setValueCount(numRows);
}
final ColumnarBatch batch = new ColumnarBatch(columns);
batch.setNumRows(numRows);
return batch;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this method? Could fill in the vectors in test case's code.

E.g.,

          final int numRows = 100;
          final ColumnarBatch batch = newArrowBatch("a boolean, b int", numRows);
          final ArrowWritableColumnVector col0 = (ArrowWritableColumnVector) batch.column(0);
          final ArrowWritableColumnVector col1 = (ArrowWritableColumnVector) batch.column(1);
          for (int j = 0; j < numRows; j++) {
            col0.putBoolean(j, j % 2 == 0);
            col1.putInt(j, 15 - j);
          }
          col1.putNull(numRows - 1);

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jinchengchenghh jinchengchenghh merged commit 6d1de90 into apache:main Aug 20, 2024
42 checks passed
sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants