-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-5320][VL] Reduce driver memory footprint by postpone the creation and serialization of LocalFilesNode #5321
Conversation
Run Gluten Clickhouse CI |
There are still some cases to fix, for example:
|
@@ -44,6 +42,7 @@ import org.apache.spark.util.ExecutorManager | |||
import java.lang.{Long => JLong} | |||
import java.nio.charset.StandardCharsets | |||
import java.time.ZoneOffset | |||
import java.util |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not introduce this package. Just use JArrayList
.
public List<String> preferredLocations() { | ||
return Arrays.asList(filePartition.preferredLocations()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val preferredLocations =
SoftAffinity.getFilePartitionLocations(f)
please keep origin logic.
@@ -91,4 +91,6 @@ trait IteratorApi { | |||
numOutputRows: SQLMetric, | |||
numOutputBatches: SQLMetric, | |||
scanTime: SQLMetric): RDD[ColumnarBatch] | |||
|
|||
def toLocalFilesNodeByteArray(p: GlutenRawPartition): Array[Array[Byte]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a new SplitInfo object file and move this method into it with toSplitInfoByteArray
? then other backends could use it more easily, and avoid add this method in IteratorApi
which seems unrelated.
thank you for the improvements, this idea works for me, just few comments. |
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks. |
@WangGuangxin are you still working on this PR? |
@Yohahaha I'll rework on this this week. |
Hi @WangGuangxin Feel free to request close my PR if yours is ready to review. |
What changes were proposed in this pull request?
Currently, driver generate
GlutenPartition
based on spark'sFilePartitions
, and then convert toLocalFilesNode
and serialized to byte array in pb format.This will double the driver memory, because the
FilePartitions
are not destroyed after convert toLocalFilesNodes
.When there are many file splits ( file status) , the impact is significant.
For example, in one of our case, there are total 48 hdfs paths to list, 7039474 files under them. With vanilla spark, it can work with driver memory = 20G, but failed in Gluten.
From the gc log, we can find that Gluten has more
String
andByte[]
objects than vanilla spark.Vanilla Spark Full GC objects
Gluten Full GC objects (before this patch)
Gluten Full GC objects (after this patch)
(Fixes: #5320)