-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50806][SQL] Support InputRDDCodegen interruption on task cancellation #49501
base: master
Are you sure you want to change the base?
Conversation
| InternalRow $row = (InternalRow) $input.next(); | ||
| $numOutputRows++; | ||
| if ($numOutputRows % 1000 == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000
looks like a magic number. So, we don't need to interrupt when the number of rows is less than 1000 because it will finish quickly in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it's by intuition. And yes we expect 1000 rows to finish quickly soon.
val logAppender = new LogAppender("") | ||
withLogAppender(logAppender, level = Some(Level.INFO)) { | ||
spark.sparkContext.setJobGroup("SPARK-50806", "SPARK-50806", false) | ||
val slowDF = spark.range(1, 100000).rdd.mapPartitions { iter => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment that this test case assumption: this should be 100000
(>1k) because we check it per 1k rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@Ngone51 the new test failed |
@cloud-fan is there a way to configure the parallelize for |
Do you mean
|
Gentle ping, @Ngone51 ~ |
@dongjoon-hyun Thanks, this is what I'm looking for. Let me try it first. (The test generates 2 tasks locally but 3 in GithubAction. So I'd liket explictly set the partitions to fix the stability of the test.) |
Maybe, could you rebase this PR to
|
@dongjoon-hyun Thanks, will updated. But this PR's test failure still exits. I'll take a closer look. 2025-01-23T05:14:27.8423862Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m- SPARK-50806: HashAggregate codegen should be interrupted on task cancellation *** FAILED *** (1 minute, 5 seconds)�[0m�[0m
2025-01-23T05:14:27.8426704Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m The code passed to eventually never returned normally. Attempted 3956 times over 1.0001835922166666 minutes. Last failure message: 3 did not equal 2. (HashAggregateCodegenInterruptionSuite.scala:91)�[0m�[0m |
Oh. Got it. |
The failure is not related to
However, this is the local Spark logs, where the job
The logs looks almost the same. It's confused why the job can't be found in Github Action's case. |
What changes were proposed in this pull request?
This PR adds a new mutable state
taskInterrupted
inInputRDDCodegen
. The state indicates whether the task thread is interrupted or not.InputRDDCodegen
would check the state each time when it produces a record and stops immediately iftaskInterrupted=true
.taskInterrupted
would be updated byTaskContext.isInterrupted()
per 1000 output rows processed.Why are the changes needed?
This facilitates the resource release when there is a task killing request.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added unit test.
Was this patch authored or co-authored using generative AI tooling?
No.