-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50858][PYTHON] Add configuration to hide Python UDF stack trace #49535
Conversation
c46ecf5
to
174fdea
Compare
I think we should show the UDF exception. Otherwise users won't know what's going on and why their job fails. |
@HyukjinKwon I agree, and this PR is just to make it configurable (with the default value set to false - show stack trace by default). There are many user-friendly errors on the Python side, but they are often buried in a long Python-side stack trace. This change is intended to optionally hide these stack traces to improve the user experience. |
@allisonwang-db @ueshin Could you review this PR that adds configuration to hide Python stack trace from analyze_udtf? |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
@@ -122,6 +122,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( | |||
protected val authSocketTimeout = conf.get(PYTHON_AUTH_SOCKET_TIMEOUT) | |||
private val reuseWorker = conf.get(PYTHON_WORKER_REUSE) | |||
protected val faultHandlerEnabled: Boolean = conf.get(PYTHON_WORKER_FAULTHANLDER_ENABLED) | |||
protected val hideTraceback: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use conf.get(PYSPARK_HIDE_TRACEBACK)
here so that we don't need to override every subclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config is defined in org.apache.spark.sql.internal.SQLConf
which seems to be inaccessible from here. For reference, PYSPARK_SIMPLIFIED_TRACEBACK
is also defined in SQLConf
so BasePythonRunner
subclasses have to override it.
Is there an advantage for putting it in SQLConf
rather than e.g. org.apache.spark.internal.config.Python
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conf in SQLConf
is session-based conf that also can be set in runtime, and any conf in core
module or StaticSQLConf
is cluster-wide conf and can't be changed while the cluster is running.
"hiding the stack trace.") | ||
.version("4.0.0") | ||
.booleanConf | ||
.createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another way is to create this conf as an int, and show the max depth of stacktrace but I don't feel strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a use case where we only want to show only last k frames of the stack? I'm under the impression that we want to show full stack trace for most exceptions, and completely hide stack trace for specific library exceptions when the message is sufficient to identify the reason.
…onf.scala Co-authored-by: Allison Wang <[email protected]>
c412da6
to
c4f40b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you show the example of the error messages, between this is enabled and disabled, in the PR description?
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
@@ -122,6 +122,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( | |||
protected val authSocketTimeout = conf.get(PYTHON_AUTH_SOCKET_TIMEOUT) | |||
private val reuseWorker = conf.get(PYTHON_WORKER_REUSE) | |||
protected val faultHandlerEnabled: Boolean = conf.get(PYTHON_WORKER_FAULTHANLDER_ENABLED) | |||
protected val hideTraceback: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conf in SQLConf
is session-based conf that also can be set in runtime, and any conf in core
module or StaticSQLConf
is cluster-wide conf and can't be changed while the cluster is running.
It's in the collapsible section "Example that illustrates the difference" |
ah, I didn't notice that. thanks! |
Thanks! Merging to master and branch-4.0. |
### What changes were proposed in this pull request? This PR adds new configuration `spark.sql.execution.pyspark.udf.hideTraceback.enabled`. If set, when handling an exception from Python UDF, only the exception class and message are included. The configuration is turned off by default. This PR also adds a new optional parameter `hide_traceback` for `handle_udf_exception` to override the configuration. Suggested review order: 1. `python/pyspark/util.py`: logic changes 2. `python/pyspark/tests/test_util.py`: unit tests 3. other files: adding new configuration ### Why are the changes needed? This allows library provided UDFs to show only the relevant message without unnecessary stack trace. ### Does this PR introduce _any_ user-facing change? If the configuration is turned off, no user change. Otherwise, the stack trace is not included in the error message when handling an exception from Python UDF. <details> <summary>Example that illustrates the difference</summary> ```py from pyspark.errors.exceptions.base import PySparkRuntimeError from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.udtf import AnalyzeArgument, AnalyzeResult from pyspark.sql.functions import udtf udtf() class PythonUDTF: staticmethod def analyze(x: AnalyzeArgument) -> AnalyzeResult: raise PySparkRuntimeError("[XXX] My PySpark runtime error.") def eval(self, x: int): yield (x,) spark.udtf.register("my_udtf", PythonUDTF) spark.sql("select * from my_udtf(1)").show() ``` With configuration turned off, the last line gives: ``` ... pyspark.errors.exceptions.captured.AnalysisException: [TABLE_VALUED_FUNCTION_FAILED_TO_ANALYZE_IN_PYTHON] Failed to analyze the Python user defined table function: Traceback (most recent call last): File "<stdin>", line 7, in analyze pyspark.errors.exceptions.base.PySparkRuntimeError: [XXX] My PySpark runtime error. SQLSTATE: 38000; line 1 pos 14 ``` With configuration turned on, the last line gives: ``` ... pyspark.errors.exceptions.captured.AnalysisException: [TABLE_VALUED_FUNCTION_FAILED_TO_ANALYZE_IN_PYTHON] Failed to analyze the Python user defined table function: pyspark.errors.exceptions.base.PySparkRuntimeError: [XXX] My PySpark runtime error. SQLSTATE: 38000; line 1 pos 14 ``` </details> ### How was this patch tested? Added unit test in `python/pyspark/tests/test_util.py`, testing two cases with the configuration turned on and off respectively. ### Was this patch authored or co-authored using generative AI tooling? No Closes #49535 from wengh/spark-50858-hide-udf-stack-trace. Authored-by: Haoyu Weng <[email protected]> Signed-off-by: Allison Wang <[email protected]> (cherry picked from commit d259132) Signed-off-by: Allison Wang <[email protected]>
Hi, @allisonwang-db , @ueshin , @HyukjinKwon . Although this passed CIs, it seems that this causes Python linter failures at both master/branch-4.0 branches. Could you check them please?
|
@dongjoon-hyun Thanks for letting us know. cc @wengh could you take a look? |
Oh, never mind. @HyukjinKwon 's revert seems to fix it in some way in both branches already. |
Thank you for the reply anyway, @allisonwang-db . |
sorry it was my bad 😢 |
What changes were proposed in this pull request?
This PR adds new configuration
spark.sql.execution.pyspark.udf.hideTraceback.enabled
. If set, when handling an exception from Python UDF, only the exception class and message are included. The configuration is turned off by default.This PR also adds a new optional parameter
hide_traceback
forhandle_udf_exception
to override the configuration.Suggested review order:
python/pyspark/util.py
: logic changespython/pyspark/tests/test_util.py
: unit testsWhy are the changes needed?
This allows library provided UDFs to show only the relevant message without unnecessary stack trace.
Does this PR introduce any user-facing change?
If the configuration is turned off, no user change.
Otherwise, the stack trace is not included in the error message when handling an exception from Python UDF.
Example that illustrates the difference
With configuration turned off, the last line gives:
With configuration turned on, the last line gives:
How was this patch tested?
Added unit test in
python/pyspark/tests/test_util.py
, testing two cases with the configuration turned on and off respectively.Was this patch authored or co-authored using generative AI tooling?
No