-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-6368] Redact sensitive configs when calling gluten::printConfig
#6793
Conversation
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArnavBalyan, thanks for your work! One comment is posted.
@xushichangdesmond, could you take a review?
@@ -673,7 +674,8 @@ object GlutenConfig { | |||
// gcs config | |||
SPARK_GCS_STORAGE_ROOT_URL, | |||
SPARK_GCS_AUTH_TYPE, | |||
SPARK_GCS_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE | |||
SPARK_GCS_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE, | |||
GLUTEN_REGEX_LOG_REDACTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArnavBalyan, I think we should just use the existing redaction config key of Spark. Thus, this Spark config will be respected in Gluten.
See the below example to pass Spark's session timezone:
https://github.com/apache/incubator-gluten/pull/6793/files#diff-21453c657b1ccfba45172821a434e5e7cdf007a49b440155514e6641b837d77cL651
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was trying to do this, then realized some compatibility problems with C++ regex
library vs Scala matching
library used by Spark.
For example: Spark typically uses something like (?i)secret\|password\|token\|access[.]key
, but this fails in C++ regex library because (?i)
is not supported and users need to represent the regex in a different manner. To prevent such conflicts, for now I had created a different config specific to gluten. I agree, this is not the most elegant way, can switch to single config as well, let me know thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArnavBalyan, can we use boost::regex instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can use boost, there are still be parity differences, but these would be less common. Also just noticed the regex redact property is present in spark core
config, it seems currently the SQL
configs flow in, we might have to wire up the core configs to flow into gluten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArnavBalyan, yes, we cannot guarantee the compatibility for all regex pattern. But at least, this default regex should be supported in Gluten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes theres differences in the c++ regex library, we could probably use a fallback to something sensible as a compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @PHILO-HE @xushichangdesmond, I'll wire up the spark core config and use that by default, maybe we can improve this to fallback to gluten specific configs if needed in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example: Spark typically uses something like (?i)secret|password|token|access[.]key, but this fails in C++ regex library because (?i) is not supported and users need to represent the regex in a different manner.
could we overwrite Spark's default pattern with a valid regex patten in native?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are alot of incompatibilities, it's just one of the things where c++ and java don't have complete parity, I would rather, we provide limited support for the same config, or allow users to provide regex supported by c++ using a different properties. Option 1 is better for now and later we can add support
cpp/core/config/GlutenConfig.cc
Outdated
#include "compute/ProtobufUtils.h" | ||
#include "config.pb.h" | ||
#include "jni/JniError.h" | ||
|
||
namespace gluten { | ||
|
||
const std::string REDACTED_VALUE = "*********(redacted)"; | ||
const std::string REGEX_REDACT_KEY = "spark.gluten.redaction.regex"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
cpp/core/config/GlutenConfig.cc
Outdated
@@ -34,12 +39,28 @@ parseConfMap(JNIEnv* env, const uint8_t* planData, const int32_t planDataLength) | |||
return sparkConfs; | |||
} | |||
|
|||
std::optional<std::regex> getRedactionRegex(const std::unordered_map<std::string, std::string>& conf) { | |||
auto it = conf.find(REGEX_REDACT_KEY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this one to anonymous namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done thanks!
Thanks. Just several nits. |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
thanks for this; think it looks good though I dont have approval rights for this repo. |
Ack thanks, I will incorporate @PHILO-HE's feedback and use the gluten specific configs, should not be much effort once the Spark core-configs can be exposed to the GlutenConfig, I'll update this shortly thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, left some comments.
cpp/core/config/GlutenConfig.cc
Outdated
#include "compute/ProtobufUtils.h" | ||
#include "config.pb.h" | ||
#include "jni/JniError.h" | ||
|
||
namespace { | ||
|
||
const std::string REGEX_REDACT_KEY = "spark.gluten.redaction.regex"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move config key definition to GlutenConfig.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and follow others naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be changed, we will no longer rely on this config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All configs from GlutenConfig.scala should mapping to GlutenConfig.h/VeloxConfig.h, let's keep config consistent.
cpp/core/config/GlutenConfig.cc
Outdated
namespace gluten { | ||
|
||
const std::string REDACTED_VALUE = "*********(redacted)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
@@ -673,7 +674,8 @@ object GlutenConfig { | |||
// gcs config | |||
SPARK_GCS_STORAGE_ROOT_URL, | |||
SPARK_GCS_AUTH_TYPE, | |||
SPARK_GCS_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE | |||
SPARK_GCS_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE, | |||
GLUTEN_REGEX_LOG_REDACTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example: Spark typically uses something like (?i)secret|password|token|access[.]key, but this fails in C++ regex library because (?i) is not supported and users need to represent the regex in a different manner.
could we overwrite Spark's default pattern with a valid regex patten in native?
cc @PHILO-HE, could you please review, wired up to existing spark confs thanks! |
@ArnavBalyan, thanks for your updating! Did you verify that Spark's this config property takes effect in Gluten in a test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
gluten::printConfig
Yes this was tested, have switched to regex boost, we are able to see redaction with more flexible regex expressions thanks! |
That sounds great. And does that mean we can use the same config key with vanilla Spark ? |
Yes, the same config key (that vanilla spark supports) will now also work with Gluten, no additional changes needed for Gluten thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
What changes were proposed in this pull request?
spark.redaction.regex
which allows redacting sensitive info like passwords, tokens etc.Fixes: #6368
How was this patch tested?
Before:
After: (Additional regex was supplied to match
spark.gluten.ugi.tokens
)