Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-6368] Redact sensitive configs when calling gluten::printConfig #6793

Merged
merged 7 commits into from
Aug 19, 2024

Conversation

ArnavBalyan
Copy link
Contributor

@ArnavBalyan ArnavBalyan commented Aug 12, 2024

What changes were proposed in this pull request?

  • Currently Spark supports spark.redaction.regex which allows redacting sensitive info like passwords, tokens etc.
  • However gluten does not enforce this and logs in plaintext when debug mode is enabled.
  • Added support to redact in printconfig which is being used in the codebase to log the config at various places.
  • User can provide a regex which will be matched and a redacted string will be visitble.

Fixes: #6368

How was this patch tested?

  • Local Spark shell
    Before:
image

After: (Additional regex was supplied to match spark.gluten.ugi.tokens)
image

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Aug 12, 2024
Copy link

#6368

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArnavBalyan, thanks for your work! One comment is posted.
@xushichangdesmond, could you take a review?

@@ -673,7 +674,8 @@ object GlutenConfig {
// gcs config
SPARK_GCS_STORAGE_ROOT_URL,
SPARK_GCS_AUTH_TYPE,
SPARK_GCS_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE
SPARK_GCS_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE,
GLUTEN_REGEX_LOG_REDACTION
Copy link
Contributor

@PHILO-HE PHILO-HE Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArnavBalyan, I think we should just use the existing redaction config key of Spark. Thus, this Spark config will be respected in Gluten.

See the below example to pass Spark's session timezone:
https://github.com/apache/incubator-gluten/pull/6793/files#diff-21453c657b1ccfba45172821a434e5e7cdf007a49b440155514e6641b837d77cL651

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was trying to do this, then realized some compatibility problems with C++ regex library vs Scala matching library used by Spark.
For example: Spark typically uses something like (?i)secret\|password\|token\|access[.]key, but this fails in C++ regex library because (?i) is not supported and users need to represent the regex in a different manner. To prevent such conflicts, for now I had created a different config specific to gluten. I agree, this is not the most elegant way, can switch to single config as well, let me know thanks!

Copy link
Contributor

@PHILO-HE PHILO-HE Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArnavBalyan, can we use boost::regex instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can use boost, there are still be parity differences, but these would be less common. Also just noticed the regex redact property is present in spark core config, it seems currently the SQL configs flow in, we might have to wire up the core configs to flow into gluten.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArnavBalyan, yes, we cannot guarantee the compatibility for all regex pattern. But at least, this default regex should be supported in Gluten.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes theres differences in the c++ regex library, we could probably use a fallback to something sensible as a compromise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @PHILO-HE @xushichangdesmond, I'll wire up the spark core config and use that by default, maybe we can improve this to fallback to gluten specific configs if needed in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example: Spark typically uses something like (?i)secret|password|token|access[.]key, but this fails in C++ regex library because (?i) is not supported and users need to represent the regex in a different manner.

could we overwrite Spark's default pattern with a valid regex patten in native?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are alot of incompatibilities, it's just one of the things where c++ and java don't have complete parity, I would rather, we provide limited support for the same config, or allow users to provide regex supported by c++ using a different properties. Option 1 is better for now and later we can add support

#include "compute/ProtobufUtils.h"
#include "config.pb.h"
#include "jni/JniError.h"

namespace gluten {

const std::string REDACTED_VALUE = "*********(redacted)";
const std::string REGEX_REDACT_KEY = "spark.gluten.redaction.regex";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -34,12 +39,28 @@ parseConfMap(JNIEnv* env, const uint8_t* planData, const int32_t planDataLength)
return sparkConfs;
}

std::optional<std::regex> getRedactionRegex(const std::unordered_map<std::string, std::string>& conf) {
auto it = conf.find(REGEX_REDACT_KEY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this one to anonymous namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

@jinchengchenghh
Copy link
Contributor

Thanks. Just several nits.

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@xushichangdesmond
Copy link

thanks for this; think it looks good though I dont have approval rights for this repo.

@ArnavBalyan
Copy link
Contributor Author

ArnavBalyan commented Aug 15, 2024

thanks for this; think it looks good though I dont have approval rights for this repo.

Ack thanks, I will incorporate @PHILO-HE's feedback and use the gluten specific configs, should not be much effort once the Spark core-configs can be exposed to the GlutenConfig, I'll update this shortly thanks!

Copy link
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, left some comments.

#include "compute/ProtobufUtils.h"
#include "config.pb.h"
#include "jni/JniError.h"

namespace {

const std::string REGEX_REDACT_KEY = "spark.gluten.redaction.regex";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move config key definition to GlutenConfig.h

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and follow others naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be changed, we will no longer rely on this config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All configs from GlutenConfig.scala should mapping to GlutenConfig.h/VeloxConfig.h, let's keep config consistent.

namespace gluten {

const std::string REDACTED_VALUE = "*********(redacted)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@@ -673,7 +674,8 @@ object GlutenConfig {
// gcs config
SPARK_GCS_STORAGE_ROOT_URL,
SPARK_GCS_AUTH_TYPE,
SPARK_GCS_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE
SPARK_GCS_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE,
GLUTEN_REGEX_LOG_REDACTION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example: Spark typically uses something like (?i)secret|password|token|access[.]key, but this fails in C++ regex library because (?i) is not supported and users need to represent the regex in a different manner.

could we overwrite Spark's default pattern with a valid regex patten in native?

@ArnavBalyan
Copy link
Contributor Author

cc @PHILO-HE, could you please review, wired up to existing spark confs thanks!

@PHILO-HE
Copy link
Contributor

@ArnavBalyan, thanks for your updating! Did you verify that Spark's this config property takes effect in Gluten in a test?
BTW, let's use boost regex, as mentioned before. The boost lib has been included in some cmake files. So I guess you just need to add its header and create boost regex parser in the code.

Yohahaha
Yohahaha previously approved these changes Aug 19, 2024
Copy link
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@zhztheplayer zhztheplayer changed the title [GLUTEN-6368] Redact sensitive configs during logging [GLUTEN-6368] Redact sensitive configs when calling gluten::printConfig Aug 19, 2024
@ArnavBalyan
Copy link
Contributor Author

ArnavBalyan commented Aug 19, 2024

@ArnavBalyan, thanks for your updating! Did you verify that Spark's this config property takes effect in Gluten in a test? BTW, let's use boost regex, as mentioned before. The boost lib has been included in some cmake files. So I guess you just need to add its header and create boost regex parser in the code.

Yes this was tested, have switched to regex boost, we are able to see redaction with more flexible regex expressions thanks!
image

@zhztheplayer
Copy link
Member

Yes this was tested, have switched to regex boost, we are able to see redaction with more flexible regex expressions thanks!

That sounds great. And does that mean we can use the same config key with vanilla Spark ?

@ArnavBalyan
Copy link
Contributor Author

Yes this was tested, have switched to regex boost, we are able to see redaction with more flexible regex expressions thanks!

That sounds great. And does that mean we can use the same config key with vanilla Spark ?

Yes, the same config key (that vanilla spark supports) will now also work with Gluten, no additional changes needed for Gluten thanks

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@Yohahaha Yohahaha merged commit 5f7bb92 into apache:main Aug 19, 2024
44 checks passed
sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core security VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Velox backend does not apply/enforce spark.redaction.regex config property when logging out spark config
6 participants