Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case sensitive columns identifiers #107

Open
giacomochiarella opened this issue May 4, 2022 · 4 comments
Open

Case sensitive columns identifiers #107

giacomochiarella opened this issue May 4, 2022 · 4 comments

Comments

@giacomochiarella
Copy link

giacomochiarella commented May 4, 2022

Hi everyone! I'm approaching to Redshift and I'm using pyspark to load tables. spark-redshift jar works very well except one thing. I'm loading tables that unfortunately have some columns with the same name but different case sensitivity, like username and userName. Unfortunately I do not have any control on this as they are external sources.
I've tried to use Redshift itself and it seems it can manage the case-sensitiveness. I've been able to create following table:
CREATE TABLE blabla.a ( unapplied numeric(38, 18) NULL, "unApplied" numeric(38, 18) NULL );
as Redshift has this feature executing SET enable_case_sensitive_identifier TO true;.
The RedshiftWriter seems do not have any check here.
Would be very useful if using an option we could enable/disable this feature.

Do you know any workaround?

@jsleight
Copy link
Collaborator

From the link you provided it seems like spark-redshift would need a PR to optionally skip the check you linked. If you're up for submitting a PR, I'm happy to help review and ship it 😄 . Probably the best approach is via adding a new writer option to enable case sensitive identifiers.

As a workaround I think you could load the table into redshift with different names and then use aws commands (e.g., boto3, aws cli, etc.) to alter the redshift table afterwards (aws examples).

@giacobbino
Copy link

Good to hear. What I did so far is remove the toLowerCase here. To make things working.
I can do the change. I have only one question. Is it enough to add the parameter to Parameter.scala and pass it to RedshiftWriter.unloadData function? Do I need to do anything to make the parameter passed from spark session into MergedParameters input map?

@jsleight
Copy link
Collaborator

Is it enough to add the parameter to Parameter.scala and pass it to RedshiftWriter.unloadData function?

I think that is enough. iirc the params get passed in to writer options in DefaultSource via subclassing one of the org.apache.spark.sql.sources

@giacobbino
Copy link

PR open #108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants