Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RedshiftException: Something unusual has occurred to cause the driver to fail when writing to Redshift via PySpark #125

Open
RusabKhan opened this issue Apr 3, 2023 · 3 comments

Comments

@RusabKhan
Copy link

RusabKhan commented Apr 3, 2023

Below is my code

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('redshift')\
    .config("spark.jars.packages","com.google.guava:guava:31.1-jre,com.eclipsesource.minimal-json:minimal-json:0.9.5,com.amazon.redshift:redshift-jdbc42:2.1.0.12,com.google.guava:guava:31.1-jre,com.amazonaws:aws-java-sdk:1.12.439,org.apache.spark:spark-avro_2.12:3.3.2,io.github.spark-redshift-community:spark-redshift_2.12:5.1.0,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.access.key", "etc") \
    .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')\
    .config("spark.hadoop.fs.s3a.secret.key", "etc") \
    .getOrCreate()
    
import os 

os.environ["AWS_ACCESS_KEY_ID"] = "etc"
os.environ["AWS_SECRET_ACCESS_KEY"] = "etc"

df.write \
  .format("io.github.spark_redshift_community.spark.redshift") \
  .option("url", "jdbc:redshift:iam://host:5439/dev?user=user&password=pass") \
  .option("dbtable", "demo") \
  .option("aws_iam_role", "arn:aws:iam::etc") \
  .option("tempdir", "s3a://etc") \
  .mode("append") \
  .save()

This throws the exception

py4j.protocol.Py4JJavaError: An error occurred while calling o62.save.
: com.amazon.redshift.util.RedshiftException: Something unusual has occurred to cause the driver to fail. Please report this exception:Unable to load AWS credentials from any provider in the chain: [EnvironmentVariableCredentialsProvider: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)), SystemPropertiesCredentialsProvider: Unable to load AWS credentials from Java system properties (aws.accessKeyId and aws.secretKey), WebIdentityTokenCredentialsProvider: You must specify a value for roleArn and roleSessionName, com.amazonaws.auth.profile.ProfileCredentialsProvider@2c690608: profile file cannot be null, com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@708c38d5: Failed to connect to service endpoint: ]

I have also tried using forward_s3_crdentials true. This role has full permission to redshift and s3. I have written multiple files and ran copy command manually via the psycopg2 library however with the spark it doesn't work.

Environment
Spark 3.2.3
Scala 2.12
Pyspark 3.2.3
Java 11
Ubuntu

@88manpreet
Copy link
Collaborator

What happens when you try com.amazonaws.auth.EnvironmentVariableCredentialsProvider instead?

@RusabKhan
Copy link
Author

.config('spark.hadoop.fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')

With the above settings at first, it wasn't loading environment variables set through code

os.environ["AWS_ACCESS_KEY_ID"] = 'etc'
os.environ["AWS_SECRET_KEY"] = 'etc'

Then I set my variables through the export command in the terminal and retried and it resulted in the below error.

23/04/03 16:44:35 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
23/04/03 16:45:39 ERROR RedshiftWriter: Exception thrown during Redshift load; will roll back transaction
java.sql.SQLException: Exception thrown in awaitResult: 
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:133)
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:109)
.......
Caused by: com.amazon.redshift.util.RedshiftException: ERROR: Query cancelled.
        at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2601)
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2269)

The database is accessible and I can write to it via the psycopg2 library and even ran some copy commands after writing to s3 via boto3.

@smoy
Copy link
Member

smoy commented Aug 31, 2023

@RusabKhan I just turned on secrets scanning in the repository, and GitHub reported possibly secret leak when you initially open the issue. Please make sure you rotate your tokens if it's still active. Despite you edit it, GitHub basically has some logs of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants