You can use {{}} to externalize metadata outside the {{}} Spark cluster.
Create an {{}} instance. See {{}}.
Choose the configurations based on your requirements. Make sure to choose Both public & private network for the endpoint configuration. After you have created the instance and the service instance credentials, make a note of the database name, port, user name, password and certificate.
Upload the {{}} certificate to an {{}} bucket where you are maintaining your application code.
To access {{}}, you need to provide a client certificate. Get the Base64 decoded certificate from the service credentials of the {{}} instance and upload the file (name it say,
) to a {{}} bucket in a specific IBM Cloud location. Later you will need to download this certificate and make it available in the {{}} instance Spark workloads for connecting to the metastore -
Customize the {{}} instance to include the {{}} certificate. See Script based customization.
This step customizes the {{}} instance to make the {{}} certificate available to all Spark workloads run against the instance through the library set.
Upload the
from the page in Script based customization to a {{}} bucket. -
that uses the spark-submit REST API to customize the instance. Note that the code referencespostgres.cert
that you uploaded to {{}}.{ "application_details": { "application": "/opt/ibm/customization-scripts/", "arguments": ["{\"library_set\":{\"action\":\"add\",\"name\":\"certificate_library_set\",\"script\":{\"source\":\"py_files\",\"params\":[\"<CHANGME>\",\"<CHANGEME_BUCKET_NAME>\",\"postgres.cert\",\"<CHANGEME_ACCESS_KEY>\",\"<CHANGEME_SECRET_KEY>\"]}}}"], "py-files": "cos://CHANGEME_BUCKET_NAME.mycosservice/" } }
{: codeblock}
Note that the library set name
must match the value of the {{}} metastore connection parameterae.spark.librarysets
that you specified.
Specify the following {{}} metastore connection parameters as part of the Spark application payload or as instance defaults. Make sure that you use the private endpoint for the
parameter below:"spark.hadoop.javax.jdo.option.ConnectionDriverName": "org.postgresql.Driver", "spark.hadoop.javax.jdo.option.ConnectionUserName": "ibm_cloud_<CHANGEME>", "spark.hadoop.javax.jdo.option.ConnectionPassword": "<CHANGEME>", "spark.sql.catalogImplementation": "hive", "spark.hadoop.hive.metastore.schema.verification": "false", "spark.hadoop.hive.metastore.schema.verification.record.version": "false", "spark.hadoop.datanucleus.schema.autoCreateTables":"true", "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:postgresql://<CHANGEME>.databases.appdomain.CHANGEME/ibmclouddb?sslmode=verify-ca&sslrootcert=/home/spark/shared/user-libs/certificate_library_set/custom/postgres.cert&socketTimeout=30", "ae.spark.librarysets":"certificate_library_set"
{: codeblock}
Set up the Hive metastore schema in the {{}} instance because there are no tables in the public schema of {{}} database when you create the instance. This step executes the Hive schema related DDL so that metastore data can be stored in them. After running the following Spark application called
, you will see the Hive metadata tables created against the "public" schema of the instance.from pyspark.sql import SparkSession import time def init_spark(): spark = SparkSession.builder.appName("postgres-create-schema").getOrCreate() sc = spark.sparkContext return spark,sc def create_schema(spark,sc): tablesDF=spark.sql("SHOW TABLES") time.sleep(30) def main(): spark,sc = init_spark() create_schema(spark,sc) if __name__ == '__main__': main()
{: codeblock}
Now run the following script called
to create a Parquet table with metadata from {{}} in the {{}} database.from pyspark.sql import SparkSession import time def init_spark(): spark = SparkSession.builder.appName("postgres-create-parquet-table-test").getOrCreate() sc = spark.sparkContext return spark,sc def generate_and_store_data(spark,sc): data =[("1","Romania","Bucharest","81"),("2","France","Paris","78"),("3","Lithuania","Vilnius","60"),("4","Sweden","Stockholm","58"),("5","Switzerland","Bern","51")] columns=["Ranking","Country","Capital","BroadBandSpeed"] df=spark.createDataFrame(data,columns) df.write.parquet("cos://<CHANGEME-BUCKET>.mycosservice/broadbandspeed") def create_table_from_data(spark,sc): spark.sql("CREATE TABLE MYPARQUETBBSPEED (Ranking STRING, Country STRING, Capital STRING, BroadBandSpeed STRING) STORED AS PARQUET location 'cos://CHANGEME-BUCKET.mycosservice/broadbandspeed/'") df2=spark.sql("SELECT * from MYPARQUETBBSPEED") def main(): spark,sc = init_spark() generate_and_store_data(spark,sc) create_table_from_data(spark,sc) time.sleep(30) if __name__ == '__main__': main()
{: codeblock}
Run the following PySpark script called
to access this Parquet table with metadata from another Spark workload:from pyspark.sql import SparkSession import time def init_spark(): spark = SparkSession.builder.appName("postgres-select-parquet-table-test").getOrCreate() sc = spark.sparkContext return spark,sc def select_data_from_table(spark,sc): df=spark.sql("SELECT * from MYPARQUETBBSPEED") def main(): spark,sc = init_spark() select_data_from_table(spark,sc) time.sleep(60) if __name__ == '__main__': main()
{: codeblock}