TiSpark (version >= 2.0) on PySpark:

Note: If you are using TiSpark version less than 2.0, please read this document instead

pytispark will not be necessary since TiSpark version >= 2.0.

Usage

There are currently two ways to use TiSpark on Python:

Directly via pyspark

This is the simplest way, just a decent Spark environment should be enough.

Make sure you have the latest version of TiSpark and a jar with all TiSpark's dependencies.
Remember to add needed configurations listed in README into your $SPARK_HOME/conf/spark-defaults.conf
Copy ./resources/session.py to $SPARK_HOME/python/pyspark/sql/session.py
Run this command in your $SPARK_HOME directory:

./bin/pyspark --jars /where-ever-it-is/tispark-core-{$version}-jar-with-dependencies.jar

To use TiSpark, run these commands:

# Query as you are in spark-shell
sql("show databases").show()
sql("use tpch_test")
sql("show tables").show()
sql("select count(*) from customer").show()

# Result
# +--------+
# |count(1)|
# +--------+
# |     150|
# +--------+

Via spark-submit

This way is useful when you want to execute your own Python scripts.

Because of an open issue [SPARK-25003] in Spark 2.3, using spark-submit for python files will only support following api

Create a Python file named test.py as below:

from py4j.java_gateway import java_import
from pyspark.context import SparkContext
 
# We get a referenct to py4j Java Gateway
gw = SparkContext._gateway

# Import TiExtensions
java_import(gw.jvm, "org.apache.spark.sql.TiExtensions")

# Inject TiExtensions, and get a TiContext
ti = gw.jvm.TiExtensions.getInstance(spark._jsparkSession).getOrCreateTiContext(spark._jsparkSession)

# Map database as old api does
ti.tidbMapDatabase("tpch_test", False, True)

# sql("use tpch_test")
sql("select count(*) from customer").show(20,200,False)

Prepare your TiSpark environment as above and execute

./bin/spark-submit --jars /where-ever-it-is/tispark-core-{$version}-jar-with-dependencies.jar test.py

Result:

+--------+
|count(1)|
+--------+
|     150|
+--------+

See pytispark for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TiSpark (version >= 2.0) on PySpark:

Usage

Directly via pyspark

Via spark-submit

Files

README.md

Latest commit

History

README.md

File metadata and controls

TiSpark (version >= 2.0) on PySpark:

Usage

Directly via pyspark

Via spark-submit