Note: If you are using TiSpark version less than 2.0, please read this document instead
pytispark will not be necessary since TiSpark version >= 2.0.
There are currently two ways to use TiSpark on Python:
This is the simplest way, just a decent Spark environment should be enough.
-
Make sure you have the latest version of TiSpark and a
jar
with all TiSpark's dependencies. -
Remember to add needed configurations listed in README into your
$SPARK_HOME/conf/spark-defaults.conf
-
Copy
./resources/session.py
to$SPARK_HOME/python/pyspark/sql/session.py
-
Run this command in your
$SPARK_HOME
directory:
./bin/pyspark --jars /where-ever-it-is/tispark-core-{$version}-jar-with-dependencies.jar
- To use TiSpark, run these commands:
# Query as you are in spark-shell
sql("show databases").show()
sql("use tpch_test")
sql("show tables").show()
sql("select count(*) from customer").show()
# Result
# +--------+
# |count(1)|
# +--------+
# | 150|
# +--------+
This way is useful when you want to execute your own Python scripts.
Because of an open issue [SPARK-25003] in Spark 2.3, using spark-submit for python files will only support following api
- Create a Python file named
test.py
as below:
from py4j.java_gateway import java_import
from pyspark.context import SparkContext
# We get a referenct to py4j Java Gateway
gw = SparkContext._gateway
# Import TiExtensions
java_import(gw.jvm, "org.apache.spark.sql.TiExtensions")
# Inject TiExtensions, and get a TiContext
ti = gw.jvm.TiExtensions.getInstance(spark._jsparkSession).getOrCreateTiContext(spark._jsparkSession)
# Map database as old api does
ti.tidbMapDatabase("tpch_test", False, True)
# sql("use tpch_test")
sql("select count(*) from customer").show(20,200,False)
- Prepare your TiSpark environment as above and execute
./bin/spark-submit --jars /where-ever-it-is/tispark-core-{$version}-jar-with-dependencies.jar test.py
- Result:
+--------+
|count(1)|
+--------+
| 150|
+--------+
See pytispark for more information.