Skip to content

allisonwang-db/pyspark-data-sources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyspark-data-sources

pypi

This repository showcases custom Spark data sources built using the new Python Data Source API for the upcoming Apache Spark 4.0 release. For an in-depth understanding of the API, please refer to the API source code. Note this repo is demo only and please be aware that it is not intended for production use. Contributions and feedback are welcome to help improve the examples.

Installation

pip install pyspark-data-sources[all]

Usage

Install the pyspark 4.0 preview version: https://pypi.org/project/pyspark/4.0.0.dev1/

pip install "pyspark[connect]==4.0.0.dev1"

Or use Databricks Runtime 15.2 or above.

Try the data sources!

from pyspark_datasources.github import GithubDataSource

# Register the data source
spark.dataSource.register(GithubDataSource)

spark.read.format("github").load("apache/spark").show()

See more here: https://allisonwang-db.github.io/pyspark-data-sources/.

Contributing

We welcome and appreciate any contributions to enhance and expand the custom data sources. If you're interested in contributing:

  • Add New Data Sources: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
  • Suggest Enhancements: If you have ideas to improve a data source or the API, we'd love to hear them!
  • Report Bugs: Found something that doesn't work as expected? Let us know by opening an issue.

Need help or have questions? Don't hesitate to open a new issue, and we'll do our best to assist you.

Development

poetry shell

Build docs

mkdocs serve

Releases

No releases published

Packages

No packages published

Languages