Skip to content

Commit

Permalink
Create script to produce schema yaml file.
Browse files Browse the repository at this point in the history
* Use s3cmd to walk ceph files
* Determine partitions by key structure (contains "=")
* Download a parquet file from each leaf directory
* Load parquet file to obtain columns
* Produce table definition yaml file from extracted data
  • Loading branch information
chambridge committed Sep 17, 2021
1 parent 6657a10 commit c5bac39
Show file tree
Hide file tree
Showing 6 changed files with 414 additions and 0 deletions.
7 changes: 7 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
S3_ENDPOINT=endpoint
AWS_ACCESS_KEY=AWS_ACCESS_KEY
AWS_SECRET_KEY=AWS_SECRET_KEY
S3_BUCKET=bucket
S3_BUCKET_PREFIX=data
SCHEMA_NAME=myschema
OUTPUT_FILE=out.yaml
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,6 @@ dmypy.json

# Pyre type checker
.pyre/

# output file
cost-management.yaml
16 changes: 16 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
minio = "*"
pandas = "*"
pyarrow = "*"
s3cmd = "*"
pyyaml = "*"

[dev-packages]

[requires]
python_version = "3.9"
222 changes: 222 additions & 0 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

50 changes: 50 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,52 @@
# ceph_trino_schema_gen
Generate Trino schema from an Ceph S3 bucket


# Getting Started

Start by cloning the repository:
```
git clone https://github.com/chambridge/ceph_trino_schema_gen.git
```

Switch to the new directory:
```
cd ceph_trino_schema_gen
```

Create Python 3.9 virual enviroment:
```
pipenv --python 3.9
pipenv install
```

Copy and configure connection to your Ceph bucket:
```
cp .env.example .env
```

Enter the virtual env:
```
pipenv shell
```

Execute the python script:
```
python gen_table_defs.py
```

_Note:_ You may encounter the following error with Python 3.9 if the dependency has not been fixed yet:
```
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getchildren'
```

In order to resolve the problem you need to remove the `.getchildren()` method calls in *s3cmd* locally.
To do this find the location of `s3scmd` in your virtual environment:
```
which s3cmd
```
Open a terminal to the python directory listed. Change to the S3 site-package:
```
cd lib/python3.9/site-packages/S3/
```
Remove all occurrences of `.getchildren()` from the code. Now the python script should run properly.
Loading

0 comments on commit c5bac39

Please sign in to comment.