A package to manage Google Cloud Data Catalog Fileset scripts.
Disclaimer: This is not an officially supported Google product.
- Executing in Cloud Shell
- 1. Environment setup
- 2. Create Filesets from CSV file
# Set your SERVICE ACCOUNT, for instructions go to 1.3. Auth credentials
# This name is just a suggestion, feel free to name it following your naming conventions
export GOOGLE_APPLICATION_CREDENTIALS=~/datacatalog-fileset-processor-sa.json
# Install datacatalog-fileset-processor
pip3 install datacatalog-fileset-processor --user
# Add to your PATH
export PATH=~/.local/bin:$PATH
# Look for available commands
datacatalog-fileset-processor --help
Using virtualenv is optional, but strongly recommended unless you use Docker.
git clone https://github.com/mesmacosta/datacatalog-fileset-processor
cd ./datacatalog-fileset-processor
All paths starting with ./
in the next steps are relative to the datacatalog-fileset-processor
folder.
pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate
pip install --upgrade .
Docker may be used as an alternative to run the script. In this case, please disregard the Virtualenv setup instructions.
- Data Catalog Admin
This name is just a suggestion, feel free to name it following your naming conventions
./credentials/datacatalog-fileset-processor-sa.json
This step may be skipped if you're using Docker.
export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-fileset-processor-sa.json
Filesets are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description | Mandatory |
---|---|---|
entry_group_name | Entry Group Name. | Y |
entry_group_display_name | Entry Group Display Name. | N |
entry_group_description | Entry Group Description. | N |
entry_id | Entry ID. | Y |
entry_display_name | Entry Display Name. | Y |
entry_description | Entry Description. | N |
entry_file_patterns | Entry File Patterns. | Y |
schema_column_name | Schema column name. | N |
schema_column_type | Schema column type. | N |
schema_column_description | Schema column description. | N |
schema_column_mode | Schema column mode. | N |
Please note that the schema_column_type
is an open string field and accept anything, if you want
to use your fileset with Dataflow SQL, follow the data-types in the official docs.
- Python + virtualenv
datacatalog-fileset-processor filesets create --csv-file CSV_FILE_PATH
- Python + virtualenv
datacatalog-fileset-processor filesets delete --csv-file CSV_FILE_PATH
TIPS
-
sample-input/create-filesets for reference;
-
If you want to create filesets without schema: sample-input/create-filesets/fileset-entry-opt-1-all-metadata-no-schema.csv for reference;