Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate pipeline #85

Open
pmayd opened this issue Oct 11, 2023 · 3 comments
Open

Automate pipeline #85

pmayd opened this issue Oct 11, 2023 · 3 comments
Assignees

Comments

@pmayd
Copy link
Collaborator

pmayd commented Oct 11, 2023

Idea:

  • automate pipeline when uploading files
  • uploading a file to a bucket will tricker a function/container to process this file
  • (new) data is automatically ingested into the database
@tasosbada
Copy link

I think that the automation process should be done at least in two-step.

  1. Container for cleaning the data. Which is based on our data cleaning pipeline in R.
  2. Container that runs a bash script for the data upload and runs bigquery queries. This script could be stored in gcs, so we could modify it.

For the first container, the files can be found in https://storage.cloud.google.com/a4d-315220-documents/docker-a4d-data-extraction/docker-a4d-data-extraction.zip, documentation can be found in the readme.md file. The problem is that although it worked for me locally, it could get deployed on gcp cloud run. It crashed due to the "devtools". I did not try it with the latest version R and our code. Possible solution could be installed dependencies without "devtools" or try it on Kubernetes cluster.

The second container could even be a cloud functions, but it needs communication and access to our gcs for the bash script. I intend to build a container to test this approach. Using bash script on our gcs, we have the flexibility to adapt and use this container only as runtime.

@tasosbada
Copy link

The docker image template and the repository for the second step (Container that runs a bash script for the data upload and runs ...) with the instruction can be found in our bucket. I zipped it and store it our bucket in case that you want to use in the future. Information and the step-by-step process can be found in the readme.md file.

@tasosbada
Copy link

In zip file can be found the necessary files and documentation for building and deployment cleaning data pipeline on GCP Cloud Run. The problem that was mentioned above is fixed and the pipeline run on Cloud Run. It runs, but since I do not have any real data input files it complains about this, otherwise it generates the log file properly, which means that there is no problem in execution and all the packages are loaded properly.

Please feel free to contact me if you have any questions or need any help.

Best regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants