Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/update table #86

Merged
merged 10 commits into from
Nov 21, 2024
Merged

Feature/update table #86

merged 10 commits into from
Nov 21, 2024

Conversation

zacdezgeo
Copy link
Collaborator

What I Changed

  1. Added Update Table Logic:
    Implemented a workflow that enables updating an existing PostgreSQL table with new data from a Parquet file. The workflow involves:
  • Creating a temporary table from the Parquet file data.
  • Using PostgreSQL’s ALTER TABLE to add any new columns that aren’t already present in the main table.
  • Performing an UPDATE operation that synchronizes columns between the temporary and main tables based on a matching hex_id column.
  • This process minimizes network overhead by only transferring new columns and rows from the Parquet file, improving efficiency.
  1. Error Handling for Column Addition:
    Incorporated logic to revert new columns in the main table if the update process fails, ensuring data consistency and preventing unintended schema changes.

  2. Column Verification:

Introduced checks in verify_columns to ensure the hex_id column exists in the incoming Parquet file, as it is essential for matching records in the update operation.

How to Test It

  1. Run Unit Tests:
  • The test suite now includes unit tests in test_ingest.py to cover:
  • Basic ingestion of data when the table does not exist.
  • Update operations with new columns.
  • Behavior when columns already exist in the base table.
  • Ensuring that the hex_id column is mandatory.
  • Rollback behavior if the update fails mid-operation.
  1. Manual Verification:
  • The following steps describe how to manually test the update process by ingesting two different datasets into the database:
  • Spin up database with docker:
docker-compose up
  • Download the initial dataset:
aws s3 cp s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet .
download: s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet to ./space2stats.parquet
  • Upload initial dataset:
space2stats-ingest <connection_string> ./space2stats_ingest/METADATA/stac/space2stats/space2stats_population_2020/space2stats_population_2020.json space2stats.parquet
  • Generate the second dataset:
python space2stats_ingest/METADATA/generate_test_data.py 
  • Upload the second dataset:
space2stats-ingest <connection_string> ./space2stats_ingest/METADATA/stac/space2stats/space2stats_population_2020/space2stats_reupload_test.json space2stats_test.parquet 

Other Notes

  • Database-Specific Update Tuning: The performance of this update process is highly dependent on the database configuration and environment. Different configurations across machines and network setups can significantly impact ingestion and update performance.
  • Remote Development Database: We set up a remote development database to limit the impacts of database tuning and streamline testing. This setup would simplify development by allowing everyone to work with the same database configuration, reducing the need for local database setup and tuning.

Remove download commands as we move to multiple files approach. Update validation of STAC metadata.
Using database approach to handle merge causes issues
Still has issues with reading database table which implies some performance issues
@zacdezgeo zacdezgeo requested review from bitner and alukach November 6, 2024 22:29
@zacdezgeo zacdezgeo self-assigned this Nov 6, 2024
@zacdezgeo zacdezgeo merged commit 8588b2e into feature/h3ronpy Nov 21, 2024
2 checks passed
@zacdezgeo zacdezgeo deleted the feature/update-table branch November 21, 2024 10:14
@zacdezgeo zacdezgeo restored the feature/update-table branch November 22, 2024 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant