Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sk ez pp/hist data load #2730

Merged
merged 65 commits into from
Nov 9, 2023
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
8e387fb
Initial commit
Nov 2, 2023
cd7dc1e
Initial commit
Nov 2, 2023
cd4b87b
Initial commit
Nov 2, 2023
1adc362
Initial commit
Nov 2, 2023
54ba1df
Initial commit
Nov 3, 2023
ff307ec
Initial commit
Nov 3, 2023
b8498e5
Initial commit
Nov 3, 2023
1138a7c
Initial commit
Nov 3, 2023
3c77f09
Removed comments
Nov 3, 2023
2547ef4
Added census_to_gsafac
Nov 3, 2023
9173ba6
Initial commit
Nov 3, 2023
352456e
Initial commit
Nov 3, 2023
988a111
Initial commit
Nov 3, 2023
71ad2cc
Initial commit
Nov 3, 2023
d6438e6
Initial commit
Nov 3, 2023
ae5ebc1
Added procedure to load test Census data to postgres
Nov 3, 2023
356a3d3
Excluding workbook loader
Nov 3, 2023
47d034f
Excluding workbook loader
Nov 3, 2023
4f4f90c
Excluding load_raw
Nov 3, 2023
8db3fe2
Updates
Nov 3, 2023
2efada7
Added c2g-db
Nov 3, 2023
9532f1e
Added c2g-db
Nov 3, 2023
31c4b7d
Merge branch 'main' of github.com:GSA-TTS/FAC into sk_ez_pp/hist_data…
Nov 6, 2023
984bff5
Replaced c2g with census_to_gsafac, renamed raw_to_pg.py as csv_to_po…
Nov 6, 2023
5d45686
Replaced c2g with census_to_gsafac, renamed raw_to_pg.py as csv_to_po…
Nov 6, 2023
f132d64
Replaced c2g with census_to_gsafac, renamed raw_to_pg.py as csv_to_po…
Nov 6, 2023
cf9435b
Replaced c2g with census_to_gsafac, renamed raw_to_pg.py as csv_to_po…
Nov 6, 2023
3343124
Replaced c2g with census_to_gsafac, renamed raw_to_pg.py as csv_to_po…
Nov 6, 2023
5e3e744
Replaced c2g with census_to_gsafac, renamed raw_to_pg.py as csv_to_po…
Nov 6, 2023
2d19a46
Replaced c2g with census_to_gsafac, renamed raw_to_pg.py as csv_to_po…
Nov 6, 2023
047a451
Apply suggestions from code review
purvinptl Nov 6, 2023
19f633f
Added census-to-gsafac database
Nov 6, 2023
3914831
Merge branch 'sk_ez_pp/hist_data_load' of github.com:GSA-TTS/FAC into…
Nov 6, 2023
dce6318
Replaced c2g with census-to-gsafac
Nov 6, 2023
294e5c5
Fix linting
Nov 6, 2023
cb940dc
Fix linting
Nov 6, 2023
1db2f1e
Fix linting
Nov 6, 2023
5a47524
Fix linting
Nov 6, 2023
6648f0a
Fix linting
Nov 6, 2023
e681aff
Reformatted with black
Nov 6, 2023
67bb9a8
Reformatted with black
Nov 6, 2023
89624c4
Reformatted with black
Nov 6, 2023
e8b91e0
Updated S3 bucket name and filename
Nov 6, 2023
d96004f
Updated S3 bucket name and filename
Nov 6, 2023
2e28f14
Updates
Nov 7, 2023
9b0beb4
Consolidated census_to_gsafac and census_historical_migration apps
sambodeme Nov 7, 2023
927a80d
Django migration
sambodeme Nov 7, 2023
a2aaffe
Telling mypy to ignore django migration files
sambodeme Nov 7, 2023
a2a58d3
Linting
sambodeme Nov 7, 2023
b11107c
Incorporated chunking capabilities from Alternative suggestion for lo…
Nov 7, 2023
414cbc0
Incorporated chunking capabilities from Alternative suggestion for lo…
Nov 7, 2023
a89982e
Moving fac_s3.py to support/management/commands/
Nov 7, 2023
a21e55b
Moving fac_s3.py to support/management/commands/
Nov 7, 2023
b09bd22
Added load_data function
gsa-suk Nov 8, 2023
e38b395
Tested load_data
Nov 8, 2023
983f7de
Merge branch 'main' of github.com:GSA-TTS/FAC into sk_ez_pp/hist_data…
gsa-suk Nov 8, 2023
6003868
Removed import botocore
gsa-suk Nov 8, 2023
d8ab3cb
Removed import botocore
gsa-suk Nov 8, 2023
f951ea2
refactored csv_to_postgres.py
Nov 8, 2023
4f03d7a
added chunk-size arguments
Nov 8, 2023
13736ad
added help comments for load_data
Nov 8, 2023
e3f90a1
Code cleaning
sambodeme Nov 8, 2023
653ec97
Renamed chunk-size to chunksize
gsa-suk Nov 9, 2023
7245368
Added chunksize argument
gsa-suk Nov 9, 2023
7396ca8
Merge branch 'main' of github.com:GSA-TTS/FAC into sk_ez_pp/hist_data…
gsa-suk Nov 9, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions backend/Apple_M1_Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,13 @@ RUN \

RUN \
apt-get update -yq && \
apt install curl -y && \
apt-get install -y gcc && \
curl -fsSL https://deb.nodesource.com/setup_16.x | bash - && \
apt-get install -y nodejs && \
apt install build-essential curl -y && \
apt-get install -y gcc ca-certificates gnupg && \
mkdir -p /etc/apt/keyrings && \
curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg && \
NODE_MAJOR=18 && \
echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_$NODE_MAJOR.x nodistro main" | tee /etc/apt/sources.list.d/nodesource.list && \
apt-get install nodejs -y && \
apt-get install -y npm && \
npm i -g npm@^8

Expand Down
58 changes: 56 additions & 2 deletions backend/census_historical_migration/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,58 @@
# Census Historical Migration
# Census to FAC data migration

## Overview

This is implemented as a Django app to leverage existing management commands and settings. It includes Python and shell scripts to:

* Load raw census data as CSV files into an S3 bucket
* Create Postgres tables from these CSV files
* Perform any data clean up required to create a table from a CSV file
* Run the historic data migrator
* Run the historic workbook generator

## Infrastructure changes

* Create a new S3 bucket in Cloud.gov spaces as well as in the local environment
* Create a new Postgres instance both in CG and locally

## Utilities

* fac_s3.py - Uploads folders or files to an S3 bucket.

```bash
python manage.py fac_s3 fac-census-to-gsafac-s3 --upload --src census_historical_migration/data
```

* csv_to_postgres.py - Inserts data into Postgres tables using the contents of the CSV files in the S3 bucket. The first row of each file is assumed to have the column names (we convert to lowercase). The name of the table is determined by examining the name of the file. The sample source files do not have delimters for empty fields at the end of a line - so we assume these are nulls.

```bash
python manage.py csv_to_postgres --folder data
python manage.py csv_to_postgres --clean True
```

* models.py These correspond to the incoming CSV files
* routers.py This tells django to use a different postgres instance.

* data A folder that contains sample data that we can use for development.

## Prerequisites

* A Django app that reads the tables created here as unmanaged models and populates SF-SAC tables by creating workbooks, etc. to simulate a real submission

## How to load test Census data into Postgres

1. Download test Census data from https://drive.google.com/drive/folders/1TY-7yWsMd8DsVEXvwrEe_oWW1iR2sGoy into census_historical_migration/data folder.
NOTE: Never check in the census_historical_migration/data folder into GitHub.

2. In the FAC/backend folder, run the following to load CSV files from census_historical_migration/data folder into fac-census-to-gsafac-s3 bucket.
```bash
docker compose run web python manage.py fac_s3 fac-census-to-gsafac-s3 --upload --src census_historical_migration/data
```

3. In the FAC/backend folder, run the following to read the CSV files from fac-census-to-gsafac-s3 bucket and load into Postgres.
```bash
docker compose run web python manage.py csv_to_postgres --folder data
```

### How to run the historic data migrator:
```
Expand All @@ -17,4 +71,4 @@ docker compose run web python manage.py historic_workbook_generator
--dbkey 100010
```
- `year` is optional and defaults to `22`.
- The `output` directory will be created if it doesn't already exist.
- The `output` directory will be created if it doesn't already exist.
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
import logging
import boto3
import csv


from django.core.management.base import BaseCommand
from django.conf import settings
from django.apps import apps


logger = logging.getLogger(__name__)
logger.setLevel(logging.WARNING)

census_to_gsafac_models = list(
apps.get_app_config("census_historical_migration").get_models()
)
census_to_gsafac_model_names = [m._meta.model_name for m in census_to_gsafac_models]
s3_client = boto3.client(
"s3",
aws_access_key_id=settings.AWS_PRIVATE_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_PRIVATE_SECRET_ACCESS_KEY,
endpoint_url=settings.AWS_S3_ENDPOINT_URL,
)
census_to_gsafac_bucket_name = settings.AWS_CENSUS_TO_GSAFAC_BUCKET_NAME
DELIMITER = ","


class Command(BaseCommand):
help = """
Populate Postgres database from csv files
Usage:
manage.py csv_to_postgres --folder <folder_name> --clean <True|False>
"""

def add_arguments(self, parser):
parser.add_argument("--folder", help="S3 folder name")
parser.add_argument("--clean")
parser.add_argument("--sample")
parser.add_argument("--load")
phildominguez-gsa marked this conversation as resolved.
Show resolved Hide resolved

def handle(self, *args, **options):
if options.get("clean") == "True":
self.delete_data()
return
if options.get("sample") == "True":
self.sample_data()
return

folder = options.get("folder")
if not folder:
print("Please specify a folder name")
return

items = s3_client.list_objects(
Bucket=census_to_gsafac_bucket_name,
Prefix=folder,
)["Contents"]
for item in items:
if item["Key"].endswith("/"):
continue
model_name = self.get_model_name(item["Key"])
if model_name:
model_obj = census_to_gsafac_models[
census_to_gsafac_model_names.index(model_name)
]
response = s3_client.get_object(
Bucket=census_to_gsafac_bucket_name, Key=item["Key"]
)
print("Obtained response from S3")
lines = response["Body"].read().decode("utf-8").splitlines(True)
print("Loaded Body into 'lines'")
rows = [row for row in csv.DictReader(lines)]
print("Completed processing 'lines'")
self.load_table(model_obj, rows)

for mdl in census_to_gsafac_models:
row_count = mdl.objects.all().count()
print(f"{row_count} in ", mdl)

def delete_data(self):
for mdl in census_to_gsafac_models:
print("Deleting ", mdl)
mdl.objects.all().delete()

def sample_data(self):
for mdl in census_to_gsafac_models:
print("Sampling ", mdl)
rows = mdl.objects.all()[:1]
for row in rows:
for col in mdl._meta.fields:
print(f"{col.name}: {getattr(row, col.name)}")

def get_model_name(self, name):
print("Processing ", name)
file_name = name.split("/")[-1].split(".")[0]
for model_name in census_to_gsafac_model_names:
if file_name.lower().startswith(model_name):
print("model_name = ", model_name)
return model_name
print("Could not find a matching model for ", name)
return None

def load_table(self, model_obj, rows):
print("Loading data for model_obj ", model_obj)
for i in range(0, len(rows)):
model_instance = model_obj()

for column_name, value in rows[i].items():
if column_name == "id":
continue
setattr(model_instance, column_name, value)
model_instance.save()
if i % 1000 == 0:
print(f"Loaded {i} of {len(rows)} rows to ", model_obj)
print(f"Loaded {len(rows)} rows to ", model_obj)
81 changes: 81 additions & 0 deletions backend/census_historical_migration/management/commands/fac_s3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
from os import path
import os

import boto3

from django.core.management.base import BaseCommand

from django.conf import settings


class Command(BaseCommand):
help = """
Alternative to aws s3 as the cli is not available in production.
Usage:
manage.py fac_s3 <bucket_name> --upload --src SRC [--tgt TGT]
manage.py fac_s3 <bucket_name> --download --src SRC [--tgt TGT]
manage.py fac_s3 <bucket_name> --rm --tgt TGT]
manage.py fac_s3 <bucket_name> --ls [--tgt TGT]
"""

def add_arguments(self, parser):
parser.add_argument("bucket_name", type=str, help="The S3 bucket name.")
parser.add_argument("--src", help="local file name.")
parser.add_argument("--tgt", help="s3 file name.")
parser.add_argument("--ls", action="store_true", help="List all files.")
parser.add_argument(
"--upload", action="store_true", help="Copy local src to S3 tgt."
)
parser.add_argument(
"--download", action="store_true", help="Copy S3 tgt to local src."
)
parser.add_argument("--rm", action="store_true", help="Delete tgt.")

def handle(self, *args, **options):
bucket_name = options["bucket_name"]
src_path = options["src"]
tgt_path = options["tgt"]

s3_client = boto3.client(
"s3",
aws_access_key_id=settings.AWS_PRIVATE_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_PRIVATE_SECRET_ACCESS_KEY,
endpoint_url=settings.AWS_S3_ENDPOINT_URL,
)

if options["ls"]:
items = s3_client.list_objects(
Bucket=bucket_name,
Prefix=tgt_path or "",
).get("Contents")
if not items:
print("Target is empty")
return
for item in items:
print(item["Key"], item["Size"], item["LastModified"])
return

if options["upload"]:
file_path = path.join(settings.BASE_DIR, src_path)
tgt_name = tgt_path or os.path.basename(file_path)
tgt_name_offset = len(str(file_path))
for subdir, dir, files in os.walk(file_path):
object_name = tgt_name + str(subdir)[tgt_name_offset:] + "/"
print(subdir, dir, object_name, files)
for file in files:
full_path = os.path.join(subdir, file)
s3_client.upload_file(full_path, bucket_name, object_name + file)
print(f"Copied {full_path} to {bucket_name} {object_name+file}.")
return

if options["download"]:
file_path = path.join(settings.BASE_DIR, src_path)
object_name = tgt_path
s3_client.download_file(bucket_name, object_name, file_path)
return

if options["rm"]:
s3_client.delete_object(
Bucket=bucket_name,
Key=tgt_path,
)
Loading