Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(#102) Create script to migrate data to S3 bucket. #106

Merged
merged 1 commit into from
Nov 12, 2024

Conversation

blairlearn
Copy link
Contributor

Expects CSV file as exported by MS SQL Server Management Studio. (See https://github.com/NCIOCPL/clinical-trials-search-print/wiki/Deployment#data-migration)

Closes #102

raise RuntimeError("Please provide the input file name as a command line argument.")


if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can condense to:

not os.environ.get("CTS_BUCKET_NAME", "").strip()

With the caveat that a None os.environ object (I don't think that's possible?) will throw an error.

loadedCount = 0
errorCount = 0
totalCount = 0
with open(sys.argv[1], encoding="utf-8-sig") as datafile:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity I would put sys.argv[1] into a variable filename up at line 32.

errorCount = 0
totalCount = 0
with open(sys.argv[1], encoding="utf-8-sig") as datafile:
csv.field_size_limit(sys.maxsize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious--why is it needed?

Copy link
Member

@jfrank-nih jfrank-nih Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sys.maxsize specifically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is too small for some of the larger multi-trial pages.


for row in reader:

key = row[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely a "if you want to minimize your code" option. I believe this works:

key, cacheDate, trialIDs, searchParams, content = row[0:4]

searchParams = row[3]
content = row[4]

metadata = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find this cleaner:

metadata = {
    "migrated-data": "True",
    "originally-generated": cacheDate,
    "search-criteria": searchParams,
    "trial-id-list": trialIDs
}

for row in reader:

key = row[0]
cacheDate = row[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically python would do cache_date instead of cacheDate, per PEP8.


loadedCount = 0
errorCount = 0
totalCount = 0
Copy link
Member

@jfrank-nih jfrank-nih Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically totalCount isn't needed and could just be len(reader), right?

Copy link
Member

@jfrank-nih jfrank-nih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter has comments:

************* Module migrate
migrate.py:43:0: C0103: Constant name "loadedCount" doesn't conform to UPPER_CASE naming style (invalid-name)
migrate.py:44:0: C0103: Constant name "errorCount" doesn't conform to UPPER_CASE naming style (invalid-name)
migrate.py:45:0: C0103: Constant name "totalCount" doesn't conform to UPPER_CASE naming style (invalid-name)
migrate.py:85:15: W0718: Catching too general exception Exception (broad-exception-caught)

The constants aren't actually constants, so they could be ignored. Generally, if you were to use a default function __main__ then this wouldn't pop.

I see the other is a non-AWS catch-all-problems, which is fair.

So for both you could add an exclusion comment for black to make those go away.

@blairlearn blairlearn merged commit 830f0d0 into develop Nov 12, 2024
2 checks passed
@blairlearn blairlearn deleted the ticket/102-data-migrate branch November 12, 2024 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enabler: Migrate data to S3
2 participants