Clean up incoming ID3C data #32

trvrb · 2019-12-15T00:01:35Z

There are a small handful of upstream fixes we need to shipping views.

The date field in v2/shipping/augur-build-metadata was formatted as 2019-09-25T19:37:35.483+00:00. This should just read 2019-09-25. I've fixed this on the augur side here: https://github.com/seattleflu/augur-build/blob/master/scripts/download_sfs_metadata.py#L25 for the time being.
Our strain names should match those used by the rest of the world rather than just being a long UUID. I'd like to match existing format as closely as possible. Strains in the US are geographically labeled by state, like B/Washington/2/2019. This means that sample UUID fe1a1206-21ef-45ff-8be0-9d7643eef879 would be strain A/Washington/43eef879/2019, ie taking A or B depending on flu A or flu B and taking year from date.
We need neighborhood (within Seattle proper) / puma (outside Seattle proper) for location. I believe that @kairstenfay may have started on this already in ID3C.
Include age_range_coarse as a field in the shipping view.
Restrict rows in shipping.augur-build-metadata to only those samples that have sequencing data.

Edited to update format for strain name in item 2 and to include items 4 and 5.

The text was updated successfully, but these errors were encountered:

tsibley · 2019-12-16T21:56:25Z

This means that sample UUID fe1a1206-21ef-45ff-8be0-9d7643eef879 would be strain A/Washington/SFS-43eef879/2019

I would really prefer to keep the entire UUID in the strain name. The whole reason for using the UUIDs in the first place is that they are universally unique; a property that we lose if we truncate them. If we don't use the UUID, then we've lost all its benefits and shouldn't have used them from the start.

I would also caution against using opaque acronyms like SFS, since they're meaningless outside of the study. Can we use something like

A/Washington/seattleflu.org/fe1a1206-21ef-45ff-8be0-9d7643eef879/2019

instead?

trvrb · 2019-12-17T21:41:27Z

@tsibley --- I'm afraid I don't agree. We should aim to be as consistent as possible with how the entire flu field treats strain names. It will be super weird if there are canonical names like B/Washington/2/2019 while we name things like B/Washington/seattleflu.org/fe1a1206-21ef-45ff-8be0-9d7643eef879/2019. It's far outside standard naming.

The strain name itself is meant to be unique, but short enough to be usable. Even A/Singapore/Infimh-16-0019/2016 was quite unwieldy. Keep in mind that each strain is tied to unique accession provisioned by Genbank or by GISAID that gives detailed provenance information. Strain names are meant to:

Provide broad virus information, ie A vs B
Provide broad geo information, ie Washington
Provide a short disambiguation string (traditionally 1, 2, 3)
Provide broad time information, ie 2019

(Field order is important too, extra slashes are non-standard and would break parsing)

I might even say to just name this as A/Washington/43eef879/2019. There is no way that the 8-digit hex will conflict with the CDC's 1, 2, 3 naming. (The SFS- was there for additional disambiguation, not for provenance)

tsibley · 2019-12-18T00:04:36Z

…but short enough to be usable. Even A/Singapore/Infimh-16-0019/2016 was quite unwieldy.

Ok! It seems like I don't understand how these names are used in practice, if that's considered unwieldy. (It doesn't, from my naive, outside perspective, seem unwieldy to me.)

Are these names regularly spoken, as opposed to copied/programmatically processed?

trvrb · 2019-12-18T16:47:16Z

Yes. Regularly spoken aloud and used to point people around a tree or around a titer table.

If you'd like to keep UUID, we can provide this as a "sample ID" in flat file data download that's paired with strain name.

tsibley · 2019-12-18T22:16:49Z

I think it would be smart to keep the full UUID linked one way or another. It is an identifier equivalent in utility to the GenBank accession.

joverlee521 · 2019-12-23T20:03:23Z

Add a more general identifier for each genome

joverlee521 · 2019-12-27T22:23:47Z

Update format for date in shipping.augur-build-metadata

trvrb · 2019-12-31T19:41:57Z

One additional request here: just using age_category eg adult vs child is too coarse of an analysis. I'd like to additionally have age_range_coarse, eg ["5 years","18 years"). I think age range coarse will be the right resolution for the genomic work and we won't be able to use age range fine.

I've added this as request number 4 above.

trvrb · 2020-01-02T17:07:24Z

Yet one more request. Can we restrict rows in shipping.augur-build-metadata to only those samples that have sequencing data? There are two reasons for this:

We want to protect data privacy in these shipping views, so rather than downloading a dataset of ~20k rows with all encounters, it's safer to download a dataset of ~2k rows with just encounters that were sequenced.
Dealing with the extra large metadata table is somewhat unwieldy given how scripts like select_strains.py are written.

I've added this as request number 5 above.

kairstenfay · 2020-01-22T00:25:10Z

Yet one more request. Can we restrict rows in shipping.augur-build-metadata to only those samples that have sequencing data?

@trvrb do you still only want the new shipping.metadata_for_augur_build to include samples with sequencing data? If so, is there a separate desire for a view similar to what Mike requested that contains all samples regardless of encounter or sequence data?

In a [GitHub issue](seattleflu/augur-build#32), Trevor requested that encountered date no longer be formatted as a timestamp but rather a date in YYYY-MM-DD format for the `shipping.metadata_for_augur_build_v*` views.

In a [GitHub issue](seattleflu/augur-build#32), Trevor requested that encountered date no longer be formatted as a timestamp but rather a date in YYYY-MM-DD format for the `shipping.metadata_for_augur_build_v2` view.

In seattleflu/augur-build#32, Trevor requested that encountered date no longer be formatted as a timestamp but rather a date in YYYY-MM-DD format for the `shipping.metadata_for_augur_build_v2` view.

In seattleflu/augur-build#32, Trevor requested that we include `age_range_coarse` as a column in the view.

In seattleflu/augur-build#32, Trevor requested that encountered date no longer be formatted as a timestamp but rather a date in YYYY-MM-DD format for the `shipping.metadata_for_augur_build_v2` view.

In seattleflu/augur-build#32, Trevor requested that we include `age_range_coarse` as a column in the view.

In seattleflu/augur-build#32, Trevor requested that encountered date no longer be formatted as a timestamp but rather a date in YYYY-MM-DD format for the `shipping.metadata_for_augur_build_v2` view.

In seattleflu/augur-build#32, Trevor requested that we include `age_range_coarse` as a column in the view.

In seattleflu/augur-build#32, Trevor requested that encountered date no longer be formatted as a timestamp but rather a date in YYYY-MM-DD format for the `shipping.metadata_for_augur_build_v2` view. Co-authored-by: Thomas Sibley <[email protected]>

In seattleflu/augur-build#32, Trevor requested that we include `age_range_coarse` as a column in the view.

kairstenfay · 2020-02-05T00:13:32Z

There are a small handful of upstream fixes we need to shipping views.

1. The `date` field in `v2/shipping/augur-build-metadata` was formatted as `2019-09-25T19:37:35.483+00:00`. This should just read `2019-09-25`. I've fixed this on the augur side here: https://github.com/seattleflu/augur-build/blob/master/scripts/download_sfs_metadata.py#L25 for the time being.

This is now fixed on master.

4. Include `age_range_coarse` as a field in the shipping view.

This column is now present on master.

trvrb assigned joverlee521 Dec 15, 2019

kairstenfay mentioned this issue Jan 22, 2020

Update metadata for augur build view seattleflu/id3c-customizations#32

Merged

kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 22, 2020

views: Add age_range_coarse to augur metadata

7a71e58

In seattleflu/augur-build#32, Trevor requested that we include `age_range_coarse` as a column in the view.

kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 23, 2020

views: Add age_range_coarse to augur metadata

d4f6289

In seattleflu/augur-build#32, Trevor requested that we include `age_range_coarse` as a column in the view.

tsibley pushed a commit to seattleflu/id3c-customizations that referenced this issue Jan 25, 2020

views: Add age_range_coarse to augur metadata

af632f1

In seattleflu/augur-build#32, Trevor requested that we include `age_range_coarse` as a column in the view.

kairstenfay added a commit to seattleflu/id3c-customizations that referenced this issue Jan 28, 2020

views: Add age_range_coarse to augur metadata

e48685c

In seattleflu/augur-build#32, Trevor requested that we include `age_range_coarse` as a column in the view.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up incoming ID3C data #32

Clean up incoming ID3C data #32

trvrb commented Dec 15, 2019 •

edited

Loading

tsibley commented Dec 16, 2019

trvrb commented Dec 17, 2019 •

edited

Loading

tsibley commented Dec 18, 2019

trvrb commented Dec 18, 2019

tsibley commented Dec 18, 2019

joverlee521 commented Dec 23, 2019

joverlee521 commented Dec 27, 2019

trvrb commented Dec 31, 2019

trvrb commented Jan 2, 2020

kairstenfay commented Jan 22, 2020

kairstenfay commented Feb 5, 2020

Clean up incoming ID3C data #32

Clean up incoming ID3C data #32

Comments

trvrb commented Dec 15, 2019 • edited Loading

tsibley commented Dec 16, 2019

trvrb commented Dec 17, 2019 • edited Loading

tsibley commented Dec 18, 2019

trvrb commented Dec 18, 2019

tsibley commented Dec 18, 2019

joverlee521 commented Dec 23, 2019

joverlee521 commented Dec 27, 2019

trvrb commented Dec 31, 2019

trvrb commented Jan 2, 2020

kairstenfay commented Jan 22, 2020

kairstenfay commented Feb 5, 2020

trvrb commented Dec 15, 2019 •

edited

Loading

trvrb commented Dec 17, 2019 •

edited

Loading