Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore/schema constraints #81

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from
Open

Conversation

MurphyMarkW
Copy link

This is a slew of small changes to the schema, primarily around imposing type and range constraints. I've kept the commits broken down so we can cherry pick certain ones into separate prs if desired. But mostly just want to get the conversation going.

Most of these add a minimum to number types, with notable exceptions being age_at_diagnosis and days_to_birth which have maximums set. The CDE pages on age_at_diagnosis and days_to_birth specify that the former is in a range [0,150] and that the latter is explicitly negative. For everything else, it seems like the definition of the value range is more vague, but does not explicitly state negative values.

year fields were changed to be integer type, following the CDE description of a year value being a four-character field w/ 0 decimal places.

The other two major changes are in md5sum where I've added a string regex to match the MD5 format, and restricted the file_size to be an integer type with minimum value of 1. This was discussed w/ @allisonheath and, unless there's a compelling reason not to, this should restrict us from accepting 0-byte files.

@NCI-GDC/ucdevs thoughts?

@allisonheath
Copy link
Contributor

cc @dmiller15 as well

@dmiller15
Copy link
Contributor

Just some things:

  • age_at_diagnosis is of type: number. shouldn't this be an integer or are we storing partial years?
  • a lot of integer parameters like all of the days_to* properties have type: number. is this intended?
  • do we want to add maximum percentages? we shouldn't be getting values over 100 after all.

@MurphyMarkW
Copy link
Author

So you're in good company w/ the number vs integer comments. I had originally gone through and converted most of them to integers, since it seemed odd that they would be recording partial days. But once I started trying to dig into the CDE descriptions that are linked for a good number of them, it seems like most are floating point values. year fields for the most part appear to be unsigned integers, but day fields appear to all be floats. I'll double check the age_at_diagnosis.

I'll also double check on the value ranges for the percentage fields, but I'd suspect that yeah, we should be capping them.

@MurphyMarkW
Copy link
Author

@dmiller15 looks like you're right about the age_at_diagnosis type.

I grabbed one of the percent fields, percent_tumor_cells and found it on the CDE website. It looks like it's intended to be 0% to 100% given the minimum length 1 and maximum length 3 for its representation, and no decimal place specification. But then, this could also mean up to 999% if we're only relying on that information...

In general, how have you guys been finding these value ranges? Maybe I'm missing something here, but it seems like the CDE pages on this stuff is rather lacking...

@MurphyMarkW
Copy link
Author

@dmiller15 and @allisonheath I've added upper bound constraints for the aforementioned percentage fields, as well as converted age_at_diagnosis over to an integer. Given the conversations we've had, I think what remains to be done w/ this is to modify the tests themselves to match what would be considered valid. Should I go ahead and do so, or does someone else want to put together a list of test cases?

@tlicht3
Copy link
Collaborator

tlicht3 commented Apr 6, 2016

Hi there. @allisonheath sent me this conversation in case I am able to contribute. I looked at the restrictions that have been added and have a couple of comments. If you're interested in building more validation for additional elements, i'm happy to share some suggestions. Looks like @dmiller15 is also working on these – let me know if I can help in any way.

  • For all “year” elements, are you sure you want the minimum value to be 0? Could be helpful to restrict it to a 4-digit value if that's possible.
  • For _terms.yaml#/age_at_diagnosis, it looks like you’re restricting it to min of 1 and max of 150. This makes sense for adult studies like TCGA, but for studies where children were included, it seems like this could be less than 1.

@MurphyMarkW
Copy link
Author

@tlicht3 sounds great, suggestion away.

@MurphyMarkW
Copy link
Author

@dmiller15 @allisonheath any thoughts on modifying tests to match?

@MurphyMarkW
Copy link
Author

Discussed w/ @dmiller15 offline and I think we're going to go ahead with modifying the tests to match these changes + suggestions from @tlicht3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants