Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandasschema #467

Merged
merged 10 commits into from
Aug 17, 2020
Merged

pandasschema #467

merged 10 commits into from
Aug 17, 2020

Conversation

Maarten-vd-Sande
Copy link
Member

@Maarten-vd-Sande Maarten-vd-Sande commented Aug 11, 2020

What problem is the PR solving / What's new?
Added pandasschema for column parsing and slightly prettier/clearer/user-friendliness log for bad public samples (in the sense they all get logged).

@siebrenf I think the same check was done twice or do I not understand?

Checklist

  • I made a PR to develop (not master)
  • If applicable: I updated the docs
  • I updated the CHANGELOG
  • These changes are covered by the tests

@Maarten-vd-Sande Maarten-vd-Sande linked an issue Aug 11, 2020 that may be closed by this pull request
Copy link
Member

@siebrenf siebrenf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the leading-/trailing- white space validations. Why would we bother users with something so trivial?

P.S. the (previous) double asserts were separate for column names and column content. That was indeed a bit clunky.

seq2science/rules/configuration_generic.smk Show resolved Hide resolved
Comment on lines 123 to 131
column_schema["sample"] = [LeadingWhitespaceValidation(),
TrailingWhitespaceValidation(),
IsDistinctValidation(),
MatchesPatternValidation(allowed_pattern)]

column_schema["descriptive_name"] = [LeadingWhitespaceValidation(),
TrailingWhitespaceValidation(),
IsDistinctValidation(),
MatchesPatternValidation(allowed_pattern)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all columns are checked in the for loop after this, so maybe make a for loop for these two columns to check the IsDistinceValidation()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep makes sense!

@Maarten-vd-Sande
Copy link
Member Author

Waiting for this to be accepted: multimeric/PandasSchema#37

Comment on lines 125 to 126
[Column(col, [MatchesPatternValidation(allowed_pattern),
IsDistinctValidation(ignore_nan=True)] if col in distinct_columns else [MatchesPatternValidation(allowed_pattern)], allow_empty=True) for col in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_nan doesnt exist now right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The cat 🐱 didnt let me sleep last night

@Maarten-vd-Sande Maarten-vd-Sande merged commit ca2dd23 into develop Aug 17, 2020
@Maarten-vd-Sande Maarten-vd-Sande deleted the pandasschema branch August 17, 2020 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FR: Check samples.tsv with PandasSchema
2 participants