Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes in line with gff2gff cgat scripts changes: preserving all contigs in gtf sanitisation #331

Merged
merged 2 commits into from
May 17, 2017

Conversation

jscaber
Copy link
Member

@jscaber jscaber commented May 17, 2017

pipeline_annotations now requires the "assembly report" file from NCBI for mapping of ENSEMBL contigs to UCSC. This enables the user to keep all contigs if they wish. The previous "genome" method would have discarded/skipped all contigs that do not have a known chromosomal location, as they would not map using the "getToken" method.

For custom contigs, e.g. if custom contigs are appended to the sequencing file (e.g. sequins, transgenic organism), these are coerced into the desired nomenclature. No contains will be skipped or lost during this process, unless specifically required.

pipeline_annotations no requires assembly file from NCBI for mapping of ENSEMBL contigs to UCSC
@sebastian-luna-valero
Copy link
Member

Thanks @jscaber

I agree we should have empty input parameters in the pipeline.ini to enforce the user to configure them correctly; otherwise the pipeline should break (this can now be easily spotted or prevented with the basic --input-validation option). However, can we please comment the old examples instead of removing them completely so they can be used as a reference for what is expected?

Please let me know your thoughts.

@Acribbs
Copy link
Member

Acribbs commented May 17, 2017

@sebastian-luna-valero Charlie suggested in one of the meetings that we have a pipeline.ini that is completely blank and a example_pipeline.ini that is an example of parameters that can be used. This would force users to look at the ini in more detail before running. If this is something we want to do I can change piepline.py accordingly and then change our production pipelines.

What do you think?

@sebastian-luna-valero
Copy link
Member

Thanks @Acribbs. Great, I am happy with that!

Then, should we just merge?

Also, @jscaber we are running these targets with pipeline_testing.py: assembly, geneset, annotations, ucsc, summary. Could you please confirm whether I need to update the tests to reflect the updates in this PR?

@IanSudbery
Copy link
Member

IanSudbery commented May 17, 2017 via email

@jscaber
Copy link
Member Author

jscaber commented May 17, 2017

I will put back examples of the paths that default to the CGAT locations.

The testing will not fail, but requires an additional input file and a change in pipeline.ini: I will update this.

Regarding pre-filled vs non prefilled pipeline.ini: difficult balance, but given that people have run the pipelines with parameters that are wrong, it may be reasonable to ask users to populate an empty pipeline.ini, that has suggestions. What I always find annoying is that emacs doesn't autocomplete paths, but I think there may be a plugin for that.

Also, we were discussing whether to transition to full paths (or relative paths that can be validated) for everything in pipeline.ini, to enable reliable input validation. I will create an issue explaining that.

@Acribbs
Copy link
Member

Acribbs commented May 17, 2017

I agree with @jscaber with regards to running pipeline with wrong parameters, iv done it before and having a pre-filled pipeline.ini file enforces bad practice. I would rather spend that little bit more time filling in the blank ini.

@IanSudbery
Copy link
Member

IanSudbery commented May 17, 2017 via email

@Charlie-George
Copy link
Member

Charlie-George commented May 17, 2017 via email

@jscaber
Copy link
Member Author

jscaber commented May 17, 2017

I have run pipeline_testing locally, and have provided the new ini and the genome summary file. It runs fine. There are no changes to the gtf since it is prefiltered for chr19 without any random chromosome names, so no changes to previous output.

Removing defaults from the default pipeline.ini has also been checked and has not made any difference here - thanks @Charlie-George, I hadn't thought about that. Testing should probably be re-run when making changes to pipeline.ini.

@jscaber jscaber changed the title Changes in line with gff2gff cgat scripts changes: preserving all contains in gtf sanitisation Changes in line with gff2gff cgat scripts changes: preserving all contigs in gtf sanitisation May 17, 2017
@sebastian-luna-valero sebastian-luna-valero merged commit c65865e into master May 17, 2017
@sebastian-luna-valero sebastian-luna-valero deleted the gff2gff branch May 17, 2017 13:35
@sebastian-luna-valero
Copy link
Member

Follow up: #332

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants