Changes in line with gff2gff cgat scripts changes: preserving all contigs in gtf sanitisation #331

jscaber · 2017-05-17T08:55:35Z

pipeline_annotations now requires the "assembly report" file from NCBI for mapping of ENSEMBL contigs to UCSC. This enables the user to keep all contigs if they wish. The previous "genome" method would have discarded/skipped all contigs that do not have a known chromosomal location, as they would not map using the "getToken" method.

For custom contigs, e.g. if custom contigs are appended to the sequencing file (e.g. sequins, transgenic organism), these are coerced into the desired nomenclature. No contains will be skipped or lost during this process, unless specifically required.

pipeline_annotations no requires assembly file from NCBI for mapping of ENSEMBL contigs to UCSC

sebastian-luna-valero · 2017-05-17T09:03:03Z

Thanks @jscaber

I agree we should have empty input parameters in the pipeline.ini to enforce the user to configure them correctly; otherwise the pipeline should break (this can now be easily spotted or prevented with the basic --input-validation option). However, can we please comment the old examples instead of removing them completely so they can be used as a reference for what is expected?

Please let me know your thoughts.

Acribbs · 2017-05-17T09:10:51Z

@sebastian-luna-valero Charlie suggested in one of the meetings that we have a pipeline.ini that is completely blank and a example_pipeline.ini that is an example of parameters that can be used. This would force users to look at the ini in more detail before running. If this is something we want to do I can change piepline.py accordingly and then change our production pipelines.

What do you think?

sebastian-luna-valero · 2017-05-17T09:15:31Z

Thanks @Acribbs. Great, I am happy with that!

Then, should we just merge?

Also, @jscaber we are running these targets with pipeline_testing.py: assembly, geneset, annotations, ucsc, summary. Could you please confirm whether I need to update the tests to reflect the updates in this PR?

IanSudbery · 2017-05-17T09:16:24Z

I do like this idea, but we might need to think carefully about the paramteres – lots of parameters call for full paths. That’s a lot of typing if you have to fill out `/filesystem/partition/directory/genome/file` for a list of say 20 contaminants in the readqc ini for example (even more for us where we have `/shared/filesystem/group/partition/directory/genome/file`.

jscaber · 2017-05-17T09:26:44Z

I will put back examples of the paths that default to the CGAT locations.

The testing will not fail, but requires an additional input file and a change in pipeline.ini: I will update this.

Regarding pre-filled vs non prefilled pipeline.ini: difficult balance, but given that people have run the pipelines with parameters that are wrong, it may be reasonable to ask users to populate an empty pipeline.ini, that has suggestions. What I always find annoying is that emacs doesn't autocomplete paths, but I think there may be a plugin for that.

Also, we were discussing whether to transition to full paths (or relative paths that can be validated) for everything in pipeline.ini, to enable reliable input validation. I will create an issue explaining that.

Acribbs · 2017-05-17T09:46:56Z

I agree with @jscaber with regards to running pipeline with wrong parameters, iv done it before and having a pre-filled pipeline.ini file enforces bad practice. I would rather spend that little bit more time filling in the blank ini.

IanSudbery · 2017-05-17T10:11:33Z

I was more wondering if parameters could be specified in a more sensitble way: i.e. instead of having to fill in ful paths for all the contaminants, you had a contaminants path, and then just filled in the files for each contaminant. It has always seems strange to me that we ask for an annotations_dir, and then don’t make the annotations db relative to that. And other such ways to making things easier to fill in .

Charlie-George · 2017-05-17T10:15:50Z

If we do this we might need to be a bit careful with the pipeline testing- currently I think only minimal pipeline.ini files are supplied for the pipelines in pipeline testing and I've got a feeling that any params that aren't specified in the pipeline.ini for the tests will be picked up from the default pipeline.ini files. An example is in the readqc test - I think the filepath for the adpaters for fastqc and for trimmomatic are specified in the default ini file and when pipeline testing is run it picks these up from here. So I think we might need to update all the pipeline testing ini files with all the default pipeline ini files otherwise tests will fail.... Dr Charlotte L. George CGAT Career Development Fellow | http://www.cgat.org<https://owa.nexus.ox.ac.uk/owa/redir.aspx?C=tDasK99q_keZj7oorbJhL4al53yikdEI0wiICundIwNzuN4H3GD6ZxCoDwZJWAhGr9unzjH9TTY.&URL=http%3a%2f%2fwww.cgat.org> MRC Centre for Computational Biology| University of Oxford Weatherall Institute of Molecular Medicine Tel: +44 1865 222145

…

________________________________ From: Sebastian Luna-Valero [[email protected]] Sent: 17 May 2017 10:15 To: CGATOxford/CGATPipelines Cc: Subscribed Subject: Re: [CGATOxford/CGATPipelines] Changes in line with gff2gff cgat scripts changes: preserving all contains in gtf sanitisation (#331) Thanks @Acribbs<https://github.com/acribbs>. Great, I am happy with that! Then, should we just merge? Also, @jscaber<https://github.com/jscaber> we are running these targets with pipeline_testing.py: assembly, geneset, annotations, ucsc, summary. Could you please confirm whether I need to update the tests to reflect the updates in this PR? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#331 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AIUc4QIwDiTs-oQgjDwt0FAZj8E2OUnjks5r6rqzgaJpZM4NdkAR>.

jscaber · 2017-05-17T10:22:13Z

I have run pipeline_testing locally, and have provided the new ini and the genome summary file. It runs fine. There are no changes to the gtf since it is prefiltered for chr19 without any random chromosome names, so no changes to previous output.

Removing defaults from the default pipeline.ini has also been checked and has not made any difference here - thanks @Charlie-George, I hadn't thought about that. Testing should probably be re-run when making changes to pipeline.ini.

sebastian-luna-valero · 2017-05-17T13:36:59Z

Follow up: #332

Changes in line with gff2gff changes:

a3e27b5

pipeline_annotations no requires assembly file from NCBI for mapping of ENSEMBL contigs to UCSC

Examples added to pipeline.ini

11def78

jscaber changed the title ~~Changes in line with gff2gff cgat scripts changes: preserving all contains in gtf sanitisation~~ Changes in line with gff2gff cgat scripts changes: preserving all contigs in gtf sanitisation May 17, 2017

jscaber mentioned this pull request May 17, 2017

Changes to pipeline.ini files (global) #332

Open

sebastian-luna-valero merged commit c65865e into master May 17, 2017

sebastian-luna-valero deleted the gff2gff branch May 17, 2017 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes in line with gff2gff cgat scripts changes: preserving all contigs in gtf sanitisation #331

Changes in line with gff2gff cgat scripts changes: preserving all contigs in gtf sanitisation #331

jscaber commented May 17, 2017

sebastian-luna-valero commented May 17, 2017

Acribbs commented May 17, 2017

sebastian-luna-valero commented May 17, 2017

IanSudbery commented May 17, 2017 via email

jscaber commented May 17, 2017

Acribbs commented May 17, 2017

IanSudbery commented May 17, 2017 via email

Charlie-George commented May 17, 2017 via email

jscaber commented May 17, 2017

sebastian-luna-valero commented May 17, 2017

Changes in line with gff2gff cgat scripts changes: preserving all contigs in gtf sanitisation #331

Changes in line with gff2gff cgat scripts changes: preserving all contigs in gtf sanitisation #331

Conversation

jscaber commented May 17, 2017

sebastian-luna-valero commented May 17, 2017

Acribbs commented May 17, 2017

sebastian-luna-valero commented May 17, 2017

IanSudbery commented May 17, 2017 via email

jscaber commented May 17, 2017

Acribbs commented May 17, 2017

IanSudbery commented May 17, 2017 via email

Charlie-George commented May 17, 2017 via email

jscaber commented May 17, 2017

sebastian-luna-valero commented May 17, 2017