-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes in line with gff2gff cgat scripts changes: preserving all contigs in gtf sanitisation #331
Conversation
pipeline_annotations no requires assembly file from NCBI for mapping of ENSEMBL contigs to UCSC
Thanks @jscaber I agree we should have empty input parameters in the Please let me know your thoughts. |
@sebastian-luna-valero Charlie suggested in one of the meetings that we have a pipeline.ini that is completely blank and a example_pipeline.ini that is an example of parameters that can be used. This would force users to look at the ini in more detail before running. If this is something we want to do I can change piepline.py accordingly and then change our production pipelines. What do you think? |
I do like this idea, but we might need to think carefully about the
paramteres – lots of parameters call for full paths.
That’s a lot of typing if you have to fill out
`/filesystem/partition/directory/genome/file` for a list of say 20
contaminants in the readqc ini for example (even more for us where we have
`/shared/filesystem/group/partition/directory/genome/file`.
|
I will put back examples of the paths that default to the CGAT locations. The testing will not fail, but requires an additional input file and a change in pipeline.ini: I will update this. Regarding pre-filled vs non prefilled pipeline.ini: difficult balance, but given that people have run the pipelines with parameters that are wrong, it may be reasonable to ask users to populate an empty pipeline.ini, that has suggestions. What I always find annoying is that emacs doesn't autocomplete paths, but I think there may be a plugin for that. Also, we were discussing whether to transition to full paths (or relative paths that can be validated) for everything in pipeline.ini, to enable reliable input validation. I will create an issue explaining that. |
I agree with @jscaber with regards to running pipeline with wrong parameters, iv done it before and having a pre-filled pipeline.ini file enforces bad practice. I would rather spend that little bit more time filling in the blank ini. |
I was more wondering if parameters could be specified in a more sensitble
way:
i.e. instead of having to fill in ful paths for all the contaminants, you
had a contaminants path, and then just filled in the files for each
contaminant.
It has always seems strange to me that we ask for an annotations_dir, and
then don’t make the annotations db relative to that.
And other such ways to making things easier to fill in .
|
If we do this we might need to be a bit careful with the pipeline testing- currently I think only minimal pipeline.ini files are supplied for the pipelines in pipeline testing and I've got a feeling that any params that aren't specified in the pipeline.ini for the tests will be picked up from the default pipeline.ini files. An example is in the readqc test - I think the filepath for the adpaters for fastqc and for trimmomatic are specified in the default ini file and when pipeline testing is run it picks these up from here. So I think we might need to update all the pipeline testing ini files with all the default pipeline ini files otherwise tests will fail....
Dr Charlotte L. George
CGAT Career Development Fellow | http://www.cgat.org<https://owa.nexus.ox.ac.uk/owa/redir.aspx?C=tDasK99q_keZj7oorbJhL4al53yikdEI0wiICundIwNzuN4H3GD6ZxCoDwZJWAhGr9unzjH9TTY.&URL=http%3a%2f%2fwww.cgat.org>
MRC Centre for Computational Biology| University of Oxford
Weatherall Institute of Molecular Medicine
Tel: +44 1865 222145
…________________________________
From: Sebastian Luna-Valero [[email protected]]
Sent: 17 May 2017 10:15
To: CGATOxford/CGATPipelines
Cc: Subscribed
Subject: Re: [CGATOxford/CGATPipelines] Changes in line with gff2gff cgat scripts changes: preserving all contains in gtf sanitisation (#331)
Thanks @Acribbs<https://github.com/acribbs>. Great, I am happy with that!
Then, should we just merge?
Also, @jscaber<https://github.com/jscaber> we are running these targets with pipeline_testing.py: assembly, geneset, annotations, ucsc, summary. Could you please confirm whether I need to update the tests to reflect the updates in this PR?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#331 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AIUc4QIwDiTs-oQgjDwt0FAZj8E2OUnjks5r6rqzgaJpZM4NdkAR>.
|
I have run pipeline_testing locally, and have provided the new ini and the genome summary file. It runs fine. There are no changes to the gtf since it is prefiltered for chr19 without any random chromosome names, so no changes to previous output. Removing defaults from the default pipeline.ini has also been checked and has not made any difference here - thanks @Charlie-George, I hadn't thought about that. Testing should probably be re-run when making changes to pipeline.ini. |
Follow up: #332 |
pipeline_annotations now requires the "assembly report" file from NCBI for mapping of ENSEMBL contigs to UCSC. This enables the user to keep all contigs if they wish. The previous "genome" method would have discarded/skipped all contigs that do not have a known chromosomal location, as they would not map using the "getToken" method.
For custom contigs, e.g. if custom contigs are appended to the sequencing file (e.g. sequins, transgenic organism), these are coerced into the desired nomenclature. No contains will be skipped or lost during this process, unless specifically required.