-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input csv fields: redundancy ok? #143
Comments
I'm not an expert of FASTQ header. But seems like the
I had the same question in the uclahs-cds/pipeline-germline-somatic pipeline. Although we just need to make sure that the correct sample name is used in call-gSNP, but might be good to use our internal ID? @tyamaguchi-ucla |
Hi guys, @zhuchcn @alkaZeltser As discussed in the last NF WG (https://confluence.mednet.ucla.edu/display/BOUTROSLAB/2021-10-20+Nextflow+Working+Group+Meeting+Notes ), I suggest that we use something like
So, here we probably want to update the
Yeah, I was thinking about this as well. I think it would be nice to have both external/internal ID in the BAM header so I'm thinking of using
For lanes, I think we may want to standardize the field and use
https://samtools.github.io/hts-specs/SAMv1.pdf |
@tyamaguchi-ucla Hi, I have a problem with the read_group_identifier convention. but there is a case where I don't have unique read_group_identifiers if I use this convention. Example: less /hot/data/unregistered/Movember-Hiyari-Bone-GAP62/2021-9126/BE-1-Blood_L001_ds.018ae688d10d49f7bdca5bb4932df2ab/BE-1-Blood_S3_L001_R1_001.fastq.gz | head -n1 Using this library_identifier + '.' + lane # convention, I will end up two with non-unique read_group_identifiers: |
@graceooh isn't it the same case we saw in the PRESTO dataset (the same library sequenced twice using the same lane)? Maybe you can check with Sarah and see how the samples were processed? |
Yes that's right! OK I'll check and add -01 like we did for PRESTO then. Thanks! |
I'm constructing a csv input file for a sample (from an ILLUMINA sequencer) and am a little confused about some of the fields.
Referencing gatk, it seems that there is a lot of redundancy in the input fields. For example both the
read_group_identifier
(ID) andplatform_unit
(PU) fields are constructed using the flowcell ID and lane number (for ILLUMINA reads). Then the lane number is provided separately as another field, to be concatenated with the ID field. Therefore the ID field should really just be the flowcell in my case?Also, for the
sample
field, would I use the internal sample ID or the original (external) sample ID?For example, an input csv for a sample with the following fastq filename:
FD00123067_S14_L001_R1_001.fastq.gz
And the following fastq header :
@A00817:312:HKTWMDRXY:1:1101:3106:1016 1:N:0:GATAGGCCGA+GCCATGTGCG
parsed using the following ILLUMINA header format:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>
Would look like this:
The text was updated successfully, but these errors were encountered: