-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconsider rgsm value in fastq list rows #606
Comments
Ouch. I can see cases where the same 'sample' was used to generate both tumor and normal DNA (in the sense of a tissue specimen) although in this particular case I would argue that the various mixes should have different sample identifiers - they are distinct mixes of two other samples. I.e., for this specific case I'd change names upstream in the tracking sheet, demultiplex again and process from there. If we want to future-proof things option 3 feels best to me; the |
Given there were other samples on this run, we were considering just updating the labmetadata and the fastq list row rgsm values for these entries. Saves a demultiplex. |
Demultiplexing is stable, I don't mind re-generating all FASTQs if it saves some extra work... |
I would also recommend this and re-convert affected samples. I can help with data cleanup and, re-run SOP at Portal side, if any. As I am aware, these samples are for validation lot. So, it has future value on re-running on them again to validate analysis pipeline changes, etc. So, it worth the effort to make it right at best as much, I reckon. |
Also worth noting, as Florian pointed out yesterday, we may come across a situation where the tumor and normal having the same sample id is actually appropriate (same tube, and some magical process separates the too - I don't know what this process is called). So it is inevitable we will have to move away from the current set up at some point. If you're happy to rerun the demux Victor and manage that then happy, thought it would be easier for you if we just updated the fastq-list-rows |
Let me access the situation and impact a bit further, pls... |
I (strongly) vote for number 1. |
Note that we'd still be running samples in T/N mode with the same library ID at some point (maybe not automatically, but the overall workflow should not fail if we run the same library against its copy). |
Huh? A library against itself? I can't see a reason why the orchestration should have an issue with that, but how would we distinguish between them in workflow results (not sure if all tools would support that)? We could possibly work out a naming convention, e.g. a second (virtual) library with a different ID pointing to the same FASTQs that could be used as the copy? |
Part of the accreditation process - sequence the same library twice at different coverage levels and analyse, anything called is a false positive. Might ignore that for the portal and co. |
After discussing with Victor, I don't think demultiplexing is the right way forward, for this sample alone we have 204 'libraryrun' entries SELECT sequencerun.instrument_run_id, limsrow_metadata.phenotype, fastqlistrow.*
FROM silver.limsrow_metadata
INNER JOIN portal.libraryrun
ON limsrow_metadata.library_id = libraryrun.library_id
INNER JOIN portal.sequencerun
ON sequencerun.instrument_run_id = libraryrun.instrument_run_id
INNER JOIN portal.fastqlistrow
ON libraryrun.library_id = fastqlistrow.rglb
AND sequencerun.id = fastqlistrow.sequence_run_id
AND libraryrun.lane = fastqlistrow.lane
WHERE
limsrow_metadata.subject_id = 'SBJ00480'
AND
fastqlistrow.rgsm = 'PTC_TsqN200511'
ORDER BY sequencerun.instrument_run_id Gives us the following: Output
Summarised asSELECT sequencerun.instrument_run_id, limsrow_metadata.phenotype, COUNT(DISTINCT fastqlistrow.rglb) as libraries_in_run
FROM silver.limsrow_metadata
INNER JOIN portal.libraryrun
ON limsrow_metadata.library_id = libraryrun.library_id
INNER JOIN portal.sequencerun
ON sequencerun.instrument_run_id = libraryrun.instrument_run_id
INNER JOIN portal.fastqlistrow
ON libraryrun.library_id = fastqlistrow.rglb
AND sequencerun.id = fastqlistrow.sequence_run_id
AND libraryrun.lane = fastqlistrow.lane
WHERE
limsrow_metadata.subject_id = 'SBJ00480'
AND
fastqlistrow.rgsm = 'PTC_TsqN200511'
GROUP BY limsrow_metadata.phenotype, sequencerun.instrument_run_id
ORDER BY sequencerun.instrument_run_id
|
Instead the short-term approach is to update these fastq list rows' rgsm values to contain the sample name and the library id |
As highlighted by Stephen before running the validation samples, there exists some cases where the tumor sample id and the normal sample id are the same value.
See Subject SBJ00480 as an example:
This is a bit of a problem.
Some proposed solutions (to be discussed on Tuesday)
Note change of rgsm values wont impact file names as these are determined by the output-prefix parameter for dragen runs.
However it will impact the values of sample names in vcfs and bams.
Replace rgsm and use the library id instead.
The tumor and normal names in vcfs and bams would instead be the library id.
Append _N or _T based on phenotype to the RGSM
This would mean that vcfs and bams would have similar names as existing
Use sample name + library id
Would help if users are relying on the sample name in vcfs and bams for tumor and normal samples.
Nothing at all and ensure that sample names differ between tumors and normals for a given subject.
The text was updated successfully, but these errors were encountered: