Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_db error in docker container - Custom db not working #414

Open
chan-98 opened this issue Dec 9, 2024 · 12 comments
Open

create_db error in docker container - Custom db not working #414

chan-98 opened this issue Dec 9, 2024 · 12 comments

Comments

@chan-98
Copy link

chan-98 commented Dec 9, 2024

I had created a docker container and I'm trying to create db for a custim amplicon regions BED file.
I'm getting this error:-

tb-profiler create_db --prefix MY_TBDB --amplicon_primers /data/MTB/amplicons.sorted.bed --match_ref /data/MTB/NC000962.3_complete_genome.fasta
Traceback (most recent call last):
  File "/opt/conda/bin/tb-profiler", line 675, in <module>
    args.func(args)
  File "/opt/conda/bin/tb-profiler", line 267, in main_create_db
    version_string = json.load(open('variables.json'))['tb-profiler-version']
                               ^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'variables.json'
Cleaning up after failed run
[06:57:22] ERROR                                                                                                                tb-profiler:74
                                                                                                                                              
                    ################################# ERROR #######################################                                           
                                                                                                                                              
                    This run has failed. Please check all arguments and make sure all input files                                             
                    exist. If no solution is found, please open up an issue at                                                                
                    https://github.com/jodyphelan/TBProfiler/issues/new and paste or attach the                                               
                    contents of the error log (GT_TBDB.errlog.txt)                                                                            
                                                                                                                                              
                    ###############################################################################                                           
@chan-98
Copy link
Author

chan-98 commented Dec 9, 2024

This is my dockerfile:

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    wget \
    curl \
    bzip2 \
    ca-certificates \
    libglib2.0-0 \
    libxext6 \
    libsm6 \
    libxrender1 \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /miniconda.sh && \
    bash /miniconda.sh -b -p /opt/conda && \
    rm /miniconda.sh

ENV PATH="/opt/conda/bin:$PATH"

RUN conda init

RUN conda config --add channels defaults
RUN conda config --add channels bioconda
RUN conda config --add channels conda-forge
# Install TB-Profiler v6.5.0
RUN conda install -y -c bioconda tb-profiler

RUN tb-profiler update_tbdb
RUN tb-profiler update_tbdb --branch who

CMD ["/bin/bash"]

@jodyphelan
Copy link
Owner

Are you running this command in a cloned copy of https://github.com/jodyphelan/tbdb? It is looking for this file but it can't seem to find it.

@chan-98
Copy link
Author

chan-98 commented Dec 10, 2024

@jodyphelan Seems I was running it outside tbdb, not inside the folder.
Thank you, but I am facing a different error.

I had built a custom db using my amplicons regions BED file and the reference I used, using this command:
tb-profiler create_db --prefix gt_tbdb --amplicon_primers /amplicons.sorted.bed --match_ref /NC000962.3_complete_genome.fasta

When I run the profile command I get this error during varcalling. However it is working with the default tbdb and WHO databases but not with my custom db. Is this an issue with my data or the create database command I used?

(base) root@ee5e30768c4b:/data/MTB/3-12-MTB_Analysis/TB-PROFILER# tb-profiler profile --read1 /data/MTB/MTB_RUN2/RAWDATA/barcode01.fastq --platform nanopore --db /tbdb/gt_tbdb --prefix TBP_BARCODE01 --csv --txt --docx  --mapper minimap2 --caller bcftools --coverage_tool bedtools --threads 12 --dir TEST1
[06:28:41] INFO     Using ref file: /tbdb/gt_tbdb.fasta                                                                              db.py:614
           INFO     Using gff file: /tbdb/gt_tbdb.gff                                                                                db.py:614
           INFO     Using bed file: /tbdb/gt_tbdb.bed                                                                                db.py:614
           INFO     Using json_db file: /tbdb/gt_tbdb.dr.json                                                                        db.py:614
           INFO     Using variables file: /tbdb/gt_tbdb.variables.json                                                               db.py:614
           INFO     Using spoligotype_spacers file: /tbdb/gt_tbdb.spoligotype_spacers.txt                                            db.py:614
           INFO     Using spoligotype_annotations file: /tbdb/gt_tbdb.spoligotype_list.csv                                           db.py:614
           INFO     Using bedmask file: /tbdb/gt_tbdb.mask.bed                                                                       db.py:614
           INFO     Using barcode file: /tbdb/gt_tbdb.barcode.bed                                                                    db.py:614
           INFO     Using rules file: /tbdb/gt_tbdb.rules.txt                                                                        db.py:614
           INFO     Mapping to reference genome                                                                                    fastq.py:53
Calling variants: 0it [00:00, ?it/s]
Indexing variants: 0it [00:00, ?it/s]
           ERROR                                                                                                                  utils.py:535
                    About:   Concatenate or combine VCF/BCF files. All source files must have the same sample                                 
                             columns appearing in the same order. The program can be used, for example, to                                    
                             concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel                                      
                             VCF into one. The input files must be sorted by chr and position. The files                                      
                             must be given in the correct order to produce sorted VCF on output unless                                        
                             the -a, --allow-overlaps option is specified. With the --naive option, the files                                 
                             are concatenated without being recompressed, which is very fast.                                                 
                    Usage:   bcftools concat [options] <A.vcf.gz> [<B.vcf.gz> [...]]                                                          
                                                                                                                                              
                    Options:                                                                                                                  
                       -a, --allow-overlaps           First coordinate of the next file can precede last record of the current                
                    file.                                                                                                                     
                       -c, --compact-PS               Do not output PS tag at each site, only at the start of a new phase set                 
                    block.                                                                                                                    
                       -d, --rm-dups STRING           Output duplicate records present in multiple files only once:                           
                    <snps|indels|both|all|exact>                                                                                              
                       -D, --remove-duplicates        Alias for -d exact                                                                      
                       -f, --file-list FILE           Read the list of files from a file.                                                     
                       -G, --drop-genotypes           Drop individual genotype information.                                                   
                       -l, --ligate                   Ligate phased VCFs by matching phase at overlapping haplotypes                          
                           --ligate-force             Ligate even non-overlapping chunks, keep all sites                                      
                           --ligate-warn              Drop sites in imperfect overlaps                                                        
                           --no-version               Do not append version and command line to the header                                    
                       -n, --naive                    Concatenate files without recompression, a header check compatibility is                
                    performed                                                                                                                 
                           --naive-force              Same as --naive, but header compatibility is not checked. Dangerous, use                
                    with caution.                                                                                                             
                       -o, --output FILE              Write output to a file [standard output]                                                
                       -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]              
                       -q, --min-PQ INT               Break phase set if phasing quality is lower than <int> [30]                             
                       -r, --regions REGION           Restrict to comma-separated list of regions                                             
                       -R, --regions-file FILE        Restrict to regions listed in a file                                                    
                           --regions-overlap 0|1|2    Include if POS in the region (0), record overlaps (1), variant overlaps (2)             
                    [1]                                                                                                                       
                           --threads INT              Use multithreading with <int> worker threads [0]                                        
                       -v, --verbose 0|1              Set verbosity level [1]                                                                 
                       -W, --write-index[=FMT]        Automatically index the output files [off]                                              
                                                                                                                                              
                    Failed to read from standard input: unknown file type                                                                     
                                                                                                                                              
Traceback (most recent call last):
  File "/opt/conda/bin/tb-profiler", line 675, in <module>
    args.func(args)
  File "/opt/conda/bin/tb-profiler", line 120, in wrapper
    func(args)
  File "/opt/conda/bin/tb-profiler", line 139, in main_profile
    variants_profile = pp.run_profiler(args)
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pathogenprofiler/cli.py", line 197, in run_profiler
    get_vcf_file(args)
  File "/opt/conda/lib/python3.12/site-packages/pathogenprofiler/cli.py", line 179, in get_vcf_file
    args.vcf = get_vcf_from_bam(args)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pathogenprofiler/cli.py", line 153, in get_vcf_from_bam
    vcf_obj = bam.call_variants(conf["ref"], caller=args.caller, filters = conf['variant_filters'], bed_file=conf["bed"], threads=args.threads, calling_params=args.calling_params, samclip = args.samclip, cli_args=vars(args))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pathogenprofiler/bam.py", line 130, in call_variants
    return caller.call_variants()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pathogenprofiler/variant_calling.py", line 87, in call_variants
    return self.run_calling(self.calling_cmd)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/pathogenprofiler/variant_calling.py", line 72, in run_calling
    run_cmd(f"bcftools concat -aD {temp_vcf_files} | bcftools view -Oz -o {self.vcf_file}")
  File "/opt/conda/lib/python3.12/site-packages/pathogenprofiler/utils.py", line 537, in run_cmd
    raise ValueError("Command Failed:\n%s\nstderr:\n%s" % (cmd,result.stderr.decode()))
ValueError: Command Failed:
/bin/bash -c set -o pipefail; bcftools concat -aD  | bcftools view -Oz -o /data/MTB/3-12-MTB_Analysis/TB-PROFILER/9ccae0ee-711f-49a8-a4bc-0c7e8dbe13ff.short_variants.targets.vcf.gz
stderr:

About:   Concatenate or combine VCF/BCF files. All source files must have the same sample
         columns appearing in the same order. The program can be used, for example, to
         concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel
         VCF into one. The input files must be sorted by chr and position. The files
         must be given in the correct order to produce sorted VCF on output unless
         the -a, --allow-overlaps option is specified. With the --naive option, the files
         are concatenated without being recompressed, which is very fast.
Usage:   bcftools concat [options] <A.vcf.gz> [<B.vcf.gz> [...]]

Options:
   -a, --allow-overlaps           First coordinate of the next file can precede last record of the current file.
   -c, --compact-PS               Do not output PS tag at each site, only at the start of a new phase set block.
   -d, --rm-dups STRING           Output duplicate records present in multiple files only once: <snps|indels|both|all|exact>
   -D, --remove-duplicates        Alias for -d exact
   -f, --file-list FILE           Read the list of files from a file.
   -G, --drop-genotypes           Drop individual genotype information.
   -l, --ligate                   Ligate phased VCFs by matching phase at overlapping haplotypes
       --ligate-force             Ligate even non-overlapping chunks, keep all sites
       --ligate-warn              Drop sites in imperfect overlaps
       --no-version               Do not append version and command line to the header
   -n, --naive                    Concatenate files without recompression, a header check compatibility is performed
       --naive-force              Same as --naive, but header compatibility is not checked. Dangerous, use with caution.
   -o, --output FILE              Write output to a file [standard output]
   -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
   -q, --min-PQ INT               Break phase set if phasing quality is lower than <int> [30]
   -r, --regions REGION           Restrict to comma-separated list of regions
   -R, --regions-file FILE        Restrict to regions listed in a file
       --regions-overlap 0|1|2    Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
       --threads INT              Use multithreading with <int> worker threads [0]
   -v, --verbose 0|1              Set verbosity level [1]
   -W, --write-index[=FMT]        Automatically index the output files [off]

Failed to read from standard input: unknown file type

Cleaning up after failed run
           ERROR                                                                                                                tb-profiler:74
                                                                                                                                              
                    ################################# ERROR #######################################                                           
                                                                                                                                              
                    This run has failed. Please check all arguments and make sure all input files                                             
                    exist. If no solution is found, please open up an issue at                                                                
                    https://github.com/jodyphelan/TBProfiler/issues/new and paste or attach the                                               
                    contents of the error log (TBP_BARCODE01.errlog.txt)                                                                      
                                                                                                                                              
                    ###############################################################################                                           

@chan-98
Copy link
Author

chan-98 commented Dec 10, 2024

Seems the bed file gt_tbdb.bed is empty

(base) root@ee5e30768c4b:/data/MTB/3-12-MTB_Analysis/TB-PROFILER# head /tbdb/who.bed 
Chromosome	1	1524	Rv0001	dnaA	isoniazid
Chromosome	4933	7267	Rv0005	gyrB	levofloxacin,moxifloxacin
Chromosome	7068	9818	Rv0006	gyrA	levofloxacin,moxifloxacin
Chromosome	13133	13911	Rv0010c	Rv0010c	isoniazid
Chromosome	490545	491793	Rv0407	fgd1	clofazimine,delamanid,pretomanid
(base) root@ee5e30768c4b:/data/MTB/3-12-MTB_Analysis/TB-PROFILER# head /tbdb/who.barcode.bed 
Chromosome	1130	1131	lineage4.2.2.1	A	Euro-American	LAM7-TUR	None
Chromosome	4205	4206	lineage4.1.3	T	Euro-American	T;X;H	None
Chromosome	9943	9944	lineage4.2.2	C	Euro-American (Ural)	T;LAM7-TUR	None
Chromosome	10359	10360	lineage1.2.2.1	C	Indo-Oceanic	NA	RD239
Chromosome	10726	10727	La1.8	G	M.bovis	None	None
(base) root@ee5e30768c4b:/data/MTB/3-12-MTB_Analysis/TB-PROFILER# head /tbdb/gt_tbdb.bed 
(base) root@ee5e30768c4b:/data/MTB/3-12-MTB_Analysis/TB-PROFILER# head /tbdb/gt_tbdb.barcode.bed 
NC_000962.3	1130	1131	lineage4.2.2.1	A	Euro-American	LAM7-TUR	None
NC_000962.3	4205	4206	lineage4.1.3	T	Euro-American	T;X;H	None
NC_000962.3	9943	9944	lineage4.2.2	C	Euro-American (Ural)	T;LAM7-TUR	

Not sure how this happened? I am sure my amplicon regions have resistant mutations

(base) root@ee5e30768c4b:/data/MTB/3-12-MTB_Analysis/TB-PROFILER# head /amplicons.sorted.bed 
NC_000962.3	5079	6235	gyrB_a1
NC_000962.3	6097	7255	gyrB_a2
NC_000962.3	7217	8407	gyrA_a1
NC_000962.3	8248	9435	gyrA_a2
NC_000962.3	8748	9915	gyrA_a3
NC_000962.3	528536	529742	groEL2_a1

@chan-98 chan-98 changed the title create_db error in docker container create_db error in docker container - Custom db not working Dec 10, 2024
@jodyphelan
Copy link
Owner

jodyphelan commented Dec 10, 2024

The primers file that you supply is not actually a bed file but should be a three column, tab separated, text file with

  1. the amplicon name
  2. the forward sequence
  3. the reverse sequence

E.g.:

gyrB	AGAGTTGGTGCGGCGTAA	GCCACTTGAGTTTGTACAGC
gyrA	GTTGACATCGAGCAGGAGAT	AAATCGACTGTCTCCTCGTC

@chan-98
Copy link
Author

chan-98 commented Dec 10, 2024

The primers file that you supply is not actually a bed file but should be a three column, tab separated, text file with

1. the amplicon name

2. the forward sequence

3. the reverse sequence

E.g.:

gyrB	AGAGTTGGTGCGGCGTAA	GCCACTTGAGTTTGTACAGC
gyrA	GTTGACATCGAGCAGGAGAT	AAATCGACTGTCTCCTCGTC

Ah, thank you so much. I have multiple primers within a single gene, some overlapping. I hope this works without errors. I will try this and update here.

@jodyphelan
Copy link
Owner

Yes you can have multiple overlapping primers, you'll just have to make sure to give them a unique name

@chan-98
Copy link
Author

chan-98 commented Dec 10, 2024

The .bed created inside the tbdb folder is still empty

(base) root@22a11c22bb98:/data/MTB/MTB_RUN2/VARIANTS# head -n5 /tbdb/tbdb.bed
NC_000962.3	1	1524	Rv0001	dnaA	isoniazid
NC_000962.3	4933	7267	Rv0005	gyrB	levofloxacin,moxifloxacin
NC_000962.3	7068	9818	Rv0006	gyrA	levofloxacin,moxifloxacin
NC_000962.3	13133	13911	Rv0010c	Rv0010c	isoniazid
NC_000962.3	490545	491793	Rv0407	fgd1	clofazimine,delamanid,pretomanid
(base) root@22a11c22bb98:/data/MTB/MTB_RUN2/VARIANTS# head /tbdb/gt.bed  
(base) root@22a11c22bb98:/data/MTB/MTB_RUN2/VARIANTS# 

Primers file:

gyrB_a1	GGGACGCACCAGGAAGAAAG	CGGTGGTGAACAAGTACGCC
gyrB_a2	GGTGGAGATCGCGATGCAAT	ACGCCAAGGATGTTCGGTTC
gyrA_a1	CGCAGCTTTATCACCCGCA	CCGCTATTACGTTGACCACCA
gyrA_a2	TCGAGATCAAGCGCGATGC	CAATATCGACGACCGGCTGC	
gyrA_a3	CTCGCCGAAATCGTGGACAG	CGAGTGTTAGGAGTCGGGGT
groEL_a1	CAGCGTAAGTAGCGGGGTTG	GGTGTCGCGGTGATCAAGG
groEL_a2	TCATCACCGTCGAGGAGTCC	TCCTCTGGTTGGGAGCTACG
rpoB_a1	GCGGCTCAGCGGTTTAGTT	CGAGGGTCAGACCACGATGA
rpoB_a2	CAAGGAGAAGCGCTACGACC	GACGGTCCCTGTACTGACGA
rpoB_a3	GTACCTACCGGATGCGCAAG	CTCGATGATCACCCAGCAGC
...
...

Command for create_db: tb-profiler create_db --prefix gt --amplicon_primers /primers.tsv --match_ref /NC000962.3_complete_genome.fasta

@jodyphelan
Copy link
Owner

Internally the pipeline uses seqkit amplicon to find the location of the amplicon using the primers. Can you verify that
seqkit amplicon gt.fasta -p primers.tsv --bed produces amplicon hits? It the output is blank it might indicate that it hasn't found the primers in the reference genome.

@chan-98
Copy link
Author

chan-98 commented Dec 11, 2024

@jodyphelan

It's giving no stdout or output file

(base) root@1a873959019b:/tbdb# rm gt.bed 
(base) root@1a873959019b:/tbdb# seqkit amplicon gt.fasta -p /primers.tsv --bed
[INFO] 46 primer pair loaded
(base) root@1a873959019b:/tbdb#

But the primers, their targets regions have already been prepared (in my amplicons.sorted.bed file mentioned earlier) from the same reference. I'm not sure why the output is blank.

@jodyphelan
Copy link
Owner

Hmm, yeah it looks as though the primers aren't found. At the moment this is the only way to create an amplicon-specific database so I would look more into why the primers are not found. Perhaps also try blast to see if there are potential mismatches between the primers and the reference genome.

@chan-98
Copy link
Author

chan-98 commented Dec 12, 2024

@jodyphelan Nope, the primers have exact matches against the ref.

query id	s. start	s. end	query id	s. start	s. end
atpEFP	1460795	1460814	atpERP	1461567	1461586
ethAFP1	4325996	4326016	ethARP1	4326856	4326876
ethAFP2	4326727	4326709	ethARP2	4325840	4325823
ethAFP3	4326536	4326517	ethARP3	4325658	4325640
fbiBFP1	3641380	3641396	fbiBRP1	3642320	3642338
fbiBFP2	3642151	3642169	fbiBRP2	3643146	3643164
ddnFP1	3986555	3986573	ddnRP1	3987464	3987482
inhAFP	1674169	1674188	inhARP	1675042	1675057
rpsLFP	781062	781080	rpsLRP	781983	782002
Rv678FP1	778791	778810	Rv678RP1	779659	779678
FbiAFP1	3640416	3640435	FbiARP1	3641559	3641578
tlyA	1917917	1917935	tlyA	1918756	1918775
EISFP	2714112	2714131	Eis	2715389	2715408
pCNAFP	2289420	2289401	pcnA	2288559	2288544
fabG1FP	1673250	1673269	FabG1	1674194	1674213
ahpCFP	2726080	2726100	ahpC	2726939	2726958

I will try to work with seqkit amplicons separately and test out the primer seqs one by one. Will update here if I can figure out what the issue was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants