Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reference panel chr1 is not fully imported #32

Closed
roman-tremmel opened this issue Oct 11, 2023 · 6 comments
Closed

reference panel chr1 is not fully imported #32

roman-tremmel opened this issue Oct 11, 2023 · 6 comments

Comments

@roman-tremmel
Copy link

roman-tremmel commented Oct 11, 2023

I downloaded the hg19 refpanel with the paramters All and vcf like

get1KGGRCh37.sh All 20 vcf

then I used the command line to score the GWAS data. But first, the refpanel is imported.

REF=~/PascalX/resource/All.1KG.GRCh37
GENE=~/PascalX/resource/gene_GRCh37.tsv
pascalx  -g False -w 10000 -m 0.05 -n True -p 20 ${GENE} ${REF} ${OUT} genescoring -sh False -cr 0 -cp 1 ${IN}

This command produced All.1KG.GRCh37.chr*.db files for all chromosomes, which then can used for scoring. However for chr1 the following error interrupts the import function after 2 hours. Of note, the same error occurs when using a python script instead of the command line function.

Reference panel data not imported. Trying to import...
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "~/anaconda3/envs/pascal/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "~/anaconda3/envs/pascal/lib/python3.9/site-packages/PascalX-0.0.4-py3.9-linux-x86_64.egg/PascalX/refpanel.py", line 270, in _import_reference_thread_vcf
    counter[int(geno[2])] += 1
ValueError: invalid literal for int() with base 10: '|'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_pascal.py", line 4, in <module>
    Scorer.load_refpanel("~/PascalX/resource/All.1KG.GRCh37",parallel=10, chrlist=[1])
  File "~/anaconda3/envs/pascal/lib/python3.9/site-packages/PascalX-0.0.4-py3.9-linux-x86_64.egg/PascalX/genescorer.py", line 95, in load_refpanel
    self._ref.set_refpanel(filename=filename,parallel=parallel,keepfile=keepfile,qualityT=qualityT,SNPonly=SNPonly,chrlist=chrlist)
  File "~/anaconda3/envs/pascal/lib/python3.9/site-packages/PascalX-0.0.4-py3.9-linux-x86_64.egg/PascalX/refpanel.py", line 120, in set_refpanel
    self._import_reference(chrs=NF,parallel=parallel,keepfile=keepfile,qualityT=qualityT,SNPonly=SNPonly,regEx=regEx,nobar=nobar)
  File "~/anaconda3/envs/pascal/lib/python3.9/site-packages/PascalX-0.0.4-py3.9-linux-x86_64.egg/PascalX/refpanel.py", line 365, in _import_reference
    r.get()
  File "~/anaconda3/envs/pascal/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
ValueError: invalid literal for int() with base 10: '|'
@Dan-RAI
Copy link
Collaborator

Dan-RAI commented Oct 14, 2023

Can you confirm that the data for chr1 got successfully downloaded ? Sometimes the download aborts and the import file could therefore be broken. I would recommend to download the data again for chr1 ( look into the shell script for the ftp url ).

@roman-tremmel
Copy link
Author

roman-tremmel commented Oct 16, 2023

I'm pretty sure. I tested the db import again using this 1.2GB file wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

But unfortunately with same error.

1216886729 Oct 16 09:00 All.1KG.GRCh37.chr1.vcf.gz

@Dan-RAI
Copy link
Collaborator

Dan-RAI commented Oct 16, 2023

ok, I am going to investigate this particular vcf. In the mean-time, you can experiment with other chrs, if you load them selectively in the python API:

G = genescorer.chi2sum()
G.load_refpanel('ALL',chrlist=['2','3','4'])

@Dan-RAI
Copy link
Collaborator

Dan-RAI commented Oct 17, 2023

Can confirm that there is an issue with this particular .vcf. Will try to find later today some time to push a mod to the refpanel.py so that exceptions are caught and lines with vcf inconsistencies are just skipped with a Warning message and corresponding line number printout. For an immediate band-aid, you can add a try: in line 234 and a except Exception: in line 310, followed by a pass . That should skip the lines with Problems.

@Dan-RAI
Copy link
Collaborator

Dan-RAI commented Oct 18, 2023

Please try with the new vcf_fix branch (see pull request #33 ).

@roman-tremmel
Copy link
Author

Yes, using the fix worked for me. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants