Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom database questions #45

Open
mbhall88 opened this issue Aug 29, 2022 · 4 comments
Open

Custom database questions #45

mbhall88 opened this issue Aug 29, 2022 · 4 comments

Comments

@mbhall88
Copy link

I'm having some issues trying to create a custom database.

My understanding from the documentation is that I clone this repo, and then replace/change the tbdb.csv file to have the mutations I want, then I run parse_db.py in the main directory?

It seems there is a file missing? And I can't find it documented anywhere

$ python parse_db.py -c tbdb.csv --custom
Traceback (most recent call last):
  File "/Users/michaelhall/Projects/drprg/paper/tmp/tbdb/parse_db.py", line 281, in <module>
    args.func(args)
  File "/Users/michaelhall/Projects/drprg/paper/tmp/tbdb/parse_db.py", line 202, in main
    gene_info = load_gene_info("genes.txt")
  File "/Users/michaelhall/Projects/drprg/paper/tmp/tbdb/parse_db.py", line 187, in load_gene_info
    for l in open(filename):
FileNotFoundError: [Errno 2] No such file or directory: 'genes.txt'

I then instead tried running the following from the tbdb main directory

$ tb-profiler create_db --custom --include_original_mutation

this completes successfully, but I have a further issue with the output of this.

As per the docs, the mutations must follow HGVS nomenclature. But it seems tb-profiler only accepts a subset of this nomenclature.

For example, I have the mutation c.196_198delinsTAG, which describes an MNP at position 196 TCG>TAG. Looking at the tbdb.conversion.log this (incorrectly) gets converted as

Converted pncA c.196_198delinsTAG to c.196_198delTCG

Are you able to clarify (here and in the docs) what subset you support?

@mbhall88
Copy link
Author

I've also notice you don't accept duplications in the recommended format? i.e. c.643dup must specify the duplicated base at the end e.g., c.643dupC

@jodyphelan
Copy link
Owner

Hi @mbhall88 ,

Sorry I need to update the documentation. You are right in using tb-profiler create_db instead.

As per the docs, the mutations must follow HGVS nomenclature. But it seems tb-profiler only accepts a subset of this nomenclature.
For example, I have the mutation c.196_198delinsTAG, which describes an MNP at position 196 TCG>TAG. Looking at the tbdb.conversion.log this (incorrectly) gets converted.

Yes at the moment it is only a subset, which it accepts. The pipeline uses snpEff to annotate variants in new samples and only represents the variants in one way (e.g. c.643dupC instead c.643dup). To simplify the variant looup step the create_db function tried to standardise all variants to the snpEff format using regex, but currently I've only added support for the variants that are tbdb.csv. I'll try over the next days to update the docs and look into adding compatibility for more types such as the one you listed.

Thanks for raising the issue!

@mbhall88
Copy link
Author

Thanks for the clarification. Trying to support all of HGVS would likely be difficult, and would likely require developing a library. I just noticed https://github.com/biocommons/hgvs though! I haven't used it before, but looks like it might make your life a little easier potentially?

Anyways, I got a custom db working and just thought this issue might be helpful just for some docs changes.

Thanks for the quick response.

@jodyphelan
Copy link
Owner

Oh I hadn't seen that before, I'll check it out thanks!
And, I'll have a go at updating the docs asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants