Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get nucleotide sequence by gene ID #47

Open
plasmid02 opened this issue Oct 27, 2019 · 12 comments
Open

get nucleotide sequence by gene ID #47

plasmid02 opened this issue Oct 27, 2019 · 12 comments

Comments

@plasmid02
Copy link

I would like to download the nucleotide sequence associated with a gene ID e.g for gene id 17363256, the nucleotide sequence if I clicked "fasta" link on this page
https://www.ncbi.nlm.nih.gov/gene/?term=17363256

@vkkodali
Copy link

You can use something like this:

esummary -db gene -id 17363256 \
  | xtract -pattern DocumentSummary -if GenomicInfoType -element Id \
    -block GenomicInfoType -element ChrAccVer ChrStart ChrStop \
  | while read -r gene_id chr_acc chr_start chr_stop ; do 
    efetch -db nuccore -id $chr_acc -chr_start $chr_start -chr_stop $chr_stop -format fasta ; 
  done

@junhuili
Copy link

It works for gene ID from NC_** Reference assembly. Is there a way to get nucleotide sequence by geneID from NZ_** Reference assembly?

@vkkodali
Copy link

@junhuili -- can you please provide an example?

@junhuili
Copy link

gene ID: 43296344, 39846661
Thanks, @vkkodali !

@vkkodali
Copy link

vkkodali commented Jan 11, 2020

You can probably do something like this (sequence truncated for brevity):

efetch -db gene -id 43296344 -format native -mode xml \
  | xtract -pattern Entrezgene-Set \
    -group Gene-commentary \
    -if Gene-commentary_type@value -equals "genomic" \
    -element Gene-commentary_accession, Gene-commentary_version \
    -block Gene-commentary_seqs \
    -element Seq-interval_from,Seq-interval_to,Na-strand@value \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1"."$2,$3,$4,$5}' \
  | while read -r chrom start stop strand ; do 
    efetch -db nuccore -id $chrom -chr_start $start -chr_stop $stop -strand $strand -format fasta ; 
    done
>NZ_FQYS01000011.1:137784-139226 Pseudomonas zeshuii strain KACC 15471, whole genome shotgun sequence
TTGAAGCAACTAACGCGCCTGGAAAGTGGAATCGAGGGGCTGGATACCTTATTACGGGGCGGCTTTGTGG
CCGGCGCGTCATATGTCATTCAGGGCCGCCCGGGTTCTGGCAAGACAATTCTGGCCAATCAGATTGCCTT
CAACCATGTGCGCAAGGGTGAGCGTGTCCTGTTCGCCACGTTGTTGTCCGAATCCCATGAGCGGATGTTC

@junhuili
Copy link

Cool!

@shigdon
Copy link

shigdon commented Feb 10, 2022

I am interested in doing something similar but instead starting from a protein accession id, WP_XXXXXXXXX. @vkkodali is this possible to do? I have been spending a lot of time trying to figure out how to do this but haven't gotten to the finish line yet.

What I'm trying to do is obtain nucleotide fasta files by extracting the protein CDS(gene) as a range of nucleotides from the nucleotide accession number (NZ_XXXXXXXX) that corresponds to the protein WP id from a specific genome assembly.

I've been trying to modify the code above but am unsuccessful thus far in producing the right combination of pipes, targets, etc. Any insight/guidance will be greatly appreciated.

@vkkodali
Copy link

@shigdon -- for your purposes, I suggest using the new NCBI tool Datasets. You can use the command-line tool to download the sequences as shown below.

$ head -n3 wpaccs.txt 
WP_221685177.1
WP_194155123.1
WP_192827711.1
$ datasets download gene accession --inputfile wpaccs.txt 
Downloading: ncbi_dataset.zip    12.5kB done
$ unzip ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
  inflating: README.md               
  inflating: ncbi_dataset/data/data_report.jsonl  
  inflating: ncbi_dataset/data/annotation_report.jsonl  
  inflating: ncbi_dataset/data/gene.fna  
  inflating: ncbi_dataset/data/protein.faa  
  inflating: ncbi_dataset/data/dataset_catalog.json 
$ grep '>' ncbi_dataset/data/gene.fna | head -n3
>NZ_UEGE01000347.1:c944-1 ampC [protein_accession=WP_164705932.1] [organism=Escherichia coli] [name=CMY2/MIR/ACT/EC family class C beta-lactamase] [gene=ampC]
>NZ_QOZZ01000625.1:c4246-3295 ampC [protein_accession=WP_162814514.1] [organism=Escherichia coli] [name=class C beta-lactamase] [gene=ampC]
>NZ_WJGH01000003.1:105977-107080 ampC [protein_accession=WP_153673307.1] [organism=Escherichia coli] [name=class C beta-lactamase] [gene=ampC]

@ShwetaaPandey
Copy link

Can you tell me how to download gene sequences with 2500 gene ids?

@vkkodali
Copy link

@ShwetaaPandey -- I suggest using NCBI Datasets for this. You can upload the list of NCBI GeneIDs here, browse the data and download sequence/metadata directly from the web browser. Alternately, you can use the NCBI Datasets tool to do the same from the command line.

@KczCAF
Copy link

KczCAF commented Sep 16, 2023

You can use something like this:

esummary -db gene -id 17363256 \
  | xtract -pattern DocumentSummary -if GenomicInfoType -element Id \
    -block GenomicInfoType -element ChrAccVer ChrStart ChrStop \
  | while read -r gene_id chr_acc chr_start chr_stop ; do 
    efetch -db nuccore -id $chr_acc -chr_start $chr_start -chr_stop $chr_stop -format fasta ; 
  done

For the nucleotide sequence, could you please explain why efetch -db is nuccore? Is nucleotide same as nuccore? Thanks, @vkkodali !

@KczCAF
Copy link

KczCAF commented Sep 17, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants