get nucleotide sequence by gene ID #47

plasmid02 · 2019-10-27T20:34:09Z

I would like to download the nucleotide sequence associated with a gene ID e.g for gene id 17363256, the nucleotide sequence if I clicked "fasta" link on this page
https://www.ncbi.nlm.nih.gov/gene/?term=17363256

vkkodali · 2019-10-27T21:19:28Z

You can use something like this:

esummary -db gene -id 17363256 \
  | xtract -pattern DocumentSummary -if GenomicInfoType -element Id \
    -block GenomicInfoType -element ChrAccVer ChrStart ChrStop \
  | while read -r gene_id chr_acc chr_start chr_stop ; do 
    efetch -db nuccore -id $chr_acc -chr_start $chr_start -chr_stop $chr_stop -format fasta ; 
  done

junhuili · 2020-01-10T15:38:50Z

It works for gene ID from NC_** Reference assembly. Is there a way to get nucleotide sequence by geneID from NZ_** Reference assembly?

vkkodali · 2020-01-10T16:10:27Z

@junhuili -- can you please provide an example?

junhuili · 2020-01-10T16:35:53Z

gene ID: 43296344, 39846661
Thanks, @vkkodali !

vkkodali · 2020-01-11T13:33:09Z

You can probably do something like this (sequence truncated for brevity):

efetch -db gene -id 43296344 -format native -mode xml \
  | xtract -pattern Entrezgene-Set \
    -group Gene-commentary \
    -if Gene-commentary_type@value -equals "genomic" \
    -element Gene-commentary_accession, Gene-commentary_version \
    -block Gene-commentary_seqs \
    -element Seq-interval_from,Seq-interval_to,Na-strand@value \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1"."$2,$3,$4,$5}' \
  | while read -r chrom start stop strand ; do 
    efetch -db nuccore -id $chrom -chr_start $start -chr_stop $stop -strand $strand -format fasta ; 
    done
>NZ_FQYS01000011.1:137784-139226 Pseudomonas zeshuii strain KACC 15471, whole genome shotgun sequence
TTGAAGCAACTAACGCGCCTGGAAAGTGGAATCGAGGGGCTGGATACCTTATTACGGGGCGGCTTTGTGG
CCGGCGCGTCATATGTCATTCAGGGCCGCCCGGGTTCTGGCAAGACAATTCTGGCCAATCAGATTGCCTT
CAACCATGTGCGCAAGGGTGAGCGTGTCCTGTTCGCCACGTTGTTGTCCGAATCCCATGAGCGGATGTTC

junhuili · 2020-01-13T05:23:02Z

Cool!

shigdon · 2022-02-10T13:23:45Z

I am interested in doing something similar but instead starting from a protein accession id, WP_XXXXXXXXX. @vkkodali is this possible to do? I have been spending a lot of time trying to figure out how to do this but haven't gotten to the finish line yet.

What I'm trying to do is obtain nucleotide fasta files by extracting the protein CDS(gene) as a range of nucleotides from the nucleotide accession number (NZ_XXXXXXXX) that corresponds to the protein WP id from a specific genome assembly.

I've been trying to modify the code above but am unsuccessful thus far in producing the right combination of pipes, targets, etc. Any insight/guidance will be greatly appreciated.

vkkodali · 2022-02-10T14:43:01Z

@shigdon -- for your purposes, I suggest using the new NCBI tool Datasets. You can use the command-line tool to download the sequences as shown below.

$ head -n3 wpaccs.txt 
WP_221685177.1
WP_194155123.1
WP_192827711.1
$ datasets download gene accession --inputfile wpaccs.txt 
Downloading: ncbi_dataset.zip    12.5kB done
$ unzip ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
  inflating: README.md               
  inflating: ncbi_dataset/data/data_report.jsonl  
  inflating: ncbi_dataset/data/annotation_report.jsonl  
  inflating: ncbi_dataset/data/gene.fna  
  inflating: ncbi_dataset/data/protein.faa  
  inflating: ncbi_dataset/data/dataset_catalog.json 
$ grep '>' ncbi_dataset/data/gene.fna | head -n3
>NZ_UEGE01000347.1:c944-1 ampC [protein_accession=WP_164705932.1] [organism=Escherichia coli] [name=CMY2/MIR/ACT/EC family class C beta-lactamase] [gene=ampC]
>NZ_QOZZ01000625.1:c4246-3295 ampC [protein_accession=WP_162814514.1] [organism=Escherichia coli] [name=class C beta-lactamase] [gene=ampC]
>NZ_WJGH01000003.1:105977-107080 ampC [protein_accession=WP_153673307.1] [organism=Escherichia coli] [name=class C beta-lactamase] [gene=ampC]

ShwetaaPandey · 2023-02-27T10:18:44Z

Can you tell me how to download gene sequences with 2500 gene ids?

vkkodali · 2023-02-27T14:27:59Z

@ShwetaaPandey -- I suggest using NCBI Datasets for this. You can upload the list of NCBI GeneIDs here, browse the data and download sequence/metadata directly from the web browser. Alternately, you can use the NCBI Datasets tool to do the same from the command line.

KczCAF · 2023-09-16T14:02:53Z

You can use something like this:

esummary -db gene -id 17363256 \
  | xtract -pattern DocumentSummary -if GenomicInfoType -element Id \
    -block GenomicInfoType -element ChrAccVer ChrStart ChrStop \
  | while read -r gene_id chr_acc chr_start chr_stop ; do 
    efetch -db nuccore -id $chr_acc -chr_start $chr_start -chr_stop $chr_stop -format fasta ; 
  done

For the nucleotide sequence, could you please explain why efetch -db is nuccore? Is nucleotide same as nuccore? Thanks, @vkkodali !

KczCAF · 2023-09-17T04:03:10Z

I find the solution from https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly. @vkkodali

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get nucleotide sequence by gene ID #47

get nucleotide sequence by gene ID #47

plasmid02 commented Oct 27, 2019

vkkodali commented Oct 27, 2019

junhuili commented Jan 10, 2020

vkkodali commented Jan 10, 2020

junhuili commented Jan 10, 2020

vkkodali commented Jan 11, 2020 •

edited

Loading

junhuili commented Jan 13, 2020

shigdon commented Feb 10, 2022

vkkodali commented Feb 10, 2022

ShwetaaPandey commented Feb 27, 2023

vkkodali commented Feb 27, 2023

KczCAF commented Sep 16, 2023

KczCAF commented Sep 17, 2023

get nucleotide sequence by gene ID #47

get nucleotide sequence by gene ID #47

Comments

plasmid02 commented Oct 27, 2019

vkkodali commented Oct 27, 2019

junhuili commented Jan 10, 2020

vkkodali commented Jan 10, 2020

junhuili commented Jan 10, 2020

vkkodali commented Jan 11, 2020 • edited Loading

junhuili commented Jan 13, 2020

shigdon commented Feb 10, 2022

vkkodali commented Feb 10, 2022

ShwetaaPandey commented Feb 27, 2023

vkkodali commented Feb 27, 2023

KczCAF commented Sep 16, 2023

KczCAF commented Sep 17, 2023

vkkodali commented Jan 11, 2020 •

edited

Loading