-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get nucleotide sequence by gene ID #47
Comments
You can use something like this: esummary -db gene -id 17363256 \
| xtract -pattern DocumentSummary -if GenomicInfoType -element Id \
-block GenomicInfoType -element ChrAccVer ChrStart ChrStop \
| while read -r gene_id chr_acc chr_start chr_stop ; do
efetch -db nuccore -id $chr_acc -chr_start $chr_start -chr_stop $chr_stop -format fasta ;
done |
It works for gene ID from NC_** Reference assembly. Is there a way to get nucleotide sequence by geneID from NZ_** Reference assembly? |
@junhuili -- can you please provide an example? |
gene ID: 43296344, 39846661 |
You can probably do something like this (sequence truncated for brevity): efetch -db gene -id 43296344 -format native -mode xml \
| xtract -pattern Entrezgene-Set \
-group Gene-commentary \
-if Gene-commentary_type@value -equals "genomic" \
-element Gene-commentary_accession, Gene-commentary_version \
-block Gene-commentary_seqs \
-element Seq-interval_from,Seq-interval_to,Na-strand@value \
| awk 'BEGIN{FS="\t";OFS="\t"}{print $1"."$2,$3,$4,$5}' \
| while read -r chrom start stop strand ; do
efetch -db nuccore -id $chrom -chr_start $start -chr_stop $stop -strand $strand -format fasta ;
done
>NZ_FQYS01000011.1:137784-139226 Pseudomonas zeshuii strain KACC 15471, whole genome shotgun sequence
TTGAAGCAACTAACGCGCCTGGAAAGTGGAATCGAGGGGCTGGATACCTTATTACGGGGCGGCTTTGTGG
CCGGCGCGTCATATGTCATTCAGGGCCGCCCGGGTTCTGGCAAGACAATTCTGGCCAATCAGATTGCCTT
CAACCATGTGCGCAAGGGTGAGCGTGTCCTGTTCGCCACGTTGTTGTCCGAATCCCATGAGCGGATGTTC |
Cool! |
I am interested in doing something similar but instead starting from a protein accession id, What I'm trying to do is obtain nucleotide fasta files by extracting the protein CDS(gene) as a range of nucleotides from the nucleotide accession number (NZ_XXXXXXXX) that corresponds to the protein WP id from a specific genome assembly. I've been trying to modify the code above but am unsuccessful thus far in producing the right combination of pipes, targets, etc. Any insight/guidance will be greatly appreciated. |
@shigdon -- for your purposes, I suggest using the new NCBI tool Datasets. You can use the command-line tool to download the sequences as shown below. $ head -n3 wpaccs.txt
WP_221685177.1
WP_194155123.1
WP_192827711.1
$ datasets download gene accession --inputfile wpaccs.txt
Downloading: ncbi_dataset.zip 12.5kB done
$ unzip ncbi_dataset.zip
Archive: ncbi_dataset.zip
inflating: README.md
inflating: ncbi_dataset/data/data_report.jsonl
inflating: ncbi_dataset/data/annotation_report.jsonl
inflating: ncbi_dataset/data/gene.fna
inflating: ncbi_dataset/data/protein.faa
inflating: ncbi_dataset/data/dataset_catalog.json
$ grep '>' ncbi_dataset/data/gene.fna | head -n3
>NZ_UEGE01000347.1:c944-1 ampC [protein_accession=WP_164705932.1] [organism=Escherichia coli] [name=CMY2/MIR/ACT/EC family class C beta-lactamase] [gene=ampC]
>NZ_QOZZ01000625.1:c4246-3295 ampC [protein_accession=WP_162814514.1] [organism=Escherichia coli] [name=class C beta-lactamase] [gene=ampC]
>NZ_WJGH01000003.1:105977-107080 ampC [protein_accession=WP_153673307.1] [organism=Escherichia coli] [name=class C beta-lactamase] [gene=ampC] |
Can you tell me how to download gene sequences with 2500 gene ids? |
@ShwetaaPandey -- I suggest using NCBI Datasets for this. You can upload the list of NCBI GeneIDs here, browse the data and download sequence/metadata directly from the web browser. Alternately, you can use the NCBI Datasets tool to do the same from the command line. |
For the nucleotide sequence, could you please explain why efetch -db is nuccore? Is nucleotide same as nuccore? Thanks, @vkkodali ! |
I would like to download the nucleotide sequence associated with a gene ID e.g for gene id 17363256, the nucleotide sequence if I clicked "fasta" link on this page
https://www.ncbi.nlm.nih.gov/gene/?term=17363256
The text was updated successfully, but these errors were encountered: