Skip to content
/ geo Public

Important information about NCBI GEO data accessioning

Notifications You must be signed in to change notification settings

kadochlab/geo

Repository files navigation

Kadoch Lab NCBI GEO Accessioning Guide

NCBI Gene Expression Omnibus (GEO) is a public functional genomics data repository supporting MIAME-compliant data submissions. Given that RNA-seq, ATAC-seq, ChIP-seq, CUT&TAG, and CUT&RUN are all functional genomics data commonly derived by the lab, all data associated with published research papers will need to be archived with NCBI GEO upon publication. Here is an opinionated guide for conducting and documenting experiments and computational analyses that ensures that important experimental details necessary for GEO accessioning are properly captured and reported, making the accessioning process straightforward and less prone to error.

The general submission information page at GEO provides a lot of useful information and I encourage everyone to read through it. GEO accepts data in the form of high throughput sequencing (HTS) datasets, such as the Illumina data that we commonly generate, and this will be the most common type of submission we perform. However, microarray data are also accessioned in GEO - we occassionally handle microarray data in the lab too. The below details are tailored to the HTS datasets but a lot of the information will apply more generally.

NCBI generally uses Excel spreadsheet templates to gather important data and experimental details, which is parsed by their systems to derive the information needed for the GEO record. Here is the spreadsheet template for HTS data submissions. Instructions and examples are provided therein (see Instructions and various example tabs) for completing the forms properly and I outline some important details below. Instructions are both in-line and are also included as pop-up messages on the cells with the red triangle in the top-right corner - simply hover over the cell to see the message. Please ensure you are careful with formatting the text to avoid poorly formatted records or submission issues.

Example GEO Submission Templates

As part of this repository, I have included three example GEO submission templates from a recent, successful GEO submission. There is one each for RNA-seq, ATAC-seq, and CUT&TAG (good proxy for ChIP-seq and CUT&RUN) that should be helpful in completing your submission template.

Completing the GEO Submission Template

In practice, the computational biologist working on the project will be responsible for accessioning data to GEO. However, they rely a lot on details from the experimentalists. Moreover, the computationalist must keep track of important data processing details. Therefore, I make the following recommendations for how you should approach your work to ensure a smooth process.

Experimentalists

I highly recommend filling out the GEO template for each genomics experiment and submission you perform. Ideally, this is done at the same time as the experiment to ensure important details are properly captured. Once you have completed the below fields, you can send the file to the computationalist you are working with (if applicable), as that person will also be filling in additional fields related to the computational analysis. Recommendation: Send this form over before you perform your sequencing, as the computationalist may have some helpful recommendations for formatting sample names, etc. so that everything is more clear or easier to work with later. It also gives them a heads up that you will have data coming soon! Once the computationalist has processed the data, he or she should fill in additional details and send the file back. At that point, both people should save a copy of the completed submission template in a safe space until the time comes to perform the GEO submission, at which point final details can be added/modifications can be made.

In the HTS GEO submission template, you will insert experimental details under the Metadata tab - this is the only place you will need to provide information (the computationalist will deal with everything else). Below is a list of the fields you will need to provide information for - all are mandatory.

  1. STUDY --> title: A title for your study. Usually we end up using the paper's title, so it is fine to use an informative placeholder at first, which can get modified later.
  2. STUDY --> summary (abstract): An abstract for your study. Usually we end up using the paper's abstract, but for now, you can add some important details as a placeholder.
  3. STUDY --> experimental design: Details of the experimental design - see the pop-up instructions. You should fill this out completely when conducting your experiment, as these are important details. Modifications can always be made later.
  4. STUDY --> contributor: You should list yourself, other experimentalists involved with the work, the computationalist, and Cigall as contributors (potentially others, but at least these people). As instructed, ensure you have a comma between your names/initials so the data is properly parsed.
  5. SAMPLES --> library name: This is a unique library name for each sample that was sequenced. Generally, the Sample Name field we record on our sequencing spreadsheet will be what you use here. Try to standardize the formatting as much as possible and stay way from special characters - characters other than alphanumeric, "-", "_", or ".".
  6. SAMPLES --> title: This is a unique title for each sample that was sequenced. It will probably take some of the information from the library name field, but could include additional information, and it should be written in a more human-readable format, like a sentence.
  7. SAMPLES --> organism: Self explanatory - the genus and species for the organism. Homo sapiens, Mus musculus, etc.
  8. SAMPLES --> combination of one or more of tissue, cell line, cell type, genotype, or treatment, depending on the context of the samples/cells used and experimental details.
  9. SAMPLES --> molecule: Select the appropriate molecule type from the drop-down menu.
  10. SAMPLES --> single or paired-end: Select single or paired-end from the drop-down menu.
  11. SAMPLES --> instrument model: Select the instrument and model of the sequencer used in the experiment from the drop-down menu.
  12. PROTOCOLS --> growth protocol: Describe the details of how organisms or cells were maintained prior to or during the experiment. You should fill this out completely when conducting your experiment, as these are important details. Modifications can always be made later.
  13. PROTOCOLS --> treatment protocol: Describe the treatments applied in the experiment performed. You should fill this out completely when conducting your experiment, as these are important details. Modifications can always be made later.
  14. PROTOCOLS --> extract protocol: Describe the protocols used to extract and prepare the material for sequencing. You should fill this out completely when conducting your experiment, as these are important details. Modifications can always be made later.
  15. PROTOCOLS --> library construction protocol: Describe the details of the genomics library preparation you performed. You should fill this out completely when conducting your experiment, as these are important details. Modifications can always be made later.
  16. PROTOCOLS --> library strategy: Identify the library strategy: RNA-seq, ATAC-seq, ChIP-seq, CUT&TAG, or CUT&RUN.

Computationalists

I recommend filling out the computational portions of the submission spreadsheet as soon as possible. In most cases, these are meant to capture the basic data processing steps of analysis like QC, mapping, peak calling, etc. and not the downstream analysis details, so it should be possible to fill this out very soon after data are processed. Upon processing data, most downstream analyses rely on a matrix of raw or normalized counts, so you should be describing all analysis steps performed to take the raw data through the count matrix generation portion of data analysis: raw read QC, read quality trimming, read mapping, mapping QC, mapping filtering, peak calling, track generation, count generation, and count normalization. Below is a list of the fields you will need to fill out - all are mandatory. Once the below fields are completed, send the updated submission spreadsheet back to the experimentalist so you both have a copy of the final document, which can be stored away in a safe place until data are archived upon manuscript submission.

  1. SAMPLES --> processed data file: You should provide the names of one or two processed data files. For RNA-seq datasets, you only need to provide a single count matrix text file, which will be the same for all samples (rows) and will be placed under the first, required processed data file field. For ATAC-seq, ChIP-seq, CUT&TAG, and CUT&RUN, you should provide the peaks that were called for each sample (i.e., MACS broadPeak or narrowPeak files) under the first (unique for each sample/row), required processed data file field and the tracks bigwig file under the second processed data file field (unique for each sample/row).
  2. SAMPLES --> raw file: You should provide the names of one or more raw data files - i.e., FASTQ files. When you have single-end data, simply provide the file names for the single end read FASTQ files in the first, mandatory raw file field. When you have paired-end data, provide the read 1 FASTQ file name to the first, mandatory raw file field and the read 2 FASTQ file to the second raw file field. If more data was generated for a sample (e.g., technical replicates were generated but combined together for analysis), additional FASTQ files can be added to the 3rd and 4th raw file fields.
  3. PROTOCOLS --> data processing step: As described above, fill in one or more of the data processing step fields to document the computational processing of the data from raw FASTQ to integrated count files, peaks, and tracks. For RNA-seq, I recommend placing information about the read trimming (if applicable) and read mapping in the first field and information about how counts were gathered and normalized in the second field. For ATAC-seq, ChIP-seq, CUT&TAG, and CUT&RUN, I recommend placing the details of the read trimming in the first field, the details of the read mapping in the second field, the details of the mapping filtering (e.g., PCR duplicate filtering) in the third field, the details of the peak calling in the forth field, details of how track (bigwig) files were generated in the fifth field, and information about how counts were gathered and normalized in the sixth field. You can deviate from the above recommendations if your experiment mandates it and can add more data processing step lines manually to include more information. Important: Please ensure you are documenting the software and exact software version that was used, as well as any non-default options (when only defaults were used, you can state that).
  4. PROTOCOLS --> genome build/assembly: Document the genome version that was mapped to - e.g., hg19 or hg38 for human.
  5. PROTOCOLS --> processed data files format and content: Fill in one or more of these fields to document the contents of the processed data file fields that were completed (see above). For RNA-seq, I recommand filling in a single field documenting the raw counts (e.g., "raw counts for each gene (rows) and sample (columns) in a tabular text format"). For ATAC-seq, ChIP-seq, CUT&TAG, and CUT&RUN, I recommend adding the details of the peak files to the first field (e.g., "Peaks for each treatment and replicate in broadPeak (BED) format") and the details of the track files to the second field (e.g., "Normalized coverage tracks (CPM) for each treatment and replicate in bigwig format.").
  6. PAIRED-END EXPERIMENTS: If you are including paired-end sequencing data in the submission, this section needs to be completed to properly associate the read 1 and read 2 files. Otherwise, it can be ignored. Ensure you place the read 1 file name in file name 1 and associated read 2 file name in file name 2 (these can probably be copied from the SAMPLES section). If you have single-cell data, you may also need to fill in the additional file name fields.
  7. MD5 Checksums tab: MD5 checksums, which are unique alphanumeric strings based on the contents of the file, need to be generated for all files being accessioned with NCBI. This ensures data quality and provenance during the transfer. You can generate the checksum for each file using the command md5sum <file>. A checksum must be provide for all of the files provided to the processed data file and raw file fields of the SAMPLES section (see above). Ensure the raw files are in the left table and the processed data files are in the right table.

As alluded to above, the computationalist must also gather together the data files in the submission spreadsheet so that they can be uploaded. In general, our existing processes should work for these purposes, as we commonly keep archives of the raw FASTQ files, BAM mapping files, peak files, and bigwig track files. Ensure you are properly depositing these files into the Kadoch data repository so that they are there when the time comes to accession the data. Also ensure you keep a copy of the count matrices generated from any genomics data, as this is also important and will often get archived - we can work on a better system to add these important files to our lab repository. I also highly recommend that you use scripts when processing data to keep track of everything you did and that you keep backups of these scripts, as they use minimal space and would allow the user to regenerate the processed data if it was ever lost or corrupted.

Finally, before the submission of the spreadsheet or data to NCBI, ensure that you have the information stored in the most up-to-date version of the submission spreadsheet. NCBI occassionally updates it and if you did the experiment in the past, it's possible that some minor changes have been made. Ensure you reconcile these changes and submit the most up-to-date version!

About

Important information about NCBI GEO data accessioning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published