Skip to content

Latest commit

 

History

History
96 lines (67 loc) · 7.34 KB

README.md

File metadata and controls

96 lines (67 loc) · 7.34 KB

Introduction

This repository is designed to hold crowdsourced genomics data from your new project. It is a generic data structure designed to hold multiple instances of genomics data and links to external sources of data too large to store on GitHub. You can fork it and start your own repository for your own projects.

If you want to create a new crowdsourced project just follow these instructions or get in touch through [email protected]

How to use this repository

Once you've forked your new repository for your new project, you'll need to know how to populate it.

The directory (folder) structure

The folders in this structure are designed to be semantically organised, so that you should be able to follow them and find what you want easily and intuitively. They start off like this:

.
|____project_name
	|____org_1
	|____org_2
	|____project.info
	|____README
	|____org.README
	|____strain.README
	|____reads.README
	|____assembly.README
	|____alignments.README
	|____annotations.README
	|____project.README

The base folder is called project_name, and it contains the project level information. Immediately inside this folder is a file called README (which is the file you are reading), project.info (which contains the project level metadata), a load of files ending in .README that define the metadata required for datatypes lower in the structure and one folder each for however many organisms you will have in this project (here we have two, org_1 and org_2). The names of the folders aren't important, only what they contain. So if you want to work on E.coli then you could rename one of the folders to e_coli.

###organism folders

The organism folder contains all information about just one organism, org_1 or e_coli or whatever. It looks like this:

org_x
|____org.info
|____strain_1
|____strain_2

Details about the organism are in the org.info metadata file, one of which is required for every organism in the project. org.README in the project folder contains details on the required metadata for the org.info file.

The organism folder also holds the strain folders, where strain_x is a particular strain or isolate and could be renamed to 0104_H4, K12 or whatever. strain.README is a file that specifies the metadata require for each strain.

###strain folders

These folders are where the actual data are. The enclosing folders exist really to help stop all the data sets from getting mixed up with themselves. Once in the strain folder we find specific datasets. Each strain folder contains two metadata files and 4 sub-folders.

strain_x
|____alignments
|____annotations
|____assemblies
|____reads
|____strain.info

The metadata file strain.info contains the metadata for the strain/isolate/ecotype analysed in this folder. The metadata are specified in the strain.README file in the project folder. One strain.info is required for every strain folder. Each sub-folder holds all the different datasets for the different types of genomic data in the repository. Just four types of dataset are identified, (alignments, annotations, assemblies and reads) but this could be extended by adding a new folder for a new datatype.

###data folders The four different types of data folder are able to hold the majority of datasets we expect a genomics project will generate. There is nothing stopping you from creating new ones of your own.

####reads This folder is designed to contain links to the raw read files, e.g fastq files at other locations around the web. These files are too large in general, to be hosted via GitHub, but text links to them aren't. If you want the actual reads you should follow instructions in the read_set.info file which should explain how to get them as well as containing the metadata specified in reads.README in the project folder. Each read set should have its own folder and its own read_set.info.

####assemblies. This folder is designed to contain the results of assemblies carried out with reads in the reads folder e.g fasta files of contigs or AGP files of assemblies. In most circumstances the fasta files will not be too big for GitHub so can be directly in the repository and not just linked. The assembly could be of genomic or transcriptomic assembly or anything else exotic along those lines. The details of how each of these assemblies was made should go in the assembly.info according to the specification in assembly.README.

####alignments This folder designed to contain the results of alignments of one data set against another. Usually this will be reads in the reads folder to assemblies in the assemblies folder, but can be some other combination as required. It may contain links to large and unwieldy files, such as BAM, or directly contain text based alignments of reasonable size, e.g SAM or even BLAST output files. The details of how each of these alignments was made should go in the alignment.info according to the specification in alignment.README.

####annotations This folder is designed to contain the results of annotations of assemblies in the assemblies folder. Annotations is used in a broad sense and could include gene calls in GFF format, or just SNP calls in a proprietary text format. It may contain links to very large files stored elsewhere on the web as well as those directly in the repository. As in the other folders, annotation.README in the project folder contains details on the required metadata for the annotation.info file. Every annotation folder should have one annotation.info file.

####other data types - extending the directory structure Extending the structure to hold data types we haven't thought of should be dead easy, just add a new folder in the strain folder. Make sure there is a .info file for the new datatype and a metadata definition in a .README file in the project_name folder

###File address

With this data structure, each data file has a unique, semantic and (hopefully) logical place and address. For example: project_name/org_1/strain_1/alignments/strain1_vs_strain2/my_bam_file or crowd_sourcing/e_coli/k12/reads/bgi_sequenced_100_paired_150_insert/lane1.fq

The .info files

The files ending in .info contain metadata describing the data according to the specification in .README files. Each .info should describe one discrete unit of data, so a read_set.info might describe one sequencing run or set of reads from a few lanes on a sequencer for which all the metadata are the same. Different read sets should be in different folders and therefore need different .info. For example, crowd_sourcing/e_coli/k12/reads/bgi_sequenced_100_paired_150_insert/ would be different from crowd_sourcing/e_coli/k12/reads/tgac_sequenced_150_single/ so would need a different .info though all the reads of that type would be in the same folder so would be covered by the same .info

The .README files and metadata standards

.README files are a place to specify what should be captured by the individual .info files and are very project specific, so are kept blank in this empty repository. You'll need to decide what is appropriate at a project level.

Citations

The idea for and layout of this repository borrows very heavily from the repository created for the E.coli 0104:H4 outbreak of 2011.