-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtraining_test_regions.README
66 lines (57 loc) · 4.32 KB
/
training_test_regions.README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# Directory Structure Format
.
├── peaks.all_input_regions.encid.bed.gz # Peaks input to the bpnet training script
├── nonpeaks.all_input_regions.encid.bed.gz # Non peaks input to the bpnet training script
├── logs.training_test_regions.encid # folder containing log files for peak and nonpeak generation scripts
│
├── fold_0
│ ├── cv_params.fold_0.json # training, validation and test chromosomes used in fold 0
│ ├── peaks.trainingset.fold_0.encid.bed.gz # peaks used in training set of fold 0 model
│ ├── nonpeaks.trainingset.fold_0.encid.bed.gz # nonpeaks used in training set of fold 0 model
│ ├── peaks.validationset.fold_0.encid.bed.gz # peaks used in validation set of fold 0 model
│ ├── nonpeaks.validationset.fold_0.encid.bed.gz # nonpeaks used in validation set of fold 0 model
│ ├── peaks.testset.fold_0.encid.bed.gz # peaks used in test set of fold 0 model
│ ├── nonpeaks.testset.fold_0.encid.bed.gz # nonpeaks used in test set of fold 0 model
│ └── logs.training_test_regions.fold_0.encid # folder containing log files for training bpnet model on fold 0
│
├── fold_1
│ └── ... # similar directory structure as fold_0 directory above
│
├── fold_2
│ └── ... # similar directory structure as fold_0 directory above
│
├── fold_3
│ └── ... # similar directory structure as fold_0 directory above
│
└── fold_4
└── ... # similar directory structure as fold_0 directory above
# Bed File Format for Peaks
* All the bed files are in narrowpeak format with 10 columns.
1) chrom - Name of the chromosome (or contig, scaffold, etc.).
2) chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
3) chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
4) name - Name given to a region (preferably unique). Use "." if no name is assigned.
5) weight - Used to indicate if the region is in a peak or background for model training purposes. '1' indicates a peak region, 0 indicates a gc-matched negative region.
6) strand - +/- to denote strand or orientation (whenever applicable). Use "." if no orientation is assigned.
7) signalValue - Measurement of overall (usually, average) enrichment for the region.
8) pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
9) qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
10) peak - Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.
# Bed File Format for Nonpeaks
* All the bed files are in narrowpeak format with 10 columns.
1) chrom - Name of the chromosome (or contig, scaffold, etc.).
2) chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
3) chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
4) empty field - "."
5) weight - Used to indicate if the region is in a peak or background for model training purposes. '1' indicates a peak region, 0 indicates a gc-matched negative region.
6) strand - +/- to denote strand or orientation (whenever applicable). Use "." if no orientation is assigned.
7) empty field - 0
8) empty field - 0
9) empty field - 0
10) midpoint = (chromEnd-chromStart)/2
# Format of file `cv_params.fold_0.json`
A dictionary with following (key,value) pairs,
1) ("CV_type", "chr_holdout")
2) ("train", list_of_chrs_trainingset)
3) ("valid", list_of_chrs_validationset)
4) ("test", list_of_chrs_testset)