-
Notifications
You must be signed in to change notification settings - Fork 25
2. Input Format
Data from a cohort of patients can be represented as a dataframe with 7 fields, where every row represents one genomic alterations annotated for the analysis.
Field | Type | Description |
---|---|---|
Misc | string | Some customary annotation |
patientID | string | Patient ID, without spaces |
variantID | string | Alteration ID, without spaces or dash/ hyphen (- ) symbols |
cluster | string | Group ID (eg., a clone, with CCF data) |
is.driver | logical | TRUE if the alteration is annotated as driver |
is.clonal | logical | TRUE if the group is clonal (truncal); there should only one such clonal group |
CCF | string | A parsable format for storage of input CCFs or binary data |
The input dataframe has the same structure for both CCF and binary data.
Any SNV, larger chromosomal re-arrangment or other covariate that can be encoded in CCF or binary format. The variantID
field of driver ones (driver=TRUE
) will be matched to detect occurrences in multiple patients, and correlate trajectories; variantID
must be unique, and ca appear only once in a patient.
The id can be you whatever you find more suitable for your analysis, for instance:
- a Hugo_Symbol (
BRAF
) - a name for a well-known SNV (
BRAF_600E
) - a reference to some cytoband (
3q26.32
) - your custom annotation (
MyFavoritePathway
).
Alterations are also associated to groups (via cluster
), which constitute the nodes of the computed trees. A group can have 0 drivers annotated, but every patient should have at least one driver to be analyzed with REVOLVER.
See also Guidelines if you are interested in modelling parallel evolution.
Field CCF
represents either:
- real-valued CCF values (in [0, 1]),
- or input binary values (either 0 or 1).
Since patients can have different number of samples/ regions associated, CCF
is a general string. The format that we propose is simple, and easy to parse it:
R1:0.86;R2:1;R3:1
would mean CCF value 0.86 in region R1
, 1 in R2
etc. In the same format one can encode binary data as R1:1;R2:1;R3:1
. If you use this format, a possible parsing function is
CCF_parser = function(x)
{
tk = strsplit(x, ';')[[1]]
tk = unlist(strsplit(tk, ':'))
samples = tk[seq(1, length(tk), 2)]
values = tk[seq(2, length(tk), 2)]
names(values) = samples
return(values)
}
This function is available as revolver:::CCF_parser
.
An example binary dataset
is the following (we subset it to only driver alterations).
> head(dataset[dataset$is.driver, ])
Misc patientID variantID cluster is.driver is.clonal CCF
UNKNOWN EV001 ABHD11 1 TRUE FALSE R1:1;R2:1;R3:1;R5:1;R8:1;R9:1;R4:0;M1:0;M2a:0;M2b:0
UNKNOWN EV001 ADAMTS10 2 TRUE FALSE R1:0;R2:0;R3:0;R5:0;R8:0;R9:1;R4:0;M1:0;M2a:0;M2b:0
UNKNOWN EV001 ADAMTSL4 3 TRUE FALSE R1:0;R2:0;R3:0;R5:0;R8:0;R9:0;R4:0;M1:1;M2a:1;M2b:0
UNKNOWN EV001 AKAP8 4 TRUE FALSE R1:1;R2:1;R3:0;R5:1;R8:1;R9:1;R4:0;M1:1;M2a:1;M2b:1
UNKNOWN EV001 AKAP9 5 TRUE FALSE R1:0;R2:0;R3:0;R5:0;R8:0;R9:0;R4:0;M1:1;M2a:1;M2b:1
UNKNOWN EV001 ALKBH8 6 TRUE FALSE R1:0;R2:0;R3:0;R5:0;R8:0;R9:0;R4:1;M1:1;M2a:1;M2b:1
UNKNOWN EV001 ALS2CR12 7 TRUE FALSE R1:1;R2:1;R3:0;R5:1;R8:1;R9:1;R4:1;M1:1;M2a:1;M2b:0
UNKNOWN EV001 ANKRD26 2 TRUE FALSE R1:0;R2:0;R3:0;R5:0;R8:0;R9:1;R4:0;M1:0;M2a:0;M2b:0
UNKNOWN EV001 ANO5 8 TRUE TRUE R1:1;R2:1;R3:1;R5:1;R8:1;R9:1;R4:1;M1:1;M2a:1;M2b:1
UNKNOWN EV001 ATXN1 8 TRUE TRUE R1:1;R2:1;R3:1;R5:1;R8:1;R9:1;R4:1;M1:1;M2a:1;M2b:1
UNKNOWN EV001 BCAS2 8 TRUE TRUE R1:1;R2:1;R3:1;R5:1;R8:1;R9:1;R4:1;M1:1;M2a:1;M2b:1
UNKNOWN EV001 BCL11A 9 TRUE FALSE R1:0;R2:0;R3:0;R5:0;R8:0;R9:0;R4:1;M1:0;M2a:0;M2b:0