diff --git a/vignettes/data-requirement.Rmd b/vignettes/data-requirement.Rmd index 87e08b8..98181ee 100644 --- a/vignettes/data-requirement.Rmd +++ b/vignettes/data-requirement.Rmd @@ -24,21 +24,21 @@ knitr::include_graphics("figures/input_diagram.png") ## A single data frame input -The single data frame input or the data frame in the list needs to contain gene expression time-course data with all replicates. The illustrated diagram below shows the required structure of the `input`. +The single data frame input or the data frame in the list must contain gene expression time-course data with all replicates. The illustrative diagram below shows the required structure of the `input`. ```{r fig-input-table, echo=FALSE, fig.align='center', out.width='75%'} knitr::include_graphics("figures/input_table.png") ``` -This data frame must contain **reference** and **query** expression data which users wish to compare, and the following five columns: +This data frame must contain **reference** and **query** expression data, and the following five columns: - `gene_id`: locus name or unique ID of each gene. -- `accession`: accession or name of the reference and query data to compare. +- `accession`: accession or name of the reference and query data. - `timepoint`: time points of the gene expression data. -- `expression_value`: desired expression values or measure of the abundance of gene or transcripts which users wish to compare. This value can be RPM, RPKM, FPKM, TPM, TMM, DESeq, SCnorm, GeTMM, ComBat-Seq, and raw reads counts. +- `expression_value`: desired expression values or measure of the abundance of gene or transcripts. This value can be RPM, RPKM, FPKM, TPM, TMM, or raw read counts. - `replicate`: biological replicate ID for an expression value at a particular time point. -Below we can see a real example of how the `input` data should look like: +The example below shows how the `input` data should look like: ```{r brapa-data-kable, echo=FALSE} # Load a data frame from the sample data @@ -51,13 +51,13 @@ b_rapa_data[, .SD[1:2], by = accession][, .(gene_id, accession, timepoint, expre ## A list of data frames as an input +If users do not have the data **reference** and **query** joined with the IDs mapped into one single data frame, there is an option of having the input data of a list of data frames. As shown in the illustrative diagram below, the list must contain both **reference** and **query** data frames with the columns as required in the single data frame input (see previous section). + ```{r fig-input-list-tables, echo=FALSE, fig.align='center', out.width='85%'} knitr::include_graphics("figures/input_list_tables.png") ``` -If users do not have the data **reference** and **query** joined with the IDs mapped into one single data frame, there is an option of having the input data of a list of data frames. As shown in the illustrative diagram above, the list must contain both **reference** and **query** data frames with the columns as required in the single data frame input (see previous section). - -Below we can see an example of how the `input` list of data frames should look like: +The example below shows how the `input` list of data frames should look like: ```{r brapa-data-list-df-read, include=FALSE} # Load a data frame from the sample data @@ -108,14 +108,18 @@ list_df #> 6: AT2G45660 Col0 10 72.98476 ERR_ds_klepikova_SRR1661475 ``` -Note here that the elements of the list needs to be named **reference** and **query**, the order of the element will not effect the registration process. +Note here that the elements of the list must be named **reference** and **query**, the order of the element will not effect the registration process. ## A list of vectors as an input +As an alternative of a list of data frame as input data, as shown above, users can also have a list of numerical vectors as their input. The illustrative diagram below shows how the input should look like: + ```{r fig-input-list-vectors, echo=FALSE, fig.align='center', out.width='38%'} knitr::include_graphics("figures/input_list_vectors.png") ``` +The example below shows how the `input` list of vectors can be defined as an input: + ```{r brapa-data-list-num} # Define expression value vectors ref_expressions <- c(1.9, 3.1, 7.8, 31.6, 33.7, 31.5, 131.4, 107.5, 116.7, 112.5, 109.7, 57.4, 50.9) @@ -128,3 +132,5 @@ list_vector <- list( list_vector ``` + +Note that since there is no ID specified on the vectors, `greatR` will assign a unique ID for the reference and query pair data (see more in the [register data > Using other inputs ](https://ruthkr.github.io/greatR/articles/register-data.html#registration-results-1) article). diff --git a/vignettes/process-results.Rmd b/vignettes/process-results.Rmd index 6fa5bad..71e4eb9 100644 --- a/vignettes/process-results.Rmd +++ b/vignettes/process-results.Rmd @@ -39,8 +39,8 @@ registration_results <- system.file("extdata/brapa_arabidopsis_registration.rds" The function `summary()` returns a list with S3 class `summary.res_greatR` containing four different objects: - `summary` is a data frame containing the summary of the registration results (default S3 print). -- `registered_genes` is a vector of gene accessions or IDs which were successfully registered. -- `non_registered_genes` is a vector of non-registered gene accessions or IDs. +- `registered_genes` is a vector of gene IDs which are successfully registered. +- `non_registered_genes` is a vector of non-registered gene IDs. - `reg_params` is a data frame containing the distribution of registration parameters. ```{r get-summary-results, fig.align='center'} @@ -51,7 +51,7 @@ reg_summary$summary |> knitr::kable() ``` -The list of gene accessions which were registered or non-registered can be viewed by calling: +The list of gene IDs which are registered or non-registered can be viewed by calling: ```{r print-accession-of-registered-genes} reg_summary$registered_genes @@ -65,16 +65,24 @@ reg_summary$non_registered_genes The function `plot()` allows users to plot the bivariate distribution of the registration parameters. Non-registered genes can be ignored by selecting `type = "registered"` instead of the default `type = "all"`. Similarly, the marginal distribution type can be changed from `type_dist = "histogram"` (default) to `type_dist = "density"`. -```{r plot-summary-results, fig.align='center', fig.height=4, fig.width=4.5, warning=FALSE} +```r plot( reg_summary, type = "registered" ) ``` +```{r plot-summary-results, echo=FALSE, fig.align='center', fig.height=4, fig.width=4.5, warning=FALSE} +plot( + reg_summary, + type = "registered", + scatterplot_size = c(4, 3.5) +) +``` + ## Plotting registration results -The function `plot()` allows users to plot the registration results of the genes of interest. +The function `plot()` allows users to plot the registration results of the genes of interest (by default only up to the first 25 genes are shown, for more control over this, use the `genes_list` argument). ```{r plot-results, fig.align='center', fig.height=8, fig.width=7, warning=FALSE} # Plot registration result @@ -84,15 +92,15 @@ plot( ) ``` -Notice that the plot includes a label indicating if the particular genes are registered or non-registered, as well as the registration parameters in case the registration was successful. +Notice that the plot includes a label indicating if the particular genes are registered or non-registered, as well as the registration parameters in case the registration is successful. -For more details on the other function parameters, go to `plot()`. +For more details on the other function arguments, go to `plot()`. ## Analysing similarity of expression profiles over time before and after registering ### Calculate sample distance -After registering the data, users can compare the overall similarity between datasets before and after registering using the function `calculate_distance()`. +After registering the data, users can compare the overall similarity between datasets before and after registering using the function `calculate_distance()`. By default all genes are considered in this calculation, this can be changed by using the `genes_list` argument. ```{r get-sample-distance} sample_distance <- calculate_distance(registration_results) @@ -100,8 +108,8 @@ sample_distance <- calculate_distance(registration_results) The function `calculate_distance()` returns a list with S3 class `dist_greatR` of two data frames: -- `result` distance between scaled reference and query expressions using registered time points. -- `original` distance between scaled reference and query expressions using original time points. +- `result` is the distance between scaled reference and query expressions using time points after registration. +- `original` is the distance between scaled reference and query expressions using original time points before registration. ### Plot heatmap of sample distances diff --git a/vignettes/register-data-manually.Rmd b/vignettes/register-data-manually.Rmd index 78de0cf..71d1934 100644 --- a/vignettes/register-data-manually.Rmd +++ b/vignettes/register-data-manually.Rmd @@ -14,11 +14,11 @@ knitr::opts_chunk$set( ) ``` -This article will show users how to register data using some pre-defined shift and stretch parameters. This demo will use one of the genes from the sample data provided by the package. Users can check whether their sample data can be registered using the manually specified shift and stretch values. +This article will show users how to register data using some specified shift and stretch parameters. This demo will use one of the genes from the sample data provided by the package. ## Load sample data -`greatR` package provides an example of data frame containing two different species *A. thaliana* and *B. rapa* with two and three different replicates, respectively. This data frame can be read as follows: +`greatR` provides an example of data frame containing two different species *A. thaliana* and *B. rapa* with two and three different replicates, respectively. This data frame can be read as follows: ```{r load-greatR, message=FALSE} # Load the package @@ -34,11 +34,13 @@ b_rapa_data <- system.file("extdata/brapa_arabidopsis_data.csv", package = "grea ## Registering without optimisation +The illustrative table below shows the major differences between runing `register()` with and without optimisation. + ```{r reg-data-manual, echo=FALSE, fig.align='center', out.width='65%'} knitr::include_graphics("figures/registration_params_table.png") ``` -Here, we will only use a single gene with `gene_id = "BRAA03G023790.3C"` from the sample data. +Here, we will only use a single gene with `gene_id = "BRAA03G023790.3C"` from the sample data, but this feature can also be used when registering multiple genes. ```{r get-subset-data, message=FALSE, warning=FALSE} gene_BRAA03G023790.3C_data <- b_rapa_data[gene_id == "BRAA03G023790.3C"] @@ -87,11 +89,11 @@ As we can see, using the given stretch and shift parameter above, the *B. rapa* ## Registering multiple gene with different pre-defined registration parameters -Users can also register multiple different genes using their pre-defined parameters. Similar to registration process above, users need to set `use_optimisation = FALSE` to disable the automated optimisation process. +Users can also specify a list of parameters rather than a single value. Similar to the registration process above, users need to set `use_optimisation = FALSE` to disable the automated optimisation process. ```{r register-multiple-gene-manual, message=FALSE, warning=FALSE, eval=FALSE} registration_results <- register( - b_rapa_data, + gene_BRAA03G023790.3C_data, reference = "Ro18", query = "Col0", scaling_method = "z-score", @@ -100,13 +102,11 @@ registration_results <- register( use_optimisation = FALSE ) #> ── Validating input data ─────────────────────────────────────────────────────── -#> ℹ Will process 10 genes. +#> ℹ Will process 1 gene. #> ℹ Using estimated standard deviation, as no `exp_sd` was provided. #> ℹ Using `scaling_method` = "z-score". #> #> ── Starting manual registration ──────────────────────────────────────────────── #> ℹ Using `overlapping_percent` = 50% as a registration criterion. -#> ✔ Applying registration for genes (10/10) [12.4s] +#> ✔ Applying registration for genes (1/1) [1.3s] ``` - -Similar to the previous registration process, users can analyse further the registration result by following the [processing registration results](https://ruthkr.github.io/greatR/articles/process-results.html) vignette. diff --git a/vignettes/register-data.Rmd b/vignettes/register-data.Rmd index 3180e2a..11e8830 100644 --- a/vignettes/register-data.Rmd +++ b/vignettes/register-data.Rmd @@ -14,17 +14,17 @@ knitr::opts_chunk$set( ) ``` -This article will show users how to register data using the sample data provided by the package. Given an input data, users can directly register the data as illustrated below. - -<br> +This article will show users how to register data using different types of input, as illustrated below. ```{r reg-data, echo=FALSE, fig.align='center', out.width='100%'} knitr::include_graphics("figures/registration_opt_diagram.png") ``` -## Loading sample data +## Using data frame input + +### Loading sample data -`greatR` package provides an example of data frame containing two different species *A. thaliana* and *B. rapa* with two and three different replicates, respectively. This data frame can be read as follows: +`greatR` provides an example of data frame containing two different species *A. thaliana* and *B. rapa* with two and three different replicates, respectively. This data frame can be read as follows: ```{r load-greatR, message=FALSE} # Load the package @@ -32,7 +32,7 @@ library(greatR) library(data.table) ``` -```{r brapa-data, message=FALSE, warning=FALSE} +```{r load-brapa-data, message=FALSE, warning=FALSE} # Load a data frame from the sample data b_rapa_data <- system.file("extdata/brapa_arabidopsis_data.csv", package = "greatR") |> data.table::fread() @@ -41,7 +41,7 @@ b_rapa_data <- system.file("extdata/brapa_arabidopsis_data.csv", package = "grea Note that the data has all of five columns required by the package: ```{r brapa-data-kable-clean, eval=FALSE} -b_rapa_data[, .SD[1:6], by = accession] |> +b_rapa_data |> knitr::kable() ``` @@ -50,9 +50,9 @@ b_rapa_data[, .SD[1:6], by = accession][, .(gene_id, accession, timepoint, expre knitr::kable() ``` -## Registering the data +### Registering the data -To align gene expression time-course between *Arabidopsis* Col-0 and *B. rapa* Ro18, we can use function `register()`. By default, the best registration parameters are optimised via Nelder-Mead (`optimisation_method = "nm"`). When using the default `use_optimisation = TRUE`, the stretch and shift search space is automatically estimated. For more details on the other function parameters, go to `register()`. +To align gene expression time-course between *Arabidopsis* Col-0 and *B. rapa* Ro18, we can use the function `register()`. When using the default `use_optimisation = TRUE`, `greatR` will find the best stretch and shift parameters through optimisation. For more details on the other function arguments, go to `register()`. ```{r register-data-raw, message=FALSE, warning=FALSE, eval=FALSE} registration_results <- register( @@ -73,7 +73,7 @@ registration_results <- register( #> ✔ Optimising registration parameters for genes (10/10) [2s] ``` -When dealing with thousands of genes, the user may want to use the argument `num_cores` to run the registration using parallel processing and speed up the registration process. +When dealing with thousands of genes, users may speed up the registration process by using the argument `num_cores` to run the registration using parallel processing. ```{r register-data-raw-detect-cores, message=FALSE, warning=FALSE, eval=FALSE} parallel::detectCores() @@ -90,35 +90,11 @@ registration_results <- register( ) ``` -As noted in the [input data requirements](https://ruthkr.github.io/greatR/articles/data-requirement.html) vignette, `register()` also accepts a list of data frames or a list of reference and query vectors as `input`: - -```{r register-data-list-vectors, message=FALSE, warning=FALSE} -registration_results_vectors <- register( - list( - reference = c(1.9, 3.1, 7.8, 31.6, 33.7, 31.5, 131.4, 107.5, 116.7, 112.5, 109.7, 57.4, 50.9), - query = c(14, 12.1, 15.9, 47, 30.9, 50.5, 80.1, 67.4, 72.9, 61.7) - ), - reference = "Ro18", - query = "Col0", - scaling_method = "z-score" -) -#> ── Validating input data ─────────────────────────────────────────────────────── -#> ℹ Will process 1 gene. -#> ℹ Using estimated standard deviation, as no `exp_sd` was provided. -#> ℹ Using `scaling_method` = "z-score". -#> -#> ── Starting registration with optimisation ───────────────────────────────────── -#> ℹ Using L-BFGS-B optimisation method. -#> ℹ Using computed stretches and shifts search space limits. -#> ℹ Using `overlapping_percent` = 50% as a registration criterion. -#> ✔ Optimising registration parameters for genes (1/1) [170ms] -``` - -## Registration results +### Registration results The function `register()` returns a list with S3 class `res_greatR` containing three different objects: -- `data` is a data frame containing the scaled expression data and an additional `timepoint_reg` column which is a result of registered time points by applying the registration parameters to the query data. +- `data` is a data frame containing the expression data and an additional `timepoint_reg` column which is a result of registered time points by applying the registration parameters to the query data. - `model_comparison` is a data frame containing (a) the optimal stretch and shift for each `gene_id` and (b) the difference between Bayesian Information Criterion for the separate model and for the combined model (`BIC_diff`) after applying optimal registration parameters for each gene. If the value of `BIC_diff < 0`, then expression dynamics between reference and query data can be registered (`registered = TRUE`). (Default S3 print). - `fun_args` is a list of arguments used when calling the function (`reference`, `query`, `scaling_method`, ...). @@ -135,10 +111,51 @@ registration_results$model_comparison |> knitr::kable() ``` -From the sample data above, we can see that for nine of the ten genes, `registered = TRUE`, meaning that reference and query data between those ten genes can be aligned or registered. These data frame outputs can further be summarised and visualised; see the documentation on the [processing registration results](https://ruthkr.github.io/greatR/articles/process-results.html) article. +From the sample data above, we can see that for nine out of ten genes, `registered = TRUE`, meaning that reference and query data between those nine genes can be aligned or registered. These data frame outputs can further be summarised and visualised; see the documentation on the [processing registration results](https://ruthkr.github.io/greatR/articles/process-results.html) article. + +## Using other inputs + +### Loading sample data + +As noted in the [input data requirements](https://ruthkr.github.io/greatR/articles/data-requirement.html) article, `register()` also accepts a list of data frames or a list of reference and query vectors as `input`: + +```{r load-input-vector, message=FALSE, warning=FALSE} +# Define expression value vectors +ref_expressions <- c(1.9, 3.1, 7.8, 31.6, 33.7, 31.5, 131.4, 107.5, 116.7, 112.5, 109.7, 57.4, 50.9) +query_expressions <- c(14, 12.1, 15.9, 47, 30.9, 50.5, 80.1, 67.4, 72.9, 61.7) + +list_vector <- list( + reference = ref_expressions, + query = query_expressions +) +``` + +### Registering the data + +As previously shown, to register the input list, users can run the function `register()`: + +```{r register-data-list-vectors, message=FALSE, warning=FALSE} +registration_results_vectors <- register( + list_vector, + reference = "Ref", + query = "Query", + scaling_method = "z-score" +) +#> ── Validating input data ─────────────────────────────────────────────────────── +#> ℹ Will process 1 gene. +#> ℹ Using estimated standard deviation, as no `exp_sd` was provided. +#> ℹ Using `scaling_method` = "z-score". +#> +#> ── Starting registration with optimisation ───────────────────────────────────── +#> ℹ Using L-BFGS-B optimisation method. +#> ℹ Using computed stretches and shifts search space limits. +#> ℹ Using `overlapping_percent` = 50% as a registration criterion. +#> ✔ Optimising registration parameters for genes (1/1) [170ms] +``` +### Registration results -Note that when, using a list of vectors, the `gene_id` is automatically generated. This is done to ensure uniqueness of the results. +The registration result object will have the same structure as when using a data frame as an input. Since no ID is explicitly defined in the input vector list, a unique `gene_id` will be automatically generated for the reference and query pair. ```{r get-model-summary-data-vectors, warning=FALSE} registration_results_vectors$model_comparison |>