-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input file problem #124
Comments
Hi Alla, After reviewing your error messages, I noticed the issue likely stems from your TSV file format. Looking at the error message, it seems the first line contains "# Constructed from biom file" which is causing the parsing issues. Could you try:
This should resolve the parsing error and allow ggpicrust2 to properly read your feature table. Let me know if you need any further assistance. Best regards, |
Hi, But I can't do the analysis, I got another errors after that( """"''' Run ggpicrust2 with input file path
Starting the ggpicrust2 analysis... Converting KO to KEGG... Loading data from file... ℹ Use Sample names extracted. Starting pathway annotation... The number of statistically significant pathways exceeds the database's query limit. Please consider breaking down the analysis into smaller queries or selecting a subset of pathways for further investigation. Returning DAA results filtered annotation data frame... The following pathways are missing annotations and have been excluded: ko05340, ko00564, ko00562, ko00563, ko03030, ko00561, ko00440, ko00250, ko04062, ko00740, ko00195, ko04650, ko03450, ko00920, ko00311, ko00310, ko04146, ko00600, ko04140, ko04142, ko00604, ko04260, ko05142, ko04540, ko04710, ko04712, ko00909, ko00513, ko05110, ko04974, ko04976, ko00450, ko01051, ko00565, ko00904, ko00524, ko00300, ko00905, ko00402, ko03440, ko00750, ko00950, ko05140, ko00592, ko00591, ko00590, ko00062, ko04662, ko03070, ko00253, ko03060, ko04370, ko04730, ko04740, ko00380, ko00500, ko05120, ko04666, ko04966, ko05322, ko04964, ko05320, ko04962, ko04960, ko04660, ko00625, ko00624, ko00627, ko00626, ko00623, ko00622, ko00270, ko04380, ko00941, ko00943, ko00100, ko00945, ko01057, ko01056, ko05016, ko01058, ko04145, ko00071, ko00072, ko04360, ko05219, ko05218, ko05216, ko05215, ko05213, ko05211, ko01055, ko00902, ko05330, ko00534, ko04910, ko00531, ko04916, ko00533, ko00532, ko00360, ko00633, ko00363, ko00364, ko05130, ko00121, ko04914, ko00130, ko03050, ko00361, ko00040, ko00730, ko00362, ko01040, ko00603, ko03018, ko04270, ko00281, ko00280, ko03013, ko04626, ko05200, ko00601, ko03015, ko00312, ko05143, ko00523, ko00520, ko00521, ko05146, ko00052, ko00051, ko00400, ko04020, ko00350, ko00480, ko00643, ko00640, ko00720, ko00120, ko00965, ko04614, ko04340, ko00980, ko00410, ko00983, ko05150, ko00791, ko05131, ko04711, ko00020, ko00710, ko00196, ko02060, ko00340, ko00785, ko00550, ko00650, ko03320, ko04744, ko04745, ko00522, ko04612, ko04621, ko04620, ko04623, ko04622, ko04971, ko00460, ko04970, ko00830, ko00780, ko00511, ko00970, ko00030, ko00232, ko00230, ko04120, ko04350, ko00540, ko03022, ko03020, ko00982, ko04630, ko03010, ko05100, ko00331, ko05310, ko00908, ko04930, ko04320, ko03430, ko00906, ko00901, ko04520, ko00903, ko00471, ko00472, ko00473, ko04510, ko00942, ko04810, ko04210, ko00240, ko04012, ko04011, ko00944, ko04113, ko04640, ko04310, ko03420, ko04912, ko00670, ko04672, ko04920, ko05160, ko04144, ko00930, ko04112, ko04720, ko04722, ko04075 """"""""""" I checked metadata file it looks ok to me metadata$
Thank you, |
Hi Alla, I see you've resolved the first issue, but now encountering errors with pathway annotations and visualizations. From the error messages, it seems the all-in-one I suggest using our step-by-step pipeline instead, which gives you more control over each stage of the analysis. You can follow these steps:
kegg_abundance <- ko2kegg_abundance(file = abundance_file)
daa_results_df <- pathway_daa(abundance = kegg_abundance,
metadata = metadata,
group = "Sample type",
daa_method = "LinDA",
select.taxa = NULL,
reference = "Controls")
pathway_annotation_df <- pathway_annotation(pathway = "KO",
daa_results_df = daa_results_df,
ko_to_kegg = TRUE)
pathway_errorbar_plot <- pathway_errorbar(abundance = kegg_abundance,
daa_results_df = daa_results_df,
pathway_annotation_df = pathway_annotation_df,
group = "Sample type",
p_values_threshold = 0.05,
order = "pathway_class",
select_pathway = NULL,
p_value_bar = TRUE,
x_lab = "pathway_name") This approach will help you identify where exactly the analysis might be failing and give you more flexibility in adjusting parameters at each step. Let me know if you need any clarification or run into other issues. Best regards, |
Hello Chen, I have tried to follow your instructions and have encountered the following problems """""""
Error in pathway_daa(abundance = kegg_abundance, metadata = metadata, : """"""""'
Sample names extracted.
Starting pathway annotation... The number of statistically significant pathways exceeds the database's query limit. Please consider breaking down the analysis into smaller queries or selecting a subset of pathways for further investigation. Returning DAA results filtered annotation data frame... """"
Starting pathway annotation... The number of statistically significant pathways exceeds the database's query limit. Please consider breaking down the analysis into smaller queries or selecting a subset of pathways for further investigation. Returning DAA results filtered annotation data frame... If I understand correctly, I have too many pathways in df, I need to reduce them. Also in pathway_annotation_df I have N/A in the pathway_name, description etc columns, so I can't even get those names on a heatmap for example. Maybe there is a way to just get the pathway names and groups and visualise it against the sample names, without even statistics, just take top of some ammount? Thank you for your time |
Hi Alla, Thank you for providing the detailed information about your issues. Let me help address them one by one:
# After running pathway_daa, filter for top 100 pathways based on p-values
daa_results_df <- daa_results_df[order(daa_results_df$p_values), ] # Sort by p-values
daa_results_df <- daa_results_df[1:100, ] # Keep top 100 pathways
# Then proceed with pathway annotation
pathway_annotation_df <- pathway_annotation(
pathway = "KO",
daa_results_df = daa_results_df,
ko_to_kegg = TRUE
)
# Check how many pathways have NA values
na_count <- sum(is.na(pathway_annotation_df$pathway_name))
total_count <- nrow(pathway_annotation_df)
# Print the results
print(paste0("Pathways with NA values: ", na_count, " out of ", total_count))
# Look at some examples of pathways with and without annotations
# Show first few rows of pathways with NA values
print("Examples of pathways with NA values:")
head(pathway_annotation_df[is.na(pathway_annotation_df$pathway_name), ])
# Show first few rows of pathways with annotations
print("Examples of pathways with annotations:")
head(pathway_annotation_df[!is.na(pathway_annotation_df$pathway_name), ]) After checking this, if you want to proceed with only the annotated pathways: # Remove rows with NA values in pathway_name
pathway_annotation_df <- pathway_annotation_df[!is.na(pathway_annotation_df$pathway_name), ]
# Get top pathways by mean abundance
mean_abundances <- rowMeans(kegg_abundance)
top_pathways <- names(sort(mean_abundances, decreasing = TRUE)[1:50]) # Top 50 pathways
# Subset the abundance data
filtered_abundance <- kegg_abundance[top_pathways, ]
# Create a basic heatmap
library(pheatmap)
pheatmap(filtered_abundance,
scale = "row",
clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation",
show_rownames = TRUE,
show_colnames = TRUE)
Please let me know:
Best regards, |
Hi Alla, Let me add some additional tips about pathway_annotation, especially for handling those unannotated pathways:
# First, identify unannotated pathways
unannotated_pathways <- daa_results_df[daa_results_df$feature %in%
pathway_annotation_df[is.na(pathway_annotation_df$pathway_name), "feature"], ]
# Try to annotate these pathways again in smaller batches
batch_size <- 20
for(i in seq(1, nrow(unannotated_pathways), batch_size)) {
end_idx <- min(i + batch_size - 1, nrow(unannotated_pathways))
batch_pathways <- unannotated_pathways[i:end_idx, ]
# Add some delay between batches to avoid overwhelming the KEGG API
if(i > 1) Sys.sleep(2)
batch_annotation <- pathway_annotation(
pathway = "KO",
daa_results_df = batch_pathways,
ko_to_kegg = TRUE
)
# Merge successful annotations back
if(!is.null(batch_annotation) && nrow(batch_annotation) > 0) {
pathway_annotation_df <- rbind(
pathway_annotation_df[!is.na(pathway_annotation_df$pathway_name), ],
batch_annotation[!is.na(batch_annotation$pathway_name), ]
)
}
}
# Filter by adjusted p-value
significant_pathways <- daa_results_df[daa_results_df$p_adjust < 0.05, ]
# Or filter by absolute log fold change
if("log2FoldChange" %in% colnames(daa_results_df)) {
high_impact_pathways <- daa_results_df[abs(daa_results_df$log2FoldChange) > 1, ]
}
# Then annotate these filtered pathways
filtered_annotations <- pathway_annotation(
pathway = "KO",
daa_results_df = significant_pathways, # or high_impact_pathways
ko_to_kegg = TRUE
)
# Save annotations after each successful batch
saveRDS(pathway_annotation_df, "pathway_annotations_progress.rds")
# Load saved progress if needed
pathway_annotation_df <- readRDS("pathway_annotations_progress.rds")
# Summary of annotation completeness
annotation_summary <- data.frame(
total_pathways = nrow(pathway_annotation_df),
annotated = sum(!is.na(pathway_annotation_df$pathway_name)),
unannotated = sum(is.na(pathway_annotation_df$pathway_name))
)
# Check which columns have the most missing values
na_by_column <- sapply(pathway_annotation_df, function(x) sum(is.na(x)))
print(na_by_column)
# Example: Focus on metabolism-related pathways
metabolism_pathways <- pathway_annotation_df[
grepl("metabolism|biosynthesis|degradation",
pathway_annotation_df$pathway_name,
ignore.case = TRUE), ] Let me know if you need any clarification on these approaches or if you'd like to try a different strategy! Best regards |
Hello,
I have the output file from analysis of 16s dataset, picrust2 plagin was used in qiime2, then I have converted the biom file to tsv format. Now I am trying to visualise results, I have some issues. I have attached the tsv table picture, I think something wrong with it.
First I tried -
"""""
library(readr)
library(ggpicrust2)
library(tibble)
library(tidyverse)
library(ggprism)
library(patchwork)
Load necessary data: abundance data and metadata
abundance_file <- "/home/output/path_exported/ko_feature_table.biom.tsv"
metadata <- read_delim(
"/home/sample-metadata.tsv",
delim = "\t",
escape_double = FALSE,
trim_ws = TRUE
)
Run ggpicrust2 with input file path
results_file_input <- ggpicrust2(file = abundance_file,
metadata = metadata,
group = "Sample type", # For example dataset, group = "Environment"
reference = "Controls",
pathway = "KO",
daa_method = "LinDA",
ko_to_kegg = TRUE,
order = "pathway_class",
p_values_bar = TRUE,
x_lab = "pathway_name")
metadata$
Sample type
<- as.factor(metadata$Sample type
)levels(metadata$
Sample type
)"""
I got next mistakes-
"""
Starting the ggpicrust2 analysis...
Converting KO to KEGG...
Loading data from file...
Rows: 10556 Columns: 1
── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): # Constructed from biom file
ℹ Use
spec()
to retrieve the full column specification for this data.ℹ Specify the column types or set
show_col_types = FALSE
to quiet this message.Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|======================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 0.01 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Performing pathway differential abundance analysis...
Sample names extracted.
Identifying matching columns in metadata...
Matching columns identified: #SampleID . This is important for ensuring data consistency.
Using all columns in abundance.
Converting abundance to a matrix...
Reordering metadata...
Converting metadata to a matrix and data frame...
Extracting group information...
Running LinDA analysis...
Error in relevel.factor(LinDA_metadata_df$Group_group_nonsense_, ref = reference) :
'ref' must be an existing level
In addition: Warning message:
One or more parsing issues, call
problems()
on your data frame for details, e.g.:dat <- vroom(...)
problems(dat)
Starting the ggpicrust2 analysis...
Converting KO to KEGG...
Loading data from file...
Rows: 10556 Columns: 1
── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): # Constructed from biom file
ℹ Use
spec()
to retrieve the full column specification for this data.ℹ Specify the column types or set
show_col_types = FALSE
to quiet this message.Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|======================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 0.01 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Performing pathway differential abundance analysis...
Sample names extracted.
Identifying matching columns in metadata...
Matching columns identified: #SampleID . This is important for ensuring data consistency.
Using all columns in abundance.
Converting abundance to a matrix...
Reordering metadata...
Converting metadata to a matrix and data frame...
Extracting group information...
Running LinDA analysis...
Error in relevel.factor(LinDA_metadata_df$Group_group_nonsense_, ref = reference) :
'ref' must be an existing level
In addition: Warning message:
One or more parsing issues, call
problems()
on your data frame for details, e.g.:dat <- vroom(...)
problems(dat)
"""""""
"""""""
I am not sure what this is about.
When I tried to check problems - seems there are no such things, but there is no input file which is there - the path are written correctly
""""problems(metadata)
A tibble: 0 × 5
ℹ 5 variables: row , col , expected , actual , file
"""""""""""
metadata <- read_delim("/home/sample-metadata.tsv", delim = "\t", escape_double = FALSE, trim_ws = TRUE)
Rows: 24 Columns: 6
── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): #SampleID, Composition, Freezing time, Sample type
dbl (2): count_reads, Layers
ℹ Use
spec()
to retrieve the full column specification for this data.ℹ Specify the column types or set
show_col_types = FALSE
to quiet this message."""""""'''
then
"""""""""
ℹ Use
spec()
to retrieve the full column specification for this data.ℹ Specify the column types or set
show_col_types = FALSE
to quiet this message.Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|======================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 0.01 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Warning message:
One or more parsing issues, call
problems()
on your data frame for details, e.g.:dat <- vroom(...)
problems(dat)
"""""""""
I have tried next -
""""""
ko_abundance_file <- "/home/output/path_exported/ko_feature_table.biom.tsv"
Loading data from file...
Rows: 10556 Columns: 1
── Column specification ────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): # Constructed from biom file
ℹ Use
spec()
to retrieve the full column specification for this data.ℹ Specify the column types or set
show_col_types = FALSE
to quiet this message.Loading KEGG reference data. This might take a while...
Performing KO to KEGG conversion. Please be patient, this might take a while...
|======================================================================================| 100%
KO to KEGG conversion completed. Time elapsed: 0.01 seconds.
Removing KEGG pathways with zero abundance across all samples...
KEGG abundance calculation completed successfully.
Warning message:
One or more parsing issues, call
problems()
on your data frame for details, e.g.:dat <- vroom(...)
problems(dat)
""""""""""""'""
here i have an emply "kegg_abundance" variable in the Rstudio
I think something wrong with the input tsv table, but I cant understand what and how to fix it.
I would appreciate any help
Thank you for your time
Best,
Alla
The text was updated successfully, but these errors were encountered: