Skip to content

Proteogenomics/trackhub-creator

Repository files navigation

How to set up the application

First thing, check out the application code using the --recursive flag like

git clone --recursive <repo_url>

as this repository uses submodules, some of them will require having the access right for, at least, deploy the module code, otherwise, some of the pipelines shipped with this application will not work.

Once the source code has been checked out, this software counts on a Makefile for doing a lot of DevOps related heavy lifting.

There are two main installation targets:

  • install, this is the usual installation target used when preparing a development setup of the application, or a production installation that doesn't have into account the possible presence of an HPC environment. A Python virtual environment will be prepared with all the application requirements, also, external tools needed by the application will be collected and made available.
  • install_lsf, like the other installation target, this one will do the same, but taking into account the possible presence of an HPC environment.

And two main cleaning (all) targets:

  • clean_all, this will remove everything (running sessions, logs, external tools...) leaving the application as it would be after being cloned from its repository.
  • lsf_clean_all, it performs the same 'cleaning' process as clean_all, but aware of a possible HPC environment.

Using the Pipelines shipped with the application

For running any pipeline shipped with the application (or added to it), from the root folder of the application, the following command must be issued

time python_install/bin/python main_app.py -a pipeline_cmd_param1=pipeline_cmd_param1_value,...,pipeline_cmd_paramN=pipeline_cmd_paramN_value <pipeline_name>

This command will time the execution of the application, using the application's Python virtual environment to run the given pipeline with the given command line key=value parameters.

The following pipelines are shipped with the application:

  • ensembl_data_collector
  • pride_cluster_export
  • create_trackhub_for_project
  • publish_trackhub

Enemsebl Data Collector Pipeline

Other pipelines shipped with this application, e.g. create_trackhub_for_project, use Ensembl protein sequence and genome reference files as part of the trackhub creation process, this files are mirrored locally in the application from the latest Ensembl release, as the same application can be running different pipelines in parallel, this pipeline is recommended to be used in order to avoid race conditions mirroring those files.

This pipeline will mirror protein sequence and genome_reference files from Ensembl, for the given list of NCBI Taxonomy IDs, e.g. Mouse and Human as it can be seen beneath this line.

time python_install/bin/python main_app.py -a ncbi_taxonomy_ids=10090,9606 ensembl_data_collector 

Those files will be made locally available at

resources/ensembl/release-XX

within the application folder, where XX is the latest Ensembl Release Number.

There is a launch script specific to PRIDE data, that collects Ensembl data for all the taxonomies present in PRIDE, it can be found at

scripts/ensembl_data_collector

and it can be launched either straight away or as an HPC job

scripts/ensembl_data_collector/launch_pipeline_for_pride_taxonomies.sh 

PRIDE Cluster Export Pipeline

This pipeline creates and registers / updates a trackhub for PRIDE Cluster data.

It is launched by the following script

scripts/pride-cluster-export/ebi-lsf-launch-pipeline.sh

straight away or as a job on the HPC environment.

It will create a subfolder at PRIDE Cluster Trackhubs FTP as 'YYYY-MM', with the year and month information of the trackhub creation, and update a 'latest' link that points to the last created trackhub for PRIDE Cluster.

More information on the process of creating a trackhub for PRIDE Cluster Trackhubs FTP as 'YYYY-MM', with the year and month information of the trackhub creation, and update a 'latest' link that points to the last created trackhub for PRIDE Cluster can be found here.

PRIDE Project Trackhub Creation Pipeline

This pipeline creates a trackhub for the given PRIDE project. It is launched by the script

scripts/create_trackhub_for_project/launch_pipeline_for_project.sh

and the only parameter it needs is the absolute path to a JSON formatted file that contains all the information related to the project being processed and the trackhub that is going to be created, e.g. title, long and short description, etc.

The following is a sample project description file content passed to this pipeline as a parameter

{
  "trackHubName" : "PXD000625",
  "trackHubShortLabel" : "<a href=\"http://www.ebi.ac.uk/pride/archive/projects/PXD000625\">PXD000625</a> - Hepatoc...",
  "trackHubLongLabel" : "Experimental design For the label-free ...",
  "trackHubType" : "PROTEOMICS",
  "trackHubEmail" : "[email protected]",
  "trackHubInternalAbsolutePath" : "...",
  "trackhubCreationReportFilePath": "...",
  "trackMaps" : [ {
    "trackName" : "PXD000625_10090_Original",
    "trackShortLabel" : "<a href=\"http://www.ebi.ac.uk/pride/archive/projects/PXD000625\">PXD000625</a> - Mus musc...",
    "trackLongLabel" : "Experimental design For the label-free proteome analysis 17 mice were used composed of 5 ...",
    "trackSpecies" : "10090",
    "pogoFile" : "..."
  } ]
}

trackhubCreationReportFilePath points to a file where the pipeline, once it is done running, will dump a JSON formatted report on the trackhub creation process, as it can be seen in the sample underneath these lines.

{
"status": "SUCCESS", 
"success_messages": [], 
"warning_messages": [], 
"error_messages": [],
"pipeline_session_working_dir": "...", 
"log_files": [], 
"hub_descriptor_file_path": "..."
}

where

  • status, represents three possible outcomes on how the pipeline worked out
    • SUCCESS, all the project data has been successfully processed and the trackhub created.
    • WARNING, some project data failed to process but a trackhub was created with at least one track. More information can be found in the accompanying messages within this report.
    • ERROR, a trackhub could not be created for the given project data. More information can be found in the accompanying messages within this report.
  • success_messages, a list of informative messages about the creation of the trackhub for the given project.
  • warning_messages, a list of messages raising issues about the trackhub creation for the given project.
  • error_messages, a list of messages stating the errors that rendered the trackhub creation process for the given project impossible.
  • pipeline_session_working_dir, this is the working directory used by the application when running this pipeline.
  • log_files, the list of absolute paths to all the log files related to the pipeline run for the given project, as with the working directory, this information is included in the report for forensic purposes.
  • hub_descriptor_file_path, absolute path to the hub.txt file created as part of the trackhub for the given project.

Trackhub Publishing / Registering / Update Pipeline

This pipeline registers a trackhub at Trackhub Registry and it can be launched by the script at

scripts/publish_trackhub/publish_trackhub.sh

providing the following parameters

  • user name, this is the user name to be used for registering the trackhub at the Trackhub Registry.
  • password, to be used for registering the trackhub at the Trackhub Registry.
  • trackhub description data file, this file describes the trackhub to be publish by this pipeline, as it can be seen in following sample file content.
{
    "trackhubUrl": "http://host.com/hub.txt",
    "publicVisibility": "1",
    "type": "PROTEOMICS",
    "pipelineReportFilePath": "pipeline.report"
}

where

  • trackhubUrl is the public URL of the hub.txt file for the trackhub to be published.
  • publicVisibility, configures whether the trackhub being published is going to be public or private, if not included in the file, the default value is 'private'.
  • type, this is the 'type' information to be assigned to the trakchub being published, if not included in the file, the default value is 'PROTEOMICS'.
  • pipelineReportFilePath, absolute path to the file where the pipeline should provide a report on the process. As sample of that report content can be seen underneath these lines.
{
"status": "...", 
"success_messages": [],
"warning_messages": [], 
"error_messages": [], 
"pipeline_session_working_dir": "...", 
"trackhub_url": "...", 
"log_files": [], 
"trackhub_registration_analysis": []
}

where

  • status, represents three possible outcomes on how the pipeline worked out
    • SUCCESS, the trackhub was successfully published / updated.
    • WARNING, the trackhub was published / updated, but some errors occurred.
    • ERROR, the trackhub could not be published / updated.
  • success_messages, a list of informative messages about the trackhub publishing process.
  • warning_messages, a list of messages raising issues about the trackhub publishing process.
  • error_messages, a list of messages stating the errors that rendered the trackhub publishing process.
  • pipeline_session_working_dir, this is the working directory used by the application when running this pipeline.
  • trackhub_url, URL of the hub.txt trackhub file.
  • log_files, the list of absolute paths to all the log files related to the pipeline run for the given project, as with the working directory, this information is included in the report for forensic purposes.

Final Notes

The default Trackhub Registry service used by the pipelines is the one at www.trackhubregistry.org.

Please, for more detailed documentation refer to the wiki pages of this reposiroty.

Contact

Manuel Bernal Llinares

Releases

No releases published

Packages

No packages published