This is a Python script to convert BED files to BigBed format, used by the UCSC Genome Browser.
At the moment, this script works specifically for the CAGE track files as it takes into consideration it's format.
python main.py <dir_path>
where dir_path is the path of the directory containing the list of BED files.
The original BED files submitted to us (novel_CAGE and annot_CAGE) do not abide to the UCSC rules for BED format and therefore several changes were made to the files before they could be converted to BigBed.
After the conversion script is run on the directory path, 2 sub-directories are created in the dir_path.
Those 2 sub-directpries are namely: updated and bigbed
1) updated directory
This folder contains the modified bed files which abide to the UCSC rules for BED files format.
The changes made are:
- inclusion of the name field. "." is used since the name field has not been provided and is thus considered empty.
- moving the width column to the end because it’s a non-standard user-defined column and needs to be after all other BED fields
- swapping the order of score and strand to abide to BED fields ordering
- removal of headers
- editing the chromEnd value from 16617 to 16616 because of error message thrown by the bedToBigBed application. The chromEnd value provided by our submitter is 16617 while the value of the chromEnd size for NC_001941.1 is 16616. See chrom.sizes file CF_002742125.1_Oar_rambouillet_v1.0.chrom.sizes
- score value must be between 0 and 1000. Score was therefore changed to int and where the value is greater than 1000, only the first 3 digits are considered as score - assuming that the decimal point was misplaced by our submitter.
- an autosql file is used to describe the fields and include the non-standard fields to ensure that conversion to bigBed happens seamlessly
2) bigbed
This folder contains the successfully generated bigBed files, ready to be uploaded to UCSC.