Support for the BED12 format #108

kepbod · 2014-06-18T02:25:59Z

I read the document, and found only a few functions support the BED12 format. Is it possible to add more supports for the BED12 format? Thanks!

daler · 2014-06-18T13:09:37Z

Do you have something specific in mind?

kepbod · 2014-06-19T02:10:23Z

I'm not familiar with c codes, so I don't know if it is possible. For example, the Interval object only contains information of first 6 fields (chrom, stat, end, name, score, strand), and it could not efficiently parse other fields like cds_start, cds_end, exon_starts and exon_ends in the BED12 format. Another issue is that in bedtools, there is options like -split in getfasta to support BED12 format, but in pybedtools, this option was omitted.

daler · 2014-06-19T11:59:35Z

Ah, I see what you mean with respect to Interval parsing of BED12. For example there is not an Interval.blockStarts attribute that returns a list of integer start positions.

However, any field of an Interval object can always be accessed by indexing into the object. So if you were working with an Interval object, x, that represents a BED12 line, you could get the comma-separated string for exon starts like this:

block_starts = x[11]

And then handle the string yourself, maybe like:

block_starts = [int(i) for i in x[11].split(',')]

In general, pybedtools doesn't do anything special with BED files over 3 fields. It just splits on the tab characters and provides aliases to each position. For example, x.score just returns the string in the 5th field -- same as if you just used x[4]. Given the wide variety of BED-like formats in the wild, this design decision means that a non-standard BED file will not break pybedtools.

I think what you might want is some sort of object representing a transcript model, which would parse a BED12 line and make available all the exons, CDSs, and introns. Something like this might be better suited for gffutils, where @yarden has already proposed something similar at daler/gffutils#21.

Alternatively, you could write a class that accepts an Interval object representing a properly-formatted BED12 line and creates whatever data structure you find useful. If you find this strategy is performance-limiting in pure Python, I could try implementing your class in Cython to see if that speeds things up.

As for your second issue regarding -split, the -split option works in pybedtools -- it's actually listed on the page you linked to. As long as your installed version of BEDTools supports it, pybedtools does too (Design principle 2)

kepbod · 2014-06-19T13:06:47Z

Thanks for your detailed explanation. In past, I always parsed BED12 files using raw python codes, and its performance is good enough. The reason why I open this issue is just to find whether pybedtools has any more convenient way to solve this problems.

kepbod closed this as completed Jun 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for the BED12 format #108

Support for the BED12 format #108

kepbod commented Jun 18, 2014

daler commented Jun 18, 2014

kepbod commented Jun 19, 2014

daler commented Jun 19, 2014

kepbod commented Jun 19, 2014

Support for the BED12 format #108

Support for the BED12 format #108

Comments

kepbod commented Jun 18, 2014

daler commented Jun 18, 2014

kepbod commented Jun 19, 2014

daler commented Jun 19, 2014

kepbod commented Jun 19, 2014