Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for the BED12 format #108

Closed
kepbod opened this issue Jun 18, 2014 · 4 comments
Closed

Support for the BED12 format #108

kepbod opened this issue Jun 18, 2014 · 4 comments

Comments

@kepbod
Copy link

kepbod commented Jun 18, 2014

I read the document, and found only a few functions support the BED12 format. Is it possible to add more supports for the BED12 format? Thanks!

@daler
Copy link
Owner

daler commented Jun 18, 2014

Do you have something specific in mind?

@kepbod
Copy link
Author

kepbod commented Jun 19, 2014

I'm not familiar with c codes, so I don't know if it is possible. For example, the Interval object only contains information of first 6 fields (chrom, stat, end, name, score, strand), and it could not efficiently parse other fields like cds_start, cds_end, exon_starts and exon_ends in the BED12 format. Another issue is that in bedtools, there is options like -split in getfasta to support BED12 format, but in pybedtools, this option was omitted.

@daler
Copy link
Owner

daler commented Jun 19, 2014

Ah, I see what you mean with respect to Interval parsing of BED12. For example there is not an Interval.blockStarts attribute that returns a list of integer start positions.

However, any field of an Interval object can always be accessed by indexing into the object. So if you were working with an Interval object, x, that represents a BED12 line, you could get the comma-separated string for exon starts like this:

block_starts = x[11]

And then handle the string yourself, maybe like:

block_starts = [int(i) for i in x[11].split(',')]

In general, pybedtools doesn't do anything special with BED files over 3 fields. It just splits on the tab characters and provides aliases to each position. For example, x.score just returns the string in the 5th field -- same as if you just used x[4]. Given the wide variety of BED-like formats in the wild, this design decision means that a non-standard BED file will not break pybedtools.

I think what you might want is some sort of object representing a transcript model, which would parse a BED12 line and make available all the exons, CDSs, and introns. Something like this might be better suited for gffutils, where @yarden has already proposed something similar at daler/gffutils#21.

Alternatively, you could write a class that accepts an Interval object representing a properly-formatted BED12 line and creates whatever data structure you find useful. If you find this strategy is performance-limiting in pure Python, I could try implementing your class in Cython to see if that speeds things up.

As for your second issue regarding -split, the -split option works in pybedtools -- it's actually listed on the page you linked to. As long as your installed version of BEDTools supports it, pybedtools does too (Design principle 2)

@kepbod
Copy link
Author

kepbod commented Jun 19, 2014

Thanks for your detailed explanation. In past, I always parsed BED12 files using raw python codes, and its performance is good enough. The reason why I open this issue is just to find whether pybedtools has any more convenient way to solve this problems.

@kepbod kepbod closed this as completed Jun 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants