-
Notifications
You must be signed in to change notification settings - Fork 602
Add compression support to BSONFileInputFormat. #82
base: master
Are you sure you want to change the base?
Conversation
- Prevent BSONSplitter to try to split compressed files. - BSONFileRecordReader detects whether a file is compressed, and if so, use the proper codec. This code is heavily inspired for the one from TextInputFormat.
Hi, what's the status of this pull request? Will be available in next release? What about LZO compression with BSON files? |
I tested this pull request and worked well, but the compressed BSON file ins't splittable. I created this project(https://github.com/alangalvino/BSON-Splitter) to split BSON files because a large BSON file compressed ins't splittable in Hadoop Map Reduce. My approach was split large BSON files (> 1GB) in small pieces (250MB) and compress them. What do you think about it? |
Running with a gzip file i'm getting an error:
With bzip everything seems right. |
With bz2 I'm losting data, for example, running a map reduce in uncompress BSON file i get 1006 records and running with same BSON file compressed with bz2, I'm getting 1000 records. The error in gzip described above it's not influencing the map reduce result, with gzip i'm getting 1006 records the same as in pure BSON file. |
I'll try to see what's wrong with bz2 as soon as I can, as well as updating the pull request to the latest mongo-hadoop version. Your approach with BSON-Splitter is similar to what I am doing :) |
Great, I tried figure out what's wrong with gzip(the exception described above) and the result was that in last decode the BSONDecoder was waiting a Integer(4bytes) but getting 1 byte and throw EOF exception, here is the code:
|
Splittable BSON compression is a feature I'd like to have in mongo-hadoop for the 1.5 release. @nlaveaucriteo, @alangalvino, have either of you gotten compression to work while keeping BSON files splittable? If so, would one of you like to update this pull request or make a new one that does this? I'm happy to implement this myself, but I don't want to undermine the efforts of anyone who's already completed this feature. |
Hi Luke, I'm still splitting my BSON files(more than 1GB) with https://github.com/alangalvino/BSON-Splitter and compressing them with gzip compression(files wit 256MB). Before run my job I decompress and split again using this file CompressBSONFileInputFormat.java(https://gist.github.com/alangalvino/8bd2842935cd43a536e4) with code borrowed from this pull request. |
use the proper codec. This code is heavily inspired for the one from
TextInputFormat.