Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4mc support for distmap upload #403

Closed
robmaz opened this issue Feb 14, 2018 · 17 comments
Closed

4mc support for distmap upload #403

robmaz opened this issue Feb 14, 2018 · 17 comments
Labels
Priority: Medium Status: Completed Finished and already included in the main code Type: Enhancement New features or improved behavior
Milestone

Comments

@robmaz
Copy link

robmaz commented Feb 14, 2018

What would it take to support the splittable 4mc compression format for up- and downloading? This would be much faster than bzip2, and although the compression ratio is also much worse, maybe it would be an acceptable compromise. (The basic implementation would be to pipe the uncompressed output through the command line utility, but with this being a java lib, there is probably a more Java-natural way to do this).

https://github.com/carlomedas/4mc

@magicDGS
Copy link
Owner

I guess that you refer to the distmap pipeline (upload). I changed the title to make it clear.

@magicDGS magicDGS changed the title mc4 support mc4 support for distmap upload Feb 14, 2018
@magicDGS
Copy link
Owner

@robmaz - after investigating a bit about the 4mc compression, it looks like the library for java is not released in Maven Central or any repository to pull out with a build system.

Although I can use the https://jitpack.io/ automatic artifact build, are you sure that the library is in active development? Latest release (2.0.0) was in Sep 11, 2016, and although there are recent commits it does not look like a release will come in a regular basis.

The time to support automatic compression based using an already implemented compressor is minimal, just handling another extension in a simple function. If some configuration is required, it would take a bit longer, but I guess that it will be only necessary for distmap uploader, no?

Anyway, let me know if this is something just for testing the compressor or to implement for production. In the first case, I can do a PR and compile an unrelease copy of ReadTools to test stuff; if working, then I can implement test and do a point-release.

@robmaz
Copy link
Author

robmaz commented Feb 14, 2018

Well, there is a high chance that this was someones PhD project or something and is left as is. But I think it is just a very thin layer over the actual lz4 frame format

https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md

so maybe it can be kept alive in the future. In the present anyway, it seems to work as is both on the command line and for Hadoop 2.7.x, which we will probably be using for the foreseeable future.

I was thinking to make it a part of this transition to Hadoop 2.7.5 and the new distmap, that I am itching to do for some time now. I think replacing the bzip2 compression on upload will have a huge impact on upload times, which is one of the main complaints of people. The alternative to mc4 is basically turning compression off entirely for now.

It could also be used for download, i.e., the mappers would generate parts in .sam.mc4 format (or rather, Hadoop will compress the mapper output in this way) that you would have to handle.

@magicDGS
Copy link
Owner

And what's about using the Hadoop-BAM compressor for block-compressed files (bgzip)? This is a nice standard for compression in bioinformatics (tabix use it, and BAM is compressed in that way).

The BGZIP format is based on GZIP and is implemented in Hadoop-BAM as a SplittableCompressionCodec. I think that this alternative might be worth to try before going to the mc4 compressor.

Regarding the compression of the parts, that's a more difficult topic; that will require a major refactoring of the downloader and each part being in BAM format is already compressed; an extra compression will decrease the performance even more...

@robmaz
Copy link
Author

robmaz commented Feb 14, 2018 via email

@robmaz robmaz changed the title mc4 support for distmap upload 4mc support for distmap upload Feb 14, 2018
@robmaz
Copy link
Author

robmaz commented Feb 14, 2018

To put some interpretation on these numbers, 4mc compresses at about the same rate as hadoop fs can push data over the network, so it is basically free compression, even if it is not the best.

@magicDGS
Copy link
Owner

I am not sure if that will be the same in the distmap pipeline for several reasons:

  • The java library won't use native implementations, and thus might be slower.
  • Network will be a major bottleneck, so I am not sure if the compression will have a major effect. We can try to profile the run of uploading with java and see in which step the JVM is spending more time to do this properly.

On the other hand, I realized that the library cannot be used unless I add the jar file to our repository. I always try to avoid that kind of dependency management because it is error prone - nevertheless, I can do a PR to check if the performance improvement will be reflected also in the upload, without merging. In the meantime, I'll wait for the author response about the possibility of releasing to Maven Central (fingltd/4mc#33)

Another splittable options are (found in https://blogs.oracle.com/datawarehousing/hadoop-compression-choosing-compression-codec-part2 and https://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_data_compression_performance.html)

  • Snappy
  • LZ4

@magicDGS
Copy link
Owner

Ok, maybe this will be easier than I thought. The hadoop library provides a way to handle compression by using the extension of the file, and using installed codec providers (I think that they should be set on the configuration file).

If this work, then you can add this providers and test different compression algorithms, even if the provider is not included. But that means that you should have in the classpath the jar file from 4mc. I will do the PR because anyway is needed for supporting other compressors...we can try tomorrow evening if that works for you and like that HDFS compression is independent of ReadTools and depends only on the final user!

@robmaz
Copy link
Author

robmaz commented Feb 14, 2018

That is also how I understood it. Hdfs should handle it on the fly if it finds the codec in the classpath and has it set in the hdfs configuration, and you tell it to.

4mc uses lz4, in the same way that bgzip uses gzip. I am a bit confused by this discussion you linked, because it claims that they had support for lz4 since 2011 or so, yet the 2.7.4 API mentions BZip2Codec as the only implementation of the SplittableCompressionCodec interface (http://hadoop.apache.org/docs/r2.7.4/api/ ) Which is why I was looking outside the Hadoop distro in the first place.

@magicDGS
Copy link
Owner

@robmaz - I think that the writing part of the issue will be solved by using the factory from Hadoop, but the reading part might be a real issue to support in ReadTools. If you are sure that the output of the new Distmap should be compressed in .4mc, can you open a different issue for reading HDFS files with that codec? We can discuss there other approaches, such as every part being a headerless-BAM, and on download it will be generated on demand. Maybe distmap can output a header in HDFS that will be added every time a single/multiple parts are download with ReadTools... Anyway, feel free to open a new issue for discussing that.

@robmaz
Copy link
Author

robmaz commented Feb 15, 2018 via email

@magicDGS
Copy link
Owner

It will work with any codec implemented for HDFS. The ones that are bundled in the ReadTools are:

  • Splittable: bzip2 (hadoop-library), bgzf (block-compressed gzip - hadoop-BAM library) and gzip handled as block-compressed (hadoop-BAM).
  • Non-splittable: deflate, lz4, normal gzip, snnapy (all from haddop-library).

In principle they should work out of the box; we should check if other compressors can be used by providing their distribution in the classpath when running ReadTools (that is blocked by #406, which is delegating the compression for HDFS files to the Hadoop library).

@magicDGS
Copy link
Owner

The latest master branch have the PR merged for the Hadoop compressors. Maybe you can try to run it with a custom classpath after installing with the HEAD option through our brew formula. @robmaz - can you tell me if that works?

@magicDGS
Copy link
Owner

For running with a custom classpath (assuming you only want the 4mc support):

java -cp ReadTools.jar:hadoop-4mc-2.0.0.jar org.magicdgs.readtools.Main

This is because the classpath is ignored with the -jar option. Probably I should add a documentation section for advance users: how to set a custom java.nio.Path provider and/or a Hadoop compressor, etc.

I think that if I would like to support this behavior, I will definitely need a wrapper script sooner or later to provide an easier way to run it. Thanks for pointing out ways to improve ReadTools.

@magicDGS magicDGS added Type: Enhancement New features or improved behavior Status: Completed Finished and already included in the main code Priority: Medium and removed distmap labels Apr 20, 2018
@magicDGS magicDGS removed the ~hadoop label Apr 23, 2018
@magicDGS
Copy link
Owner

Have you tested this, @robmaz? If so, and it works, please close the issue.

@magicDGS
Copy link
Owner

@robmaz - can you test that the upload with the custom classpath is working? I just release v1.3.0 with the changes included. Follow this instructions to run it: http://magicdgs.github.io/ReadTools/custom_java_classpath.html#example-usage-4mc-compression-for-distmap

@magicDGS
Copy link
Owner

Closing this issue - it should work although I haven't test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium Status: Completed Finished and already included in the main code Type: Enhancement New features or improved behavior
Projects
None yet
Development

No branches or pull requests

2 participants