Skip to content

Commit

Permalink
feat: add read support for v2016, v2020 chunkers
Browse files Browse the repository at this point in the history
In addition to the FastCDC types in the v2016 and v2020 modules, now there
are StreamCDC structs that will read from a boxed Read into a buffer sized
to fit the maximum chunk. While this is convenient for processing large
files, it is a bit slower than using memory-mapped files with a crate such
as memmap2. Added examples that demonstrate using the streaming chunkers.

cargo test passes
  • Loading branch information
nlfiedler committed Jan 27, 2023
1 parent 837ccb1 commit c41f3c1
Show file tree
Hide file tree
Showing 12 changed files with 979 additions and 214 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,12 @@ This file follows the convention described at
## [Unreleased]
### Changed
- **Breaking:** moved ronomon FastCDC implementation into `ronomon` module.
What was `fastcdc::FastCDC::new()` is now `fastcdc::ronomon::FastCDC::new()`.
### Added
- Canonical implementation of FastCDC from 2016 paper in `v2016` module.
- Canonical implementation of FastCDC from 2020 paper in `v2020` module.
- `Normalization` enum to set the normalized chunking for `v2016` and `v2020` chunkers.
- `StreamCDC`, streaming version of `FastCDC`, in `v2016` and `v2020` modules.

## [2.0.0] - 2023-01-14
### Added
Expand Down
4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ exclude = [
[dev-dependencies]
aes = "0.8.2"
byteorder = "1.4.3"
clap = { version = "4.0.32", features = ["cargo"] }
clap = { version = "4.1.4", features = ["cargo"] }
ctr = "0.9.2"
md-5 = "0.10.5"
memmap = "0.7.0"
memmap2 = "0.5.8"
24 changes: 16 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,7 @@ $ cargo test

## Example Usage

Examples can be found in the `examples` directory of the source repository,
which demonstrate reading files of arbitrary size into a memory-mapped buffer
and passing them through the different chunker implementations.
Examples can be found in the `examples` directory of the source repository, which demonstrate finding chunk boundaries in a given file. There are both streaming and non-streaming examples, where the non-streaming examples can read from arbitrarily large files via the `memmap2` crate.

```shell
$ cargo run --example v2020 -- --size 16384 test/fixtures/SekienAkashita.jpg
Expand Down Expand Up @@ -47,24 +45,34 @@ assert_eq!(results[1].offset, 66549);
assert_eq!(results[1].length, 42917);
```

### Streaming

Both the `v2016` and `v2020` modules have a streaming version of FastCDC named `StreamCDC`, which takes a boxed `Read` and uses a byte vector with capacity equal to the specified maximum chunk size.

```rust
use std::fs::File;
use fastcdc::v2020::StreamCDC;
let source = File::open("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = StreamCDC::new(Box::new(source), 4096, 16384, 65535);
for result in chunker {
let chunk = result.unwrap();
println!("offset={} length={}", chunk.offset, chunk.length);
}
```

## Migration from pre-3.0

If you were using a release of this crate from before the 3.0 release, you will need to make a small adjustment to continue using the same implemetation as before.

Before the 3.0 release:

```rust
use fastcdc::ronomon as fastcdc;
use std::fs;
let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc::FastCDC::new(&contents, 8192, 16384, 32768);
```

After the 3.0 release:

```rust
use std::fs;
let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc::ronomon::FastCDC::new(&contents, 8192, 16384, 32768);
```

Expand Down
25 changes: 10 additions & 15 deletions TODO.org
Original file line number Diff line number Diff line change
@@ -1,16 +1,11 @@
* Action Items
** TODO Rewrite
*** TODO incorporate some form of streaming support based on =Read=
**** c.f. https://gitlab.com/asuran-rs/asuran/ (asuran-chunker, uses =fastcdc= with =Read=)
**** basically just allocate a buffer 2*max and fill it as needed
**** c.f. https://github.com/jotfs/fastcdc-go/blob/master/fastcdc.go
**** c.f. https://github.com/wxiacode/Duplicacy-FastCDC/blob/master/src/duplicacy_chunkmaker.go
*** TODO test: check if ronomon version of fastcdc produces same results as rust version
**** if so, maybe make this a requirement of the rust version of ronomon
** timing on =MSEdge-Win10.ova= with 4mb chunks
*** run with =--release= flag 7 times, drop low/high, average remaining 5
| chunker | avg time |
|---------+----------|
| v2020 | 3.437 |
| ronomon | 4.085 |
| v2016 | 4.266 |
** Time for examples to chunk =MSEdge-Win10.ova= with 4mb chunks
*** use =time cargo run --release ...=, 7 times, drop low/high, average remaining 5
*** note that the non-streaming examples use =memmap2= to read from the file as a slice
| chunker | avg time |
|------------+----------|
| v2020 | 3.437 |
| ronomon | 4.085 |
| v2016 | 4.266 |
| stream2020 | 5.847 |
| stream2016 | 6.659 |
4 changes: 2 additions & 2 deletions examples/ronomon.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
//
use clap::{arg, command, value_parser, Arg};
use fastcdc::ronomon::*;
use memmap::MmapOptions;
use memmap2::Mmap;
use std::fs::File;

fn main() {
Expand All @@ -26,7 +26,7 @@ fn main() {
let avg_size = *size as usize;
let filename = matches.get_one::<String>("INPUT").unwrap();
let file = File::open(filename).expect("cannot open file!");
let mmap = unsafe { MmapOptions::new().map(&file).expect("cannot create mmap?") };
let mmap = unsafe { Mmap::map(&file).expect("cannot create mmap?") };
let min_size = avg_size / 4;
let max_size = avg_size * 4;
let chunker = FastCDC::new(&mmap[..], min_size, avg_size, max_size);
Expand Down
38 changes: 38 additions & 0 deletions examples/stream2016.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
//
// Copyright (c) 2023 Nathan Fiedler
//
use clap::{arg, command, value_parser, Arg};
use fastcdc::v2016::*;
use std::fs::File;

fn main() {
let matches = command!("Example of using v2016 streaming chunker.")
.about("Finds the content-defined chunk boundaries of a file.")
.arg(
arg!(
-s --size <SIZE> "The desired average size of the chunks."
)
.value_parser(value_parser!(u32)),
)
.arg(
Arg::new("INPUT")
.help("Sets the input file to use")
.required(true)
.index(1),
)
.get_matches();
let size = matches.get_one::<u32>("size").unwrap_or(&131072);
let avg_size = *size;
let filename = matches.get_one::<String>("INPUT").unwrap();
let file = File::open(filename).expect("cannot open file!");
let min_size = avg_size / 4;
let max_size = avg_size * 4;
let chunker = StreamCDC::new(Box::new(file), min_size, avg_size, max_size);
for result in chunker {
let entry = result.expect("failed to read chunk");
println!(
"hash={} offset={} size={}",
entry.hash, entry.offset, entry.length
);
}
}
38 changes: 38 additions & 0 deletions examples/stream2020.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
//
// Copyright (c) 2023 Nathan Fiedler
//
use clap::{arg, command, value_parser, Arg};
use fastcdc::v2020::*;
use std::fs::File;

fn main() {
let matches = command!("Example of using v2020 streaming chunker.")
.about("Finds the content-defined chunk boundaries of a file.")
.arg(
arg!(
-s --size <SIZE> "The desired average size of the chunks."
)
.value_parser(value_parser!(u32)),
)
.arg(
Arg::new("INPUT")
.help("Sets the input file to use")
.required(true)
.index(1),
)
.get_matches();
let size = matches.get_one::<u32>("size").unwrap_or(&131072);
let avg_size = *size;
let filename = matches.get_one::<String>("INPUT").unwrap();
let file = File::open(filename).expect("cannot open file!");
let min_size = avg_size / 4;
let max_size = avg_size * 4;
let chunker = StreamCDC::new(Box::new(file), min_size, avg_size, max_size);
for result in chunker {
let entry = result.expect("failed to read chunk");
println!(
"hash={} offset={} size={}",
entry.hash, entry.offset, entry.length
);
}
}
4 changes: 2 additions & 2 deletions examples/v2016.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
//
use clap::{arg, command, value_parser, Arg};
use fastcdc::v2016::*;
use memmap::MmapOptions;
use memmap2::Mmap;
use std::fs::File;

fn main() {
Expand All @@ -26,7 +26,7 @@ fn main() {
let avg_size = *size;
let filename = matches.get_one::<String>("INPUT").unwrap();
let file = File::open(filename).expect("cannot open file!");
let mmap = unsafe { MmapOptions::new().map(&file).expect("cannot create mmap?") };
let mmap = unsafe { Mmap::map(&file).expect("cannot create mmap?") };
let min_size = avg_size / 4;
let max_size = avg_size * 4;
let chunker = FastCDC::new(&mmap[..], min_size, avg_size, max_size);
Expand Down
4 changes: 2 additions & 2 deletions examples/v2020.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
//
use clap::{arg, command, value_parser, Arg};
use fastcdc::v2020::*;
use memmap::MmapOptions;
use memmap2::Mmap;
use std::fs::File;

fn main() {
Expand All @@ -26,7 +26,7 @@ fn main() {
let avg_size = *size;
let filename = matches.get_one::<String>("INPUT").unwrap();
let file = File::open(filename).expect("cannot open file!");
let mmap = unsafe { MmapOptions::new().map(&file).expect("cannot create mmap?") };
let mmap = unsafe { Mmap::map(&file).expect("cannot create mmap?") };
let min_size = avg_size / 4;
let max_size = avg_size * 4;
let chunker = FastCDC::new(&mmap[..], min_size, avg_size, max_size);
Expand Down
14 changes: 12 additions & 2 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//
// Copyright (c) 2020 Nathan Fiedler
// Copyright (c) 2023 Nathan Fiedler
//

//! This crate implements multiple versions of the FastCDC content defined
Expand Down Expand Up @@ -55,7 +55,7 @@
//! For a canonical implementation of the algorithm as described in the 2020
//! paper, see the `v2020` crate. This implementation produces identical cut
//! points as the 2016 version, but does so a bit faster.
//!
//!
//! If you are using this crate for the first time, the `v2020` implementation
//! would be the most approprite. It uses 64-bit hash values and tends to be
//! faster than both the `ronomon` and `v2016` versions.
Expand Down Expand Up @@ -112,6 +112,16 @@
//! points that were determined by the maximum size rather than the data itself.
//! Ideally you want cut points that are determined by the input data. However,
//! this is application dependent and your situation may be different.
//!
//! ## Large Data
//!
//! If processing very large files, the streaming version of the chunkers in the
//! `v2016` and `v2020` modules may be a suitable approach. They both allocate a
//! byte vector equal to the maximum chunk size, draining and resizing the
//! vector as chunks are found. However, using a crate such as `memmap2` can be
//! significantly faster than the streaming chunkers. See the examples in the
//! `examples` directory for how to use the streaming versions as-is, versus the
//! non-streaming chunkers which read from a memory-mapped file.
pub mod ronomon;
pub mod v2016;
Expand Down
Loading

0 comments on commit c41f3c1

Please sign in to comment.