Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Resource Identifiers with ResourceIdTrait #32

Closed
wants to merge 5 commits into from

Conversation

tareknaser
Copy link
Collaborator

Description

This PR ports changes from ARK-Builders/arklib#90 into ark-rust.
It represents the first milestone of #29

Changes

  • Introduce ResourceIdTrait:
    • Hash type restrictions
    • Calculate hash values from path or bytes
  • Add compilation features for data-resource:
    • "non-cryptographic-hash" and "cryptographic-hash" (default) features
  • Implement blake3::ResourceId and crc32::ResourceId:
    • Include blake3::ResourceId under the "cryptographic-hash" feature
    • Include crc32::ResourceId under the "non-cryptographic-hash" feature
  • Define benchmarks for hash creation:
    • Create benchmarks for hash generation from paths and bytes
  • Update other crates in ark-rust workspace:
    • Transition them to use the "non-cryptographic-hash" feature for now, because previously they were using crc32 hash values directly

Copy link

Benchmark for 84bcdf7

Click to view benchmark
Test Base PR %
../test-assets/lena.jpg/compute_bytes 13.5±0.11µs N/A N/A
../test-assets/test.pdf/compute_bytes 110.0±1.08µs N/A N/A
compute_bytes_large/compute_bytes 136.8±0.91µs N/A N/A
compute_bytes_medium/compute_bytes 26.9±0.30µs N/A N/A
compute_bytes_small/compute_bytes 127.2±1.50ns N/A N/A
index_build/index_build/../test-assets/ 158.9±1.23µs 941.9±4.46µs +492.76%
resource_id_creation/compute_from_bytes:large 92.1±0.31µs N/A N/A
resource_id_creation/compute_from_bytes:medium 5.7±0.05µs N/A N/A
resource_id_creation/compute_from_bytes:small 96.4±0.41ns N/A N/A
resource_id_creation/compute_from_path:../test-assets/lena.jpg 56.7±0.36µs N/A N/A
resource_id_creation/compute_from_path:../test-assets/test.pdf 859.3±26.07µs N/A N/A

data-resource/Cargo.toml Outdated Show resolved Hide resolved
Deserialize,
Serialize,
)]
pub struct ResourceId {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it stub for future implementation? We have "cryptographic-hash" enabled by default, so we should provide the default implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResourceId struct doesn't hold any data because there is no need to. Instead, methods like from_path and from_bytes return hash values directly.

The ResourceIdTrait simply enforces some restrictions on types implementing it, to allow us to use it from other crates without worrying about the underlying hash function.

Deserialize,
Serialize,
)]
pub struct ResourceId {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was like this:

pub struct ResourceId {
    pub data_size: u64,
    pub hash: u32,
}

By the way, could it be useful to have definition like this shared across implementations?

pub struct ResourceId<H> {
    pub data_size: u64,
    pub hash: H,
}

Then we would have ResourceId type everywhere in the client code without the need for aliases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a generic type instead of an associated type

pub trait ResourceIdTrait<T>
where
    T: Debug
        + Display
        + FromStr
        + Clone
        + PartialEq
        + Eq
        + Ord
        + PartialOrd
        + Hash
        + Serialize,
{
    /// Computes the resource identifier from the given file path
    fn from_path<P: AsRef<Path>>(file_path: P) -> Result<T>;
    /// Computes the resource identifier from the given bytes
    fn from_bytes(data: &[u8]) -> Result<T>;
}

and in data-resource/src/crc32.rs, we can implement it

impl ResourceIdTrait<u32> for ResourceId {
    fn from_path<P: AsRef<Path>>(file_path: P) -> Result<u32> {

I'm not entirely sure how that change would affect our situation.

Btw, do you think we should keep data_size as a field of ResourceId?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point to keep in mind:
With pub trait ResourceIdTrait<T> where T: Debug, we'd need to manually annotate types when importing ResourceId in other crates.

This isn't ideal as we want to switch between "non-cryptographic-hash" and "cryptographic-hash" implementations without manual code changes if possible.

ark-cli/src/main.rs Outdated Show resolved Hide resolved
Copy link

Benchmark for 9ee46bd

Click to view benchmark
Test Base PR %
../test-assets/lena.jpg/compute_bytes 13.3±0.10µs N/A N/A
../test-assets/test.pdf/compute_bytes 130.4±1.19µs N/A N/A
compute_bytes_large/compute_bytes 135.4±1.66µs N/A N/A
compute_bytes_medium/compute_bytes 27.5±0.46µs N/A N/A
compute_bytes_small/compute_bytes 129.1±2.50ns N/A N/A
index_build/index_build/../test-assets/ 158.9±0.83µs 943.3±4.58µs +493.64%
resource_id_creation/compute_from_bytes:large 91.7±0.44µs N/A N/A
resource_id_creation/compute_from_bytes:medium 5.7±0.01µs N/A N/A
resource_id_creation/compute_from_bytes:small 96.5±1.44ns N/A N/A
resource_id_creation/compute_from_path:../test-assets/lena.jpg 56.3±0.71µs N/A N/A
resource_id_creation/compute_from_path:../test-assets/test.pdf 856.6±12.88µs N/A N/A

Copy link

github-actions bot commented May 4, 2024

Benchmark for 88315ef

Click to view benchmark
Test Base PR %
../test-assets/lena.jpg/compute_bytes 13.3±0.06µs N/A N/A
../test-assets/test.pdf/compute_bytes 136.3±0.40µs N/A N/A
compute_bytes_large/compute_bytes 138.6±1.82µs N/A N/A
compute_bytes_medium/compute_bytes 27.5±0.36µs N/A N/A
compute_bytes_small/compute_bytes 129.9±1.32ns N/A N/A
index_build/index_build/../test-assets/ 159.1±0.64µs 948.3±3.69µs +496.04%
resource_id_creation/compute_from_bytes:large 91.9±0.26µs N/A N/A
resource_id_creation/compute_from_bytes:medium 5.7±0.05µs N/A N/A
resource_id_creation/compute_from_bytes:small 96.4±0.33ns N/A N/A
resource_id_creation/compute_from_path:../test-assets/lena.jpg 56.8±0.39µs N/A N/A
resource_id_creation/compute_from_path:../test-assets/test.pdf 861.6±12.73µs N/A N/A

@kirillt
Copy link
Member

kirillt commented May 11, 2024

We should test build of data-resource crate with both non-cryptographic-hash and cryptographic-hash features on CI.

Copy link

Benchmark for da50af1

Click to view benchmark
Test Base PR %
../test-assets/lena.jpg/compute_bytes 13.3±0.06µs N/A N/A
../test-assets/test.pdf/compute_bytes 110.2±0.55µs N/A N/A
compute_bytes_large/compute_bytes 138.9±0.86µs N/A N/A
compute_bytes_medium/compute_bytes 27.5±0.41µs N/A N/A
compute_bytes_small/compute_bytes 127.9±1.36ns N/A N/A
index_build/index_build/../test-assets/ 165.5±0.59µs 937.3±6.73µs +466.34%
resource_id_creation/compute_from_bytes:large 91.7±0.35µs N/A N/A
resource_id_creation/compute_from_bytes:medium 5.7±0.02µs N/A N/A
resource_id_creation/compute_from_bytes:small 96.5±1.16ns N/A N/A
resource_id_creation/compute_from_path:../test-assets/lena.jpg 57.2±0.55µs N/A N/A
resource_id_creation/compute_from_path:../test-assets/test.pdf 855.3±20.47µs N/A N/A

@tareknaser
Copy link
Collaborator Author

I've added two more tasks to the CI workflow. These tasks check if data-resource can be built with both the default feature (cryptographic-hash) and non-cryptographic-hash. Right now, these tasks are only part of the Linux job.

Once we reach the second milestone as discussed in #29 (comment), we can simplify things by using these commands to check compilation features:

cargo clippy --workspace
cargo clippy --workspace --features non-cryptographic-hash --no-default-features

and

cargo test --workspace
cargo test --workspace --features non-cryptographic-hash --no-default-features

These commands will handle linting and running tests for all relevant crates with different features.

@kirillt
Copy link
Member

kirillt commented May 11, 2024

So, after we finish #42 we should have approximately this layout:

  • data-resource will be really light crate, containing only the ResourceId trait
  • dev-hash, or data-resource-test defines reference implementations of ResourceId: CRC-32 and Blake3, later we'll extend this crate with other functions for benchmarking
  • fs-index depends on data-resource and operates with abstract ResourceId
  • fs-storage, fs-metadata, fs-properties,etc. ⏤ same
  • ark-cli depends on dev-hash because it's a Swiss Army Knife
  • benchmarks depend on dev-hash, is it possible in release profile? or only if the benchmarks are in separate crate?

Compilation features cryprographic-hash and non-cryptographic-hash could be moved into fs-index because only its code is conditioned, actually we can replace 2 flags with 1 flag (e.g. fast, or opposite secure, or cryptographic-hash). And CI would need to test build both flavors of fs-index.

@tareknaser
Copy link
Collaborator Author

  • fs-index depends on data-resource and operates with abstract ResourceId

But wouldn't that be an issue if data-resource only contains ResourceIdTrait?
fs-index and other workspace crates still need a ResourceId struct to operate.
As an example, in fs-index, the hash is a field stored in ResourceIndex.

  • ark-cli depends on dev-hash because it's a Swiss Army Knife

What sets ark-cli apart from other crates that require ResourceId? How does the usage of ResourceId differ between fs-index and ark-cli?

  • benchmarks depend on dev-hash, is it possible in release profile? or only if the benchmarks are in separate crate?

We could write benchmarks directly into the crate housing the implementations of ResourceIdTrait. This means that the hash function benchmarks would be located within the dev-hash crate itself for this proposal.

Compilation features cryprographic-hash and non-cryptographic-hash could be moved into fs-index because only its code is conditioned

I think it would be useful for the crate containing multiple implementations of ResourceId to include compilation flags representing the hash function used for each ResourceId implementation. For instance, the dev-hash or data-resource-test crate could have feature flags to export only the chosen ResourceId type.

Here's the reasoning:
Let's consider that dev-hash crate contains seven different implementations of ResourceIdTrait for seven different hash functions, some cryptographic and some not.
As a user, if I only want to use the blake3 implementation and have no interest in the others, having feature flags would allow me to select only the necessary crate dependencies, to avoid unnecessary bloat.

actually we can replace 2 flags with 1 flag (e.g. fast, or opposite secure, or cryptographic-hash).

I agree.
I don't see any rationale for having two feature flags. We could get the same results with just one.

@kirillt
Copy link
Member

kirillt commented May 13, 2024

fs-index depends on data-resource and operates with abstract ResourceId

But wouldn't that be an issue if data-resource only contains ResourceIdTrait?
fs-index and other workspace crates still need a ResourceId struct to operate.
As an example, in fs-index, the hash is a field stored in ResourceIndex.

I mean something like this:

pub trait ResourceId {

}

pub struct IndexEntry<Id: ResourceId> {
    pub modified: SystemTime,
    pub id: Id,
}

ark-cli depends on dev-hash because it's a Swiss Army Knife

What sets ark-cli apart from other crates that require ResourceId? How does the usage of ResourceId differ between fs-index and ark-cli?

The difference is that ark-cli is an app, not really a crate. Users of ARK framework could use ark-cli as a tool, or as a reference to implement something. But they will not depend on ark-cli as a crate. We could use only one hash function in ark-cli, but it seems that we'll need to cover at least several the most common scenarios of usage. This also means that the hash function would be selectable in ark-cli as a CLI parameter.

benchmarks depend on dev-hash, is it possible in release profile? or only if the benchmarks are in separate crate?

We could write benchmarks directly into the crate housing the implementations of ResourceIdTrait. This means that the hash function benchmarks would be located within the dev-hash crate itself for this proposal.

Agree, I've also came to this idea that we should move the benchmarks into dev-hash. We can also make dev-hash depend on fs-index and any other crates, so all benchmarks would be there. In fact, we could just use name benchmarks for this 😁

Compilation features cryprographic-hash and non-cryptographic-hash could be moved into fs-index because only its code is conditioned

I think it would be useful for the crate containing multiple implementations of ResourceId to include compilation flags representing the hash function used for each ResourceId implementation. For instance, the dev-hash or data-resource-test crate could have feature flags to export only the chosen ResourceId type.

The dev-hash or benchmarks should not export anything. It's more like a reference/mock app. Also, we would need all hash functions for benchmarking. So, compilation flags are not necessary here. By the way, calling this folder benchmarks could make it clear that it's not suitable as a dependency for the users. It'd still be possible to use it for quick prototyping 😉

- Introduce `ResourceIdTrait`:
	- Hash type restrictions
    - Calculate hash values from paths or bytes.
- Add compilation features for `data-resource`:
    - "non-cryptographic-hash" and "cryptographic-hash" (default) features.
- Implement `blake3::ResourceId` and `crc32::ResourceId`:
    - Include `blake3::ResourceId` under the "cryptographic-hash" feature.
    - Include `crc32::ResourceId` under the "non-cryptographic-hash" feature.
- Define benchmarks for hash creation:
    - Create benchmarks for hash generation from paths and bytes.
- Update other crates in `ark-rust` workspace:
    - Transition them to use the "non-cryptographic-hash" feature for now, because previously they were using crc32fast hash values directly.

Signed-off-by: Tarek <[email protected]>
The `data-resource` crate now defines two compilation features: "cryptographic-hash" and "non-cryptographic-hash". Each feature corresponds to a specific hash function.

This allows for more granular control over dependencies, as only the necessary dependencies for the selected hash function are exposed.

Signed-off-by: Tarek <[email protected]>
@tareknaser
Copy link
Collaborator Author

Moved to #42

@tareknaser tareknaser closed this May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants