Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3 Planning & Objectives #194

Closed
jeremy-visionaid opened this issue Oct 13, 2024 · 38 comments
Closed

v3 Planning & Objectives #194

jeremy-visionaid opened this issue Oct 13, 2024 · 38 comments
Milestone

Comments

@jeremy-visionaid
Copy link
Collaborator

jeremy-visionaid commented Oct 13, 2024

Now that v2.4 looks like it's coming together. I was wondering what folks thoughts were for objectives for v3?

If I understand correctly, the values/objectives for v2 are:

  • Pure dotnet implementation
  • Maximized client compatilibilty (i.e. netstandard2.0, net4.0)
  • Easy traversal of storages/streams
  • Easy manipulation of stream data

There are some goals I have in mind for v3:

  • Support 16 TB files (i.e. the maximum 0xFFFFFFFA sector count and therefore uint sector IDs)
  • Support transactions (e.g. scratch data rather than snapshot copy)
  • Support consolidation on commit (e.g. online rather than copy)
  • Revised API to follow dotnet conventions (e.g. CFStream by implementing Stream directly instead of via a decorator)
  • Idiomatic exception hierarchy (Review exception hierarchy for v3 #146)
  • Improved performance
  • Reduced memory usage
  • Nullable attributes/static analysis

Other thoughts:

@jeremy-visionaid
Copy link
Collaborator Author

As I began to understand the OpenMcdf code better and more generally the CFB format while investigating #184, I was thinking that it might be quite difficult to add the features I've previously mentioned to v2. So, at the end of last week I took the liberty and started some proof of concept work for a new approach for version 3. I've adapted/written code to parse headers and directory entries, enumerate FAT sectors, traverse sector chains and read the contents of a stream. There's obviously quite a lot of functionality missing (substorage traversal, reading mini FAT, any kind of writing), but it is at least enough to spin up the equivalent of the "InMemory" benchmark:

Windows Structured Storage (ILockBytes over a MemoryStream)

Method BufferSize TotalStreamSize Mean Error StdDev Allocated
Test 1048576 1048576 215.4 us 2.25 us 2.11 us 440 B
Test 524288 1048576 215.6 us 4.07 us 4.36 us 440 B
Test 262144 1048576 212.6 us 4.06 us 4.51 us 440 B
Test 131072 1048576 211.4 us 1.71 us 1.42 us 440 B
Test 4096 1048576 205.6 us 1.78 us 1.58 us 440 B
Test 1024 1048576 237.9 us 4.69 us 4.39 us 440 B
Test 512 1048576 307.4 us 6.02 us 9.72 us 440 B

OpenMcdf v2.3.1

Method BufferSize TotalStreamSize Mean Error StdDev Gen0 Gen1 Gen2 Allocated
Test 1048576 1048576 187.9 us 3.71 us 8.14 us 137.6953 68.6035 - 1.1 MB
Test 524288 1048576 196.9 us 3.85 us 5.39 us 140.1367 55.1758 - 1.12 MB
Test 262144 1048576 219.8 us 2.81 us 2.76 us 144.2871 72.0215 - 1.15 MB
Test 131072 1048576 276.2 us 5.04 us 4.47 us 152.8320 76.1719 - 1.22 MB
Test 4096 1048576 3,899.4 us 74.65 us 69.83 us 671.8750 335.9375 - 5.41 MB
Test 1024 1048576 15,455.3 us 141.46 us 125.40 us 2281.2500 375.0000 156.2500 18.37 MB
Test 512 1048576 30,795.2 us 614.23 us 880.90 us 4468.7500 437.5000 187.5000 35.66 MB

OpenMcdf v3 Proof of Concept

Method BufferSize TotalStreamSize Mean Error StdDev Gen0 Allocated
Test 1048576 1048576 59.28 us 0.150 us 0.125 us 0.1221 1.96 KB
Test 524288 1048576 60.57 us 0.117 us 0.092 us 0.1831 1.96 KB
Test 262144 1048576 59.48 us 0.099 us 0.083 us 0.1831 1.96 KB
Test 131072 1048576 60.04 us 0.790 us 0.739 us 0.1831 1.96 KB
Test 4096 1048576 62.56 us 1.119 us 1.046 us 0.1221 1.96 KB
Test 1024 1048576 76.02 us 0.265 us 0.248 us 0.1221 1.96 KB
Test 512 1048576 75.58 us 0.997 us 0.933 us 0.1221 1.96 KB

So, there's some pretty big performance (400x faster for short reads) and memory reduction (Gen0 GCs are drastically reduced, and Gen1/2 GCs are eliminated) wins to be had on reading, while also enforcing reasonably strict validation.

In the proof of concept, BinaryReader and BinaryWriter are extended to handle CFB types, and there is only one reader and one writer stored in a context (along with the header) that is shared across objects that need access to it. Sectors are lightweight structs mostly to record their ID and map to their position within the CFB stream/file. There are a couple of enumerators that do the main work:
FatSectorEnumerator: Enumerates the FAT sectors from the Header's DIFAT array and the DIFAT chain.
FatSectorChainEnumerator: Enumerates a chain of FAT sectors for an entry/directory/FAT

Although I haven't done any code to write data yet, I'm thinking the enumerators might be converted to mutable iterators which should also be reasonably fast/efficient. I'll share some code when I've cleaned it up and progressed a bit further!

@jeremy-visionaid jeremy-visionaid changed the title v3 Planning v3 Planning & Objectives Oct 13, 2024
@ironfede
Copy link
Owner

Now that v2.4 looks like it's coming together. I was wondering what folks thoughts were for objectives for v3?

If I understand correctly, the values/objectives for v2 are:

  • Pure dotnet implementation
  • Maximized client compatilibilty (i.e. netstandard2.0, net4.0)
  • Easy traversal of storages/streams
  • Easy manipulation of stream data

There are some goals I have in mind for v3:

  • Support 16 TB files (i.e. the maximum 0xFFFFFFFA sector count and therefore uint sector IDs)
  • Support transactions (e.g. scratch data rather than snapshot copy)
  • Support consolidation on commit (e.g. online rather than copy)
  • Revised API to follow dotnet conventions (e.g. CFStream by implementing Stream directly instead of via a decorator)
  • Idiomatic exception hierarchy (Review exception hierarchy for v3 #146)
  • Improved performance
  • Reduced memory usage
  • Nullable attributes/static analysis

Other thoughts:

Honestly speaking... it's a perfect summary.
I think that 2.4 target is almost reached. I would not introduce new features in this branch since it has reached a certain maturity level and v3 should take good ideas from it and refactor them to allow a better logic separation and avoid up-and-down runs to allocate and persist sector chains since there lay the big performance penalties even if it's somehow a compact representation and working unit for cfb handling.

@jeremy-visionaid
Copy link
Collaborator Author

@ironfede Yesterday, I added a dedicated enumerator for directory entries in a FAT chain and another enumerator for the directory tree, so it can now traverse storages and streams as part of a tree (it could only traverse them as a list before) . I also improved the enumerators so they're a bit easier to follow and improved validation (enumerators and sector offsets throw if you try to access something that's invalid/out of bounds). I'll have a look at implementing support for mini FAT sectors today, then I think I'll be to the point where I'll have something worth sharing for comments and feedback. But essentially, aside from some clean-up and further validation/testing, I think the POC already meets the following objectives:

  • Support 16 TB files (i.e. the maximum 0xFFFFFFFA sector count and therefore uint sector IDs)
  • Revised API to follow dotnet conventions (e.g. CFStream by implementing Stream directly instead of via a decorator)
  • Idiomatic exception hierarchy (Review exception hierarchy for v3 #146)
  • Improved performance
  • Reduced memory usage
  • Nullable attributes/static analysis

@jeremy-visionaid
Copy link
Collaborator Author

@Numpsy Looks like you might have a particular interest in the OLE Property Set Data Structures? Do you have anything you'd like to see for v3? I can't say I know too much about it, so aside from some nit-pick refactoring work my only real comment is that perhaps OpenMcdf.Extensions should be renamed to OpenMcdf.Ole (i.e. explicitly about OLE only). Especially since we probably won't require a decorator for streams in v3.

@Numpsy
Copy link
Contributor

Numpsy commented Oct 14, 2024

My current use case is reading and writing metadata (summary information etc) in Office documents, so things like really massive files aren't really an issue.
The recent changes have shaved a nice amount of memory allocations from reading said properties, but as the time taken is sub-millisecond it's hard to measure changes (though any gain is nice).

As far as the API goes, I think it would be nice to review how it presents the different property sets - as well as SummaryInformation/DocumentSummaryInformation, there is some amound of support for others, but it's not clear how far the intended support goes.
@farfilli has raised some issues in that area, so maybe he has some thoughts for possible improvements?

@Numpsy
Copy link
Contributor

Numpsy commented Oct 14, 2024

There's also scope for a more complete set of functions for adding/updating/deleting properties (e.g. #190) - that's a higher level API than the changes to the storage part.

@farfilli
Copy link

farfilli commented Oct 15, 2024

My current use case is reading and writing metadata (summary information etc) in Office documents, so things like really massive files aren't really an issue. The recent changes have shaved a nice amount of memory allocations from reading said properties, but as the time taken is sub-millisecond it's hard to measure changes (though any gain is nice).

As far as the API goes, I think it would be nice to review how it presents the different property sets - as well as SummaryInformation/DocumentSummaryInformation, there is some amound of support for others, but it's not clear how far the intended support goes. @farfilli has raised some issues in that area, so maybe he has some thoughts for possible improvements?

My use case is similar to @Numpsy but on Solid Edge (CAD) documents, besides the standard SummaryInformation/DocumentSummaryInformation, some more app-specific metadata streams need to be accessed. Regular documents contain one part per file however some documents may contain more than one part; this means the file's structure becomes multilevel and each part has its own SummaryInformation/DocumentSummaryInformation and so on.
It would be nice to have methods to identify these situations.

If needed I can provide example files with an explanation of their structure.

@jeremy-visionaid
Copy link
Collaborator Author

@farfilli Sure, I'd love to get some test documents from other apps for test purposes if they're not massive. My proof of concept also addresses #58, and I've been thinking about ways to cover #66 too which might also address your use case.

@farfilli
Copy link

@jeremy-visionaid that's great, let me know where to upload the files besides a brief descriptions of them

@jeremy-visionaid
Copy link
Collaborator Author

@farfilli Maybe you could push them with on a branch of a fork and put some descriptions in the commit message?

@farfilli
Copy link

@jeremy-visionaid done in 0c8fc18

@jeremy-visionaid
Copy link
Collaborator Author

jeremy-visionaid commented Oct 16, 2024

Okeydoke, I think I've progressed with the v3 proof of concept far enough that I'm happy to push it for some early feedback.
It's a fresh branch that doesn't share any common commits with the upstream (it helped me having the projects side-by-side):
https://github.com/Visionaid-International-Ltd/openmcdf/tree/3.0-poc

There are some notable things still missing:

  • Comments 😅
  • It's currently read-only
  • Missing some argument validation (e.g.. ArgumentNullException)
  • Some data/format validation
  • A FileFormatException or equivalent class (esp. catching ArgumentExceptions and rethrowing for corrupt files)
  • Red-black tree search
  • Some things are constant that could/should be taken from the header (but would otherwise be considered corrupt according to the spec)
  • Project setup stuff (e.g. Licensing, spell checking, CI etc.)
  • EntryInfo is sparse (only the name is provided to the client)

I'll at least try and add some comments tomorrow, and see if I can add some additional features as above through the week, but comments and feedback would be appreciated!

@jeremy-visionaid
Copy link
Collaborator Author

I've pushed some changes with some refactoring, improved terminology, filled out the summary comments and improved validation for corrupt CFBs. The reading part at least should be in pretty good shape if anyone wants to take it for a spin and try it out. I could bring the structured storage explorer across if it's of interest? Otherwise it's just unit tests right now.

@ironfede
Copy link
Owner

ironfede commented Oct 17, 2024

Thank you @jeremy-visionaid for all your work. I'm already working on a base v3, including some deep structural changes in manipulation of sector chains and streams in general to set a background to evolve in a future for a SWMR scenario and for a performance enhancement in write operations. I hope to put a draft online in short time since I understand that there's (correctly) a little bit of pressure on this... :-)

@jeremy-visionaid
Copy link
Collaborator Author

jeremy-visionaid commented Oct 18, 2024

@ironfede That's no worries, it's been quite fun! I did consider trying to work with v2.4 as a base, but I've handled sectors, chains and streams in a totally different way, changed and improved the error handling etc., and even the terminology for types and members is a bit different. Since I'm the one that wants most of the v3 features and robustness etc,, I'm also happy to do the bulk of the work, but I think it was just more efficient for me to start afresh in order to get there.

I'd be curious as to what things you were thinking about with regards to SWMR? Although it's not a goal I had in mind, I think the architecture of what I've done in the proof of concept would likely lend itself quite well to that situation already. I have been thinking ahead a bit towards transactions, consolidation and scratch data, etc., but I only really have an idea about how they could be achieved with the POC.

It's not so much that there's a huge amount of pressure to get v3 out immediately. I can probably work around the transactions and consolidation stuff on the client side. Releasing v2.4.0 would be helpful though, since v2.3 currently crashes our app, so I can't yet merge the changes to use it. Updating to v3 I hope would improve the performance, enable larger files and reduce memory and disk space requirements, etc, but it otherwise might not be a release blocker. (Though before we go to RTM with our application, I do need to validate some things with 2.4 - or possibly even 3.0 instead depending on how things go on that front).

Though I've been somewhat busy with other tasks this week, I hope might find some time to bring the POC up to feature parity with v2.4 around the middle of next week, but I'd also love to see your draft 3.0 to see how it compares.

@ironfede
Copy link
Owner

There are some points in a version 3 that are honestly a little bit difficult to implement in a clean way. The first is supporting LARGE files for v4 CFB (16TB Stream as specs define). Only this point alone means that a chain with no data should have more than 4 billions of indexes and this is absolutely a huge number of Sector indexes to support so I'm thinking how to partialize chain loading for sections of requested sectors in order to manipulate only required read/write sections. The natural "old" way to manipulate this type of things should be a memory file mapping but this means that all structures need to be rethinked in terms of structs with no reference at all in a C-way and this could be difficult to mantain in the long run. On the other hand, using a Stream-ed approach, I think that some functions to obtain partial chains sections should be introduced...

@molesmoke
Copy link

That's actually largely the point of why I rewrote the sector/chain/stream handling in the proof of concept. The POC is designed/intended to handle arbitrarily long files/chains/streams, but I couldn't really see how to adapt v2.4 to handle that... I made a start on handling writing, but its limited to in-place alterations at the moment, allowing arbitrarily large writes while deferring commits is still to do, but I think I might have that going in another day or two

@ironfede
Copy link
Owner

@jeremy-visionaid , @molesmoke , @Numpsy and everybody interested, I've added 3.0 branch.
If you want, please start pr-ing on this branch for POCs so we have a common ground on this repository.

Many thanks to everybody!

@ironfede ironfede added this to the 3.0 milestone Oct 20, 2024
@jeremy-visionaid
Copy link
Collaborator Author

jeremy-visionaid commented Oct 20, 2024

@ironfede Thanks for making that branch. I'll keep pushing to #199 for now, and we can retarget it/rework it as appropriate once the investigation is complete.

I've pushed some more reading related improvements to #199 that I found while working on the writing part (though I'm aware that more can still be done). But here's a quick summary of the current POC state/design:

CfbBinaryReader/Writer: Read and write CFB types (e.g. GUIDs, DateTimes, Headers, DirectoryEntries)
DirectoryEntryEnumerator: Enumerates DirectoryEntries from a FAT sector chain (via a FatSectorChainEnumerator)
DirectoryTreeEnumerator: Enumerates the children of a DirectoryEntry from the Red-Black tree (via a DirectoryEntryEnumerator)
FatSectorChainEnumerator: Enumerates the sectors from a FAT sector chain (i.e. over a FatSectorEnumerator)
FatSectorEnumerator: Enumerates the sectors that contain the FAT (i.e from the Header DIFAT array and DIFAT chain)
FatStream: Provides a Stream for a stream object for a FAT sector chain (via a FatSectorChainEnumerator)
MiniFatSectorChainEnumerator: Enumerates the MiniSectors in a mini FAT sector chain (i.e. over a MiniFatSectorEnumerator)
MiniFatSectorEnumerator: Enumerates the MiniSectors in a FAT sector chain (i.e. over a FatSectorChainEnumerator)
MiniFatStream: Provides a Stream for a stream object in the mini FAT stream (via a MiniFatSectorChainEnumerator)
RootStorage: The root storage of the CFB file/stream
Storage: Analagous to to file system directory

A lot of the initial shortcomings are still present, but I'd like to get writing basically finished before neating it up and filling it out.

I'm currently working on the following bits:
FatEnumerator: Enumerates all the sectors from the FAT (e.g. to find free sectors)
CfbStream: Provides a Stream over either a FatStream or MiniFatStream (and transitions between the two as/when appropriate)
OverlayStream: Implements scratch storage for dirty sectors as an overlay over the base stream

The writing is obviously a fair bit more complex than the reading, but I'll push something when I reached some reasonable milestone if the above design works out OK. Although I can write a CFB file and read it back, it is still very much work in progress.

@farfilli
Copy link

What about implementing #202 in v3? Basically, I need to create \ read \ edit \ delete any kind of available property streams, any advanced function that validates and \ or helps accomplish these tasks would be very welcome.

@jeremy-visionaid
Copy link
Collaborator Author

@farfilli I'm afraid I'm only really working on the core library, not the extensions. I don't want to speak for @ironfede, but IMHO, you and @Numpsy could reasonably continue to make changes to those parts on the v2 branch (regardless of whether it gets released or not)... It should be a trivial matter to port the extensions across to the v3 core (and I don't mind just handling the porting part to keep the extensions working).

@jeremy-visionaid
Copy link
Collaborator Author

A quick status update on the POC. I didn't get as much time to work on it this week as I would have liked (I still need to implement transitions to/from mini FAT streams), and haven't really done much with transactions/scratch data at all. I also wrote some benchmarks, suffice to say the POC is about 30% faster at writing compared to v2.3, but is a fair bit slower than the reference implementation, so I'd like to clean it up and optimize it a bit more before sharing it. So, still work in progres at this stage...

@jeremy-visionaid
Copy link
Collaborator Author

jeremy-visionaid commented Oct 26, 2024

I had some optimizations I wanted to get them out of my brain and into the world, so I have some write benchmarks to share!

Windows Structured Storage (ILockBytes from MemoryStream)

| Method | BufferSize | TotalStreamSize | Mean       | Error    | StdDev    | Allocated |
|------- |----------- |---------------- |-----------:|---------:|----------:|----------:|
| Write  | 1048576    | 1048576         |   267.5 us |  4.02 us |   3.76 us |     440 B |
| Write  | 512        | 1048576         | 3,059.0 us | 60.71 us | 133.25 us |     442 B |

OpenMcdf v2.3.1

| Method | BufferSize | TotalStreamSize | Mean      | Error     | StdDev    | Gen0      | Gen1     | Gen2     | Allocated |
|------- |----------- |---------------- |----------:|----------:|----------:|----------:|---------:|---------:|----------:|
| Write  | 1048576    | 1048576         |  1.278 ms | 0.0350 ms | 0.1020 ms |  369.1406 | 326.1719 | 230.4688 |   2.12 MB |
| Write  | 512        | 1048576         | 27.609 ms | 0.4599 ms | 0.4077 ms | 6031.2500 | 437.5000 | 218.7500 |  48.13 MB |

OpenMcdf v3-poc

| Method | BufferSize | TotalStreamSize | Mean     | Error   | StdDev  | Gen0   | Allocated |
|------- |----------- |---------------- |---------:|--------:|--------:|-------:|----------:|
| Write  | 1048576    | 1048576         | 194.7 us | 1.49 us | 1.24 us | 0.2441 |   3.39 KB |
| Write  | 512        | 1048576         | 278.3 us | 2.84 us | 2.52 us | 9.7656 |  83.35 KB |

I have a couple of last optimizations I'd like to make still...

The small buffer writes are allocating more memory than they probably need to, I can probably bring that down a bit.

I think the speed of the large buffer writes can also be improved by reserving FAT sectors (for the additional FAT entries) and then reserving data sectors in one contiguous block. Currently FAT sectors are added on the fly, and writes are chunked by the sector size, but if they're chunked by contiguous runs of sectors instead, then that might give a bit of a performance boost.

@jeremy-visionaid
Copy link
Collaborator Author

So, I'm a little behind where I'd hoped to be as I've had other things I needed to work on.

I've pushed the direct write support to the poc branch. The write performance measurements on the previous were out since it wasn't actually using the mini stream. So, that's fixed, along with a bunch of other fixed and improvements. It has dropped a bit of performance as well - I should be able to get some of that back, but still a fair bit faster than v2.3.

| Method | BufferSize | TotalStreamSize | Mean      | Error    | StdDev   | Gen0   | Allocated |
|------- |----------- |---------------- |----------:|---------:|---------:|-------:|----------:|
| Read   | 1048576    | 1048576         |  80.65 us | 0.514 us | 0.429 us | 0.2441 |   2.49 KB |
| Read   | 512        | 1048576         |  94.38 us | 0.708 us | 0.662 us | 0.2441 |   2.49 KB |
| Write  | 1048576    | 1048576         | 398.28 us | 5.912 us | 5.241 us |      - |   3.15 KB |
| Write  | 512        | 1048576         | 889.01 us | 4.006 us | 3.748 us |      - |   3.45 KB |

In terms of feature partity:

  • It still only constructs an all-black tree for directory entries
  • There's no delete/rename for storages/streams
  • No transactions
  • No consolidation

@jeremy-visionaid
Copy link
Collaborator Author

jeremy-visionaid commented Nov 5, 2024

I've added transactions (i.e commit/revert). The allocations for in-memory benchmarks look relatively high compared to v2.3 because the modifications get written to a stream rather than individual buffers for the Sectors (and the performance is accordingly worse since it needs to copy every time the capacity needs to be increased). It seems like an unfair comparison to use an in-memory benchmark for that, given the feature I'm interested in is using temp files for scratch/transaction data (which essentially require streams). So, the following is for file streams:

| Method          | BufferSize | TotalStreamSize | Mean       | Error     | StdDev    | Allocated |
|---------------- |----------- |---------------- |-----------:|----------:|----------:|----------:|
| Read            | 1048576    | 1048576         |   6.355 ms |  1.227 ms | 0.0673 ms |   2.47 KB |
| Read            | 512        | 1048576         |   6.878 ms |  2.309 ms | 0.1266 ms |   2.47 KB |
| Write           | 1048576    | 1048576         |  49.508 ms | 30.384 ms | 1.6655 ms |   3.53 KB |
| Write           | 512        | 1048576         |  68.863 ms |  1.211 ms | 0.0664 ms |   3.85 KB |
| WriteTransacted | 1048576    | 1048576         | 100.084 ms | 19.425 ms | 1.0648 ms | 220.17 KB |
| WriteTransacted | 512        | 1048576         |  99.068 ms |  2.661 ms | 0.1459 ms | 220.51 KB |

Unfortunately running a fair bit slower than the reference (Windows Structured Storage) at the moment.

| Method          | BufferSize | TotalStreamSize | Mean     | Error     | StdDev    | Allocated |
|---------------- |----------- |---------------- |---------:|----------:|----------:|----------:|
| WriteTransacted | 1048576    | 1048576         | 1.904 ms | 0.0370 ms | 0.0871 ms |     674 B |
| WriteTransacted | 512        | 1048576         | 4.643 ms | 0.1464 ms | 0.4082 ms |     677 B |

@jeremy-visionaid
Copy link
Collaborator Author

jeremy-visionaid commented Nov 5, 2024

So, I found the main reason for the performance issue was a miscalculation when extending the scratch data stream, which made it a fair bit longer than it needed to be. I also added a cache for the FAT's last used sector to speed up FAT lookups too. That makes it faster than the native implementation again for straight reads and writes in memory, and somewhat close to parity with the native implementation for transactions.

| Method          | BufferSize | TotalStreamSize | Mean        | Error     | StdDev    | Gen0    | Gen1    | Gen2    | Allocated |
|---------------- |----------- |---------------- |------------:|----------:|----------:|--------:|--------:|--------:|----------:|
| Read            | 1048576    | 1048576         |    626.5 us |   3.42 us |   3.04 us |       - |       - |       - |   3.01 KB |
| Read            | 512        | 1048576         |    635.9 us |   5.67 us |   5.03 us |       - |       - |       - |   3.01 KB |
| Write           | 1048576    | 1048576         |  5,468.2 us | 109.22 us | 107.26 us |       - |       - |       - |   4.17 KB |
| Write           | 512        | 1048576         |  6,419.2 us |  44.51 us |  34.75 us |       - |       - |       - |   4.32 KB |
| WriteTransacted | 1048576    | 1048576         | 12,094.0 us | 104.34 us |  81.46 us | 15.6250 | 15.6250 | 15.6250 |  219.9 KB |
| WriteTransacted | 512        | 1048576         | 13,148.9 us |  96.54 us |  80.62 us |       - |       - |       - | 220.04 KB |

Still some room for improvement and still some features to implement... but looking OK.

@jeremy-visionaid
Copy link
Collaborator Author

I realized at the end of last week that my DIFAT implementation was incorrect and that the reference implementation couldn't read the files that were being produced. Fixing that along with some various other improvements now makes the v3 POC faster than the reference implementation in all cases except small transacted writes.

File Stream Read:

Method BufferSize StreamLength Mean Error StdDev Median Ratio RatioSD Allocated
Read 512 1048576 672.1 μs 21.93 μs 32.82 μs 659.7 μs 0.59 0.03 7250 B
ReadStructuredStorage 512 1048576 1,147.4 μs 20.32 μs 28.48 μs 1,145.9 μs 1.00 0.03 209 B
Read 1048576 1048576 642.4 μs 2.53 μs 3.78 μs 641.9 μs 0.62 0.02 7249 B
ReadStructuredStorage 1048576 1048576 1,039.4 μs 18.32 μs 25.68 μs 1,024.6 μs 1.00 0.03 209 B

File Stream Write:

Method BufferSize StreamLength Mean Error StdDev Median Ratio RatioSD Allocated
Write 512 1048576 4,709.6 μs 250.56 μs 351.25 μs 4,615.8 μs 0.53 0.22 7717 B
WriteStructuredStorage 512 1048576 12,496.7 μs 6,641.81 μs 9,525.48 μs 6,941.1 μs 1.40 1.27 211 B
Write 1048576 1048576 649.2 μs 16.65 μs 23.34 μs 640.4 μs 0.16 0.03 7713 B
WriteStructuredStorage 1048576 1048576 4,212.5 μs 1,033.44 μs 1,414.59 μs 3,633.9 μs 1.07 0.42 210 B

File Stream Transacted Write:

Method BufferSize StreamLength Mean Error StdDev Ratio RatioSD Allocated
WriteTransacted 512 1048576 5.296 ms 0.1902 ms 0.2604 ms 1.03 0.11 14029 B
WriteStructuredStorageTransacted 512 1048576 5.197 ms 0.3628 ms 0.5430 ms 1.01 0.14 211 B
WriteTransacted 1048576 1048576 1.259 ms 0.0345 ms 0.0505 ms 0.71 0.04 9905 B
WriteStructuredStorageTransacted 1048576 1048576 1.767 ms 0.0486 ms 0.0713 ms 1.00 0.06 209 B

Memory Stream Read:

Method BufferSize StreamLength Mean Error StdDev Ratio RatioSD Gen0 Allocated
Read 512 1048576 83.72 μs 14.42 μs 0.791 μs 0.26 0.00 0.2441 2928 B
ReadStructuredStorage 512 1048576 317.91 μs 89.66 μs 4.914 μs 1.00 0.02 - 264 B
Read 1048576 1048576 82.36 μs 94.74 μs 5.193 μs 0.33 0.02 0.2441 2928 B
ReadStructuredStorage 1048576 1048576 246.97 μs 157.78 μs 8.649 μs 1.00 0.04 - 264 B

Memory Stream Write:

Method BufferSize StreamLength Mean Error StdDev Ratio RatioSD Gen0 Allocated
Write 512 1048576 31.62 μs 9.902 μs 0.543 μs 0.01 0.00 0.3662 3352 B
WriteStructuredStorage 512 1048576 2,297.26 μs 3,499.622 μs 191.826 μs 1.00 0.10 - 265 B
Write 1048576 1048576 24.46 μs 2.355 μs 0.129 μs 0.10 0.00 0.3967 3352 B
WriteStructuredStorage 1048576 1048576 243.40 μs 30.874 μs 1.692 μs 1.00 0.01 - 264 B

Memory Stream Transacted Write:

Method BufferSize StreamLength Mean Error StdDev Gen0 Allocated
WriteTransacted 512 1048576 33.86 μs 3.939 μs 0.216 μs 1.0376 8.53 KB
WriteTransacted 1048576 1048576 37.98 μs 1.179 μs 0.065 μs 1.0376 8.53 KB

@jeremy-visionaid
Copy link
Collaborator Author

I've just pushed the last feature that I needed for v3 - switching streams (i.e. a more general solution to "SaveAs"). Some more minor performance optimizations too of course :)

I've also now ported the Structured Storage Explorer and the OLE projects (as OpenMcdf.Ole) and refactored/modernized them both a bit.

@Numpsy @farfilli I think it's possible to notably improve the variant handling architecture, but I'm afraid I don't have cause to give it much attention myself as I have other fish to fry. However, the core proof of concept should be reasonably complete/robust. Did you want to take the branch for a spin and let me know how you get on? I'll port the OLE test project now - hopefully I haven't broken anything.

Some thoughts/comments on the implementation

  • Missing some argument validation (e.g.. ArgumentNullException)
  • Missing a FileFormatException or equivalent class (esp. catching ArgumentExceptions and rethrowing for corrupt files)
  • Red-black tree balancing (directory tree is all-black, but there is binary search)
  • Project setup stuff (e.g. Licensing, Packaging, CI etc.)
  • EntryInfo is still sparse
  • There's no protection against situations like deleting a storage/stream that is currently open

Other than that, I'll fill out the tests a bit more for corrupt files, but otherwise I think the test coverage might already be better than for v2.4 (16 TB files notwithstanding!).

@ironfede Aside from the project setup stuff to allow a release to be made, I think this is likely getting pretty close to complete now if you want to take it for a spin too and let me know what you think?

@jeremy-visionaid
Copy link
Collaborator Author

@Numpsy @farfilli Seems I was a bit premature there... Although the port looks largely OK, the v3 POC is much stricter with validating arguments and state. So, the existing OLE code was doing bad things and happening to get away with it... It doesn't look too bad to fix up the problems, but I'm afraid I've run out of time for today.

@ironfede
Copy link
Owner

I've just pushed the last feature that I needed for v3 - switching streams (i.e. a more general solution to "SaveAs"). Some more minor performance optimizations too of course :)

I've also now ported the Structured Storage Explorer and the OLE projects (as OpenMcdf.Ole) and refactored/modernized them both a bit.

@Numpsy @farfilli I think it's possible to notably improve the variant handling architecture, but I'm afraid I don't have cause to give it much attention myself as I have other fish to fry. However, the core proof of concept should be reasonably complete/robust. Did you want to take the branch for a spin and let me know how you get on? I'll port the OLE test project now - hopefully I haven't broken anything.

Some thoughts/comments on the implementation

  • Missing some argument validation (e.g.. ArgumentNullException)
  • Missing a FileFormatException or equivalent class (esp. catching ArgumentExceptions and rethrowing for corrupt files)
  • Red-black tree balancing (directory tree is all-black, but there is binary search)
  • Project setup stuff (e.g. Licensing, Packaging, CI etc.)
  • EntryInfo is still sparse
  • There's no protection against situations like deleting a storage/stream that is currently open

Other than that, I'll fill out the tests a bit more for corrupt files, but otherwise I think the test coverage might already be better than for v2.4 (16 TB files notwithstanding!).

@ironfede Aside from the project setup stuff to allow a release to be made, I think this is likely getting pretty close to complete now if you want to take it for a spin too and let me know what you think?

I'm impressed of all of your work! I'm really out of time at the moment. I'll do my best to check but anyway I think it's going to work great!

@jeremy-visionaid
Copy link
Collaborator Author

@Numpsy @farfilli I've now fixed up the OLE library to account for the stricter validation in the proof of concept:

https://github.com/Visionaid-International-Ltd/openmcdf/tree/3.0-poc

@jeremy-visionaid
Copy link
Collaborator Author

OK, I think we're pretty much all set for 3.0.0-preview1! 🚀

I've put the merge commit of the proof of concept as a draft here:
https://github.com/ironfede/openmcdf/tree/3.0-draft

The idea is that it will clobber the existing 3.0 branch with a merge that replaces it with the proof of concept, then 3.0 can be merged to master. Unfortunately nothing will then merge from 2.x to master since the histories are unrelated (3.0 is a complete rewrite). However, both histories will be maintained, and branches can be made for 2.3 and 2.4 for the old API if anybody requires changes/bug fixes for them.

So, unless anyone has any objections I'm thinking we could hit go on that tomorrow...

Massive thanks to @ironfede for granting me the privileges to be able to push forwards with such a major change.

@Numpsy
Copy link
Contributor

Numpsy commented Nov 16, 2024

I've updated my project from version 2 to the current v3/master branch and the basic tests pass :-)

Question about the api (not sure if this was discussed elsewhere before) -

I had some code that tried to get optional streams (e.g. DocumentSummaryInformation) from a storage using CompoundFile.RootStorage.TryGetStream which I changed to RootStorage.OpenStream to get it to build, and then added local try/catches to ignore any FileNotFoundException as it's not actually an error.
I don't think having to do that is as friendly an API for things that are expected to be optional?

@jeremy-visionaid
Copy link
Collaborator Author

@Numpsy That's great, thanks for testing it out!

There's definitely scope for the API to be changed. Having a "Try" version of the Open calls would definitely be a good addition, since it avoids potentially traversing the directory tree twice for certain operations, and it's trivial to add. I'm happy to put in a PR for it now...

@Numpsy
Copy link
Contributor

Numpsy commented Nov 17, 2024

Thanks, I'll have a look later

@Numpsy
Copy link
Contributor

Numpsy commented Nov 17, 2024

To follow up on my previous question in #216 (comment) -

I wonder if there is actually a need for IProperty itself to be public - instances of it are produced by the PropertySetFactory/PropertySetStream machinery (which is internal) and consumed by the internals of OlePropertiesContainer - and it if could be made an internal implementation details for v3 then there might be more freedom to change the internals later without worrying about public interface changes?

e.g Numpsy@99cbb23

@jeremy-visionaid
Copy link
Collaborator Author

@Numpsy I'd agree with that. There doesn't seem to be much purpose to the interfaces if they aren't being used for things like mocking, etc. I think they could reasonably just be removed and take the wins on devirtualization and simplification of other refactoring (especially given it's still experimental).

Moreover though, given that the OLE is experimental, I don't think any modifications there needs to hold up the wider 3.0.0 milestone. We could just changes there as they come... (and probably mark it explicitly in NuGet as experimental).

So, I think this usefulness of this particular thread has come to an end now. All the initial objectives have been met aside from async, which I don't really need and was optional anyway. So I might make a new issue for that and keep it on the back burner until someone has an explicit need for it.

Red-black tree balancing is also still TODO, but the performance should still be significantly better than 2.4 since it always loaded the full tree, didn't utilize the tree for searches, and an all black tree is still perfectly valid AFAIK (it just might not be as fast as possible for storages with large numbers of entries).

There should technically be ArgumentNullExceptions in various places, but nullable analysis is enabled, and it's a pretty minor thing. I'll add it if I find time and someone else doesn't beat me to it!

Unit test coverage is currently around 90%, which mostly leaves only testing for various corrupt headers, directory entries and chain loops. I believe those things should work, but it's just a little time consuming to create tests for them.

There is #215, but performance still seems a lot better than baseline despite optimizations being technically possible. So I'd be more inclined to target that to a later milestone too.

So, thanks everyone for your input! Please make new tickets for any specific feedback that comes up with the preview!

@Numpsy
Copy link
Contributor

Numpsy commented Nov 18, 2024

and probably mark it explicitly in NuGet as experimental

fwiw I've had a custom build of 2.3 in production and it all seems to be working ok (the API is quite 'low level', but it does everything I need and all the known bugs are fixed in the official 2.4 release now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants