Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow more control over what to do with an archive's entries #20

Closed
muja opened this issue Dec 21, 2019 · 7 comments
Closed

Allow more control over what to do with an archive's entries #20

muja opened this issue Dec 21, 2019 · 7 comments

Comments

@muja
Copy link
Owner

muja commented Dec 21, 2019

Currently, one operation is provided to rule all entries. i.e. the operation is set for an entire archive, when the RAR format can do better: it allows, to some extent, to provide different operations on an entry-to-entry basis, so one could extract some entries and decide to skip others.

One could also provide a different destination folder to extract for each entry. This would take some burden off the struct, as well, as it wouldn't need to keep a record of these things.

Currently, when opening an archive, RAR needs only two pieces of information:

  1. filename - the path to the archive.
  2. OpenMode - this can be either LIST or EXTRACT (or LIST_SPLIT which is like LIST but behaves differently for multipart archives). This, in essence, prepares the archive for a set of operations. You cannot extract from an archive which was opened in LIST mode, for example.

My idea is to return an Entry in the iterator, which is produced by the native read_header call. The problem, why normal iterators won't work, is because the Iterator needs to return owned values, it cannot return a borrowed value to an interior field.

That would separate the Archive and the Entry, however, they cannot be separate, because the call sequence on a Handle has some restrictions, e.g. calling RARReadHeader after a failed RARProcessFile is illegal.

Instead I've been investigating the streaming_iterator crate. I'm currently struggling with the interface, since it does not allow mutable reference to be returned in next so I cannot mutate Entry and thus track whether it has been processed by the user or not. I need this information however in order to know whether to call RARProcessFile(Operation::Skip) on the handle, because calling it when the file has already been processed leads to weird behavior and not calling it leads to an error in the next RARReadHeader step.

@vjoki
Copy link
Contributor

vjoki commented Dec 21, 2019

Oh, I have a functioning StreamingIterator implementation in my WIP branch vjoki@5d0238f, but it's quite ugly.

It works by iterating over UnprocessedEntry structs that provide the processing interface (skip etc.), yielding the inner Entry when processed. Unprocessed entries are tracked by keeping a reference in the iterator. But annoyingly the UnprocessedEntry is forced to live until the next iteration, allowing industrious users to attempt processing it multiple times (I have it panicking when that happens), because expiring the UnprocessedEntry is blocked by the lack of &mut in the StreamingIterator trait as you've also noticed.

Overall I'm left feeling like using Iterators isn't the right answer, I'm not really seeing the benefits, maybe it'd be better to define some sort of thin custom solution for iterating the files.

@muja
Copy link
Owner Author

muja commented Dec 22, 2019

Well the clear benefit would be to have access to multiple combinators like map, filter etc.

The cleanest solution would be to have an FSM and to always move on each call.

I took some time to draw (an approximation of) an FSM:

This would be strongly typed and mistakes would be almost impossible during compile time. But it's very unergonomic for users to loop or iterate over entries this way...

@vjoki
Copy link
Contributor

vjoki commented Dec 22, 2019

Well that state machine (and the ergonomics that come with it) is basically what you get with StreamingIterator, particularly the extract route? With the exception that the FILE* ends up lingering until it's out of scope even after use.

For implementing listing, you don't need a skip since all you can do is read the header anyway,
also is it really necessary to separate list_split and list cases?
So the regular Iterator would probably work cleanly if restricted just for this case, for example by returning a ListOnlyOpenArchive (think about proper names later) from list() and list_split() implementing Iterator that returns just the headers. And then for the extract case you could return a ExtractableOpenArchive that implements StreamingIterator, or just next with &mut directly (losing map et al. while gaining the right type to expire the unprocessed entry).

To get a clean (without UnsafeCell or RefCell) implementation of the streaming iterator you'd need to create a mutable StreamingIterator trait (assuming it's possible). But I'm just not convinced the ergonomics gains here are actually worth the effort, when all the iterator seems to buy here is just some shorthands for loops. Doing the iteration in a loop using a builtin next, especially when you want to make decisions while iterating, should not be particularly unergonomic IMO.

I think I'll try playing around with this idea later and see how it looks.

@muja
Copy link
Owner Author

muja commented Jan 11, 2020

I tried to fork the streaming_iterator crate and make it mutable but my Rust skills suck too hard, I had a lot of problems with some occurrences of FnOnce et al which I failed to convert to mut and I wasn't motivated enough to wrap my head around it. :D

Anyway I thought of a very hacky way we could make it work with the streaming_iterator crate. Since we'd only be needing to have some kind of interior mutability, we could do something like in this playground:

fn manipulate_immutable(s: &String) {
    let s = s as *const String as *mut String;
    let s = unsafe { &mut *s };
    s.push('!');
}

fn main() {
    let s = "Hello World!".to_string();
    manipulate_immutable(&s);
    println!("{}", s);
}

Output: Hello World!!

Unless I'm missing some UB, this might also be an option.

I believe this wouldn't be thread safe...

@vjoki
Copy link
Contributor

vjoki commented Jan 11, 2020

Unless I'm missing something, that's what I did in my implementation by using UnsafeCell (which is just a wrapper for the cast) for interior mutability. Ultimately RefCell would probably be the better option though.

It works, but the problem with that is that it allows the user to try to process the entry multiple times.
Meaning something like the following would compile:

Archive::new("archive.rar").extract_to(".").unwrap()
  .for_each(|unprocessed_entry| {
                  let result = unprocessed_entry.as_ref().unwrap().extract();
                  let result = unprocessed_entry.as_ref().unwrap().extract();
                });

This could be mitigated by returning a suitable error when trying to process an already processed entry. Of course ideally the compiler should be able to prevent this from happening, and with the right type for next that would indeed be the case.

Another caveat with the StreamingIterator is that users will need to the trait to use it, we can re-export it so users won't have to deal with streaming_iterator directly though.

I could submit a PR with my StreamingIterator implementation if you really want it and are ok with these caveats. However personally I'd still prefer the alternative approach (implementation at my WIP branch) which would avoid these caveats, but require the user to do their own iteration with a next function.

Also I don't think thread safety is a concern here?
At least I've been assuming that anything past OpenArchive is not thread safe, since I don't recall seeing any documentation regarding the thread safety of the underlying FFI functions.

There's also the additional concern of whether a faster route for listing archive contents is desired? As in should the Iterator stay around, get replaced by something else for this purpose or just be removed?

@muja
Copy link
Owner Author

muja commented Jan 12, 2020

Another caveat with the StreamingIterator is that users will need to the trait to use it, we can re-export it so users won't have to deal with streaming_iterator directly though.

Have you tried using the inherent-pub crate? I also couldn't get that to work.

Personally I'm also more and more leaning more towards the FSM approach. I took a look into the WIP branch you mentioned and I see what you're doing but I believe we can do even better to prevent invalid usage. Pseudo code of what I had imagined the API to be like:

// reader.rs
pub struct OpenArchiveReader { handle: native::Handle, ... }
pub struct OpenEntry { handle: native::Handle, ... }
impl OpenArchiveReader {
    fn next(self) -> UnrarResult<OpenEntry> { // note the moving of `self`
        let header = native::RARReadHeaderEx(&self.handle, ...)?;
        Ok(OpenEntry::new(self.handle, header))
    }
}

impl OpenEntry {
    pub fn process(self) -> UnrarResult<OpenArchiveReader> { // note the moving of `self`
        native::RARProcessFileW(&self.handle, ...)?;
        Ok(OpenArchiveReader::new(self.handle, ...))
    }
}

// drop impls
impl Drop for OpenEntry {
    fn drop(self) { native::RARCloseFileW(self.handle).unwrap(); }
}

// drop impls
impl Drop for OpenArchiveReader {
    fn drop(self) { native::RARCloseFileW(self.handle).unwrap(); }
}

This way, we get rid of those if statements.

What do you think?

@vjoki
Copy link
Contributor

vjoki commented Jan 12, 2020

No, I haven't tried inherent-pub, I can give it a go if we go that way though.

As for the pseudo code, I'm not seeing the benefits. To me it seems to only make iteration harder, since the iterator will keep moving around. Also won't those Drop impls lead into calling RARCloseFileW multiple times? AFAIK the implementation in reader.rs already prevents invalid usage, was there something specific you were concerned about?

I was envisioning that the next function would be directly in OpenArchive. And instead of returning Entry, the processing functions could simply return () (or Vec<u8> for read_bytes) on success. This would remove the need for the trait gymnastics I have going. As for the ugly if in next, it would be moved to the unprocessed entrys Drop impl once there's an error field in the shared data struct. (The error field is also required for returning errors back from the callback.)

@muja muja closed this as completed Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants