Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read tar file by requested block #13

Open
jtmoon79 opened this issue Aug 8, 2022 · 4 comments
Open

read tar file by requested block #13

jtmoon79 opened this issue Aug 8, 2022 · 4 comments
Labels
difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request file parser

Comments

@jtmoon79
Copy link
Owner

jtmoon79 commented Aug 8, 2022

Problem

An .tar file is entirely read during BlockReader::read_block_FileTar.
This may cause problems for very large compressed files (the s4 program will have the entire unarchived file in memory; it will use too much memory).

This is due to design of the tar crate. The crate does not provide a method to store tar::Archive<File> instance and tar::Entry<'a, R: 'a + Read> instance due to inter-instance references and explicit lifetimes. (or is prohibitively complex; I made many attempts using various strategies involving references, lifetimes, pointers, etc.)

A tar::Entry holds a reference to data within the tar::Archive<File>. I found it impossible to store both related instances during new() or read_block_FileTar() and then later, during another call to read_block_FileTar(), utilize the same tar::Entry.
A new tar::Entry could be created per call to read_block_FileTar(). But then to read the requested BlockOffset, the entire .tar file entry would have to re-read. This means reading an entire file entry within a .tar file would be an O(n^2) algorithm.

Solution

Read an .tar file per block request, as done for normal files.


Meta Issue #182
Similar problem as Issue #12.

@jtmoon79 jtmoon79 added the enhancement New feature or request label Aug 8, 2022
@jtmoon79
Copy link
Owner Author

jtmoon79 commented Aug 8, 2022

Not sure how to accomplish the Solution. And tar is the only mature rust crate for reading .tar files.

@jtmoon79
Copy link
Owner Author

jtmoon79 commented Aug 8, 2022

Crate basic_tar should be considered.

@jtmoon79 jtmoon79 added the difficult A difficult problem; a major coding effort or difficult algorithm to perfect label May 21, 2023
@jtmoon79 jtmoon79 reopened this May 31, 2024
@jtmoon79 jtmoon79 closed this as not planned Won't fix, can't repro, duplicate, stale May 31, 2024
@jtmoon79
Copy link
Owner Author

jtmoon79 commented Jun 15, 2024

What is wanted is to create an tar::Entry and later make calls to Entry.read_exact within the same thread.

Creating an tar::Entry requires creating tar::Archive<File>, tar::Entries<File>, tar::Entry<'a, File>. But this is greatly complicated in that tar::Entry is borrowing tar::Archive and also the tar::Entry is derived from the tar::Entries. So AFAICT, a later call to Entry.read_exact requires all three instances to remain in existence.

Here are the permutation of technical approaches I have tried:

  • storing a Box<...>
  • storing a Pin<Box<...>>; this was to avoid an error of the Archive becoming overwritten
    • storing the same but using unsafe blocks to read from Entry; Entry instance became corrupted
  • using thread_local! (lazy_static! requires Sync and Send to be implemented)
  • forcibly allocating on the heap with the help of the copyless crate; this was to attempt to avoid lifetime problem of Archive
  • storing Archive, Entries, Entry in a struct and annotating with ouroboros::self_referencing; ouroboros macros did not like the < symbol in the T
    • trying the same with self_cell::self_cell
  • trying Serialize, Deserialize from crate serde; tar::Archive does not support Sync and Send

Again, I tried many permutations of all of the prior.


The closest I got was

use std::env;
use std::io::prelude::*;
use std::fs::File;
use std::cell::RefCell;
use std::ops::DerefMut;
use std::pin::Pin;
use ::copyless;

use ::tar::Archive;
use ::tar::Entries;
use ::tar::Entry;

std::thread_local! {
    static MyArchive4: RefCell<Option<Box<Archive<File>>>> = {
        eprintln!("thread_local! MyArchive4");
        RefCell::new(None)
    };
    static MyEntry4: RefCell<Option<Box<Entry<'static, File>>>> = {
        eprintln!("thread_local! MyEntry4");
        RefCell::new(None)
    };
    static MyEntries4: RefCell<Option<Box<Entries<'static, File>>>> = {
        eprintln!("thread_local! MyEntries4");
        RefCell::new(None)
    };
}
fn main() {
    let args: Vec<String> = env::args().collect();
    let filename = &args[1];

    MyArchive4.with(|rca| {
        let file: File = File::open(filename).unwrap();
        unsafe {
            // https://stackoverflow.com/a/59368947/471376
            // forcibly allocate on the heap
            let mut bx = <Box<Archive<File>> as copyless::BoxHelper<Archive<File>>>::alloc();
            rca.borrow_mut().replace(
                copyless::BoxAllocation::init(
                    bx,
                    Archive::<File>::new(file)
                )
            );
        }
        MyEntries4.with(|rces| {
            unsafe {
                let mut bx = <Box<Entries<'_, File>> as copyless::BoxHelper<Entries<'_, File>>>::alloc();
                MyEntry4.with(|rce| {
                    let mut bx = <Box<Entry<'_, File>> as copyless::BoxHelper<Entry<'_, File>>>::alloc();
                    rce.borrow_mut().replace(
                        copyless::BoxAllocation::init(
                            bx,
                            rca.borrow_mut().as_mut().unwrap().entries().unwrap().nth(0).unwrap().unwrap()
                        )
                    );
                });
            }
        });
    });
}

but this results in error

298 |     MyArchive4.with(|rca| {
    |                      ---
    |                      |
    |                      `rca` is a reference that is only valid in the closure body
    |                      has type `&'1 RefCell<Option<Box<Archive<std::fs::File>>>>`
...
333 |                             rca.borrow_mut().as_mut().unwrap().entries().unwrap().nth(0).unwrap().unwrap()
    |                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |                             |
    |                             `rca` escapes the closure body here
    |                             argument requires that `'1` must outlive `'static`

another approach that violated teh borrow checker

use std::env;
use std::io::prelude::*;
use std::fs::File;
use std::cell::RefCell;
use std::ops::DerefMut;
use std::pin::Pin;
use ::copyless;

use ::tar::Archive;
use ::tar::Entries;
use ::tar::Entry;

std::thread_local! {
    static MyArchiveEntriesEntry8: RefCell<Option<Box<(Archive<File>, Entries<'static, File>, Entry<'static, File>)>>> = {
        eprintln!("thread_local! MyArchiveEntriesEntry8");
        RefCell::new(None)
    };
}
fn main() {
    let args: Vec<String> = env::args().collect();
    let filename = &args[1];

    MyArchiveEntriesEntry8.with(|rcae| {
        let file: File = File::open(filename).unwrap();
        let mut archive = Archive::new(file);
        let mut entries = archive.entries().unwrap();
        let mut entry = entries.nth(0).unwrap().unwrap();
        rcae.borrow_mut().replace(Box::new((archive, entries, entry)));
    });
}

results in

error[E0597]: `archive` does not live long enough
   --> src/main.rs:528:27
    |
525 |     MyArchiveEntriesEntry8.with(|rcae| {
    |                                  ---- has type `&RefCell<Option<Box<(Archive<std::fs::File>, Entries<'1, std::fs::File>, tar::Entry<'_, std::fs::File>)>>>`
526 |         let file: File = File::open(filename).unwrap();
527 |         let mut archive = Archive::new(file);
    |             ----------- binding `archive` declared here
528 |         let mut entries = archive.entries().unwrap();
    |                           ^^^^^^^----------
    |                           |
    |                           borrowed value does not live long enough
    |                           argument requires that `archive` is borrowed for `'1`
...
531 |     });
    |     - `archive` dropped here while still borrowed

Another permutation

use std::env;
use std::io::prelude::*;
use std::fs::File;
use std::cell::RefCell;
use std::ops::DerefMut;
use std::pin::Pin;
use ::copyless;

use ::tar::Archive;
use ::tar::Entries;
use ::tar::Entry;

std::thread_local! {
    static MyArchive: RefCell<Option<Pin<Box<*mut Archive<File>>>>> = {
        eprintln!("thread_local! MyArchive");
        RefCell::new(None)
    };
    static MyEntry: RefCell<Option<Pin<Box<*mut Entry<'static, File>>>>> = {
        eprintln!("thread_local! MyEntry");
        RefCell::new(None)
    };
    static MyEntries: RefCell<Option<Pin<Box<*mut Entries<'static, File>>>>> = {
        eprintln!("thread_local! MyEntries");
        RefCell::new(None)
    };
}
fn main() {
    let args: Vec<String> = env::args().collect();
    let filename = &args[1];

    MyArchive.with(|rca| {
        let file: File = File::open(filename).unwrap();
        let mut archive = &mut Archive::new(file);
        let mut parchive: Box<*mut Archive<File>>  = Box::new(archive);
        MyEntries.with(|rces| {
            unsafe {
                let mut es = parchive.deref_mut().as_mut().unwrap().entries().unwrap();
                let mut entries: Box<*mut Entries<File>> = Box::new(&mut es);
                MyEntry.with(|rce| {
                    let mut entry: Box<*mut tar::Entry<File>> = match es.nth(0) {
                        Some(Ok(ref mut e)) => {
                            eprintln!("nth(0) OK {:?}", e.header().path().unwrap());
                            Box::new(e)
                        }
                        Some(Err(e)) => {
                            panic!("{}", e);
                        }
                        None => {
                            panic!("None");
                        }
                    };
                    rce.borrow_mut().replace(Pin::new(entry));
                });
                rces.borrow_mut().replace(Pin::new(entries));
            }
        });
        rca.borrow_mut().replace(Pin::new(parchive));
    });
}

This compiles and runs but match es.nth(0) often fails with Segmentation fault.

@jtmoon79 jtmoon79 reopened this Jun 15, 2024
jtmoon79 added a commit that referenced this issue Jun 23, 2024
Read all blocks from a .tar file. This reverts to prior behavior.

Issue #13
Issue #182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request file parser
Projects
None yet
Development

No branches or pull requests

1 participant