Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cp performance issue #7092

Open
hongxuchen opened this issue Jan 8, 2025 · 21 comments
Open

cp performance issue #7092

hongxuchen opened this issue Jan 8, 2025 · 21 comments

Comments

@hongxuchen
Copy link

As is observed in nushell/nushell#14778, uu_cp seems to have performance issues when dealing with large directories.

> timeit ~/.cargo/bin/coreutils cp cymbol_out/ -r cymbol_out_bak
52sec 95ms 4µs 960ns
> timeit ~/.cargo/bin/coreutils rm -rf cymbol_out_bak
10ms 929µs 36ns
> ~/.cargo/bin/coreutils -h
coreutils 0.0.28 (multi-call binary)
...

In my case, the directory is of ~2.5GB containing a list of big json files. GNU coreutils cp takes <3s to complete the operation but uu_cp takes ~50s. I've also tried on zsh with similar results.

> time ~/.cargo/bin/coreutils cp cymbol_out -r cymbol_out_bak
real    50.43s
user    0.00s
sys     3.23s
> time ~/.cargo/bin/coreutils rm -r cymbol_out_bak
real    0.56s
user    0.00s
sys     0.56s
> time cp cymbol_out -r cymbol_out_bak
real    2.81s
user    0.00s
sys     2.81s
> time rm -r cymbol_out_bak
real    0.54s
user    0.00s
sys     0.54s
@sylvestre
Copy link
Contributor

please rerun the benchmarks with hyperfine. time isn't good enough for this.

@sylvestre
Copy link
Contributor

and how did you get/build your ~/.cargo/bin/coreutils ?

@sylvestre
Copy link
Contributor

I did a quick experiment and while we are slower, it isn't to your scale:

$ hyperfine --export-markdown cp.md "/usr/bin/cp  cp-perf/cymbol_out/ -r cymbol_out_bak && rm -rf cymbol_out_bak"  "./target/release/coreutils cp cp-perf/cymbol_out/ -r cymbol_out_bak && rm -rf cymbol_out_bak"

Benchmark 1: /usr/bin/cp  cp-perf/cymbol_out/ -r cymbol_out_bak && rm -rf cymbol_out_bak
  Time (mean ± σ):      5.428 s ±  1.477 s    [User: 0.000 s, System: 1.479 s]
  Range (min … max):    2.268 s …  7.106 s    10 runs

Benchmark 2: ./target/release/coreutils cp cp-perf/cymbol_out/ -r cymbol_out_bak && rm -rf cymbol_out_bak
  Time (mean ± σ):      7.298 s ±  0.552 s    [User: 0.003 s, System: 1.593 s]
  Range (min … max):    6.307 s …  8.007 s    10 runs

Summary
  /usr/bin/cp  cp-perf/cymbol_out/ -r cymbol_out_bak && rm -rf cymbol_out_bak ran
    1.34 ± 0.38 times faster than ./target/release/coreutils cp cp-perf/cymbol_out/ -r cymbol_out_bak && rm -rf cymbol_out_bak

With


import os
import random

# Directory to create files in
output_dir = "cymbol_out"
os.makedirs(output_dir, exist_ok=True)

# List of random file names and their sizes (in bytes)
file_specs = [
   ("GenericClass.json", 5 * 1024 * 1024),
   ("cymbol_cc.json", 11 * 1024 * 1024),
   ("Location.json", 468 * 1024 * 1024),
   ("FunctionBody.json", 182 * 1024 * 1024),
   ("Attribute.json", 166 * 1024 * 1024),
   ("Method.json", 161 * 1024 * 1024),
   ("Access.json", 168 * 1024 * 1024),
   ("infusion_snapshot.json", 222),
   ("LocalVariable.json", 248 * 1024 * 1024),
   ("LocationWithReceiver.json", 431 * 1024 * 1024),
   ("GlobalFunction.json", 30 * 1024 * 1024),
   ("Package.json", 3 * 1024 * 1024),
   ("Call.json", 335 * 1024 * 1024),
   ("Module.json", 29 * 1024 * 1024),
   ("System.json", 218 * 1024),
   ("InheritanceRelation.json", 5 * 1024 * 1024),
   ("GlobalVariable.json", 12 * 1024 * 1024),
   ("File.json", 33 * 1024 * 1024),
   ("Namespace.json", 2 * 1024 * 1024),
   ("Class.json", 41 * 1024 * 1024),
   ("Parameter.json", 211 * 1024 * 1024),
   ("TypedefDecorator.json", 6 * 1024 * 1024),
   ("Union.json", 111 * 1024),
   ("Subsystem.json", 1.5 * 1024 * 1024),
   ("Primitive.json", 2.8 * 1024),
]

# Create files with specified sizes and random content
for filename, size in file_specs:
   file_path = os.path.join(output_dir, filename)
   with open(file_path, "wb") as f:
       f.write(os.urandom(int(size)))  # Write random bytes

output_dir

to create random files

@sylvestre
Copy link
Contributor

The profile:
https://share.firefox.dev/3PtBRks
Generated with:
samply record ./target/profiling/coreutils cp cp-perf/cymbol_out/ -r cymbol_out_bak

@hongxuchen
Copy link
Author

and how did you get/build your ~/.cargo/bin/coreutils ?

I installed with cargo install coreutils --locked.

> cargo --version
cargo 1.84.0-nightly (8c30ce536 2024-10-15)
> inxi -CMm
Machine:
  Type: Kvm System: OpenStack Foundation product: OpenStack Nova v: 13.2.1-20230116174715_590b25c
    serial: a2e7a571-64eb-471c-bbc6-d38d185b6d3e
  Mobo: N/A model: N/A serial: N/A BIOS: SeaBIOS
    v: rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 date: 04/01/2014
Memory:
  System RAM: total: 64 GiB available: 62.79 GiB used: 4.84 GiB (7.7%)
  Array-1: capacity: 64 GiB slots: 4 modules: 4 EC: Multi-bit ECC
  Device-1: DIMM 0 type: RAM size: 16 GiB speed: N/A
  Device-2: DIMM 1 type: RAM size: 16 GiB speed: N/A
  Device-3: DIMM 2 type: RAM size: 16 GiB speed: N/A
  Device-4: DIMM 3 type: RAM size: 16 GiB speed: N/A
CPU:
  Info: 8-core model: Intel Xeon Gold 6266C bits: 64 type: MT MCP cache: L2: 8 MiB
  Speed (MHz): avg: 3000 min/max: N/A cores: 1: 3000 2: 3000 3: 3000 4: 3000 5: 3000 6: 3000
    7: 3000 8: 3000 9: 3000 10: 3000 11: 3000 12: 3000 13: 3000 14: 3000 15: 3000 16: 3000

@tertsdiepraam
Copy link
Member

I also can't reproduce this. What kind of storage does you machine have? The only thing I can think of is that maybe you have slow storage but a filesystem that supports reflinking and uu_cp isn't picking up on that. Maybe you could try running it with --reflink=always?

@hongxuchen
Copy link
Author

@tertsdiepraam both GNU coreutils cp and uu_cp report Operation not supported when running with --reflink=always.

@hongxuchen
Copy link
Author

@sylvestre This is my hyperfine results.

> hyperfine --export-markdown cp.md "/usr/bin/cp  cymbol_out -r cymbol_out_bak && rm -rf cymbol_out_bak"  "~/.cargo/bin/coreutils cp cymbol_out -r cymbol_out_bak && rm -rf cymbol_out_bak"
Benchmark 1: /usr/bin/cp  cymbol_out -r cymbol_out_bak && rm -rf cymbol_out_bak
  Time (mean ± σ):      2.985 s ±  0.210 s    [User: 0.002 s, System: 2.981 s]
  Range (min … max):    2.721 s …  3.296 s    10 runs

Benchmark 2: ~/.cargo/bin/coreutils cp cymbol_out -r cymbol_out_bak && rm -rf cymbol_out_bak
  Time (mean ± σ):     50.834 s ±  0.289 s    [User: 0.004 s, System: 3.571 s]
  Range (min … max):   50.412 s … 51.314 s    10 runs

Summary
  /usr/bin/cp  cymbol_out -r cymbol_out_bak && rm -rf cymbol_out_bak ran
   17.03 ± 1.20 times faster than ~/.cargo/bin/coreutils cp cymbol_out -r cymbol_out_bak && rm -rf cymbol_out_bak

@hongxuchen
Copy link
Author

hongxuchen commented Jan 9, 2025

BTW, I'm using an AWS-like elastic compute cloud. Not sure whether this affects I/O much.

> sudo dmidecode -t 16,17
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.

Handle 0x1000, DMI type 16, 23 bytes
Physical Memory Array
        Location: Other
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 64 GB
        Error Information Handle: Not Provided
        Number Of Devices: 4

Handle 0x1100, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x1000
        Error Information Handle: Not Provided
        Total Width: Unknown
        Data Width: Unknown
        Size: 16 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMM 0
        Bank Locator: Not Specified
        Type: RAM
        Type Detail: Other
        Speed: Unknown
        Manufacturer: QEMU
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Memory Speed: Unknown
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: Unknown
...

> sudo lshw -c disk
  *-virtio3
       description: Virtual I/O device
       physical id: 0
       bus info: virtio@3
       logical name: /dev/vda
       size: 200GiB (214GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: driver=virtio_blk guid=4be7b5ae-c649-47da-83c0-db611f0550b7 logicalsectorsize=512 sectorsize=512
  *-virtio4   # <============== cp is run on this device
       description: Virtual I/O device
       physical id: 0
       bus info: virtio@4
       logical name: /dev/vdb
       size: 400GiB (429GB)
       capabilities: partitioned partitioned:dos
       configuration: driver=virtio_blk logicalsectorsize=512 sectorsize=512 signature=b698daaa

@sylvestre
Copy link
Contributor

Could you please run samply and share the profile of your run? Thanks

@hongxuchen
Copy link
Author

hongxuchen commented Jan 9, 2025

Could you please run samply and share the profile of your run? Thanks

I'm sorry it seems that I cannot use samply record and start the web server on the remote machine due to my company's regulations. For now I can only provide a samply record -s -n output profile.json; but I'm afraid that it is not informative.

FYI, I tried a physical machine (running on Windows however), it also taks <3s for uu_cp.

@sylvestre
Copy link
Contributor

Sure, please share the profile.json :)

@hongxuchen
Copy link
Author

@sylvestre profile.json is embedded in last reply:)

@sylvestre
Copy link
Contributor

sorry!

here is the profile: https://share.firefox.dev/4fTBcUr
you need to rebuild the coreutils with debug info

as it isn't really actionable

@hongxuchen
Copy link
Author

profile.json
Please see the profile where I built with cargo build against coreutils git commit 33ac583. (I changed clap to 4.5.23 and tempfile to 3.14.0 in Cargo.toml to avoid compilation failure, which I don't think should affect much.)

> rustc --version
rustc 1.86.0-nightly (243d2ca4d 2025-01-06)
> time samply record -s -n /root/OSS/rust/coreutils/target/debug/coreutils cp -r cymbol_out/ cymbol_out_bak
real    49.19s
user    0.10s
sys     3.24s

@sylvestre
Copy link
Contributor

you will have to upload it yourself. the profile doesn't have the debug info, sorry

@hongxuchen
Copy link
Author

you will have to upload it yourself. the profile doesn't have the debug info, sorry

I'm sorry but I'm not allowed to upload files greater than certain size(for sure coreutils exceeds it) according to my company's regulations :(

@tertsdiepraam
Copy link
Member

Maybe you could run it on just one file? Even if it doesn't take that long we might still be able to see which function takes up most of the time.

@hongxuchen
Copy link
Author

Maybe you could run it on just one file? Even if it doesn't take that long we might still be able to see which function takes up most of the time.

I failed to upload the debugging version of uu_cp single binary (even its tar.bz2 is >5MB) as the size exceeds; seems that the company sets the limit as 100KB...

@sylvestre
Copy link
Contributor

maybe try on some personal hardware :)

@hongxuchen
Copy link
Author

@sylvestre I cannot reproduce the huge performance differences on my personal (physical) machines; and that's why I guess it may result from I/O virtualization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants