Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google::Cloud::Storage is about twice slower than gsutil cp for downloading files #1897

Closed
casperisfine opened this issue Dec 21, 2017 · 29 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.

Comments

@casperisfine
Copy link
Contributor

Hi, I ran the following benchmark from a GKE container:

#!/usr/bin/env ruby

require 'benchmark'

def measure(&block)
  duration = Benchmark.realtime(&block)
  puts "took: #{duration.round(1)}s"
end

puts "--- Generate 500MB file"
puts system('dd if=/dev/urandom of=/tmp/random-500M.bin bs=1048576 count=500')

url = "gs://<my-test-bucket>/test/random-500M.bin.#{rand}"
puts "--- Upload with gsutil"
measure do
  system("gsutil cp /tmp/random-500M.bin #{url}")
end


puts "--- Download with gsutil"
measure do
  system("gsutil cp #{url} /tmp/random-500M.bin.#{rand}")
end

puts "--- Upload with GCS-ruby"

bucket = Google::Cloud::Storage.new(
  project: '<my-test-project>',
  keyfile: ENV['GOOGLE_APPLICATION_CREDENTIALS'],
  timeout: 5,
  retries: 0,
).bucket('<my-test-bucket>', skip_lookup: true)

bucket = Buildkite::FSCache.send(:bucket)

file_name = "test/random-500M.bin.#{rand}"
measure do
  puts bucket.create_file('/tmp/random-500M.bin', file_name)
end

puts "--- Download with GCS-ruby"
measure do
  puts bucket.file(file_name).download("/tmp/random-500M.bin.#{rand}")
end

In short, upload an then download a random 500MB file, with the following results:

  • gsutil cp upload 7.1s
  • gsutil cp download 6.1s
  • G::C::S upload 4.8s
  • G::C::S download 13.9s

There obviously a bit of variance between the runs, but G::C::S#download is constantly twice slower than the rest.

Also note that gsutil doesn't use crcmod here (it's printing warnings about it).

I tired looking at gsutil's source to see what it's doing differently, but it's a bit like looking for a needle in a haystack.

Any ideas?

cc @DazWorrall @wvanbergen

@blowmage
Copy link
Contributor

Are you disabling the verification check when downloading? You pass verify: :none to the download method to do this.

https://googlecloudplatform.github.io/google-cloud-ruby/#/docs/google-cloud-storage/v1.9.0/google/cloud/storage/file?method=download-instance

@casperisfine
Copy link
Contributor Author

I'll try this, thanks!

@blowmage
Copy link
Contributor

Ugh, the app I use for GitHub notifications didn’t show the benchmark code in the issue description. Now that I’m reading this in a browser I see that you were not disabling the verify check.

It is possible that this verification stuff is the cause for the difference. Calculating MD5 on a 500 MB file will take some time, but we think we are doing this as efficiently as possible.

@casperisfine
Copy link
Contributor Author

I ran the benchmark twice with verify: :none:

  • run 1: gsutil -> 8.0s, gcs-ruby -> 11.6s
  • run 1: gsutil -> 6.3s, gcs-ruby -> 13.0s.

So I'm afraid disabling the check only improve things marginally :/

@blowmage
Copy link
Contributor

K, I'll look into this. FWIW, google-cloud-storage uses google-api-client to access the API, which uses httpclient for the HTTP calls. So it will take a bit to dive through all the dependencies to see where we can pick up performance.

@blowmage
Copy link
Contributor

@casperisfine Oh! One more thing! Your benchmark includes creating the File object, which also makes an API call. Can you update it to something like the following?

  puts bucket.file(file_name, skip_lookup: true).download("/tmp/random-500M.bin.#{rand}", verify: :none)

@blowmage blowmage self-assigned this Dec 21, 2017
@blowmage blowmage added api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Dec 21, 2017
@casperisfine
Copy link
Contributor Author

Can you update it to something like the following?

Good point, but still no sensible change (~6s vs ~12s). But it's still nice to do one less call so I'll add this to our production system.

which uses httpclient for the HTTP calls. So it will take a bit to dive through all the dependencies to see where we can pick up performance.

Yeah my hunch says the performance difference is more likely to be at that layer. Either some non-optimal buffering, or different compression, etc

What is interesting too is that it impacts only the download performance, the upload performance is fine.

I'll do some more benchmark to see if the perf diff is linear, let me know if there is anything else I can do to help you figure this out.

Also maybe gsutil is doing something smart and the authors would know?

@casperisfine
Copy link
Contributor Author

I did 3 more runs, this time with 100M, 500M and 1G (speadsheet)

There is a lot of variance obviously, but is does appear to be a fairly linear progression:

capture d ecran 2017-12-22 a 12 35 34

@quartzmo
Copy link
Member

Hi @casperisfine, what is this line for?

bucket = Buildkite::FSCache.send(:bucket)

Can you share your entire configuration somewhere so that I can run this script in my own GKE deployment? I'm just trying to reproduce your results first, then I'll try to debug.

@DazWorrall
Copy link

DazWorrall commented Dec 22, 2017

[context: I work with @casperisfine]

You can ignore that line I think @quartzmo , it looks like some leftover cruft as a result of extracting this code from the project we're using it in. The result is the same as the code immediately above, it returns a Google::Cloud::Storage::Bucket object.

@quartzmo
Copy link
Member

quartzmo commented Dec 22, 2017

@DazWorrall, OK, sounds like maybe it isn't easy to share your deployment configuration.

@casperisfine I know my home workstation isn't your deployment target, but here's what I get locally:

--- Upload with gsutil
- [1 files][500.0 MiB/500.0 MiB]    8.7 MiB/s                                   
Operation completed over 1 objects/500.0 MiB.                                    
took: 65.2s
--- Download with gsutil
Copying gs://downloadtest_978004358714/test/random-500M.bin.0.5400586331450901...
\ [1 files][500.0 MiB/500.0 MiB]    6.7 MiB/s                                   
Operation completed over 1 objects/500.0 MiB.                                    
took: 75.3s
--- Upload with GCS-ruby
#<Google::Cloud::Storage::File:0x00007fd63a70d588>
took: 57.4s
--- Download with GCS-ruby
#<File:0x00007fd63a5c7368>
took: 61.9s

@byroot
Copy link

byroot commented Dec 22, 2017

[context: I am @casperisfine alias]

OK, sounds like maybe it isn't easy to share your deployment configuration.

It's entirely my bad. I used this in my initial benchmark, but that method simply contain what I copy pasted above:

bucket = Google::Cloud::Storage.new(
  project: '<my-test-project>',
  keyfile: ENV['GOOGLE_APPLICATION_CREDENTIALS'],
  timeout: 5,
  retries: 0,
).bucket('<my-test-bucket>', skip_lookup: true)

I know my home workstation isn't your deployment target, but here's what I get locally:

If it can help, I'm benchmarking from a GKE cluster.

@blowmage
Copy link
Contributor

blowmage commented Jan 8, 2018

I'm still looking into this.

@blowmage
Copy link
Contributor

blowmage commented Jan 9, 2018

I've tried to reproduce this on several different systems, with several different network connections, and for the vast majority i've found google-cloud-storage to be faster than the gcloud CLI.

The fastest connection has been (not surprising) using a Google Compute Engine virtual machine, the same platform the GKE runs on. After installing the gcloud CLI, ruby, the google-cloud-storage gem, and run several attempts I've collected the following measurements. (The full output log and modified benchmark script can be found here: https://gist.github.com/blowmage/b9e89cd710fcc73117cbc8a2c3ac2cce)

10MB File

Action Try 1 Try 2 Try 3
gsutil upload 1.8s 2.0s 2.0s
gsutil download 1.1s 1.1s 1.0s
gcs-ruby upload 1.2s 1.0s 1.1s
gcs-ruby download 0.2s 0.3s 0.3s

100MB File

Action Try 1 Try 2 Try 3
gsutil upload 2.4s 2.6s 3.7s
gsutil download 2.6s 2.9s 3.0s
gcs-ruby upload 1.7s 2.0s 1.8s
gcs-ruby download 1.0s 1.0s 1.0s

250MB File

Action Try 1 Try 2 Try 3
gsutil upload 4.2s 3.6s 3.8s
gsutil download 3.0s 2.8s 3.0s
gcs-ruby upload 3.5s 3.6s 2.7s
gcs-ruby download 2.2s 2.1s 2.3s

500MB File

Action Try 1 Try 2 Try 3
gsutil upload 10.5s 6.4s 6.0s
gsutil download 5.5s 4.8s 5.3s
gcs-ruby upload 4.9s 5.5s 5.6s
gcs-ruby download 4.0s 4.0s 4.1s

750MB File

Action Try 1 Try 2 Try 3
gsutil upload 8.6s 7.2s 9.3s
gsutil download 8.0s 7.5s 8.0s
gcs-ruby upload 7.8s 8.0s 7.6s
gcs-ruby download 6.5s 6.4s 6.2s

1GB File

Action Try 1 Try 2 Try 3
gsutil upload 13.3s 8.9s 10.0s
gsutil download 9.3s 9.0s 9.2s
gcs-ruby upload 9.9s 10.1s 10.4s
gcs-ruby download 9.6s 10.6s 10.4s

I have not found an appreciable difference between these two libraries, and I am unaware of any low hanging fruit in how HTTPClient is used. Is there more information you can share to help us continue looking into this, or can this response close this issue?

@casperisfine
Copy link
Contributor Author

Could the bucket configuration have an impact here? If so I'm benchmarking with the shopify-ci/shopify-ci bucket which is configured as multi-regional, and the benchmark was ran from us-east1.

I can't think of any other information that could help pinpoint the issue. So if you think it's not it I'll close and look for issues elsewhere in the stack.

@blowmage
Copy link
Contributor

blowmage commented Jan 9, 2018

I don’t know if bucket configuration would affect this, or why it would affect the gem only and not the cli. But I’m also not a Google employee and have no insight into the Storage service beyond the public documentation. Is there a support channel you can ask that question to?

@casperisfine
Copy link
Contributor Author

Well, thanks for all the time you spent on this. I'll indeed try to go through support with this.

@blowmage
Copy link
Contributor

blowmage commented Jan 9, 2018

I’m sorry we couldn’t repro the download speed you are seeing. If you discover the cause we would love to know so we can change the Ruby client to avoid it if possible.

@casperisfine
Copy link
Contributor Author

No need to be sorry, that's actually helpful. And I'll make sure to report any further findings here even if there is no modifications to do to the library.

@blowmage
Copy link
Contributor

blowmage commented Jan 9, 2018

Sounds good. Best of luck!

@casperisfine
Copy link
Contributor Author

I received an update from Google support, they were able to reproduce the issue:

Hello Jean, My initial tests do show a very significant (~7x) difference between downloading via gsutil and the ruby library via the benchmark. I still need to run a couple other tests, and if those also show similar results I will need to bring this up with Engineering. I'll have another update for you no later than Friday. For now I would keep using the gsutil command for large downloads. Sincerely,

🤞

@quartzmo
Copy link
Member

Thanks for the update!

@casperisfine
Copy link
Contributor Author

Another update:

Engineering is still looking into this issue. I ran a couple more tests and from my end it seems that the ruby download is drastically tied with the amount of CPU that the instance has. I've added this to the backed issue I've filed with engineering. I'll have another update for you next week with any progress that engineering may have.

The CPU part is interesting. Maybe the API used for HTTP end up using write(read()) instead of leveraging sendfile()

@casperisfine
Copy link
Contributor Author

So I reran the download benchmark with a (very dirty) patch to httpclient to use IO.copy_stream (which leverage sendfile() when possible) instead of read_partial.

For 500M, the bench went down from ~15s to >5s, so that is definitely the difference between gsutil and gcs-ruby.

The shitty patch: https://gist.github.com/casperisfine/f7830c20718813a570927f07d6357bfe
And the bench: https://gist.github.com/casperisfine/6e25454406322be4ecde976d15392ca0

There's an open issue in httpclient to use IO.copy_stream: nahi/httpclient#66 but it's very old, and from my investigation I really don't see how it could be done without breaking the current API.

@quartzmo / @blowmage I'd be willing to put time into this if there is interest, but I'd need to know to what lengths a patch could go. Is there a specific reason for using httpclient in the first place, or would migrating to another one that supports copy_stream be acceptable?

I see google-cloud-env has faraday as a dependency, is there plans to migrate to it? (I'm actually ensure how faraday perform on this particular issue).

@blowmage
Copy link
Contributor

The httpclient gem is used by google-api-client, the library that is actually making the API calls. It used to use faraday, but there were several issues with faraday. It eventually migrated to hurley, which solved many of the issues, but hurley never gained wide adoption and was eventually abandoned. Not wanting to revert on the many improvements that hurley made possible, the maintainers of google-api-client decided to switch to httpclient.

Around the same time that google-api-client was first started, the gem signet was created to perform the OAuth calls. Like google-api-client, signet was also built on top of faraday. Because the OAuth lifecycle is so simple it has never needed to move away from it.

All authentication for the google-cloud-* gems use the googleauth gem, which is built on top of signet, so we will always have faraday as a dependency. This is why google-cloud-env chose to use faraday for its calls. It has similar HTTP usage to signet, so it made sense to use a library that was already a known dependency.

Okay, history over. I would not expect faraday to bring improved download performance. I see three main options for addressing the performance:

  1. Fix httpclient. Continue to use google-api-client, which continues to use httpclient, and all users benefit from performance gains. But, as you stated, this would likely require an additional API or breaking API compatibility in httpclient. Not a quick solution, likely tied to a future httpclient 3.0 release.
  2. Switch google-api-client to use a library that performs better than httpclient. Unfortunately, while google-api-client is being maintained, it is also not under active development. Also, the last time google-api-client switched HTTP libraries it was necessary to break backwards compatibility.
  3. Have google-cloud-storage drop the use of google-api-client for downloads, and re-implement all functionality, including error handling, retry, incremental backoff, etc.

There are other options, such as maintaining the "shitty patch" somewhere in our code, but they don't seem to be as realistic or viable as the three above. @geigerj, can you offer any guidance on how to proceed? The tl;dr is downloads are CPU-intensive for very large files compared to other implementations.

@casperisfine
Copy link
Contributor Author

Thanks for the history :), much appreciated.

I didn't try very hard to make that patch maintainable. We could certainly do much better.

Another possibility would be to expose a specific API for this in httpclient, since they have an open issue for it, they might be open to the idea. This specific call is likely the only one which would benefit from it.

@blowmage
Copy link
Contributor

Another possibility would be to expose a specific API for this in httpclient

Yes, I listed that in my first option. I think adding a new API is a better solution that breaking backwards compatibility, most of the time. But I also think your solution of yielding block arguments of io and offset is far superior to chunk, so as an uninterested observer I would prefer the API to use that. But either solution would be an improvement for sure! :)

@casperisfine
Copy link
Contributor Author

Yeah my bad. It was late -_-.

I'll try to see what's possible in term of API, and submit a PR to httpclient if I have something decent.

@casperisfine
Copy link
Contributor Author

I submitted a PR to httpclient: nahi/httpclient#383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

5 participants