Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hashInHttpHeaders option for Coursier resolver #383

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

timothyg-stripe
Copy link

Add a new resolver option hashInHttpHeaders for Coursier, which tells bazel-deps to use checksums in HTTP headers if possible, instead of downloading the jar artifact and computing hash digests locally. The option is read from the resolverOptions object in the input YAML file.

If this option is true, when computing checksums, bazel-deps will:

  1. First try to make a HEAD request to the artifact.
  2. If the HEAD request was successful:
    1. Save the headers as a JSON file in the Coursier cache directory
    2. If the headers include the necessary checksum, return that.
    3. If the headers don't contain the checksum, fall back to downloading the artifact itself.
  3. If the HEAD request was unsuccessful:
    1. If the status code is 404, but the .pom file exists for this artifact, assume that the artifact will never be published in the future. Cache the error status in the Coursier cache directory to avoid downloading this file in the future. This is the same heuristic that Coursier itself uses.
    2. Return error.

This change reduces the fully-cached runtime of bazel-deps on our internal repo (which uses Artifactory, which supplies both SHA-1 and SHA-256 hashes via headers) from 80s to 7s, and reduces the size of the Coursier cache directory by 99%.

Co-authored-by: Keith Lea [email protected]

cc @keithl-stripe

…f checksumming from network and the disk cache)
@timothyg-stripe
Copy link
Author

Also note that this is the first bit of serious Scala that I've written, so please feel free to suggest anything stylistic also :)

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really exciting.

Thanks for sending the PR.

I had a few requests for changes or comments before merging.

case e: java.nio.file.FileAlreadyExistsException => ()
}
} else {
// println(s"not caching error for $artifact")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the commented code or use a logging api here (I think other files are using slf4j if I remember correctly).

Task.schedule(CoursierResolver.downloadPool) {
// Since we use atomic moves, we can guarantee that if the header file exists, it is complete.
if (!Files.exists(headersPath)) {
val tmp = CachePath.temporaryFile(headersPath.toFile).toPath
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we factor lines 197 - 246 into a method such as fetchHeadersToPath(...) or something.

Failure(Recoverable(new RuntimeException(s"failed to parse headers file $headersPath", error)))
case Right(obj) => Success(obj)
})
.map((headerMap) => HttpHeaders.of(headerMap.map { case (k, v) => (k, v.asJava) }.asJava, { (_, _) => true }))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You don't need (headerMap) it is the same as .map(headerMap => HttpHeaders...

}
.map { case (sha, length) => (artifact, ShaValue(sha, digestType), length) }
))
} else Task.fail(Recoverable(new RuntimeException("skipped HEAD request")))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if for this else is very far some here... maybe it will get closer if we take the body of the if and put it in a function or method.

.flatMap {
case Right(r) => Task.point(r)
case Left(e: Recoverable) =>
// println(s"falling back to downloading the whole file: $e")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the commented code please?

"hashInHttpHeaders",
hashInHttpHeaders.map(b => Doc.text(s"$b"))
),
).sortBy(_._1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was copied, but we don't need to sort a list of length 1. but you can leave it if you add a comment that this is here so we don't forget if we add more options.


val items = List(
(
"hashInHttpHeaders",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure why we wouldn't always want to enable this if it works. Can you think of a reason? Can you document the reason here so someone reading the code can remember.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it doesn't work with all Maven repository servers. E.g., https://repo1.maven.org/maven2/ only supplies SHA-1 checksums, so it would be a waste to enable this option if that's the repo that's getting used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

)
case g: ResolverType.Gradle =>
val ec = scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._

lazy val coursierResolver =
new CoursierResolver(model.getOptions.getResolvers, ec, 3600.seconds, resolverCachePath)
new CoursierResolver(model.getOptions.getResolvers, false, ec, 3600.seconds, resolverCachePath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above: if this works, shouldn't true be the default? Why would you set false? Maybe if you know with certainty your http service doesn't give out the headers?

.firstValue(digestType match {
// See also https://maven.apache.org/resolver/expected-checksums.html#non-standard-x-headers
case DigestType.Sha1 => "x-checksum-sha1"
case DigestType.Sha256 => "x-checksum-sha256"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be moved to a method on DigestType? so this function is: headers.firstValue(digestType.mavenHeader) or something?

Also what about x-goog-meta-checksum-sha1 and x-goog-meta-checksum-md5?

Maybe in fact we need to something like:

sealed trait DigestType {
  def getDigestInstance: MessageDigest
  def expectedHexLength: Int
  def name: String
  def headerKeys: List[String]
}

then do:

Try {
val digest = digestType
  .headerKeys
  .flatMap(key => headers.firstValue(key).asScala)
  .headOption
  .getOrElse(throw new Recoverable(new RuntimeException(s"no ${digestType} found in headers in $headersPath"))

val len = headers
  .firstValueAsLong("Content-Length")
  .orElseThrow(() =>
    Recoverable(new RuntimeException(s"no Content-Length found in headers $headersPath"))
  )

(digest, len)
}

@johnynek
Copy link
Collaborator

johnynek commented Jun 4, 2024

also it looks like the tests fail to build:

 test/scala/com/github/johnynek/bazel_deps/ModelGenerators.scala:126: error: type mismatch;
 found   : Option[Serializable]
 required: Option[com.github.johnynek.bazel_deps.ResolverType]
    resolverType,
    ^
one error found
Build failed

ideally we could generate Coursier with and without the option set to test the parsing and formatting code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants