Add hashInHttpHeaders option for Coursier resolver #383

timothyg-stripe · 2024-06-04T17:25:07Z

Add a new resolver option hashInHttpHeaders for Coursier, which tells bazel-deps to use checksums in HTTP headers if possible, instead of downloading the jar artifact and computing hash digests locally. The option is read from the resolverOptions object in the input YAML file.

If this option is true, when computing checksums, bazel-deps will:

First try to make a HEAD request to the artifact.
If the HEAD request was successful:
1. Save the headers as a JSON file in the Coursier cache directory
2. If the headers include the necessary checksum, return that.
3. If the headers don't contain the checksum, fall back to downloading the artifact itself.
If the HEAD request was unsuccessful:
1. If the status code is 404, but the .pom file exists for this artifact, assume that the artifact will never be published in the future. Cache the error status in the Coursier cache directory to avoid downloading this file in the future. This is the same heuristic that Coursier itself uses.
2. Return error.

This change reduces the fully-cached runtime of bazel-deps on our internal repo (which uses Artifactory, which supplies both SHA-1 and SHA-256 hashes via headers) from 80s to 7s, and reduces the size of the Coursier cache directory by 99%.

Co-authored-by: Keith Lea [email protected]

cc @keithl-stripe

…f checksumming from network and the disk cache)

timothyg-stripe · 2024-06-04T17:25:46Z

Also note that this is the first bit of serious Scala that I've written, so please feel free to suggest anything stylistic also :)

johnynek

This looks really exciting.

Thanks for sending the PR.

I had a few requests for changes or comments before merging.

johnynek · 2024-06-04T19:52:48Z

src/scala/com/github/johnynek/bazel_deps/CoursierResolver.scala

+                            case e: java.nio.file.FileAlreadyExistsException => ()
+                          }
+                        } else {
+                          // println(s"not caching error for $artifact")


can we remove the commented code or use a logging api here (I think other files are using slf4j if I remember correctly).

johnynek · 2024-06-04T19:55:08Z

src/scala/com/github/johnynek/bazel_deps/CoursierResolver.scala

+            Task.schedule(CoursierResolver.downloadPool) {
+              // Since we use atomic moves, we can guarantee that if the header file exists, it is complete.
+              if (!Files.exists(headersPath)) {
+                val tmp = CachePath.temporaryFile(headersPath.toFile).toPath


could we factor lines 197 - 246 into a method such as fetchHeadersToPath(...) or something.

johnynek · 2024-06-04T19:55:54Z

src/scala/com/github/johnynek/bazel_deps/CoursierResolver.scala

+                      Failure(Recoverable(new RuntimeException(s"failed to parse headers file $headersPath", error)))
+                    case Right(obj) => Success(obj)
+                  })
+                  .map((headerMap) => HttpHeaders.of(headerMap.map { case (k, v) => (k, v.asJava) }.asJava, { (_, _) => true }))


nit: You don't need (headerMap) it is the same as .map(headerMap => HttpHeaders...

johnynek · 2024-06-04T19:57:41Z

src/scala/com/github/johnynek/bazel_deps/CoursierResolver.scala

+                  }
+                  .map { case (sha, length) => (artifact, ShaValue(sha, digestType), length) }
+              ))
+          } else Task.fail(Recoverable(new RuntimeException("skipped HEAD request")))


this if for this else is very far some here... maybe it will get closer if we take the body of the if and put it in a function or method.

johnynek · 2024-06-04T19:58:22Z

src/scala/com/github/johnynek/bazel_deps/CoursierResolver.scala

+          .flatMap {
+            case Right(r) => Task.point(r)
+            case Left(e: Recoverable) =>
+              // println(s"falling back to downloading the whole file: $e")


can we remove the commented code please?

johnynek · 2024-06-04T20:00:44Z

src/scala/com/github/johnynek/bazel_deps/DepsModel.scala

+          "hashInHttpHeaders",
+          hashInHttpHeaders.map(b => Doc.text(s"$b"))
+        ),
+      ).sortBy(_._1)


I guess this was copied, but we don't need to sort a list of length 1. but you can leave it if you add a comment that this is here so we don't forget if we add more options.

johnynek · 2024-06-04T20:03:10Z

src/scala/com/github/johnynek/bazel_deps/DepsModel.scala

+
+      val items = List(
+        (
+          "hashInHttpHeaders",


I'm not really sure why we wouldn't always want to enable this if it works. Can you think of a reason? Can you document the reason here so someone reading the code can remember.

Unfortunately, it doesn't work with all Maven repository servers. E.g., https://repo1.maven.org/maven2/ only supplies SHA-1 checksums, so it would be a waste to enable this option if that's the repo that's getting used.

makes sense.

johnynek · 2024-06-04T20:03:49Z

src/scala/com/github/johnynek/bazel_deps/MakeDeps.scala

        )
      case g: ResolverType.Gradle =>
        val ec = scala.concurrent.ExecutionContext.Implicits.global
        import scala.concurrent.duration._

        lazy val coursierResolver =
-          new CoursierResolver(model.getOptions.getResolvers, ec, 3600.seconds, resolverCachePath)
+          new CoursierResolver(model.getOptions.getResolvers, false, ec, 3600.seconds, resolverCachePath)


same comment as above: if this works, shouldn't true be the default? Why would you set false? Maybe if you know with certainty your http service doesn't give out the headers?

johnynek · 2024-06-04T20:11:41Z

src/scala/com/github/johnynek/bazel_deps/CoursierResolver.scala

+                        .firstValue(digestType match {
+                          // See also https://maven.apache.org/resolver/expected-checksums.html#non-standard-x-headers
+                          case DigestType.Sha1 => "x-checksum-sha1"
+                          case DigestType.Sha256 => "x-checksum-sha256"


I wonder if this should be moved to a method on DigestType? so this function is: headers.firstValue(digestType.mavenHeader) or something?

Also what about x-goog-meta-checksum-sha1 and x-goog-meta-checksum-md5?

Maybe in fact we need to something like:

sealed trait DigestType { def getDigestInstance: MessageDigest def expectedHexLength: Int def name: String def headerKeys: List[String] }

then do:

Try { val digest = digestType .headerKeys .flatMap(key => headers.firstValue(key).asScala) .headOption .getOrElse(throw new Recoverable(new RuntimeException(s"no ${digestType} found in headers in $headersPath")) val len = headers .firstValueAsLong("Content-Length") .orElseThrow(() => Recoverable(new RuntimeException(s"no Content-Length found in headers $headersPath")) ) (digest, len) }

johnynek · 2024-06-04T20:13:32Z

also it looks like the tests fail to build:

 test/scala/com/github/johnynek/bazel_deps/ModelGenerators.scala:126: error: type mismatch;
 found   : Option[Serializable]
 required: Option[com.github.johnynek.bazel_deps.ResolverType]
    resolverType,
    ^
one error found
Build failed

ideally we could generate Coursier with and without the option set to test the parsing and formatting code.

Speed up bazel-deps by using HEAD requests to fetch SHA256 (instead o…

ec9a954

…f checksumming from network and the disk cache)

prefer b over a

71b453c

johnynek reviewed Jun 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hashInHttpHeaders option for Coursier resolver #383

Add hashInHttpHeaders option for Coursier resolver #383

timothyg-stripe commented Jun 4, 2024

timothyg-stripe commented Jun 4, 2024

johnynek left a comment

johnynek Jun 4, 2024

johnynek Jun 4, 2024

johnynek Jun 4, 2024

johnynek Jun 4, 2024

johnynek Jun 4, 2024

johnynek Jun 4, 2024

johnynek Jun 4, 2024

timothyg-stripe Jun 4, 2024

johnynek Jun 4, 2024

johnynek Jun 4, 2024

johnynek Jun 4, 2024

johnynek commented Jun 4, 2024

Add hashInHttpHeaders option for Coursier resolver #383

Are you sure you want to change the base?

Add hashInHttpHeaders option for Coursier resolver #383

Conversation

timothyg-stripe commented Jun 4, 2024

timothyg-stripe commented Jun 4, 2024

johnynek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jun 4, 2024