Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3a URLs don't work in WarcLoader (Wrong FS: s3a://...) #3

Open
acruise opened this issue Feb 22, 2024 · 1 comment
Open

s3a URLs don't work in WarcLoader (Wrong FS: s3a://...) #3

acruise opened this issue Feb 22, 2024 · 1 comment

Comments

@acruise
Copy link

acruise commented Feb 22, 2024

EDIT: this helped with Wrong FS, more tickets incoming ;)

sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")

Hey folks, I'm trying to read some common crawl data from S3... See archivesunleashed/aut#556 where I'm using the aut pattern, but I get the same symptom using Sparkling by itself:

bin/spark-shell --jars ~/dev/Sparkling/target/scala-2.12/sparkling-assembly-0.3.8-SNAPSHOT.jar --packages com.amazonaws:aws-java-sdk:1.12.662,org.apache.hadoop:hadoop-aws:3.3.4

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.0.73:4040
Spark context available as 'sc' (master = local[*], app id = local-1708561219560).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.2
      /_/
         
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.21)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.archive.webservices.sparkling._, org.archive.webservices.sparkling.warc._, org.archive.webservices.sparkling.io._
import org.archive.webservices.sparkling._
import org.archive.webservices.sparkling.warc._
import org.archive.webservices.sparkling.io._

scala> val warcs = WarcLoader.load("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz")
24/02/21 16:20:25 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
java.lang.IllegalArgumentException: Wrong FS: s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:807)
  at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:105)
  at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:774)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
  at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:115)
  at org.apache.hadoop.fs.Globber.doGlob(Globber.java:349)
  at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124)
  at org.archive.webservices.sparkling.io.HdfsIO.files(HdfsIO.scala:163)
  at org.archive.webservices.sparkling.util.RddUtil$.loadFilesLocality(RddUtil.scala:74)
  at org.archive.webservices.sparkling.util.RddUtil$.loadBinary(RddUtil.scala:125)
  at org.archive.webservices.sparkling.warc.WarcLoader$.load(WarcLoader.scala:56)
  ... 53 elided
@acruise acruise changed the title s3a URLs don't work in WarcLoader s3a URLs don't work in WarcLoader (Wrong FS: s3a://...) Feb 22, 2024
@helgeho
Copy link
Contributor

helgeho commented Feb 23, 2024

Hi Alex,

you're right, Sparkling was not designed for S3 in the first place, but HDFS. It might in fact work, with S3 adapters set up properly in Hadoop, but I'm not sure and this would be untested. According to your recent edit, it sounds like it actually did, though?

However, there's also an S3 client built in to Sparkling as well, the pattern for loading WARCs with it would be slightly different then: (also untested, but should work this way, here's an example to print all URLs in a WARC file)

import $ivy.`com.amazonaws:aws-java-sdk:1.7.4` // you'll need Amazon's AWS SDK 1.7.4 in your classpath, this would be the directive if you run it in a Jupyter notebook with Almond, as I usually do

import org.archive.webservices.sparkling.io._
import org.archive.webservices.sparkling.warc._

S3Client(accessKey, secretKey).access { s3 =>
    s3.open("commoncrawl", "crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz") { in =>
        WarcLoader.load(in).flatMap(_.url).foreach(println)
    }
}

Also, please note that you're using Sparkling's WARC loader for WAT files here. This works as WAT uses WARC as its container format, but the payload is not an HTTP message as you'd expect in "regular" WARC files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants