You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey folks, I'm trying to read some common crawl data from S3... See archivesunleashed/aut#556 where I'm using the aut pattern, but I get the same symptom using Sparkling by itself:
bin/spark-shell --jars ~/dev/Sparkling/target/scala-2.12/sparkling-assembly-0.3.8-SNAPSHOT.jar --packages com.amazonaws:aws-java-sdk:1.12.662,org.apache.hadoop:hadoop-aws:3.3.4
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.0.73:4040
Spark context available as 'sc' (master = local[*], app id = local-1708561219560).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.4.2
/_/
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.21)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.archive.webservices.sparkling._, org.archive.webservices.sparkling.warc._, org.archive.webservices.sparkling.io._
import org.archive.webservices.sparkling._
import org.archive.webservices.sparkling.warc._
import org.archive.webservices.sparkling.io._
scala> val warcs = WarcLoader.load("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz")
24/02/21 16:20:25 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
java.lang.IllegalArgumentException: Wrong FS: s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:807)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:105)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:774)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:115)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:349)
at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124)
at org.archive.webservices.sparkling.io.HdfsIO.files(HdfsIO.scala:163)
at org.archive.webservices.sparkling.util.RddUtil$.loadFilesLocality(RddUtil.scala:74)
at org.archive.webservices.sparkling.util.RddUtil$.loadBinary(RddUtil.scala:125)
at org.archive.webservices.sparkling.warc.WarcLoader$.load(WarcLoader.scala:56)
... 53 elided
The text was updated successfully, but these errors were encountered:
acruise
changed the title
s3a URLs don't work in WarcLoaders3a URLs don't work in WarcLoader (Wrong FS: s3a://...)
Feb 22, 2024
you're right, Sparkling was not designed for S3 in the first place, but HDFS. It might in fact work, with S3 adapters set up properly in Hadoop, but I'm not sure and this would be untested. According to your recent edit, it sounds like it actually did, though?
However, there's also an S3 client built in to Sparkling as well, the pattern for loading WARCs with it would be slightly different then: (also untested, but should work this way, here's an example to print all URLs in a WARC file)
import$ivy.`com.amazonaws:aws-java-sdk:1.7.4`// you'll need Amazon's AWS SDK 1.7.4 in your classpath, this would be the directive if you run it in a Jupyter notebook with Almond, as I usually doimportorg.archive.webservices.sparkling.io._importorg.archive.webservices.sparkling.warc._S3Client(accessKey, secretKey).access { s3 =>
s3.open("commoncrawl", "crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz") { in =>WarcLoader.load(in).flatMap(_.url).foreach(println)
}
}
Also, please note that you're using Sparkling's WARC loader for WAT files here. This works as WAT uses WARC as its container format, but the payload is not an HTTP message as you'd expect in "regular" WARC files.
EDIT: this helped with
Wrong FS
, more tickets incoming ;)sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")
Hey folks, I'm trying to read some common crawl data from S3... See archivesunleashed/aut#556 where I'm using the
aut
pattern, but I get the same symptom using Sparkling by itself:The text was updated successfully, but these errors were encountered: