-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Directory implementation for AWS S3 #13949
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing this!
I'm surprised at the number of dependencies, does the S3 SDK actually have this many dependencies? E.g. https://central.sonatype.com/artifact/software.amazon.awssdk/s3/dependencies suggests that netty-transport is a test-only dependency? (or is it also a transitive dependency which is why you had to add it here?)
Regarding testing, a quick search suggests that there are mocks for S3 that avoid connecting to an actual S3 bucket, would it work? https://stackoverflow.com/questions/6615988/how-to-mock-amazon-s3-in-an-integration-test
I left some other comments from a quick pass over the code, it would be nice to make the code conform to some Lucene expectations, such as no logging.
moduleTestImplementation project(':lucene:test-framework') | ||
moduleTestImplementation project(':lucene:analysis:common') | ||
moduleTestImplementation project(':lucene:backward-codecs') | ||
moduleTestImplementation project(':lucene:queryparser') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need to depend on the analysis-common
, backward-codecs
and queryparser
modules for tests. For analysis, our tests usually use MockAnalyzer
rather than whitespace or keyword analyzers.
ramDirectory = new MMapDirectory(FileSystems.getDefault().getPath("target/index")); | ||
fsDirectory = FSDirectory.open(FileSystems.getDefault().getPath("target/index")); | ||
} catch (IOException ex) { | ||
logger.log(Level.SEVERE, null, ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's throw this exception instead of swallowing it, we should fail the test suite if this throws an exception?
} | ||
iwriter.forceMerge(1, true); | ||
} catch (Exception e) { | ||
logger.log(Level.SEVERE, null, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here, let's not swallow exceptions in general
// Parse a simple query that searches for "text": | ||
|
||
final QueryParser parser = new QueryParser("fieldname", analyzer); | ||
final Query query = parser.parse("text"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create the query manually instead and remove the dependency on the query parser module? This could be:
final Query query = new TermQuery(new Term("fieldname", "text"));
import org.junit.Assert; | ||
import org.junit.Test; | ||
|
||
public class TestS3Directory extends LuceneTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this test extend BaseDirectoryTestCase
, it has a few tests for various things that directories are supposed to be able to do like cloning and slicing.
*/ | ||
public class S3Directory extends Directory { | ||
|
||
private static final Logger logger = Logger.getLogger(S3Directory.class.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a library, Lucene never logs, so let's remove this logger and make sure we propagate exceptions when appropriate?
|
||
private void initialize(final String bucket, final String path, LockFactory lockFactory) { | ||
this.s3 = S3Client.create(); | ||
this.bucket = bucket.toLowerCase(Locale.getDefault()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's not rely on the default locale. I would expect our forbidden-apis build task to fail the build because of this.
return true; | ||
} catch (Exception e) { | ||
logger.log(Level.WARNING, null, e); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. here we should not swallow the exception and log, but rather propagate it, possibly wrapped inside of an IOException
(if the method signature allows it) or UncheckedIOException
otherwise.
.getS3() | ||
.getObject(b -> b.bucket(s3Directory.getBucket()).key(getPath() + name)); | ||
|
||
synchronized (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synchronization shouldn't be needed as IndexInputs
should be consumed from a single thread anyway?
* A simple base class that performs index output memory based buffering. The buffer size if | ||
* configurable. | ||
*/ | ||
// NEED TO BE MONITORED AGAINST LUCENE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Lucene now. :)
Have we explored trying a fuse-based mount instead of writing java code? There seems to be several potentially viable approaches there: https://github.com/s3fs-fuse/s3fs-fuse, but maybe higher-performant https://github.com/kahing/goofys + https://github.com/kahing/catfs |
Amazon S3 also offers its own solution for mounting S3 buckets, available at https://github.com/awslabs/mountpoint-s3/. There are benchmarks at https://github.com/awslabs/mountpoint-s3/blob/main/doc/BENCHMARKING.md throughput chart |
@rmuir @josefschiefer27 I haven't yet tried experimenting with S3 fuse mounts but I doubt the performance will be any better than using the S3 API directly. Also, there are some specific API calls around the legal hold locks used in this project as a locking mechanism. I don't know if those are supported by the libraries you mentioned. |
Just to share similar developments, we did S3 filesystem plugin store prototype for OpenSearch based on the same library [1], it works but the latency wise, it is not great (obviously, no caching etc have been done). [1] https://github.com/reta/OpenSearch/tree/s3.fs/plugins/store-s3 |
Since we have a S3 Directory implemented already, let's run a comparison with the fuse mount approach? |
Description
This PR adds a new module
s3directory
to Lucene, containing a newDirectory
implementation for AWS S3.The code was adapted from the lucene-s3-directory project as requested by @jpountz in #13868
Currently, there are a few issues which cause the build to fail:
module-info.java