Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Directory implementation for AWS S3 #13949

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

albogdano
Copy link

Description

This PR adds a new module s3directory to Lucene, containing a new Directory implementation for AWS S3.
The code was adapted from the lucene-s3-directory project as requested by @jpountz in #13868

Currently, there are a few issues which cause the build to fail:

  • AWS SDK module not found or could not be resolved in module-info.java
  • tests require an AWS account and valid credentials to execute

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this!

I'm surprised at the number of dependencies, does the S3 SDK actually have this many dependencies? E.g. https://central.sonatype.com/artifact/software.amazon.awssdk/s3/dependencies suggests that netty-transport is a test-only dependency? (or is it also a transitive dependency which is why you had to add it here?)

Regarding testing, a quick search suggests that there are mocks for S3 that avoid connecting to an actual S3 bucket, would it work? https://stackoverflow.com/questions/6615988/how-to-mock-amazon-s3-in-an-integration-test

I left some other comments from a quick pass over the code, it would be nice to make the code conform to some Lucene expectations, such as no logging.

moduleTestImplementation project(':lucene:test-framework')
moduleTestImplementation project(':lucene:analysis:common')
moduleTestImplementation project(':lucene:backward-codecs')
moduleTestImplementation project(':lucene:queryparser')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to depend on the analysis-common, backward-codecs and queryparser modules for tests. For analysis, our tests usually use MockAnalyzer rather than whitespace or keyword analyzers.

ramDirectory = new MMapDirectory(FileSystems.getDefault().getPath("target/index"));
fsDirectory = FSDirectory.open(FileSystems.getDefault().getPath("target/index"));
} catch (IOException ex) {
logger.log(Level.SEVERE, null, ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's throw this exception instead of swallowing it, we should fail the test suite if this throws an exception?

}
iwriter.forceMerge(1, true);
} catch (Exception e) {
logger.log(Level.SEVERE, null, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here, let's not swallow exceptions in general

// Parse a simple query that searches for "text":

final QueryParser parser = new QueryParser("fieldname", analyzer);
final Query query = parser.parse("text");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create the query manually instead and remove the dependency on the query parser module? This could be:

final Query query = new TermQuery(new Term("fieldname", "text"));

import org.junit.Assert;
import org.junit.Test;

public class TestS3Directory extends LuceneTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this test extend BaseDirectoryTestCase, it has a few tests for various things that directories are supposed to be able to do like cloning and slicing.

*/
public class S3Directory extends Directory {

private static final Logger logger = Logger.getLogger(S3Directory.class.getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a library, Lucene never logs, so let's remove this logger and make sure we propagate exceptions when appropriate?


private void initialize(final String bucket, final String path, LockFactory lockFactory) {
this.s3 = S3Client.create();
this.bucket = bucket.toLowerCase(Locale.getDefault());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's not rely on the default locale. I would expect our forbidden-apis build task to fail the build because of this.

return true;
} catch (Exception e) {
logger.log(Level.WARNING, null, e);
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. here we should not swallow the exception and log, but rather propagate it, possibly wrapped inside of an IOException (if the method signature allows it) or UncheckedIOException otherwise.

.getS3()
.getObject(b -> b.bucket(s3Directory.getBucket()).key(getPath() + name));

synchronized (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronization shouldn't be needed as IndexInputs should be consumed from a single thread anyway?

* A simple base class that performs index output memory based buffering. The buffer size if
* configurable.
*/
// NEED TO BE MONITORED AGAINST LUCENE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Lucene now. :)

@rmuir
Copy link
Member

rmuir commented Nov 2, 2024

Have we explored trying a fuse-based mount instead of writing java code?

There seems to be several potentially viable approaches there: https://github.com/s3fs-fuse/s3fs-fuse, but maybe higher-performant https://github.com/kahing/goofys + https://github.com/kahing/catfs

@josefschiefer27
Copy link

Amazon S3 also offers its own solution for mounting S3 buckets, available at https://github.com/awslabs/mountpoint-s3/.

There are benchmarks at https://github.com/awslabs/mountpoint-s3/blob/main/doc/BENCHMARKING.md

throughput chart
throughput (with cache) chart
throughput (s3-express) chart
latency chart
latency (s3-express) chart

@albogdano
Copy link
Author

@rmuir @josefschiefer27 I haven't yet tried experimenting with S3 fuse mounts but I doubt the performance will be any better than using the S3 API directly. Also, there are some specific API calls around the legal hold locks used in this project as a locking mechanism. I don't know if those are supported by the libraries you mentioned.
Right now I'm focused on minimizing dependencies by integrating an alternative S3 client - this lightweight AWS client. Ideally IMHO, we should have a lightweight S3 Directory implementation that is suitable for serverless deployments.

@reta
Copy link
Member

reta commented Nov 4, 2024

Just to share similar developments, we did S3 filesystem plugin store prototype for OpenSearch based on the same library [1], it works but the latency wise, it is not great (obviously, no caching etc have been done).

[1] https://github.com/reta/OpenSearch/tree/s3.fs/plugins/store-s3

@jpountz
Copy link
Contributor

jpountz commented Nov 13, 2024

Since we have a S3 Directory implemented already, let's run a comparison with the fuse mount approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants