-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for reading files from s3 #5
base: master
Are you sure you want to change the base?
Conversation
@@ -27,15 +27,35 @@ def initialize(&block) | |||
block.call(self) | |||
end | |||
|
|||
def load_data_bucket | |||
return nil unless ENV['AWS_DATA_BUCKET'] | |||
s3 = AWS::S3.new(region: ENV['AWS_DATA_BUCKET_REGION']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is aws-sdk v1, which we get from skillet. Not ideal, but I think better than adding an explicit dependency here since there aren't any in schlepp's gemfile.
paths = data_bucket ? object_glob([ path ]) : Dir.glob(path) | ||
|
||
paths.each do |path| | ||
path = copy_s3_to_tmp(path) if data_bucket && !ENV['NOIMAGES'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think skipping the local copy will be fine if ENV['NOIMAGES']
is set. This isn't used anywhere that it won't be skipped anyway. We definitely don't want to download image files from s3 if we are skipping images.
@lyleunderwood This is ready for review. If we move the logic to copy s3 binaries to a tmp directory into here, then we can actually keep the changes entirely within schlepp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this won't help with any code in skillet that assumes that files exist in the data
directory right?
Like this: https://github.com/ElasticSuite/skillet/blob/6d5893e10f2816e6758e5b16e4fe55a0707f904a/lib/elastic/import/burdens/catalogs.rb#L109
def glob_to_regexp(glob) | ||
escaped = Regexp.escape(glob) | ||
.sub('\*\*', '.*') | ||
.sub('\*', '(?:(?!\\/).)*') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is only supporting *
and **
? We've got other glob patterns like this one:
https://github.com/ElasticSuite/spyder-spice/blob/04ed000f0fd3883c86e40ee1071428aff459d5fe/import/inventory_import.rb#L93-L97
I wonder if there's a general glob 2 regex library we can use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we only need to support what Dir.glob
supports: https://ruby-doc.org/core-2.1.0/Dir.html#method-c-glob
This might work, looks like it supports everything but []
(but that works the same as regex so it's probably ok): https://github.com/alexch/rerun/blob/master/lib/rerun/glob.rb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good call. Thought we could get away with just *
and **
. I saw a promising glob to regex written in node.
Doesn't the one you linked get *
and **
mixed up? I thought *
shouldn't match subdirectories infinitely, but that one's replacing it with .*
. If it did, we wouldn't be able to move images to imported
without them being picked up anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't necessarily trust that guy's code. It's likely that it's full of bugs.
Doesn't everything expect that files are in |
Yes? Are we mounting the s3 bucket as a filesystem at |
Ooooooh I see what you're saying. Uhhhh, crap. I think we will need to make spice changes for those cases. I knew it was too good to be true. |
Wait, when those things are called are we still in the context of a burden? Can we add a method to the burden |
I think so. The sales program import expects a filename. Not sure if we can modify it to pass an IO object o something. There may be other places that deal with paths/filenames in spice burdens, but I don't know off the top of my head. |
Ah, sales program import doesn't use schlepp to parse. It looks like |
No description provided.