Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading files from s3 #5

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cschauer
Copy link

No description provided.

@@ -27,15 +27,35 @@ def initialize(&block)
block.call(self)
end

def load_data_bucket
return nil unless ENV['AWS_DATA_BUCKET']
s3 = AWS::S3.new(region: ENV['AWS_DATA_BUCKET_REGION'])
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is aws-sdk v1, which we get from skillet. Not ideal, but I think better than adding an explicit dependency here since there aren't any in schlepp's gemfile.

paths = data_bucket ? object_glob([ path ]) : Dir.glob(path)

paths.each do |path|
path = copy_s3_to_tmp(path) if data_bucket && !ENV['NOIMAGES']
Copy link
Author

@cschauer cschauer Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think skipping the local copy will be fine if ENV['NOIMAGES'] is set. This isn't used anywhere that it won't be skipped anyway. We definitely don't want to download image files from s3 if we are skipping images.

@cschauer
Copy link
Author

@lyleunderwood This is ready for review. If we move the logic to copy s3 binaries to a tmp directory into here, then we can actually keep the changes entirely within schlepp.

Copy link

@chuckbjones chuckbjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this won't help with any code in skillet that assumes that files exist in the data directory right?
Like this: https://github.com/ElasticSuite/skillet/blob/6d5893e10f2816e6758e5b16e4fe55a0707f904a/lib/elastic/import/burdens/catalogs.rb#L109

def glob_to_regexp(glob)
escaped = Regexp.escape(glob)
.sub('\*\*', '.*')
.sub('\*', '(?:(?!\\/).)*')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is only supporting * and **? We've got other glob patterns like this one:
https://github.com/ElasticSuite/spyder-spice/blob/04ed000f0fd3883c86e40ee1071428aff459d5fe/import/inventory_import.rb#L93-L97

I wonder if there's a general glob 2 regex library we can use here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we only need to support what Dir.glob supports: https://ruby-doc.org/core-2.1.0/Dir.html#method-c-glob

This might work, looks like it supports everything but [] (but that works the same as regex so it's probably ok): https://github.com/alexch/rerun/blob/master/lib/rerun/glob.rb

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call. Thought we could get away with just * and **. I saw a promising glob to regex written in node.

Doesn't the one you linked get * and ** mixed up? I thought * shouldn't match subdirectories infinitely, but that one's replacing it with .*. If it did, we wouldn't be able to move images to importedwithout them being picked up anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't necessarily trust that guy's code. It's likely that it's full of bugs.

@cschauer
Copy link
Author

cschauer commented Feb 1, 2019

So this won't help with any code in skillet that assumes that files exist in the data directory right?

Doesn't everything expect that files are in data? I think we're going to have to prefix all our bucket paths with data. So, for example, ENV['AWS_DATA_BUCKET'] might be harley, but the sftp user's home directory is actually harley/data.

@chuckbjones
Copy link

Doesn't everything expect that files are in data?

Yes? Are we mounting the s3 bucket as a filesystem at data? If so, what's the point of this change? Performance?

@cschauer
Copy link
Author

cschauer commented Feb 1, 2019

Ooooooh I see what you're saying. Uhhhh, crap. I think we will need to make spice changes for those cases. I knew it was too good to be true.

@cschauer
Copy link
Author

cschauer commented Feb 1, 2019

Wait, when those things are called are we still in the context of a burden? Can we add a method to the burden data_file_exists? that will check s3 or disk?

@chuckbjones
Copy link

I think so.

The sales program import expects a filename. Not sure if we can modify it to pass an IO object o something.

There may be other places that deal with paths/filenames in spice burdens, but I don't know off the top of my head.

@cschauer
Copy link
Author

cschauer commented Feb 4, 2019

Ah, sales program import doesn't use schlepp to parse. It looks like Roo::Spreadsheet.open only takes a path, so we probably need to copy the sales program file to a tmp file for processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants