Add support for reading files from s3 #5

cschauer · 2019-01-30T09:44:00Z

No description provided.

cschauer · 2019-01-31T21:26:06Z

lib/schlepp/format/base.rb

@@ -27,15 +27,35 @@ def initialize(&block)
        block.call(self)
      end

+      def load_data_bucket
+        return nil unless ENV['AWS_DATA_BUCKET']
+        s3 = AWS::S3.new(region: ENV['AWS_DATA_BUCKET_REGION'])


This is aws-sdk v1, which we get from skillet. Not ideal, but I think better than adding an explicit dependency here since there aren't any in schlepp's gemfile.

cschauer · 2019-01-31T21:27:13Z

lib/schlepp/format/binary.rb

+        paths = data_bucket ? object_glob([ path ]) : Dir.glob(path)
+
+        paths.each do |path|
+          path = copy_s3_to_tmp(path) if data_bucket && !ENV['NOIMAGES']


I think skipping the local copy will be fine if ENV['NOIMAGES'] is set. This isn't used anywhere that it won't be skipped anyway. We definitely don't want to download image files from s3 if we are skipping images.

cschauer · 2019-01-31T21:30:22Z

@lyleunderwood This is ready for review. If we move the logic to copy s3 binaries to a tmp directory into here, then we can actually keep the changes entirely within schlepp.

chuckbjones

So this won't help with any code in skillet that assumes that files exist in the data directory right?
Like this: https://github.com/ElasticSuite/skillet/blob/6d5893e10f2816e6758e5b16e4fe55a0707f904a/lib/elastic/import/burdens/catalogs.rb#L109

chuckbjones · 2019-02-01T16:53:07Z

lib/schlepp/format/base.rb

+      def glob_to_regexp(glob)
+        escaped = Regexp.escape(glob)
+          .sub('\*\*', '.*')
+          .sub('\*', '(?:(?!\\/).)*')


Looks like this is only supporting * and **? We've got other glob patterns like this one:
https://github.com/ElasticSuite/spyder-spice/blob/04ed000f0fd3883c86e40ee1071428aff459d5fe/import/inventory_import.rb#L93-L97

I wonder if there's a general glob 2 regex library we can use here.

I guess we only need to support what Dir.glob supports: https://ruby-doc.org/core-2.1.0/Dir.html#method-c-glob

This might work, looks like it supports everything but [] (but that works the same as regex so it's probably ok): https://github.com/alexch/rerun/blob/master/lib/rerun/glob.rb

Ah, good call. Thought we could get away with just * and **. I saw a promising glob to regex written in node.

Doesn't the one you linked get * and ** mixed up? I thought * shouldn't match subdirectories infinitely, but that one's replacing it with .*. If it did, we wouldn't be able to move images to importedwithout them being picked up anyway.

Yeah I don't necessarily trust that guy's code. It's likely that it's full of bugs.

cschauer · 2019-02-01T17:28:23Z

So this won't help with any code in skillet that assumes that files exist in the data directory right?

Doesn't everything expect that files are in data? I think we're going to have to prefix all our bucket paths with data. So, for example, ENV['AWS_DATA_BUCKET'] might be harley, but the sftp user's home directory is actually harley/data.

chuckbjones · 2019-02-01T17:49:30Z

Doesn't everything expect that files are in data?

Yes? Are we mounting the s3 bucket as a filesystem at data? If so, what's the point of this change? Performance?

cschauer · 2019-02-01T18:17:16Z

Ooooooh I see what you're saying. Uhhhh, crap. I think we will need to make spice changes for those cases. I knew it was too good to be true.

cschauer · 2019-02-01T18:21:25Z

Wait, when those things are called are we still in the context of a burden? Can we add a method to the burden data_file_exists? that will check s3 or disk?

chuckbjones · 2019-02-02T19:13:07Z

I think so.

The sales program import expects a filename. Not sure if we can modify it to pass an IO object o something.

There may be other places that deal with paths/filenames in spice burdens, but I don't know off the top of my head.

cschauer · 2019-02-04T18:13:52Z

Ah, sales program import doesn't use schlepp to parse. It looks like Roo::Spreadsheet.open only takes a path, so we probably need to copy the sales program file to a tmp file for processing.

Add support for reading files from s3

68bb6bf

cschauer force-pushed the s3 branch from 6668339 to 68bb6bf Compare January 30, 2019 09:48

Copy binary files to disk before calling glob block

84962db

cschauer commented Jan 31, 2019

View reviewed changes

chuckbjones reviewed Feb 1, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reading files from s3 #5

Add support for reading files from s3 #5

cschauer commented Jan 30, 2019

cschauer Jan 31, 2019

cschauer Jan 31, 2019 •

edited

Loading

cschauer commented Jan 31, 2019

chuckbjones left a comment

chuckbjones Feb 1, 2019

chuckbjones Feb 1, 2019

cschauer Feb 1, 2019

chuckbjones Feb 1, 2019

cschauer commented Feb 1, 2019

chuckbjones commented Feb 1, 2019

cschauer commented Feb 1, 2019 •

edited

Loading

cschauer commented Feb 1, 2019

chuckbjones commented Feb 2, 2019

cschauer commented Feb 4, 2019

Add support for reading files from s3 #5

Are you sure you want to change the base?

Add support for reading files from s3 #5

Conversation

cschauer commented Jan 30, 2019

cschauer Jan 31, 2019

Choose a reason for hiding this comment

cschauer Jan 31, 2019 • edited Loading

Choose a reason for hiding this comment

cschauer commented Jan 31, 2019

chuckbjones left a comment

Choose a reason for hiding this comment

chuckbjones Feb 1, 2019

Choose a reason for hiding this comment

chuckbjones Feb 1, 2019

Choose a reason for hiding this comment

cschauer Feb 1, 2019

Choose a reason for hiding this comment

chuckbjones Feb 1, 2019

Choose a reason for hiding this comment

cschauer commented Feb 1, 2019

chuckbjones commented Feb 1, 2019

cschauer commented Feb 1, 2019 • edited Loading

cschauer commented Feb 1, 2019

chuckbjones commented Feb 2, 2019

cschauer commented Feb 4, 2019

cschauer Jan 31, 2019 •

edited

Loading

cschauer commented Feb 1, 2019 •

edited

Loading