Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Hash firehose stream name if it is too long #1191

Merged
merged 5 commits into from
Mar 17, 2020

Conversation

chunyong-lin
Copy link
Contributor

to: @airbnb/streamalert-maintainers
related to: #1115 #1188
resolves: #1190

Background

See issue #1190

Changes

  • Update firehose stream name in classifier and pass the new stream name to tf_kinesis_firehose_delivery_stream module during terraform generate time. The firehose stream name is based on schema name.
    • This change will generate a new stream name if it's longer than 64 characters. In another words, most of the firehose stream name will remain the same.
    • For the stream name longer than 64 characters, the first 58 characters remains unchanged and the rest characters will hash to 8 characters by hashlib.md5(). hashlib.md5() returns 32 characters (128-bit) and we will only take the first 8 characters. There is low chance is too reach the stream name limit, so I don't worry the hash conflict.
  • Remove _streamalert_ (13 chars) from firehose stream name. Firehose stream name is restrict to the name length is no longer than 64 characters. We want the schema names more descriptive and it will be easily hit the limit especially when we add prefix.
  • Use var.glue_catalog_table_name in the s3 file prefix instead of var.log_name, then latter may be changed if it is too long.

Testing

  • Add unit test cases to cover the code change.
  • Tested in staging environment with the data matched to schemas both with short and extremely long schema name, e.g. osquery_differential and cloudwatch:events.dots-test.crazy.long_name_yay_hahahahaha
    • classifier can process data and send data to both firehose
    • rule's engine can trigger alerts from both types of data.
    • alert process can send alerts to firehose correctly
    • athena partition can partition both types of data
    • both types of data and alerts can be searchable in the Athena in parquet format.

@coveralls
Copy link

coveralls commented Mar 14, 2020

Coverage Status

Coverage increased (+0.01%) to 95.434% when pulling 4011cff on firehose_log_stream_name_length into c4d2c12 on release-3-1-0.

@chunyong-lin chunyong-lin changed the title [core] Hash firehose stream name it is too long [core] Hash firehose stream name if it is too long Mar 14, 2020
@chunyong-lin chunyong-lin added this to the 3.1.0 milestone Mar 14, 2020
@@ -52,19 +53,32 @@ class FirehoseClient:
MAX_RECORD_SIZE = 1000 * 1000 - 2

# Default firehose name format, should be formatted with deployment prefix
DEFAULT_FIREHOSE_FMT = '{}streamalert_data_{}'
DEFAULT_FIREHOSE_FMT = '{}data_{}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I'd rather keep the streamalert portion of this and drop the data bit.. these are Firehose Data Delivery Streams and including data is pretty redundant. at least having streamalert in the name will help with things like filtering streams in console, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this before, but I recommend explicitly stating the firehose names in the conf/ directory, and explicitly populating them during the generate step. I imagine the workflow is like:

  • Look in conf directory for explicit firehose name
  • If no explicit name, use use_prefix and the log type
  • During manage.py generate, explicitly populate into Terraform json files
  • Instead of generating them using locals {$prefix}_data_{$hash}, just use the explicit name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ryxias quick question.

Look in conf directory for explicit firehose name

Potentially, we may have hundreds of firehose up running due to one log schema using one firehose. The firehose conf will be quite long and add complexity to read all firehose names from the conf/global.json. IMO, I would take @ryandeivert 's suggestion. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like Derek's idea, I'm just unsure where this would live.. in global.json? how do we map from logs.json to this reliably... or does it live in logs.json?? any suggestions for how this would actually look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also still insist we drop _data_, even with the hashing route. really not sure why we're hung up on keeping it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ryxias again I'm not opposed to your way, I just want to hear more about "how" this will live in the config. does it go in logs.json or the firehose config of global.json?

Copy link
Contributor

@Ryxias Ryxias Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have it in conf/logs.json, or conf/schemas/*.json or conf/firehose.json. I don't really care either way.

I think the important one is that it's OPTIONAL. In most cases, you should not explicitly specify your firehoses. But they should be explicitly stated somewhere. I don't care much about the conf/****.json one, but I am very strongly in favor of explicitly generating them in the terraform/***.tf.json files, instead of dynamically creating them inside of the the Terraform modules.

Copy link
Contributor

@Ryxias Ryxias Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be mean your manage.py generate step would just have a new function that's like:

def get_firehose_name(log_type):
  # Explicitly stated
  if config['whatever'].get(log_type, {}).get('firehose_name', None):
    return config['whatever'].get(log_type, {}).get('firehose_name', None)

  # Generated
  use_prefix = config['global']['whatever']['use_prefix']

  firehose_name = '{}{}_{}'.format(
    use_prefix ? '{}_'.format(prefix),
    'streamalert_data' # Or whatever
    log_schema
  )

  if len(firehose_name) > 64:
    # Truncate streamalert_data ... 

  if len(firehose_name) > 64:
    # then do some hashing stuff...

  return firehose_name

And the above would set the real firehose name into the **.tf.json files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm drawing attention to the where on this because I think we're forgetting how this is currently performed. data retention is optional and does not need to be enabled.. but enabling means toggling it in global.json. having some settings in global.json and others in logs.json is confusing.. but on the other hand, having the name for an optional feature embedded in logs.json and other settings in global.json is also confusing. we're sorta asking a lot of users here to mind-map all of this together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap up this thread. I took Ryan's suggestion and addressed it in this commit. We will support the custom firehose stream name later and I have open an issue #1193 to track the future work.

@@ -298,6 +312,46 @@ def firehose_log_name(cls, log_name):
"""
return re.sub(cls.SPECIAL_CHAR_REGEX, cls.SPECIAL_CHAR_SUB, log_name)

@classmethod
def generate_firehose_stream_name(cls, use_prefix, prefix, log_stream_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method's name seems to imply that it generates the full firehose name, but that's actually not true right? It generates the prefix-less firehose name.

I suggest renaming the function to generate_firehose_suffix reduce confusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Yes, you are right, the return result doesn't include prefix. Will rename this method.

Copy link
Contributor

@Ryxias Ryxias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes. LGTM

@chunyong-lin chunyong-lin merged commit 666b4cd into release-3-1-0 Mar 17, 2020
@chunyong-lin chunyong-lin deleted the firehose_log_stream_name_length branch March 17, 2020 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants