-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Hash firehose stream name if it is too long #1191
Conversation
@@ -52,19 +53,32 @@ class FirehoseClient: | |||
MAX_RECORD_SIZE = 1000 * 1000 - 2 | |||
|
|||
# Default firehose name format, should be formatted with deployment prefix | |||
DEFAULT_FIREHOSE_FMT = '{}streamalert_data_{}' | |||
DEFAULT_FIREHOSE_FMT = '{}data_{}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I'd rather keep the streamalert
portion of this and drop the data
bit.. these are Firehose Data Delivery Streams
and including data
is pretty redundant. at least having streamalert
in the name will help with things like filtering streams in console, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this before, but I recommend explicitly stating the firehose names in the conf/
directory, and explicitly populating them during the generate
step. I imagine the workflow is like:
- Look in conf directory for explicit firehose name
- If no explicit name, use
use_prefix
and the log type - During
manage.py generate
, explicitly populate into Terraform json files - Instead of generating them using
locals {$prefix}_data_{$hash}
, just use the explicit name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ryxias quick question.
Look in conf directory for explicit firehose name
Potentially, we may have hundreds of firehose up running due to one log schema using one firehose. The firehose conf will be quite long and add complexity to read all firehose names from the conf/global.json
. IMO, I would take @ryandeivert 's suggestion. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like Derek's idea, I'm just unsure where this would live.. in global.json
? how do we map from logs.json
to this reliably... or does it live in logs.json
?? any suggestions for how this would actually look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also still insist we drop _data_
, even with the hashing route. really not sure why we're hung up on keeping it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ryxias again I'm not opposed to your way, I just want to hear more about "how" this will live in the config. does it go in logs.json or the firehose config of global.json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have it in conf/logs.json
, or conf/schemas/*.json
or conf/firehose.json
. I don't really care either way.
I think the important one is that it's OPTIONAL. In most cases, you should not explicitly specify your firehoses. But they should be explicitly stated somewhere. I don't care much about the conf/****.json
one, but I am very strongly in favor of explicitly generating them in the terraform/***.tf.json
files, instead of dynamically creating them inside of the the Terraform modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be mean your manage.py generate
step would just have a new function that's like:
def get_firehose_name(log_type):
# Explicitly stated
if config['whatever'].get(log_type, {}).get('firehose_name', None):
return config['whatever'].get(log_type, {}).get('firehose_name', None)
# Generated
use_prefix = config['global']['whatever']['use_prefix']
firehose_name = '{}{}_{}'.format(
use_prefix ? '{}_'.format(prefix),
'streamalert_data' # Or whatever
log_schema
)
if len(firehose_name) > 64:
# Truncate streamalert_data ...
if len(firehose_name) > 64:
# then do some hashing stuff...
return firehose_name
And the above would set the real firehose name into the **.tf.json
files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm drawing attention to the where on this because I think we're forgetting how this is currently performed. data retention is optional and does not need to be enabled.. but enabling means toggling it in global.json
. having some settings in global.json and others in logs.json is confusing.. but on the other hand, having the name for an optional feature embedded in logs.json and other settings in global.json is also confusing. we're sorta asking a lot of users here to mind-map all of this together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap up this thread. I took Ryan's suggestion and addressed it in this commit. We will support the custom firehose stream name later and I have open an issue #1193 to track the future work.
@@ -298,6 +312,46 @@ def firehose_log_name(cls, log_name): | |||
""" | |||
return re.sub(cls.SPECIAL_CHAR_REGEX, cls.SPECIAL_CHAR_SUB, log_name) | |||
|
|||
@classmethod | |||
def generate_firehose_stream_name(cls, use_prefix, prefix, log_stream_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method's name seems to imply that it generates the full firehose name, but that's actually not true right? It generates the prefix-less firehose name.
I suggest renaming the function to generate_firehose_suffix
reduce confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Yes, you are right, the return result doesn't include prefix. Will rename this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes. LGTM
to: @airbnb/streamalert-maintainers
related to: #1115 #1188
resolves: #1190
Background
See issue #1190
Changes
classifier
and pass the new stream name totf_kinesis_firehose_delivery_stream
module during terraform generate time. The firehose stream name is based on schema name.hashlib.md5()
.hashlib.md5()
returns 32 characters (128-bit) and we will only take the first 8 characters. There is low chance is too reach the stream name limit, so I don't worry the hash conflict._streamalert_
(13 chars) from firehose stream name. Firehose stream name is restrict to the name length is no longer than 64 characters. We want the schema names more descriptive and it will be easily hit the limit especially when we add prefix.var.glue_catalog_table_name
in the s3 file prefix instead ofvar.log_name
, then latter may be changed if it is too long.Testing
osquery_differential
andcloudwatch:events.dots-test.crazy.long_name_yay_hahahahaha