-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Hash firehose stream name if it is too long #1191
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ | |
""" | ||
from collections import defaultdict | ||
import json | ||
import hashlib | ||
import re | ||
|
||
import backoff | ||
|
@@ -52,19 +53,32 @@ class FirehoseClient: | |
MAX_RECORD_SIZE = 1000 * 1000 - 2 | ||
|
||
# Default firehose name format, should be formatted with deployment prefix | ||
DEFAULT_FIREHOSE_FMT = '{}streamalert_data_{}' | ||
DEFAULT_FIREHOSE_FMT = '{}data_{}' | ||
|
||
# Exception for which backoff operations should be performed | ||
EXCEPTIONS_TO_BACKOFF = (ClientError, BotocoreConnectionError, HTTPClientError) | ||
|
||
# Set of enabled log types for firehose, loaded from configs | ||
_ENABLED_LOGS = dict() | ||
|
||
# The max length of the firehose stream name is 64. For streamalert data firehose, | ||
# we reserve 5 chars to have `data_` as part of prefix. Please refer to | ||
# terraform/modules/tf_kinesis_firehose_delivery_stream/main.tf | ||
FIREHOSE_NAME_MAX_LEN = 59 | ||
|
||
FIREHOSE_NAME_HASH_LEN = 8 | ||
|
||
def __init__(self, prefix, firehose_config=None, log_sources=None): | ||
self._original_prefix = prefix | ||
if firehose_config and firehose_config.get('use_prefix', True): | ||
self._use_prefix = True | ||
else: | ||
self._use_prefix = False | ||
|
||
self._prefix = ( | ||
'{}_'.format(prefix) | ||
# This default value must be consistent with the classifier Terraform config | ||
if firehose_config and firehose_config.get('use_prefix', True) | ||
if self._use_prefix | ||
else '' | ||
) | ||
self._client = boto3.client('firehose', config=boto_helpers.default_config()) | ||
|
@@ -298,6 +312,45 @@ def firehose_log_name(cls, log_name): | |
""" | ||
return re.sub(cls.SPECIAL_CHAR_REGEX, cls.SPECIAL_CHAR_SUB, log_name) | ||
|
||
@classmethod | ||
def generate_firehose_stream_name(cls, use_prefix, prefix, log_stream_name): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method's name seems to imply that it generates the full firehose name, but that's actually not true right? It generates the prefix-less firehose name. I suggest renaming the function to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point! Yes, you are right, the return result doesn't include prefix. Will rename this method. |
||
"""Generate stream name complaint to firehose naming restriction, no | ||
longer than 64 characters | ||
|
||
Args: | ||
prefix (str): TODO | ||
log_stream_name (str): TODO | ||
|
||
Returns: | ||
str: compliant stream name | ||
""" | ||
|
||
reserved_len = cls.FIREHOSE_NAME_MAX_LEN | ||
|
||
if use_prefix: | ||
# the prefix will have a trailing '_' (underscore) that's why deduct 1 | ||
# in the end. Please refer to terraform module for more detail | ||
# terraform/modules/tf_kinesis_firehose_delivery_stream/main.tf | ||
reserved_len = reserved_len - len(prefix) - 1 | ||
|
||
# Don't change the stream name if its length is complaint | ||
if len(log_stream_name) <= reserved_len: | ||
return log_stream_name | ||
|
||
# Otherwise keep the first 51 chars if no prefix and hash the rest string into 8 chars. | ||
# With prefix enabled, keep the first (51 - len(prefix) and hash the rest string into 8 | ||
# chars. | ||
pos = reserved_len - cls.FIREHOSE_NAME_HASH_LEN | ||
hash_part = log_stream_name[pos:] | ||
hash_result = hashlib.md5(hash_part.encode()).hexdigest() # nosec | ||
|
||
# combine the first part and first 8 chars of hash result together as new | ||
# stream name. | ||
# e.g. if use prefix | ||
# 'very_very_very_long_log_stream_name_abcd_59_characters_long' may hash to | ||
# 'very_very_very_long_log_stream_name_abcd_59_06ceefaa' | ||
return ''.join([log_stream_name[:pos], hash_result[:cls.FIREHOSE_NAME_HASH_LEN]]) | ||
|
||
@classmethod | ||
def enabled_log_source(cls, log_source_name): | ||
"""Check that the incoming record is an enabled log source for Firehose | ||
|
@@ -392,8 +445,12 @@ def send(self, payloads): | |
for log_type, records in records.items(): | ||
# This same substitution method is used when naming the Delivery Streams | ||
formatted_log_type = self.firehose_log_name(log_type) | ||
stream_name = self.DEFAULT_FIREHOSE_FMT.format(self._prefix, formatted_log_type) | ||
|
||
# firehose stream name has the length limit, no longer than 64 characters | ||
formatted_stream_name = self.generate_firehose_stream_name( | ||
self._use_prefix, self._original_prefix, formatted_log_type | ||
) | ||
stream_name = self.DEFAULT_FIREHOSE_FMT.format(self._prefix, formatted_stream_name) | ||
# Process each record batch in the categorized payload set | ||
for record_batch in self._record_batches(records): | ||
batch_size = len(record_batch) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I'd rather keep the
streamalert
portion of this and drop thedata
bit.. these areFirehose Data Delivery Streams
and includingdata
is pretty redundant. at least havingstreamalert
in the name will help with things like filtering streams in console, etcThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this before, but I recommend explicitly stating the firehose names in the
conf/
directory, and explicitly populating them during thegenerate
step. I imagine the workflow is like:use_prefix
and the log typemanage.py generate
, explicitly populate into Terraform json fileslocals {$prefix}_data_{$hash}
, just use the explicit nameThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ryxias quick question.
Potentially, we may have hundreds of firehose up running due to one log schema using one firehose. The firehose conf will be quite long and add complexity to read all firehose names from the
conf/global.json
. IMO, I would take @ryandeivert 's suggestion. Thoughts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like Derek's idea, I'm just unsure where this would live.. in
global.json
? how do we map fromlogs.json
to this reliably... or does it live inlogs.json
?? any suggestions for how this would actually look?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also still insist we drop
_data_
, even with the hashing route. really not sure why we're hung up on keeping it...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ryxias again I'm not opposed to your way, I just want to hear more about "how" this will live in the config. does it go in logs.json or the firehose config of global.json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have it in
conf/logs.json
, orconf/schemas/*.json
orconf/firehose.json
. I don't really care either way.I think the important one is that it's OPTIONAL. In most cases, you should not explicitly specify your firehoses. But they should be explicitly stated somewhere. I don't care much about the
conf/****.json
one, but I am very strongly in favor of explicitly generating them in theterraform/***.tf.json
files, instead of dynamically creating them inside of the the Terraform modules.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be mean your
manage.py generate
step would just have a new function that's like:And the above would set the real firehose name into the
**.tf.json
filesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm drawing attention to the where on this because I think we're forgetting how this is currently performed. data retention is optional and does not need to be enabled.. but enabling means toggling it in
global.json
. having some settings in global.json and others in logs.json is confusing.. but on the other hand, having the name for an optional feature embedded in logs.json and other settings in global.json is also confusing. we're sorta asking a lot of users here to mind-map all of this togetherThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap up this thread. I took Ryan's suggestion and addressed it in this commit. We will support the custom firehose stream name later and I have open an issue #1193 to track the future work.