-
Notifications
You must be signed in to change notification settings - Fork 17
Datagen bns changes #466
base: bns
Are you sure you want to change the base?
Datagen bns changes #466
Conversation
adding the non spin bns prior
detector_frame_prior, | ||
) | ||
if signal_type == "bns": | ||
with ProcessPoolExecutor(140) as exe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of parallelizing this, but I don't think this is doing what you expect: This will submit a single job that will generate all the requested waveforms in one process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was a long discussion with Alec on a thread that incidentally did not include you. I am going to forward that thread to you and hopefully you can see the entire convo.
The crux was that, using concurrent.futures did reduce the waveform generation from days to under 46 min on the hanford box.
Let me know if you can access the following thread on slack:
https://fastml.slack.com/archives/C05EHNRU8AK/p1695772374684639
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ethan's right. This is only helpful if you submit a job for each choice of parameters, submitting one job that generates waveforms for all the parameters won't multiprocess anything, it will just generate all the waveforms in serial in one process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so I get your point. Not sure why it reduced the waveform generation time so drastically. Or could it be that the hanford box just behaved rather nicely at that particular run. It's kind of puzzling.
So I will remove the concurrent.futures for now. If we run into bottlenecks in generating BNS in future, we can revisit at that time.
waveform_approximant: str, | ||
detector_frame_prior: bool = False, | ||
): | ||
padding = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for this? In general should avoid "magic numbers" - maybe make this a parameter with a default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is the issue where thebilby generated waveforms have coalescence at timestamp=0 sec and that's on purpose I think. We need to move the coalescence to the end of the waveform at time stamp = 16 sec (lets say). To do this I played with several waveforms and found that if we roll the waveforms by 200 datapoints to the left, then we do get most of the coalescence at the very end. There are some ring-down remnants in some cases and to deal with that, we chop off the first sec of the waveform. So the padding is set to 1 sec and the waveform is generated "longer "by an amount equal to padding. After rolling and chopping off the first sec, the resultant waveform is of the intended length and the coalescence is nicely at the end of it.
I will add some comments to the code to explain this properly.
So in the specific context that it is used, I don't think its value will be changed or can be used elsewhere. Given its limited scope, I don't think that we need to put this in pyproject.toml
|
||
# just shift the coalescence to the left by 200 datapoints | ||
# to cancel wraparound in the beginning | ||
dt = -200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, another magic number: why is this 200? Is there a first principles motivation for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see above
if i == (n_samples / 4): | ||
logging.info("Generated polarizations : {}".format(i)) | ||
elif i == (n_samples / 2): | ||
logging.info("Generated polarizations : {}".format(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue this reduces readability more than it helps with logging. If you wan't to keep track maybe something cleaner would be
# every 10th waveform
if not i % 10:
# note the logging.debug so that it's only called if verbose=True.
logging.debug(f"{i + 1} polarizations generated")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, will change
elif i == (n_samples / 2): | ||
logging.info("Generated polarizations : {}".format(i)) | ||
|
||
logging.info("Finished Generated polarizations") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Finished Generated polarizations" --> "Finished generating polarizations"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, will change
responses, swap_indices = self.swapper(responses) | ||
responses, mute_indices = self.muter(responses) | ||
X[mask] += responses | ||
if N > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me when this edge case would be reached? In what instance would we not wan't to inject waveforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was also something that Alec had proposed. We ran into this issue when we had to decrease the batch size to like 8 for BNS and that would lead to N coming to 0 on some instances, with waveform prob set to 0.277. For smaller batch sizes, the chances of N coming to zero, even for reasonable values of waveform probs is pretty high. So Alec proposed to add this check to aframe
detector_frame_prior, | ||
) | ||
signals = future.result() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is going into a dedicated bns
branch, I think it might make sense to just assume we are generating bns waveforms, and not have this if else
. We can move to generalizing everything down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't want to create a separate repo for BNS, it will be most ideal and efficient, if we take care of this now rather than later. And write code keeping in mind that it works seamlessly with the main pipeline. Otherwise it will be a huge issue later to merge bns branch into main aframe branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice stuff, left some comments and questions!
This reverts commit 474ddaa.
Following changes are implemented:
scripts/waveforms.py:
utils/injection.py:
projects/sandbox/pyproject.toml: