Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concern around random number seeding #181

Closed
mhwilder opened this issue Dec 13, 2017 · 4 comments
Closed

Concern around random number seeding #181

mhwilder opened this issue Dec 13, 2017 · 4 comments

Comments

@mhwilder
Copy link

I ran into an issue while trying to implement class balancing using the np.random number generator. Basically, I was seeing a weird repeating sequence in the number generation and it was causing poor sampling. I think the root of the cause comes from line 58 in utils/image.py. Here np.random.randint is used to get a seed. Then when the Keras method image_data_generator.random_transform gets called, it calls np.random.seed(seed). In doing this, it biases what seed will be selected next. The following code snippet illustrates this problem:

import numpy as np
n = 10000
selected = np.zeros(n)
for i in range(100000):
    seed = np.random.randint(n)
    selected[seed] += 1
    np.random.seed(seed)
print('%d of %d selected' % (np.sum(selected>0), n))
print('Mean count of those selected is %0.1f' % np.mean(selected[selected>0]))

This is probably not causing many problems in the existing codebase because the python random library is used in most other places (using that library for my class balancing code also seemed to solve my issue). Additionally, not much augmentation is done in random_transform so repeating seeds may not be much of an issue. Still, it's probably worth considering alternatives for this because it could be a gotcha down the line as it was with me. I'm not sure what the best solution would be though.

As a side note, I wonder if in preprocessing/generator.py we should call random.seed(seed) in addition to np.random.seed(seed) since the goal is to get replicable sequences and random is used for the sequencing.

@de-vri-es
Copy link
Contributor

Using a global pseudo random number generator (plain python or numpy) is not really nice in the first place. Different code using the same PRNG may have different expectations of when it is (re)seeded.

The best solution (I think) is to have a unique PRNG with its own state for each thing that needs one. That would allow anyone to seed their own PRNG for reproducible results without interfering with other code. I'm not sure how easy that would be to do though.

@de-vri-es
Copy link
Contributor

de-vri-es commented Dec 13, 2017

Looks like (atleast for numpy), this would be a good solution: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.RandomState.html . The alternative would be fiddling with get_state/set_state, but that would be horrible.

I would recommend that you use numpy.random.RandomState for your own algorithm so that you're not affected by others messing with the global PRNG.

We should also switch to a similar approach, but that requires rewriting the random augmentation code. That should be done together with #68 and #150.

@mhwilder
Copy link
Author

Thanks for the suggestion. It seems that RandomState would be a good solution for me and would be good to tie in elsewhere.

@de-vri-es
Copy link
Contributor

de-vri-es commented Dec 16, 2017

I'm assuming the problem is solved by using numpy.random.RandomState, so I'm closing this issue. Additionally, when #190 is merged, keras-retinet itself will never (re)seed any global PRNG anymore.

If the problem is not solved, feel free to request a re-open.

Thanks for reporting the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants