Travis tests no longer pass. #1123

de-vri-es · 2019-10-02T09:06:09Z

The travis tests no longer pass as can be seen here: https://travis-ci.org/fizyr/keras-retinanet/builds/592460276 .

At a glance, it seems to be a problem with an incompatible version of tensorflow, but I didn't dig very deep.

hgaiser · 2019-10-02T09:08:22Z

Tensorflow recently updated to 2.0, probably related :)

de-vri-es · 2019-10-02T09:13:09Z

Sounds plausible ^^

anuar12 · 2019-10-02T15:15:04Z

Any chance necessary changes will be made for TF2.0 support?

hgaiser · 2019-10-03T08:41:16Z

Yes I expect this week or next week I'll push an update to fix compatibility with tensorflow 2.0.

de-vri-es · 2019-10-12T12:14:03Z

If I try to run the tests locally on a machine without CUDA the tests also hang forever. In my case, all 8GB of RAM was exhausted and the whole computer froze.

Perhabs we have a memory leak with TF2 somewhere, or perhaps TF2 or its TF1 compatibility layer is doing something very inefficiently. I didn't try to pinpoint it to anything specific. I also don't think I'll have time for that anywhere soon.

hgaiser · 2019-10-14T09:06:29Z

all 8GB of RAM

You need an upgrade :)

I was testing only tests/models/test_mobilenet.py and it was using a lot of memory, but it did finish. Interestingly, adding the following:

import tensorflow as tf
tf.compat.v1.disable_v2_behavior()

Seems to make everything "normal". I don't want to add that code everywhere, but so far it's the only solution I have. I don't have much time to look into this issue further, but I will see if I can find out the underlying issue. In the meantime, @de-vri-es could you verify that disabling v2 behavior fixes the tests on your system too?

hgaiser · 2019-10-14T10:08:46Z

I feel like this could be related: tensorflow/tensorflow#32052 . Sounds like there are some memory issues with tf2 and the travis build most likely gets killed because it uses too much memory. Maybe as a temporary fix we can disable the regression tests (the ones that actually train on a bit of data) and only test specific functions for now?

de-vri-es · 2019-10-14T19:32:14Z

all 8GB of RAM

You need an upgrade :)

It was actually 16 GB, though maybe I still should :)

Anyway, to set a baseline I tried running the tests again without any modifications, and python segfaulted in some multiprocessing operation 💃

tests/test_losses.py .                                                                                          [  1%]
tests/backend/test_common.py ..                                                                                 [  5%]
tests/bin/test_train.py Fatal Python error: Segmentation fault

Thread 0x00007ffa7d5fb700 (most recent call first):
  File "/usr/lib/python3.7/multiprocessinzsh: segmentation fault (core dumped)  pytest

Ran it again, no segfault, but memory usage crept up in tests/models/test_densenet.py until I killed it when it reached 14 GB. Then I added the change you suggested to that test, and the same test (also densenet) passed using "only" 4 GB of RAM.

However, then I copied the change into keras_retinanet/models/retinanet.py to get it everywhere, and the train.py tests started failing:

______________________________________________________ test_coco ______________________________________________________

    def test_coco():
        # ignore warnings in this test
        warnings.simplefilter('ignore')
    
        # run training / evaluation
        keras_retinanet.bin.train.main([
            '--epochs=1',
            '--steps=1',
            '--no-weights',
            '--no-snapshots',
            'coco',
>           'tests/test-data/coco',
        ])

tests/bin/test_train.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
keras_retinanet/bin/train.py:516: in main
    validation_data=validation_generator
/usr/lib/python3.7/site-packages/keras/legacy/interfaces.py:91: in wrapper
    return func(*args, **kwargs)
/usr/lib/python3.7/site-packages/keras/engine/training.py:1732: in fit_generator
    initial_epoch=initial_epoch)
/usr/lib/python3.7/site-packages/keras/engine/training_generator.py:260: in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
/usr/lib/python3.7/site-packages/keras/callbacks/callbacks.py:152: in on_epoch_end
    callback.on_epoch_end(epoch, logs)
/usr/lib/python3.7/site-packages/keras/callbacks/callbacks.py:1036: in on_epoch_end
    logs['lr'] = K.get_value(self.model.optimizer.lr)
/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:2927: in get_value
    return x.numpy()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tf.Variable 'Adam/learning_rate:0' shape=() dtype=float32>

    def numpy(self):
      if context.executing_eagerly():
        return self.read_value().numpy()
      raise NotImplementedError(
>         "numpy() is only available when eager execution is enabled.")
E     NotImplementedError: numpy() is only available when eager execution is enabled.

/usr/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:579: NotImplementedError

So it looks like we can't just do this for everything blindly.

de-vri-es · 2019-10-14T19:47:53Z

hmm, for some reason I'm now getting that error even without modifications. It may not be related. Besides that error, the test suite is passing with the suggested change added only to the densenet and mobilenet tests.

de-vri-es · 2019-10-14T19:54:26Z

Ah, I see, pytest isn't running each tests in a clean, isolated environment. It appears it already imported the densenet and mobilenet tests, which already called tensorflow.compat.v1.disable_v2_behavior(). So it is that call that's causing the train.py test to fail.

I glanced at pytest --help but I didn't immediately see an option to fork a clean python interpreter for each test.

hgaiser · 2019-10-14T20:02:51Z

Actually I think we had that a few commits ago with python-xdist but I thought at first that that was causing issues so I removed it to test in Travis but then it got merged a bit prematurely.

hgaiser · 2019-10-17T16:34:17Z

I put the forking of tests back, it stops the cumulating of memory usage, but it still doesn't work on travis (it does run fine locally for the record).

Also, on two comparable systems, but one running tensorflow 2 and the other running tensorflow 1.14, the times for test_mobilenet.py are quite different. 100s for tensorflow 2 and 46s for tensorflow 1.14 =\

Disabling eager execution in tensorflow reduces the time to 70s, disabling v2 behavior gives approximately the same time (47s) as 1.14.

Borda · 2019-10-17T18:08:55Z

So what is advantage of using FT 2.x?

hgaiser · 2019-10-24T10:23:52Z

So what is advantage of using FT 2.x?

It's going to be the version of tensorflow many people will use, so we have to support it. But the main advantage of tf 2 is eager execution (which is being annoying at the moment) and the move to tf.keras for their API.

Borda · 2019-10-29T14:20:17Z

according to experimenting in #1137 it is kind of magic circle, you disable v2 behaviour which contains eager execution which is needed for #1123 (comment) and enabling this execution overflow memory and time limit... so we would need to disable the behaviour and replace .numpy() calling...

hgaiser · 2019-10-31T08:24:35Z

Yep you're right :)

I'm hoping keras-team/keras#13476 will get merged soon, then we can just continue with disabling v2 behavior. Seems to me the most easy way forward.

Borda · 2019-10-31T09:58:50Z

Well, the time of being merged and being release can be different... 8-)

hgaiser · 2019-11-04T10:59:29Z

Well, the time of being merged and being release can be different... 8-)

Yes, but merging that will break usage when using tensorflow 2 :) At least now it should work with tensorflow 1.14/1.15.

Borda · 2019-11-04T11:26:04Z

should is not very convincing, I would merge #1137 even it not fully fixed, but it has tests for TF 1.x and 2.x so it is clear what passes and what not...

hgaiser · 2019-11-04T11:29:15Z

There's no point in merging it yet without having the fix in Keras to go with it.

Borda · 2019-11-04T11:40:44Z

let me clarify... what you expect from Keras fix, that the disable_v2_behavior will work fine with .numpy(), right? so then you do not expect any other change on side of this repo... so the mentioned PR after dropping enable_eager_execution is what should be the final solution, right?
Does it also mean that until Keras is fixed you do not allow merge any other PR because there is working CI to check if everything is running correctly?

ThanhNhann · 2019-12-15T16:13:43Z

Excuse me, when I read .travis.yml I see the line
travis_wait coverage run --source keras_retinanet -m py.test keras_retinanet tests --doctest-modules --forked --flake8
when I run, Its says travis_wait: command not found
and I also find travis_wait but don't see, can you explain to me about this.
Thanks!

Borda · 2019-12-15T16:22:20Z

travis_wait is a pure Travis command, which does not work elsewhere...
in the end, it does not have much effect, so you can omit it completely :]
there a work-around in this PR - #1137

ThanhNhann · 2019-12-16T02:02:05Z

Thanks for the quick reply @Borda, when I run the file train.py, I met this problem
File "keras_retinanet/bin/train.py", line 528, in main() File "keras_retinanet/bin/train.py", line 521, in main validation_steps = args.steps_for_validation, AttributeError: 'Namespace' object has no attribute 'steps_for_validation'
I dont see another steps_for_validation in train.py, can you explain for this ?
Thanks

Borda · 2019-12-16T08:19:18Z

You are probably missing a paremeter steps_for_validation while you call the training script... So there may be a bug since it should be mandatory paraneter or with a default value...

ThanhNhann · 2019-12-17T04:52:45Z

When I find the parameter steps_for_validation in the train.py, it's just called but not defined. Do you think this is an omission?

pendex900x · 2019-12-19T05:12:54Z

same error here

/usr/local/lib/python3.6/dist-packages/keras/callbacks/tensorboard_v2.py:92: UserWarning: The TensorBoard callback batch_size argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored.
warnings.warn('The TensorBoard callback batch_size argument '
Traceback (most recent call last):
File "/content/keras-retinanet/keras_retinanet/bin/train.py", line 528, in
main()
File "/content/keras-retinanet/keras_retinanet/bin/train.py", line 521, in main
validation_steps = args.steps_for_validation,
AttributeError: 'Namespace' object has no attribute 'steps_for_validation'

hgaiser · 2019-12-19T09:57:11Z

I just reverted some commits, can you try again?

Note: You still need keras-team/keras#13476 to pass the tests and in case you are using tensorflow 2.1, you also need tensorflow/tensorflow#34870 (or remove Tensorboard).

SalahAdDin · 2020-01-21T12:44:25Z

What's about the next warning?
warnings.warn('The TensorBoard callback batch_size argument '

de-vri-es changed the title ~~Travis tests no longer passes.~~ Travis tests no longer pass. Oct 2, 2019

hgaiser mentioned this issue Oct 14, 2019

fix build in Travis #1137

Closed

hgaiser added bug enhancement labels Nov 7, 2019

SalahAdDin mentioned this issue Jan 21, 2020

UserWarning: The TensorBoard callback batch_size argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored. #1251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Travis tests no longer pass. #1123

Travis tests no longer pass. #1123

de-vri-es commented Oct 2, 2019

hgaiser commented Oct 2, 2019

de-vri-es commented Oct 2, 2019

anuar12 commented Oct 2, 2019

hgaiser commented Oct 3, 2019

de-vri-es commented Oct 12, 2019

hgaiser commented Oct 14, 2019

hgaiser commented Oct 14, 2019 •

edited

Loading

de-vri-es commented Oct 14, 2019

de-vri-es commented Oct 14, 2019

de-vri-es commented Oct 14, 2019 •

edited

Loading

hgaiser commented Oct 14, 2019

hgaiser commented Oct 17, 2019

Borda commented Oct 17, 2019

hgaiser commented Oct 24, 2019

Borda commented Oct 29, 2019

hgaiser commented Oct 31, 2019

Borda commented Oct 31, 2019 •

edited

Loading

hgaiser commented Nov 4, 2019

Borda commented Nov 4, 2019

hgaiser commented Nov 4, 2019

Borda commented Nov 4, 2019

ThanhNhann commented Dec 15, 2019

Borda commented Dec 15, 2019 •

edited

Loading

ThanhNhann commented Dec 16, 2019

Borda commented Dec 16, 2019

ThanhNhann commented Dec 17, 2019

pendex900x commented Dec 19, 2019

hgaiser commented Dec 19, 2019

SalahAdDin commented Jan 21, 2020

Travis tests no longer pass. #1123

Travis tests no longer pass. #1123

Comments

de-vri-es commented Oct 2, 2019

hgaiser commented Oct 2, 2019

de-vri-es commented Oct 2, 2019

anuar12 commented Oct 2, 2019

hgaiser commented Oct 3, 2019

de-vri-es commented Oct 12, 2019

hgaiser commented Oct 14, 2019

hgaiser commented Oct 14, 2019 • edited Loading

de-vri-es commented Oct 14, 2019

de-vri-es commented Oct 14, 2019

de-vri-es commented Oct 14, 2019 • edited Loading

hgaiser commented Oct 14, 2019

hgaiser commented Oct 17, 2019

Borda commented Oct 17, 2019

hgaiser commented Oct 24, 2019

Borda commented Oct 29, 2019

hgaiser commented Oct 31, 2019

Borda commented Oct 31, 2019 • edited Loading

hgaiser commented Nov 4, 2019

Borda commented Nov 4, 2019

hgaiser commented Nov 4, 2019

Borda commented Nov 4, 2019

ThanhNhann commented Dec 15, 2019

Borda commented Dec 15, 2019 • edited Loading

ThanhNhann commented Dec 16, 2019

Borda commented Dec 16, 2019

ThanhNhann commented Dec 17, 2019

pendex900x commented Dec 19, 2019

hgaiser commented Dec 19, 2019

SalahAdDin commented Jan 21, 2020

hgaiser commented Oct 14, 2019 •

edited

Loading

de-vri-es commented Oct 14, 2019 •

edited

Loading

Borda commented Oct 31, 2019 •

edited

Loading

Borda commented Dec 15, 2019 •

edited

Loading