Recently encountered a strange error
When running the dla34 3D model in my local keras tensorflow
is normal to train
But when DGX runs the same code, the following error will appear:

 

Traceback (most recent call last):
File "main_train.py", line 655, in <module>
history = net_final.fit_generator(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1943, in fit_generator
return self.fit(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1214, in fit
val_logs = self.evaluate(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1489, in evaluate
tmp_logs = self.test_function(iterator)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 956, in _call
return self._concrete_stateful_fn._call_flat(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 591, in call
outputs = execute.execute(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: side_input shape must be equal to input shape: [2,32,24,24,24] != [2,32,24,576]
[[node model/base.level2.tree1.tree2.bn2/FusedBatchNormV3 (defined at main_train.py:655) ]] [Op:__inference_test_function_6241]

 

But I have repeatedly confirmed
The data input is exactly the same
The model program is indeed the same
Kit versions don't vary much
But when running in DGX, there will be this error

 

 

then i try many ways to debug
can't solve this problem
That is, the dimensions of conv3D before FusedBatchNormV3 will be one degree less
very strange

 


Then I thought of a way
In the DGX version code added

tf.config.run_functions_eagerly(True)

 

Find...
it will work fine
this...
Ok
So if something strange goes wrong
You can try to put

 

tf.config.run_functions_eagerly(True)

 

This debug mode is turned on first
maybe it can solve the problem
For your reference