Recently encountered a strange error
When running the dla34 3D model in my local keras tensorflow
is normal to train
But when DGX runs the same code, the following error will appear:
Traceback (most recent call last): File "main_train.py", line 655, in <module> history = net_final.fit_generator( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1943, in fit_generator return self.fit( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1214, in fit val_logs = self.evaluate( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1489, in evaluate tmp_logs = self.test_function(iterator) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__ result = self._call(*args, **kwds) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 956, in _call return self._concrete_stateful_fn._call_flat( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat return self._build_call_outputs(self._inference_function.call( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 591, in call outputs = execute.execute( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: side_input shape must be equal to input shape: [2,32,24,24,24] != [2,32,24,576] [[node model/base.level2.tree1.tree2.bn2/FusedBatchNormV3 (defined at main_train.py:655) ]] [Op:__inference_test_function_6241]
But I have repeatedly confirmed
The data input is exactly the same
The model program is indeed the same
Kit versions don't vary much
But when running in DGX, there will be this error
then i try many ways to debug
can't solve this problem
That is, the dimensions of conv3D before FusedBatchNormV3 will be one degree less
very strange
Then I thought of a way
In the DGX version code added
tf.config.run_functions_eagerly(True)
Find...
it will work fine
this...
Ok
So if something strange goes wrong
You can try to put
tf.config.run_functions_eagerly(True)
This debug mode is turned on first
maybe it can solve the problem
For your reference
Message Board
Feel free to leave suggestions and share! Thanksgiving!