- Notifications
You must be signed in to change notification settings - Fork75k
Description
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.1.0
- Python version: 3.7.4
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: 10.1/7.6.5
- GPU model and memory: Nvidia 1080Ti
Describe the current behavior
I was trying to train a small net similar to PNet of MTCNN and I wrote a custom loss to test if it is working. During the training, after several epochs the model weights and loss became 'nan'.
I test the training procedure one epoch by one and found that the last epoch that gives a normal loss (0.0814), also outputs a model with all parameters of 'nan'. Thus I think when giving a normal loss, the backward propagation has something wrong and gives the model a 'nan' update.
What I have done to rule out some other possibilities:
(1) Check & clean the data:
My data set is:
X: images of shape (12, 12, 3);
Y: label, box regression coords & 6-landmark regression coords concatenated together of shape (17, ).
For the label, it could be 1, -1, 0, -2 where only labels 1 and 0 will participate in calculating the custom loss I wrote myself.
For the roi & landmark coords, they all belong to [-1, 1].
For the image data, it will be processed as: (x - 127.5) / 128. before being sent into the training stream.
I tried both the TFRecords dataflow & numpy array as the input for training.
(2) Add BatchNormalization layer, add L2-Norm to the weights, use Xavier initialization and pick a smaller learning rate (from 0.001 to 0.0001) to avoid problems like gradient exploding.
(3) Replace the custom loss I wrote myself with 'mse'.
All the three changes made did not fix the 'nan' loss thing.
Describe the expected behavior
The training procedure should work well.
Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.
def pnet_train1(train_with_landmark = False): X = Input(shape = (12, 12, 3), name = 'Pnet_input') M = Conv2D(10, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X) M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M) M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!! M = Conv2D(16, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M) M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M) M = Conv2D(32, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M) M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M) Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M) Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M) Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M) Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv) Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv) if train_with_landmark: Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv) Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor]) model = Model(X, Pnet_output) else: Pnet_output = Concatenate()([Classifier, Bbox_regressor]) model = Model(X, Pnet_output) return modeldef pnet_train2(train_with_landmark = False): X = Input(shape = (12, 12, 3), name = 'Pnet_input') M = Conv2D(10, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X) M = BatchNormalization(axis = -1, name = 'Pnet_bn1')(M) M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M) M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!! M = Conv2D(16, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M) M = BatchNormalization(axis = -1, name = 'Pnet_bn2')(M) M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M) M = Conv2D(32, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M) M = BatchNormalization(axis = -1, name = 'Pnet_bn3')(M) M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M) Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M) Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M) Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M) Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv) Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv) if train_with_landmark: Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv) Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor]) model = Model(X, Pnet_output) else: Pnet_output = Concatenate()([Classifier, Bbox_regressor]) model = Model(X, Pnet_output) return model# Here just check the the first classify loss. def custom_loss(y_true, y_pred): zero_index = K.zeros_like(y_true[:, 0]) ones_index = K.ones_like(y_true[:, 0]) labels = y_true[:, 0] class_preds = y_pred[:, 0] bi_crossentropy_loss = -labels * K.log(class_preds) - (1 - labels) * K.log(1 - class_preds) classify_valid_index = tf.where(K.less(y_true[:, 0], 0), zero_index, ones_index) classify_keep_num = K.cast(tf.cast(tf.reduce_sum(classify_valid_index), tf.float32) * 0.7, dtype = tf.int32) classify_loss_sum = bi_crossentropy_loss * tf.cast(classify_valid_index, bi_crossentropy_loss.dtype) classify_loss_sum_filtered, _ = tf.nn.top_k(classify_loss_sum, k = classify_keep_num) classify_loss = tf.where(K.equal(classify_keep_num, 0), tf.constant(0, dtype = tf.float32), K.mean(classify_loss_sum_filtered)) loss = classify_loss return lossOther info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
