tensorflow/tensorflowPublic

NotificationsYou must be signed in to change notification settings
Fork75k
Star193k

tf.keras: Model parameters suddenly updated to 'nan' during back propagation when training #38416

New issue

Closed

tf.keras: Model parameters suddenly updated to 'nan' during back propagation when training#38416

Assignees

Labels

TF 2.1for tracking issues in 2.1 releaseTF 2.2Issues related to TF 2.2comp:kerasKeras related issuestype:bugBug

Description

TMaysGGS

opened

on Apr 10, 2020

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.1.0
Python version: 3.7.4
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 10.1/7.6.5
GPU model and memory: Nvidia 1080Ti

Describe the current behavior
I was trying to train a small net similar to PNet of MTCNN and I wrote a custom loss to test if it is working. During the training, after several epochs the model weights and loss became 'nan'.
I test the training procedure one epoch by one and found that the last epoch that gives a normal loss (0.0814), also outputs a model with all parameters of 'nan'. Thus I think when giving a normal loss, the backward propagation has something wrong and gives the model a 'nan' update.

What I have done to rule out some other possibilities:
(1) Check & clean the data:
My data set is:
X: images of shape (12, 12, 3);
Y: label, box regression coords & 6-landmark regression coords concatenated together of shape (17, ).
For the label, it could be 1, -1, 0, -2 where only labels 1 and 0 will participate in calculating the custom loss I wrote myself.
For the roi & landmark coords, they all belong to [-1, 1].
For the image data, it will be processed as: (x - 127.5) / 128. before being sent into the training stream.
I tried both the TFRecords dataflow & numpy array as the input for training.

(2) Add BatchNormalization layer, add L2-Norm to the weights, use Xavier initialization and pick a smaller learning rate (from 0.001 to 0.0001) to avoid problems like gradient exploding.

(3) Replace the custom loss I wrote myself with 'mse'.

All the three changes made did not fix the 'nan' loss thing.

Describe the expected behavior
The training procedure should work well.

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

def pnet_train1(train_with_landmark = False):        X = Input(shape = (12, 12, 3), name = 'Pnet_input')        M = Conv2D(10, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!!         M = Conv2D(16, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)        M = Conv2D(32, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)        Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)        Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv)     if train_with_landmark:         Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor])         model = Model(X, Pnet_output)     else:        Pnet_output = Concatenate()([Classifier, Bbox_regressor])        model = Model(X, Pnet_output)        return modeldef pnet_train2(train_with_landmark = False):    X = Input(shape = (12, 12, 3), name = 'Pnet_input')    M = Conv2D(10, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)    M = BatchNormalization(axis = -1, name = 'Pnet_bn1')(M)    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!!     M = Conv2D(16, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)    M = BatchNormalization(axis = -1, name = 'Pnet_bn2')(M)    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)    M = Conv2D(32, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)    M = BatchNormalization(axis = -1, name = 'Pnet_bn3')(M)    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)    Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)    Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv)     if train_with_landmark:         Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor])         model = Model(X, Pnet_output)     else:        Pnet_output = Concatenate()([Classifier, Bbox_regressor])        model = Model(X, Pnet_output)    return model# Here just check the the first classify loss. def custom_loss(y_true, y_pred):            zero_index = K.zeros_like(y_true[:, 0])     ones_index = K.ones_like(y_true[:, 0])         labels = y_true[:, 0]     class_preds = y_pred[:, 0]     bi_crossentropy_loss = -labels * K.log(class_preds) - (1 - labels) * K.log(1 - class_preds)         classify_valid_index = tf.where(K.less(y_true[:, 0], 0), zero_index, ones_index)     classify_keep_num = K.cast(tf.cast(tf.reduce_sum(classify_valid_index), tf.float32) * 0.7, dtype = tf.int32)         classify_loss_sum = bi_crossentropy_loss * tf.cast(classify_valid_index, bi_crossentropy_loss.dtype)     classify_loss_sum_filtered, _ = tf.nn.top_k(classify_loss_sum, k = classify_keep_num)     classify_loss = tf.where(K.equal(classify_keep_num, 0), tf.constant(0, dtype = tf.float32), K.mean(classify_loss_sum_filtered))         loss = classify_loss         return loss

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

Metadata

Assignees

jvishnuvardhan

Labels

TF 2.1for tracking issues in 2.1 releaseTF 2.2Issues related to TF 2.2comp:kerasKeras related issuestype:bugBug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tf.keras: Model parameters suddenly updated to 'nan' during back propagation when training #38416

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions