Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

tf.keras: Model parameters suddenly updated to 'nan' during back propagation when training #38416

Closed
Assignees
jvishnuvardhan
Labels
TF 2.1for tracking issues in 2.1 releaseTF 2.2Issues related to TF 2.2comp:kerasKeras related issuestype:bugBug
@TMaysGGS

Description

@TMaysGGS

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.1.0
  • Python version: 3.7.4
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.1/7.6.5
  • GPU model and memory: Nvidia 1080Ti

Describe the current behavior
I was trying to train a small net similar to PNet of MTCNN and I wrote a custom loss to test if it is working. During the training, after several epochs the model weights and loss became 'nan'.
I test the training procedure one epoch by one and found that the last epoch that gives a normal loss (0.0814), also outputs a model with all parameters of 'nan'. Thus I think when giving a normal loss, the backward propagation has something wrong and gives the model a 'nan' update.

What I have done to rule out some other possibilities:
(1) Check & clean the data:
My data set is:
X: images of shape (12, 12, 3);
Y: label, box regression coords & 6-landmark regression coords concatenated together of shape (17, ).
For the label, it could be 1, -1, 0, -2 where only labels 1 and 0 will participate in calculating the custom loss I wrote myself.
For the roi & landmark coords, they all belong to [-1, 1].
For the image data, it will be processed as: (x - 127.5) / 128. before being sent into the training stream.
I tried both the TFRecords dataflow & numpy array as the input for training.

(2) Add BatchNormalization layer, add L2-Norm to the weights, use Xavier initialization and pick a smaller learning rate (from 0.001 to 0.0001) to avoid problems like gradient exploding.

(3) Replace the custom loss I wrote myself with 'mse'.

All the three changes made did not fix the 'nan' loss thing.

Describe the expected behavior
The training procedure should work well.

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

def pnet_train1(train_with_landmark = False):        X = Input(shape = (12, 12, 3), name = 'Pnet_input')        M = Conv2D(10, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!!         M = Conv2D(16, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)        M = Conv2D(32, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)        Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)        Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv)     if train_with_landmark:         Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor])         model = Model(X, Pnet_output)     else:        Pnet_output = Concatenate()([Classifier, Bbox_regressor])        model = Model(X, Pnet_output)        return modeldef pnet_train2(train_with_landmark = False):    X = Input(shape = (12, 12, 3), name = 'Pnet_input')    M = Conv2D(10, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)    M = BatchNormalization(axis = -1, name = 'Pnet_bn1')(M)    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!!     M = Conv2D(16, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)    M = BatchNormalization(axis = -1, name = 'Pnet_bn2')(M)    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)    M = Conv2D(32, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)    M = BatchNormalization(axis = -1, name = 'Pnet_bn3')(M)    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)    Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)    Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv)     if train_with_landmark:         Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor])         model = Model(X, Pnet_output)     else:        Pnet_output = Concatenate()([Classifier, Bbox_regressor])        model = Model(X, Pnet_output)    return model# Here just check the the first classify loss. def custom_loss(y_true, y_pred):            zero_index = K.zeros_like(y_true[:, 0])     ones_index = K.ones_like(y_true[:, 0])         labels = y_true[:, 0]     class_preds = y_pred[:, 0]     bi_crossentropy_loss = -labels * K.log(class_preds) - (1 - labels) * K.log(1 - class_preds)         classify_valid_index = tf.where(K.less(y_true[:, 0], 0), zero_index, ones_index)     classify_keep_num = K.cast(tf.cast(tf.reduce_sum(classify_valid_index), tf.float32) * 0.7, dtype = tf.int32)         classify_loss_sum = bi_crossentropy_loss * tf.cast(classify_valid_index, bi_crossentropy_loss.dtype)     classify_loss_sum_filtered, _ = tf.nn.top_k(classify_loss_sum, k = classify_keep_num)     classify_loss = tf.where(K.equal(classify_keep_num, 0), tf.constant(0, dtype = tf.float32), K.mean(classify_loss_sum_filtered))         loss = classify_loss         return loss

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

1

Metadata

Metadata

Labels

TF 2.1for tracking issues in 2.1 releaseTF 2.2Issues related to TF 2.2comp:kerasKeras related issuestype:bugBug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions


    [8]ページ先頭

    ©2009-2025 Movatter.jp