Disclosure of Invention
The invention provides a method and a system for detecting the replay attack of the recording with self-adaption in the field, aiming at solving the problem of field diversity of the replay attack of the recording. Designing a shared voiceprint feature extraction module, inputting the acoustic features of voice into the shared module, extracting the shared voiceprint features, and then respectively inputting the shared voiceprint features into four sub-classification modules, wherein the four sub-classification modules respectively comprise: a replay attack detection module, a replay device detection module, a replay environment detection module and a replay speaker detection module. The error gradients of the replay attack detection module are directly fed back to the shared voiceprint feature extraction module and the replay attack detection module, the error gradients of the replay device detection module, the replay environment detection module and the replay speaker detection module are fed back to the outside of the respective modules, and the error gradients are fed back to the shared voiceprint feature extraction module after being inverted. By the method and the system, the field adaptivity of the system can be enhanced, and the replay attack detection capability of the system is improved.
The invention realizes the purpose through the following technical scheme:
a method and a system for detecting the attack of playback of a voice record with self-adaptation field comprise the following steps:
calculating and extracting acoustic features from at least one recording region in the recording, wherein the acoustic features comprise Mel Frequency Cepstrum Coefficient (MFCC) or Power-normalized Cepstral Coefficients (PNCC);
extracting a shared voiceprint feature vector from the acoustic features;
and detecting whether the sound recording is a replay sound recording or not from the shared voiceprint feature vector by a domain adaptive method.
Further, in a detection phase, the shared voiceprint feature vectors are used to detect corresponding targets of at least one domain adaptive countermeasure task associated with the replay attack detection, the domain adaptive countermeasure task comprising: a playback device detection task, a playback environment detection task, and a playback speaker detection task.
Furthermore, the shared voiceprint feature vector is extracted through a shared voiceprint feature module, whether the record is replayed or not is detected through a replay attack detection module, the replay device detection task is achieved through a replay device detection module, the replay environment detection task is achieved through a replay environment detection module, and the replay speaker detection task is achieved through a replay speaker detection module.
Further, the shared voiceprint feature module, the replay attack detection module, the replay device detection module, the replay environment detection module, and the replay speaker detection module are all formed of a deep neural network including a combination of one or more of a Convolutional Neural Network (CNN), a recurrent neural network (RNN, LSTM, GRU), and a time-delayed neural network (TDNN).
Further, the method also comprises a training method of each module. Wherein the weight of the shared voiceprint feature module is WfThe replay attack detection module has a weight WaThe weight of the detection module of the playback device is WdThe replay speaker detection module has a weight WsThe playback environment detection module has a weight WeThe training steps of each module are as follows:
s0: inputting the acoustic features of the sound recording into a shared voiceprint feature module, and extracting shared voiceprint feature vectors;
s1: inputting the shared voiceprint feature vector in S0 into a replay attack detection module, and outputting a classification error La;
S2: inputting the shared voiceprint feature vector in S0 into a detection module of a playback device, and outputting a classification error Ld;
S3: inputting the shared voiceprint feature vector in S0 into the speaker detection module, and outputting a classification error Ls;
S4: inputting the shared voiceprint feature vector in S0 into a playback environment detection module, and outputting a classification error Le;
S5: the update method of each weight is as follows:
where ε is the learning rate, α1、α2、α3The weights of the playback device detection module, the playback speaker detection module, and the playback environment detection module, respectively.
S6: the steps of S0 to S5 are repeated until the blocks converge.
The embodiment of the invention provides another field self-adaptive record replay attack detection system, which comprises the following modules:
the acoustic feature extraction module is used for extracting acoustic features of at least one section of recording area in the recording;
the shared voiceprint feature extraction module is used for extracting a shared voiceprint feature vector from the acoustic features;
a detection module for detecting whether the shared voiceprint feature vector is a replay attack;
further, the detection module is also used to detect at least one domain-adaptive countermeasure task associated with the replay attack.
Further, the shared voiceprint feature extraction module and the detection module further comprise a deep neural network module.
And further, the system also comprises a training module which is used for training the deep neural network module in the shared voiceprint feature extraction module and the detection module.
The invention has the beneficial effects that:
the invention can solve the problem of performance degradation of the record replay attack detection system caused by the field diversity of the record replay equipment, environment and speakers; the robustness of the replay attack detection system can still be ensured under the conditions of the devices and environments of replay recording and the field diversity of speakers.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Example one
A domain-adaptive replay attack detection method proposed by the present invention is described with reference to fig. 1 and 2, where fig. 1 shows a flowchart of the replay attack detection method and fig. 2 shows a training flowchart of the domain-adaptive replay attack detection method.
In step S101, extracting acoustic features from at least one recording region in the recording, the acoustic features including Mel-Frequency Cepstrum Coefficients (MFCCs) or Power-normalized Cepstral Coefficients (PNCCs);
in step S102, a shared voiceprint feature vector is extracted from the multiple acoustic features extracted in step S101;
in step S103, it is detected whether the recording is a replay attack from the shared voiceprint feature vector extracted in step S102; while in a detection phase, the shared voiceprint feature vectors are used to detect corresponding targets of at least one domain adaptive countermeasure task associated with the replay attack detection, the domain adaptive countermeasure task including, but not limited to: the replay device detects the task, replays the environment detection task and replays the speaker detection task, and obtains detection results of all fields of adaptive confrontation tasks. The shared voiceprint feature vector is extracted through a shared voiceprint feature module, whether the record is replayed or not is detected through a replay attack detection module, the replay device detection task is achieved through a replay device detection module, the replay environment detection task is achieved through a replay environment detection module, and the replay speaker detection task is achieved through a replay speaker detection module. The shared voiceprint feature module, the replay attack detection module, the replay device detection module, the replay environment detection module and the replay speaker detection module are all composed of a deep neural network comprising one or a combination of Convolutional Neural Networks (CNN), recurrent neural networks (RNN, LSTM, GRU) and time-delay neural networks (TDNN).
In addition, the method of the present invention further comprises a training method of a shared voiceprint feature module, a replay attack detection module, a replay device detection module, a replay environment detection module and a replay speaker detection module, as shown in fig. 2.
In step S201, a sound recording sample in a training set, its real playback label, and a real label of a domain-adaptive countermeasure task are acquired;
in step S202, the weight of the shared voiceprint feature module is WfThe weight of the replay attack detection module is WaThe weight of the detection module of the playback device is WdThe replay speaker detection module has a weight of WsAnd the playback environment detection module has a weight of WeInputting the acoustic features of the sound recording into a shared voiceprint feature module, extracting shared voiceprint feature vectors, inputting the shared voiceprint feature vectors into a replay attack detection module, a replay device detection module, a replay speaker detection module and a replay environment detection module, and acquiring the detection result of replay sound recording and the detection result of a domain-adaptive confrontation task;
in step S203, the detection result of the playback audio recording is compared with its genuine tag of the playback audio recording, and a detection error L is obtaineda;
In step S204, parameters of the replay attack detection module are updated in a back propagation manner, where the updating manner is: w
d←
Wherein ε is the learning rate;
in step S205, the detection result of the domain-adaptive countermeasure task and the true label of the domain-adaptive countermeasure task are compared, respectively, and the detection error L is obtainedd、Ls、Le;
In step S206, parameters of the domain-adaptive confrontation task detection module are updated in a back-propagation manner, where the updating manner is:
wherein ε is the learning rate;
in step S207, the shared voiceprint feature module parameters are updated by performing back propagation on the detection error of the domain adaptive countermeasure task and the detection error of the playback sound recording at the same time, and the updating is performed as follows:
where ε is the learning rate, α1、α2、α3The weights of the replay device detection module, the replay speaker detection module and the replay environment detection module are respectively.
In step S208, it is determined whether the module converges or the training frequency reaches the set maximum iteration frequency or the module error reaches the set minimum error, if any one of the conditions is satisfied, the training is terminated, otherwise, the steps S201 to S208 are repeated.
The field-adaptive attack detection method for the record replay provided by the embodiment of the invention can still ensure the robustness of a record replay attack detection system under the conditions of the field diversity of the record replay equipment, environment and speakers.
Example two
A domain adaptive replay attack detection system proposed by the present invention is described with reference to fig. 3, and fig. 3 shows the constituent modules of the system. Referring to fig. 3, the system includes an acousticfeature extraction module 301, a shared voiceprintfeature extraction module 302, adetection module 303, and atraining module 304.
The acousticfeature extraction module 301 extracts acoustic features from at least one recording area or the whole recording in the recording data;
the shared voiceprintfeature extraction module 302 extracts a shared voiceprint feature vector from the acoustic features in the acousticfeature extraction module 301;
thedetection module 303 detects whether the recording is a playback recording from the shared voiceprint feature vectors in the shared voiceprintfeature extraction module 302. Meanwhile, thedetection module 303 may further detect at least one domain-adaptive countermeasure task associated with the replay attack from the shared voiceprint feature vector, where the countermeasure tasks include a replay device detection task, a replay environment detection task, and a replay speaker detection task and obtain detection results of all the domain-adaptive countermeasure tasks.
Thetraining module 304 is used to train the deep neural network modules in the shared voiceprintfeature extraction module 302 and thedetection module 303, and the system replay attack detection at least one domain-adaptive countermeasure task that can be associated with replay attack is simultaneously trained, and the training step refers to the first embodiment described above.
The second field-adaptive attack detection system provided by the embodiment of the invention can still ensure the robustness of the attack detection system under the conditions of the field diversity of the equipment, environment and speaker for record playback.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware instructions related to a program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the above embodiments of the methods. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present invention, and shall cover the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims. It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition. In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as the disclosure of the present invention as long as it does not depart from the spirit of the present invention.