CN111785274B

Movatterモバイル変換

Info

Publication number: CN111785274B
Application number: CN202010600719.4A
Authority: CN
Inventors: 王让定; 王冬华; 董理; 严迪群
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2023-12-05
Anticipated expiration: 2040-06-28
Also published as: CN111785274A

Abstract

The application relates to a black box countermeasure sample generation method for a voice recognition system, which is characterized in that: s1, initializing a population; s2, calculating the fitness value of individuals in the population; s3, judging whether the fitness value of the individual meets the set condition, if so, obtaining a countermeasure sample, and if not, executing S4; s4, selecting individuals in the population to obtain elite individuals; s5, performing cross operation on the rest individuals after elite individuals are obtained in the population; s6, performing mutation operation on the rest individuals; s7, forming the next generation by the residual individuals after the mutation operation and elite individuals, and returning to the step S1. Under the condition that the target voice recognition network cannot be obtained, the method can successfully generate the countermeasure sample of the target voice recognition system only by utilizing the voice-to-text translation function provided by the target voice recognition network, and has strong practicability.

Description

Translated fromChinese

一种针对语音识别系统的黑盒对抗样本生成方法A black-box adversarial sample generation method for speech recognition systems

技术领域Technical field

本发明涉及语音识别技术领域，尤其涉及一种针对语音识别系统的黑盒对抗样本生成方法。The present invention relates to the field of speech recognition technology, and in particular to a method for generating black-box adversarial samples for speech recognition systems.

背景技术Background technique

语音对抗样本是指一种被攻击者有目的的向语音信号中加入细微扰动后的样本，其主要目的是使目标语音识别系统对该对抗样本做出错误的识别，甚至识别成攻击者指定的结果。在听觉感受上，语音对抗样本与其被修改前的样本相似；在作用上，却能使语音识别系统输出错误的识别结果。现有的主流攻击方法按攻击者是否能够利用目标网络信息可以分为两种：白盒攻击和黑盒攻击。白盒攻击是假设攻击者能够获得目标语音识别系统的模型信息(包括网络模型和参数等)，而黑盒攻击是假设攻击者无法获得模型信息，但能够获得目标语音识别系统对输入语音的输出。A speech adversarial sample refers to a sample in which an attacker purposefully adds subtle perturbations to the speech signal. Its main purpose is to cause the target speech recognition system to misidentify the adversarial sample, or even identify it as the attacker's designated one. result. In terms of auditory perception, the speech adversarial sample is similar to the sample before modification; in terms of function, it can cause the speech recognition system to output wrong recognition results. The existing mainstream attack methods can be divided into two types according to whether the attacker can use the target network information: white box attack and black box attack. White-box attacks assume that the attacker can obtain the model information of the target speech recognition system (including network models and parameters, etc.), while black-box attacks assume that the attacker cannot obtain model information, but can obtain the output of the input speech by the target speech recognition system. .

当前语音识别系统攻击的研究中，Taori利用遗传算法迭代寻找满足对抗样本条件的最优个体，Carlini假设攻击者能够获得目标语音识别系统的信息，并根据目标识别系统的输出设计了目标函数，对该目标函数进行优化，最终获得最优的对抗样本。Taroi使用遗传算法寻找最优个体从而获得对抗样本，在适应度计算中利用了目标语音识别网络最后一层的输出，使得该方法无法应用于实际场景的对抗样本攻击；此外，仅在变异算子中添加均匀分布的随机扰动，这导致算法搜索最优个体的时间代价较高。Carlini利用目标语音识别网络最后一层的输出与目标文本的稀疏编码设计了优化目标函数，利用梯度反传算法，不断地优化目标函数，最终使语音样本被目标语音识别系统识别为目标文本，该方法生成的语音样本虽然能使目标语音识别网络识别错误，但其假设攻击者能够获得目标语音识别网络的参数信息使其也无法应用于实际对抗样本攻击。此外，目前存在的对抗样本方法仅对单个语音识别系统有效。不难看出，现有的语音对抗样本攻击方法中，大部分方法都假攻击者能够获得目标模型的参数，这使得这些方法无法应用于实际物理世界的攻击。In the current research on speech recognition system attacks, Taori uses genetic algorithms to iteratively find the optimal individual that satisfies the adversarial sample conditions. Carlini assumes that the attacker can obtain information about the target speech recognition system, and designs an objective function based on the output of the target recognition system. The objective function is optimized to finally obtain the optimal adversarial sample. Taroi uses a genetic algorithm to find the optimal individual to obtain adversarial samples. It uses the output of the last layer of the target speech recognition network in the fitness calculation, making this method unable to be applied to adversarial sample attacks in actual scenarios; in addition, only in the mutation operator Uniformly distributed random perturbations are added to the algorithm, which results in a high time cost for the algorithm to search for the optimal individual. Carlini used the output of the last layer of the target speech recognition network and the sparse coding of the target text to design an optimized objective function. He used the gradient backpropagation algorithm to continuously optimize the objective function, and finally the speech sample was recognized as the target text by the target speech recognition system. Although the speech samples generated by this method can cause the target speech recognition network to recognize errors, it assumes that the attacker can obtain the parameter information of the target speech recognition network, making it impossible to apply it to actual adversarial sample attacks. Furthermore, currently existing adversarial example methods are only effective for a single speech recognition system. It is not difficult to see that most of the existing speech adversarial sample attack methods pretend that the attacker can obtain the parameters of the target model, which makes these methods unable to be applied to attacks in the actual physical world.

发明内容Contents of the invention

鉴于上述问题，本发明的目的在于提供一种针对语音识别系统的黑盒对抗样本生成方法，该方法无需假设攻击者能够获得目标模型的信息即可生成语音对抗样本。In view of the above problems, the purpose of the present invention is to provide a black-box adversarial sample generation method for speech recognition systems, which can generate speech adversarial samples without assuming that the attacker can obtain information about the target model.

为了实现上述目的，本发明的技术方案为：一种针对语音识别系统的黑盒对抗样本生成方法，其特征在于：包括，In order to achieve the above objectives, the technical solution of the present invention is: a black box adversarial sample generation method for speech recognition systems, which is characterized by: including,

S1，种群初始化；S1, population initialization;

S2，计算种群中个体的适应度值；S2, calculate the fitness value of individuals in the population;

S3，判断个体的适应度值是否满足设定条件，若是，则获得对抗样本，若否，则执行S4；S3: Determine whether the individual's fitness value meets the set conditions. If so, obtain the adversarial sample. If not, execute S4;

S4，对种群中的个体进行选择操作获取精英个体；S4, perform selection operations on individuals in the population to obtain elite individuals;

S5，对种群中获取精英个体后的剩余个体进行交叉操作；S5, perform crossover operation on the remaining individuals after obtaining elite individuals from the population;

S6，对剩余个体执行变异操作；S6, perform mutation operation on the remaining individuals;

S7，将执行变异操作后的剩余个体与精英个体构成下一代并返回执行S1。In S7, the remaining individuals after the mutation operation are combined with the elite individuals to form the next generation and return to S1.

进一步的，所述S1中的种群的个体由如下公式获得：Further, the individuals of the population in S1 are obtained by the following formula:

p_i＝x+r_i×w_i，i＝1,2,…,N，p_i =x+r_i ×w_i , i=1,2,…,N,

其中x为待修改的音频样本，N表示种群规模，r_i为向第i个个体中添加的随机扰动，w_i为根据待修改的音频样本x计算的扰动权重向量；where x is the audio sample to be modified, N represents the population size, r_i is the random perturbation added to the i-th individual, and w_i is the perturbation weight vector calculated based on the audio sample x to be modified;

r_i,j由如下公式获得：r_i,j is obtained by the following formula:

r_i,j＝R(-b,b),j＝1,2,…M，r_i,j =R(-b,b),j=1,2,…M,

其中M表示单个个体的基因数，r_i,j表示第i个个体中的第j个基因，b表示扰动的取值范围b＝256；R()是随机函数，该函数从[-b,b]范围内随机选择一个整数；Where M represents the number of genes in a single individual, r_i,j represents the j-th gene in the i-th individual, b represents the value range of the perturbation b=256; R() is a random function, which starts from [-b, b] Randomly select an integer within the range;

w_i,j由如下公式获得：w_i,j is obtained by the following formula:

其中w_i,j表示第i个个体的第j个基因的权重，q_min,q,q_max分别表示权重的取值范围，其中q_min,＝0.6，q＝1，q_max＝2；R′()是随机函数，该函数在区间[q_min,q]或[q,q_max]内生成一个随机数作为当前基因的权重；|·|表示绝对值函数；函数Sort()表示对变量进行从大到小的排序。where w_{i, j} represent the weight of the j -th gene of the i -th individual, q_min , q, q_max respectively represent the value range of the weight, where q_min , = 0.6, q = 1, q_max = 2; R ′() is a random function, which generates a random number within the interval [q_min , q] or [q, q_max ] as the weight of the current gene; |·| represents the absolute value function; the function Sort() represents the pair of variables Sort from largest to smallest.

进一步的，所述S2中适应度值采用如下公式进行计算：Further, the fitness value in S2 is calculated using the following formula:

其中f()是目标语音识别网络的函数，该函数接收个体p作为输入，输出相应的翻译结果s；Levenshtein(s,t)是指将个体的文本识别结果s通过插入、删除或替换转换成目标文本t所需要的最小单字符编辑次数；H是指目标文本t的字符个数。where f() is the function of the target speech recognition network, which receives the individual p as input and outputs the corresponding translation result s; Levenshtein(s,t) refers to converting the individual text recognition result s into The minimum number of single-character edits required for target text t; H refers to the number of characters in target text t.

进一步的，所述S3中的设定条件为，WER值为0。Further, the setting condition in S3 is that the WER value is 0.

进一步的，所述S4中选择操作包括，对个体的适应度值进行降序排序，得到排完序的种群p′＝(p₁′,p₂′,...,p_N′)，取前topk个个体作为精英个体p′_topk。Further, the selection operation in S4 includes sorting the individual fitness values in descending order to obtain the sorted population p′=(p₁ ′, p₂ ′,..., p_N ′), taking the first Topk individuals are regarded as elite individuals p′_topk .

进一步的，所述S5中的剩余个体为N-topk个，所述S5中交叉操作包括，Further, the remaining individuals in S5 are N-topk, and the crossover operation in S5 includes,

首先，随机从N-topk个剩余个体p′_N-topk中取两个个体作为父代1和父代2；然后，生成个体基因长度的随机数向量λ，其中随机数向量λ的范围为[0,1]，最后子代的取值基于随机数向量λ，规则如下公式所示：First, randomly select two individuals from the N-topk remaining individuals p′_N-topk as parent 1 and parent 2; then, generate a random number vector λ of the individual gene length, where the range of the random number vector λ is [ 0,1], the value of the last descendant is based on the random number vector λ, and the rule is as follows:

上式中子代offspring的第i个基因的值根据随机向量中λ第i个值决定；若λ_i小于0.5，子代第i个值取自父代1中第i个基因parent1_i的值；否则，子代第i个值取自父代2中第i个基因parent2_i的值，最终生成N-topk个子代。In the above formula, the value of the i-th gene of the offspring offspring is determined based on the i-th value of λ in the random vector; if λ_i is less than 0.5, the i-th value of the offspring is taken from the value of the i-th gene parent1_i in the parent 1 ; Otherwise, the i-th value of the offspring is taken from the value of the i-th gene parent2_i in parent 2, and N-topk offspring are finally generated.

进一步的，所述S6中变异操作采用如下公式进行，Further, the mutation operation in S6 is performed using the following formula:

new_p_i＝offspring+r_i×w_i,i＝1,2,…,N-topk。new_p_i =offspring+r_i ×w_i ,i=1,2,...,N-topk.

与现有技术相比，本发明的优点在于：Compared with the prior art, the advantages of the present invention are:

本申请的方法在攻击者无法获得目标语音识别网络的条件下，仅利用目标语音识别网络提供的语音到文本的翻译功能，通过使用改进的传统遗传算法中的适应度函数和初始化及变异算法，能够成功地生成目标语音识别系统的对抗样本，从而使得该方法具有很好的实用性。The method of this application only uses the speech-to-text translation function provided by the target speech recognition network under the condition that the attacker cannot obtain the target speech recognition network. By using the fitness function and initialization and mutation algorithm in the improved traditional genetic algorithm, The method can successfully generate adversarial samples for the target speech recognition system, making this method very practical.

附图说明Description of drawings

图1本申请的方法流程图。Figure 1 is a flow chart of the method of this application.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements with the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are only used to explain the present invention and cannot be understood as limiting the present invention.

如图1为本申请的方法流程图，如图所示，该种针对语音识别系统的黑盒对抗样本生成方法，包括如下步骤：Figure 1 is a method flow chart of this application. As shown in the figure, this black-box adversarial sample generation method for speech recognition systems includes the following steps:

S1，种群初始化；S1, population initialization;

具体而言，该方法的详细流程如下：Specifically, the detailed process of this method is as follows:

种群初始化：Population initialization:

假设音频x为待修改的样本，种群中个体通过如下公式(1)获得：Assuming that the audio x is the sample to be modified, the individuals in the population are obtained through the following formula (1):

p_i＝x+r_i×w_i，i＝1,2,…,N (1)p_i ＝x+r_i ×w_i , i＝1,2,…,N (1)

其中N表示种群规模，r_i为向第i个个体中添加的随机扰动，其由公式(2)获得。w_i为根据待修改的样本x计算的扰动权重向量，其由公式(3)获得。在使用遗传算法生成语音对抗样本的问题中，个体是语音采样值序列，是一个1维向量，基因的规模取决于语音序列的长度(1秒音频16000个采样点，即16000个基因)。若通过添加随机扰动使目标语音识别系统做出错误的识别结果，则需要较大的时间代价，算法的收敛速度也慢。为了解决该问题，本发明提出扰动权重方法，即对语音内容部分赋予更大的权重，增加对语音内容部分采样点的修改能够更快地改变目标语音识别系统的识别结果。扰动权重的方法由公式(3)给出。Where N represents the population size, r_i is the random disturbance added to the i-th individual, which is obtained by formula (2). w_i is the perturbation weight vector calculated based on the sample x to be modified, which is obtained by formula (3). In the problem of using genetic algorithms to generate speech adversarial samples, the individual is a sequence of speech sample values, which is a 1-dimensional vector. The size of the gene depends on the length of the speech sequence (16,000 sampling points in 1 second of audio, that is, 16,000 genes). If the target speech recognition system makes incorrect recognition results by adding random perturbations, it will require a large time cost and the convergence speed of the algorithm will be slow. In order to solve this problem, the present invention proposes a perturbation weight method, that is, giving greater weight to the speech content part, and increasing the modification of the sampling points of the speech content part can change the recognition results of the target speech recognition system faster. The method of perturbing the weight is given by formula (3).

r_i,j＝R(-b,b),j＝1,2,…M (2)r_i,j =R(-b,b),j＝1,2,…M (2)

其中M表示单个个体的基因数，r_i,j表示第i个个体中的第j个基因，b表示扰动的取值范围，本实施例中b＝256；R()是随机函数，该函数从[-b,b]范围内随机选择一个整数。通过公式(3)为个体生成扰动，每个个体的扰动都不同，这保证了种群的多样性。Where M represents the number of genes in a single individual, r_{i, j} represents the j-th gene in the i-th individual, b represents the value range of the perturbation, in this example b = 256; R () is a random function, this function Randomly select an integer from the range [-b,b]. Disturbance is generated for individuals through formula (3), and the disturbance for each individual is different, which ensures the diversity of the population.

其中w_i,j表示第i个个体的第j个基因的权重，q_min,q,q_max分别表示权重的取值范围，其中q_min,＝0.6，q＝1，q_max＝2,；R′()是随机函数，该函数在区间[q_min,q]或[q,q_max]内生成一个随机数作为当前基因的权重；|·|表示绝对值函数；函数Sort()表示对变量进行从大到小的排序。扰动权重方法的详细过程为：首先，对待修改的音频样本x进行一阶差分操作|x_j-x_j-1|，待修改的音频样本x基因长度变为M-1；为使一阶差分后待修改的音频样本x的基因个数不发生变化，在一阶差分后向量的第一个位置插入原始基因的值，从而得到长度为M的向量{x₁,x₂-x₁,...,x_M-x_M-1}；然后，对该向量中的元素取绝对值，并进行从大到小排序。最后，排序前50％的采样点覆盖大部分语音内容，因此赋予更高的权重；在排序后50％的采样点静音段较多，因此赋予更低的权重。Among them, w_{i, j} represent the weight of the j-th gene of the i-th individual, q_min , q, q_max respectively represent the value range of the weight, where q_min , = 0.6, q = 1, q_max = 2,; R′() is a random function, which generates a random number as the weight of the current gene within the interval [q_min , q] or [q, q_max ]; |·| represents the absolute value function; the function Sort() represents the Variables are sorted from large to small. The detailed process of the perturbation weight method is: first, perform a first-order difference operation |x_j -x_j-1 | on the audio sample x to be modified, and the gene length of the audio sample x to be modified becomes M-1; in order to make the first-order difference The number of genes of the audio sample x to be modified does not change, and the value of the original gene is inserted into the first position of the vector after the first-order difference, thereby obtaining a vector of length M {x₁ ,x₂ -x₁ ,. ..,x_M -x_M-1 }; Then, take the absolute value of the elements in the vector and sort them from large to small. Finally, the first 50% of the sampling points cover most of the speech content, so they are given a higher weight; the 50% of the sampling points after the sorting have more silent segments, so they are given a lower weight.

适应度值计算：Fitness value calculation:

适应度值是判断的个体是否满足对抗样本唯一依据。适应度函数的设计也关乎算法能够应用于实际攻击。因此，为了使算法能够应用于实际攻击，本发明使用基于levenshtein距离(编辑距离)的词错误率(WER)作为个体的适应度值。如公式(4)所示：The fitness value is the only basis for judging whether an individual satisfies the adversarial sample. The design of the fitness function is also related to the ability of the algorithm to be applied to actual attacks. Therefore, in order to enable the algorithm to be applied to actual attacks, the present invention uses the word error rate (WER) based on levenshtein distance (edit distance) as the individual fitness value. As shown in formula (4):

其中f()是目标语音识别网络的函数，该函数接收语音(个体p)作为输入，输出相应的翻译结果s；Levenshtein(s,t)距离是指将个体的文本识别结果s通过插入、删除或替换转换成目标文本t所需要的最小单字符编辑次数；H是指目标文本t的字符个数。WER越小越好，当个体的识别结果s与目标文本t完全一样时，WER值为0，即生成了对抗样本。where f() is the function of the target speech recognition network, which receives speech (individual p) as input and outputs the corresponding translation result s; Levenshtein(s,t) distance refers to the individual text recognition result s through insertion and deletion Or replace the minimum number of single-character edits required to convert into target text t; H refers to the number of characters in target text t. The smaller the WER, the better. When the individual recognition result s is exactly the same as the target text t, the WER value is 0, that is, an adversarial sample is generated.

如果通过适应度值计算不能生成对抗样本，则继续执行选择操作，选择操作的执行如下：If the adversarial sample cannot be generated through fitness value calculation, continue to perform the selection operation. The selection operation is performed as follows:

对个体的适应度值进行降序排序,得到排完序的种群p′＝(p₁′,p₂′,...,p_N′)。取前topk个个体作为精英个体p′_topk，不参与交叉和变异，对剩余的N-topk个个体p′_N-topk进行交叉和变异。Sort the individual fitness values in descending order to obtain the sorted population p′ = (p₁ ′, p₂ ′,..., p_N ′). The first topk individuals are taken as the elite individuals p′_topk and do not participate in crossover and mutation. The remaining N-topk individuals p′_N-topk are crossed and mutated.

交叉操作：Crossover operation:

对上述剩余的N-topk个个体执行交叉操作，本申请使用的交叉操作规则如下所示：Perform crossover operations on the remaining N-topk individuals above. The crossover operation rules used in this application are as follows:

父代1Parent 1p_i,1p_i,1p_i,2p_i,2p_i,3p_i,3p_i,4p_i,4p_i,5p_i,5……p_i,M-4p_i,M-4p_i,M-3p_i,M-3p_i,M-2p_i,M-2p_i,M-1p_i,M-1p_i,Mp_i,M父代2Parent 2p_k,1p_k,1p_k,2p_k,2p_k,3p_k,3p_k,4p_k,4p_k,5p_k,5……p_k,M-4p_k,M-4p_k,M-3p_k,M-3p_k,M-2p_k,M-2p_k,M-1p_k,M-1p_k,Mp_k,M子代offspringp_i,1p_i,1p_i,2p_i,2p_i,3p_i,3p_k,4p_k,4p_k,5p_k,5……p_k,M-4p_k,M-4p_k,M-3p_k,M-3p_i,M-2p_i,M-2p_k,M-1p_k,M-1p_i,Mp_i,M

如上表所示，本发明采用的交叉算子为，首先，随机从N-topk个个体p′_N-topk中取两个个体作为父代1和父代2,然后生成个体基因长度的随机数向量λ，其中随机数的范围为[0,1]，最后子代的取值基于随机数向量λ，规则如公式(5)所示：As shown in the table above, the crossover operator used in this invention is: first, randomly select two individuals from N-topk individuals p′_N-topk as parent 1 and parent 2, and then generate random numbers of individual gene lengths Vector λ, where the range of random numbers is [0,1], and the value of the last descendant is based on the random number vector λ. The rules are as shown in formula (5):

其中，子代offspring的第i个基因的值根据随机向量中λ第i个值决定；若λ_i小于0.5，子代第i个值取自父代1中第i个基因parent1_i的值；否则，子代第i个值取自父代2中第i个基因parent2_i的值。通过这种方式，生成N-topk个子代。Among them, the value of the i-th gene of the offspring offspring is determined based on the i-th value of λ in the random vector; if λ_i is less than 0.5, the i-th value of the offspring is taken from the value of the i-th gene parent1_i in the parent 1; Otherwise, the i-th value of the offspring is taken from the value of the i-th gene parent2_i in parent 2. In this way, N-topk descendants are generated.

变异操作：Mutation operation:

对完成交叉操作的N-topk个子代依变异概率进行变异操作，本发明使用的变异算子如公式(6)所示：The N-topk offspring that have completed the crossover operation are subjected to a mutation operation according to the mutation probability. The mutation operator used in the present invention is as shown in formula (6):

new_p_i＝offspring+r_i×w_i,i＝1,2,…,N-topknew_p_i ＝offspring+r_i ×w_i ,i＝1,2,…,N-topk

本发明使用的变异算子将初始化算子中的音频x更改为完成交叉操作的子代offspring，相应地将公式(3)中的变量x修改为offspring就成了本利使用的变异算子。最终将执行完变异操作的N-topk个子代与topk个精英个体构成N个下一代作为一个新的种群继续进行计算。The mutation operator used in the present invention changes the audio x in the initialization operator to the offspring offspring that completes the crossover operation. Correspondingly, changing the variable x in formula (3) to offspring becomes the mutation operator used by Benli. Finally, the N-topk offspring and topk elite individuals that have performed the mutation operation will form N next generations and continue to be calculated as a new population.

尽管已经示出和描述了本发明的实施例，本领域技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变形，本发明的范围由权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and variations can be made to these embodiments without departing from the principles and purposes of the invention. The scope is defined by the claims and their equivalents.