CN118865993A

Movatterモバイル変換

Info

Publication number: CN118865993A
Application number: CN202411195163.XA
Authority: CN
Inventors: 周陈; 胡长风; 唐少雄; 杨轲淇; 涂胜军; 杨云波; 夏成刚; 雷霆
Original assignee: Hunan Zhongke Youxin Technology Co ltd
Current assignee: Hunan Zhongke Youxin Technology Co ltd
Priority date: 2024-08-29
Filing date: 2024-08-29
Publication date: 2024-10-29
Anticipated expiration: 2044-08-29
Also published as: CN118865993B

Abstract

The present invention relates to the field of speech signal processing technologies, and in particular, to a method, a system, and an apparatus for noise reduction of speech signals. The method comprises the following steps: performing self-adaptive framing treatment on the voice signal data to be processed to obtain framing voice signal data; carrying out semantic embedding space processing according to the framing voice signal data to obtain semantic embedding space data; carrying out semantic vector field construction according to the semantic embedded space data to generate voice semantic vector field data; voice noise area marking is carried out on voice semantic vector field data, and semantic noise area data are generated; performing noise frame restoration processing according to the semantic noise region data to generate noise-reduced voice signal data; auditory masking spectrum correction is performed on the noise-reduced speech signal data to generate enhanced speech signal data. According to the invention, the voice signal is guided to be noise-reduced through semantic information accurately, and even under complex scenes such as low signal-to-noise ratio, excellent noise reduction effect can be obtained, so that user experience is improved remarkably.

Description

Translated fromChinese

语音信号降噪方法、系统及设备Speech signal noise reduction method, system and device

技术领域Technical Field

本发明涉及语音信号处理技术领域，尤其涉及一种语音信号降噪方法、系统及设备。The present invention relates to the technical field of speech signal processing, and in particular to a speech signal noise reduction method, system and device.

背景技术Background Art

随着信息技术的快速发展，语音通信和语音识别技术在各个领域得到了广泛应用。特别是在智能手机、智能助手以及远程会议等场景中，清晰的语音信号传输和处理至关重要。然而，在实际应用中，语音信号常常受到各种噪声干扰，这些噪声可以来自环境（如交通噪声、人群喧哗等）或设备本身（如麦克风质量不佳、信号传输不稳定等）。这些噪声会显著降低语音信号的清晰度和可懂度，甚至导致信息传递错误。然而，传统的语音信号降噪方法主要包括谱减法、维纳滤波法、卡尔曼滤波法等，这些方法在一定程度上能够减少背景噪声的干扰，但往往存在降噪效果不理想和语音失真等问题，尤其在处理低信噪比音频时，容易将微弱语音信号过度抑制，导致语音难以听清楚，影响用户体验。With the rapid development of information technology, voice communication and speech recognition technologies have been widely used in various fields. In particular, in scenarios such as smartphones, smart assistants, and remote conferences, clear voice signal transmission and processing are crucial. However, in practical applications, voice signals are often interfered with by various noises, which can come from the environment (such as traffic noise, crowd noise, etc.) or the equipment itself (such as poor microphone quality, unstable signal transmission, etc.). These noises can significantly reduce the clarity and intelligibility of voice signals, and even cause information transmission errors. However, traditional voice signal denoising methods mainly include spectral subtraction, Wiener filtering, Kalman filtering, etc. These methods can reduce the interference of background noise to a certain extent, but often have problems such as unsatisfactory noise reduction effect and voice distortion. Especially when processing low signal-to-noise ratio audio, it is easy to over-suppress weak voice signals, making it difficult to hear the voice clearly, affecting the user experience.

发明内容Summary of the invention

基于此，本发明提供一种语音信号降噪方法、系统及设备，以解决至少一个上述技术问题。Based on this, the present invention provides a speech signal noise reduction method, system and device to solve at least one of the above technical problems.

为实现上述目的，一种语音信号降噪方法，包括以下步骤：To achieve the above object, a method for reducing noise of a speech signal comprises the following steps:

步骤S1：对待处理语音信号数据进行自适应分帧处理，得到分帧语音信号数据；根据分帧语音信号数据进行基频声纹特征融合，生成初始声纹基底数据；Step S1: Adaptively frame the speech signal data to be processed to obtain framed speech signal data; perform fundamental frequency voiceprint feature fusion according to the framed speech signal data to generate initial voiceprint base data;

步骤S2：根据初始声纹基底数据进行声学上下文增强处理，生成声学特征向量数据；通过声学特征向量数据对分帧语音信号数据进行语义嵌入空间处理，得到语义嵌入空间数据；根据语义嵌入空间数据进行时序语义轨迹处理，得到语音语义轨迹数据；Step S2: Perform acoustic context enhancement processing on the initial voiceprint base data to generate acoustic feature vector data; perform semantic embedding space processing on the framed speech signal data through the acoustic feature vector data to obtain semantic embedding space data; perform temporal semantic trajectory processing on the semantic embedding space data to obtain speech semantic trajectory data;

步骤S3：根据语音语义轨迹数据以及语义嵌入空间数据进行语义矢量场构建，生成语音语义矢量场数据；对语音语义矢量场数据进行语音噪声区域标记，生成语义噪声区域数据；Step S3: constructing a semantic vector field according to the speech semantic trajectory data and the semantic embedding space data to generate speech semantic vector field data; marking the speech noise area of the speech semantic vector field data to generate semantic noise area data;

步骤S4：通过语义噪声区域数据对语音语义矢量场数据进行噪声抑制处理，生成降噪语义矢量场数据；利用降噪语义矢量场数据对语音语义轨迹数据进行关键语义噪声帧识别，生成语音关键噪声帧数据；根据语音关键噪声帧数据进行噪音帧修复处理，生成降噪语音信号数据；Step S4: performing noise suppression processing on the speech semantic vector field data through the semantic noise area data to generate noise reduction semantic vector field data; performing key semantic noise frame recognition on the speech semantic trajectory data using the noise reduction semantic vector field data to generate speech key noise frame data; performing noise frame repair processing according to the speech key noise frame data to generate noise reduction speech signal data;

步骤S5：对降噪语音信号数据进行多基频谱分解处理，生成多基频语音谱数据；根据多基频语音谱数据进行听觉掩蔽频谱修正，并进行语音信号重构处理，生成增强语音信号数据。Step S5: performing multi-base frequency spectrum decomposition processing on the noise reduction speech signal data to generate multi-base frequency speech spectrum data; performing auditory masking spectrum correction based on the multi-base frequency speech spectrum data, and performing speech signal reconstruction processing to generate enhanced speech signal data.

本发明对待处理的语音信号进行自适应分帧处理，能够根据语音信号的动态特性灵活调整分析窗口，保证了对不同频率成分的准确捕捉。这种分帧处理能够捕捉到语音信号的瞬时特性，有助于提取语音信号的基本频率特征，基频声纹特征的融合不仅提取了语音信号中与说话人身份密切相关的独特信息，还增强了对环境变化的鲁棒性。利用初始声纹基底数据进行声学上下文增强处理，有助于提升声学特征向量数据的质量。这一增强处理能够有效地考虑上下文信息，减少外界噪声对语音信号的干扰。通过声学特征向量数据对分帧语音信号数据进行语义嵌入空间处理，能够将语音信号映射到一个高维的语义空间中，从而使得语音信号的语义信息更加明确。基于语音语义轨迹数据和语义嵌入空间数据进行语义矢量场构建，能够生成声学信息与语义信息的结合体。通过准确标记噪声区域，从而有效指导噪声抑制的方向和策略，提升降噪处理的精准度。通过语义噪声区域数据对语音语义矢量场数据进行噪声抑制处理，能够有效降低语音信号中的噪声成分，提升了语音信号的清晰度。利用降噪语义矢量场数据对语音语义轨迹数据进行关键语义噪声帧识别，能够有效识别出对语音理解影响较大的噪声帧。通过对语音关键噪声帧数据进行噪音帧修复处理，能够将识别出的噪声帧进行修复，能够显著提高语音信号的可听性和可理解性。对降噪语音信号数据进行多基频谱分解处理，能够将语音信号分解为多个频率成分，使其在频域中的特征更加明显。通过对多基频语音谱数据进行听觉掩蔽频谱修正，能够进一步改善语音信号的频谱质量，消除不必要的频率干扰。在此基础上，进行语音信号重构处理，能够生成增强语音信号数据。将处理后的频域信息有效转化为高质量的时域语音信号，提升语音信号的整体质量和用户体验。因此，本发明的一种语音信号降噪方法通过将语音信号的语义信息映射到高维空间，形成的语义轨迹数据，根据语义轨迹数据对语音信号进行语义引导噪声识别，得到关键噪声语音帧，考虑关键噪声语音帧的声学特征，结合语音趋势进行精准降噪修复，确保了在增强过程中保留语音的语义信息，提高了语音的清晰度与可懂度，有效避免过度抑制对语音造成的损伤的问题。The present invention performs adaptive framing processing on the speech signal to be processed, and can flexibly adjust the analysis window according to the dynamic characteristics of the speech signal, thereby ensuring the accurate capture of different frequency components. This framing processing can capture the instantaneous characteristics of the speech signal, which is helpful to extract the basic frequency characteristics of the speech signal. The fusion of the fundamental frequency voiceprint features not only extracts the unique information closely related to the identity of the speaker in the speech signal, but also enhances the robustness to environmental changes. Using the initial voiceprint base data for acoustic context enhancement processing helps to improve the quality of the acoustic feature vector data. This enhancement processing can effectively consider the context information and reduce the interference of external noise on the speech signal. By performing semantic embedding space processing on the framed speech signal data through the acoustic feature vector data, the speech signal can be mapped to a high-dimensional semantic space, thereby making the semantic information of the speech signal clearer. Based on the speech semantic trajectory data and the semantic embedding space data, the semantic vector field is constructed to generate a combination of acoustic information and semantic information. By accurately marking the noise area, the direction and strategy of noise suppression can be effectively guided, and the accuracy of noise reduction processing can be improved. By using the semantic noise region data to perform noise suppression on the speech semantic vector field data, the noise component in the speech signal can be effectively reduced, thereby improving the clarity of the speech signal. By using the noise reduction semantic vector field data to perform key semantic noise frame recognition on the speech semantic trajectory data, the noise frames that have a greater impact on speech understanding can be effectively identified. By performing noise frame repair processing on the speech key noise frame data, the identified noise frames can be repaired, which can significantly improve the audibility and comprehensibility of the speech signal. By performing multi-base spectrum decomposition processing on the noise reduction speech signal data, the speech signal can be decomposed into multiple frequency components, making its characteristics in the frequency domain more obvious. By performing auditory masking spectrum correction on the multi-base frequency speech spectrum data, the spectral quality of the speech signal can be further improved, and unnecessary frequency interference can be eliminated. On this basis, the speech signal reconstruction processing can be performed to generate enhanced speech signal data. The processed frequency domain information is effectively converted into a high-quality time domain speech signal, thereby improving the overall quality of the speech signal and the user experience. Therefore, a speech signal denoising method of the present invention maps the semantic information of the speech signal to a high-dimensional space to form semantic trajectory data, performs semantic-guided noise recognition on the speech signal according to the semantic trajectory data, obtains key noise speech frames, considers the acoustic characteristics of the key noise speech frames, and performs precise noise reduction and restoration in combination with speech trends, thereby ensuring that the semantic information of the speech is retained during the enhancement process, improving the clarity and intelligibility of the speech, and effectively avoiding the problem of damage to the speech caused by excessive suppression.

优选地，步骤S1包括以下步骤：Preferably, step S1 comprises the following steps:

步骤S11：对待处理语音信号数据进行语音端点检测，并进行静音段剔除处理，生成初始语音信号数据；Step S11: performing voice endpoint detection on the voice signal data to be processed and performing silent segment elimination processing to generate initial voice signal data;

步骤S12：根据初始语音信号数据进行基频周期提取，得到基频轮廓特征数据；Step S12: extracting the fundamental frequency period according to the initial speech signal data to obtain fundamental frequency contour feature data;

步骤S13：根据初始语音信号数据进行自适应分帧处理，得到分帧语音信号数据；Step S13: performing adaptive framing processing according to the initial voice signal data to obtain framed voice signal data;

步骤S14：对分帧语音信号数据进行快速傅里叶变换处理，并根据梅尔滤波器进行梅尔频率倒谱系数计算，生成梅尔频谱特征矩阵数据；Step S14: performing fast Fourier transform processing on the framed speech signal data, and calculating the Mel frequency cepstrum coefficients according to the Mel filter to generate Mel frequency spectrum feature matrix data;

步骤S15：通过梅尔频谱特征矩阵数据对基频轮廓特征数据进行动态时间规整处理，并进行基频声纹特征融合，得到基频融合特征数据；Step S15: performing dynamic time warping processing on the fundamental frequency profile feature data through the Mel frequency spectrum feature matrix data, and performing fundamental frequency voiceprint feature fusion to obtain fundamental frequency fusion feature data;

步骤S16：基于基频融合特征数据进行声纹分布概率估计，并进行高斯模糊泛化处理，生成初始声纹基底数据。Step S16: Estimating the voiceprint distribution probability based on the baseband fusion feature data, and performing Gaussian fuzzy generalization processing to generate initial voiceprint base data.

本发明对待处理的语音信号数据进行语音端点检测，能够精准识别语音的起始和结束位置，从而有效剔除静音段。根据初始语音信号数据进行基频周期提取，能够提取出语音信号的基本频率信息。对初始语音信号数据进行自适应分帧处理，能够将连续的语音信号划分为若干短时帧，既能捕捉到语音的细节变化，又能适应不同场景下语音速率的变化，提高了处理的灵活性和有效性。对分帧语音信号数据进行快速傅里叶变换处理，能够将时域信号转换为频域信号，从而揭示语音信号的频谱特性。结合梅尔滤波器进行梅尔频率倒谱系数（MFCC）计算，梅尔频谱特征矩阵能够更准确地反映人耳对不同频率的感知特性，提升了语音特征的代表性和有效性。通过梅尔频谱特征矩阵数据对基频轮廓特征数据进行动态时间规整处理，能够有效对齐不同语音样本之间的时间差异，即使在说话速度有差异的情况下也能确保特征的一致性匹配。通过模糊泛化处理，能够有效减少特征数据的噪声影响，提高声纹的鲁棒性和泛化能力。The present invention performs voice endpoint detection on the voice signal data to be processed, and can accurately identify the start and end positions of the voice, thereby effectively eliminating the silent segment. The fundamental frequency period extraction is performed according to the initial voice signal data, and the basic frequency information of the voice signal can be extracted. The initial voice signal data is adaptively framed, and the continuous voice signal can be divided into several short time frames, which can not only capture the detailed changes of the voice, but also adapt to the changes in the voice rate in different scenarios, thereby improving the flexibility and effectiveness of the processing. The framed voice signal data is processed by fast Fourier transform, and the time domain signal can be converted into a frequency domain signal, thereby revealing the spectral characteristics of the voice signal. Combined with the Mel filter to calculate the Mel frequency cepstrum coefficient (MFCC), the Mel spectrum feature matrix can more accurately reflect the perception characteristics of the human ear to different frequencies, and improve the representativeness and effectiveness of the voice features. The fundamental frequency contour feature data is dynamically time-warped by the Mel spectrum feature matrix data, which can effectively align the time differences between different voice samples, and ensure the consistency of the features even when there are differences in the speaking speed. Through fuzzy generalization processing, the noise influence of the feature data can be effectively reduced, and the robustness and generalization ability of the voiceprint can be improved.

优选地，步骤S2包括以下步骤：Preferably, step S2 comprises the following steps:

步骤S21：根据初始声纹基底数据进行声学上下文增强处理，生成声学特征向量数据；Step S21: performing acoustic context enhancement processing according to the initial voiceprint base data to generate acoustic feature vector data;

步骤S22：基于声学特征向量数据利用预设的语音识别模型对分帧语音信号数据进行逐帧语音识别，得到帧级语音文本数据；Step S22: performing frame-by-frame speech recognition on the framed speech signal data using a preset speech recognition model based on the acoustic feature vector data to obtain frame-level speech text data;

步骤S23：根据帧级语音文本数据进行语义嵌入空间处理，得到语义嵌入空间数据；Step S23: performing semantic embedding space processing according to the frame-level speech text data to obtain semantic embedding space data;

步骤S24：通过语义嵌入空间数据对分帧语音信号数据进行语义空间映射，生成分帧语音语义坐标数据；Step S24: semantic space mapping is performed on the framed speech signal data by using the semantic embedding space data to generate framed speech semantic coordinate data;

步骤S25：根据分帧语音语义坐标数据进行时序语义轨迹处理，得到语音语义轨迹数据。Step S25: performing time-series semantic trajectory processing according to the framed speech semantic coordinate data to obtain speech semantic trajectory data.

本发明根据初始声纹基底数据进行声学上下文增强处理，能够有效提升声学特征向量数据的质量和准确性，提升了从初始声纹基底数据中提取的特征对于环境变化的鲁棒性，增强了语音信号中言语信息的表现力。利用预设的语音识别模型对分帧语音信号数据进行逐帧语音识别，能够实现准确的语音转文本处理，提高了语音识别的精度，确保了每一帧语音信号的内容都得到有效捕捉，降低了由于环境噪声引起的识别错误的可能性。根据帧级语音文本数据进行语义嵌入空间处理，能够将文本信息映射到一个高维的语义空间中，能够更好地反映语音内容的语义特征，提升了对语音信号理解的准确性与深度。通过语义嵌入空间数据对分帧语音信号数据进行语义空间映射，能够将每一帧语音信号的语义信息以坐标形式表示，使得不同帧之间的语义关系更加直观。根据分帧语音语义坐标数据进行时序语义轨迹处理，能够反映语音信号随时间变化的语义动态，揭示了语音内容随时间推移的语义变化模，不仅体现了单个词汇的意义，还捕捉到了整个语音序列的逻辑结构和语义流转。The present invention performs acoustic context enhancement processing based on the initial voiceprint base data, which can effectively improve the quality and accuracy of the acoustic feature vector data, improve the robustness of the features extracted from the initial voiceprint base data to environmental changes, and enhance the expressiveness of the speech information in the speech signal. Using a preset speech recognition model to perform frame-by-frame speech recognition on the framed speech signal data can achieve accurate speech-to-text processing, improve the accuracy of speech recognition, ensure that the content of each frame of the speech signal is effectively captured, and reduce the possibility of recognition errors caused by environmental noise. According to the frame-level speech text data, semantic embedding space processing is performed, and the text information can be mapped to a high-dimensional semantic space, which can better reflect the semantic characteristics of the speech content and improve the accuracy and depth of the understanding of the speech signal. By performing semantic space mapping on the framed speech signal data through semantic embedding space data, the semantic information of each frame of the speech signal can be represented in coordinate form, making the semantic relationship between different frames more intuitive. The temporal semantic trajectory processing based on the framed speech semantic coordinate data can reflect the semantic dynamics of speech signals over time and reveal the semantic change pattern of speech content over time. It not only reflects the meaning of individual words, but also captures the logical structure and semantic flow of the entire speech sequence.

优选地，步骤S23包括以下步骤：Preferably, step S23 includes the following steps:

步骤S231：对帧级语音文本数据进行高维语义向量转换，生成语义原向量数据；Step S231: converting the frame-level speech text data into a high-dimensional semantic vector to generate semantic original vector data;

步骤S232：根据语义原向量数据进行高维流形结构构建，得到语义流形结构数据；Step S232: constructing a high-dimensional manifold structure according to the semantic original vector data to obtain semantic manifold structure data;

步骤S233：通过帧级语音文本数据对语义流形结构数据进行核心语义因子标识，生成核心语义因子数据；Step S233: using the frame-level speech text data to identify the core semantic factors of the semantic manifold structure data, and generating core semantic factor data;

步骤S234：利用预设的语音语料库对帧级语音文本数据进行上下文语义嵌入匹配，并通过对抗生成网络进行语义语料生成，得到多样化语料数据；Step S234: using a preset speech corpus to perform contextual semantic embedding matching on the frame-level speech text data, and generating semantic corpus through a generative adversarial network to obtain diversified corpus data;

步骤S235：根据多样化语料数据进行语义特征解耦，并进行语义关联网络处理，生成语义关系网络数据；Step S235: decoupling semantic features according to the diversified corpus data, and performing semantic association network processing to generate semantic relationship network data;

步骤S236：通过语义关系网络数据对语义流形结构数据进行动态语义融合，得到动态语义嵌入数据；Step S236: Dynamically semantically fuse the semantic manifold structure data through the semantic relationship network data to obtain dynamic semantic embedding data;

步骤S237：对动态语义嵌入数据进行语义刻度标定，并根据核心语义因子数据进行KD树快速索引构建，从而得到语义嵌入空间数据。Step S237: semantically calibrate the dynamic semantic embedding data, and construct a KD tree fast index based on the core semantic factor data, so as to obtain semantic embedding space data.

本发明将文本信息转化为高维向量形式，使得语义信息的表达更加丰富和全面。高维语义向量能够捕捉到文本中的细微差别和潜在语义关系。通过构建高维流形，能够有效地表示语义向量之间的关系和分布特征。通过帧级语音文本数据对语义流形结构数据进行核心语义因子标识，能够从复杂的语义流形中提取出关键的语义因子，帮助明确语音信号中的主要信息和主题，从而增强了语音识别的效率和准确性。通过上下文匹配，能够生成与语音内容相关的多样化语料，丰富了语音处理的背景信息。通过解耦语义特征，可以更好地理解不同语义成分之间的独立性和相互关系。通过语义关系网络数据对语义流形结构数据进行动态语义融合，实现了对语义表示的动态调整和优化，确保了语义嵌入既反映了当前语音信号的即时语义，又融合了历史上下文和潜在的语义发展趋向，提高了语义表达的准确性和适应性。通过语义刻度标定和KD树快速索引构建，不仅标准化了动态语义嵌入数据，使得不同语义特征可以进行统一尺度上的比较和检索，而且利用KD树实现了高效的数据索引和查询。The present invention converts text information into a high-dimensional vector form, making the expression of semantic information richer and more comprehensive. High-dimensional semantic vectors can capture subtle differences and potential semantic relationships in text. By constructing a high-dimensional manifold, the relationship and distribution characteristics between semantic vectors can be effectively represented. By identifying the core semantic factors of semantic manifold structure data with frame-level speech text data, key semantic factors can be extracted from complex semantic manifolds, helping to clarify the main information and themes in the speech signal, thereby enhancing the efficiency and accuracy of speech recognition. Through context matching, a diversified corpus related to speech content can be generated, enriching the background information of speech processing. By decoupling semantic features, the independence and mutual relationship between different semantic components can be better understood. Dynamic semantic fusion of semantic manifold structure data through semantic relationship network data realizes dynamic adjustment and optimization of semantic representation, ensuring that semantic embedding not only reflects the immediate semantics of the current speech signal, but also integrates historical context and potential semantic development trends, thereby improving the accuracy and adaptability of semantic expression. Through semantic scale calibration and KD tree fast index construction, not only the dynamic semantic embedding data is standardized, so that different semantic features can be compared and retrieved on a unified scale, but also efficient data indexing and query are achieved using KD tree.

优选地，步骤S3包括以下步骤：Preferably, step S3 comprises the following steps:

步骤S31：根据语音语义轨迹数据以及语义嵌入空间数据进行语义矢量场构建，生成语音语义矢量场数据；Step S31: constructing a semantic vector field according to the speech semantic trajectory data and the semantic embedding space data to generate speech semantic vector field data;

步骤S32：对语音语义矢量场数据进行网格张量特征分析，生成网格语义张量数据；通过网格语义张量数据对语音语义矢量场数据进行语义矢量流线追踪，得到语义矢量流线数据；Step S32: performing grid tensor feature analysis on the speech semantic vector field data to generate grid semantic tensor data; performing semantic vector streamline tracing on the speech semantic vector field data through the grid semantic tensor data to obtain semantic vector streamline data;

步骤S33：根据语义矢量流线数据进行流线稳定性评估，生成流线稳定评估数据；Step S33: performing streamline stability evaluation according to the semantic vector streamline data to generate streamline stability evaluation data;

步骤S34：基于语义矢量流线数据对语音语义矢量场数据进行语义涡流检测，生成语音涡流区域数据；Step S34: performing semantic eddy current detection on the speech semantic vector field data based on the semantic vector streamline data to generate speech eddy current area data;

步骤S35：通过预设的多尺度滑动窗口对语义矢量流线数据进行平均速度计算，生成语义流线速度序列数据；Step S35: calculating the average speed of the semantic vector streamline data through a preset multi-scale sliding window to generate semantic streamline speed sequence data;

步骤S36：基于语义流线速度序列数据对语音涡流区域数据进行语义加速度提取，并进行涡流强度量化，生成涡流强度数据；Step S36: extracting semantic acceleration from speech eddy current region data based on semantic streamline velocity sequence data, and quantifying eddy current intensity to generate eddy current intensity data;

步骤S37：通过涡流强度数据、语音涡流区域数据以及流线稳定评估数据对语音语义矢量场数据进行语音噪声区域标记，生成语义噪声区域数据。Step S37: Mark the speech noise area of the speech semantic vector field data using the eddy current intensity data, the speech eddy current area data, and the streamline stability evaluation data to generate semantic noise area data.

本发明根据语音语义轨迹数据以及语义嵌入空间数据进行语义矢量场构建，将语音信号的语义信息以空间向量的形式表示，提供了对语音内容的全面理解。对语音语义矢量场数据进行网格张量特征分析，能够提取语义矢量场中的重要特征，揭示语音信号的结构特征。通过网格语义张量数据对语音语义矢量场数据进行语义矢量流线追踪，能够显示语义信息在空间中的流动和变化，帮助识别语音信号的动态特征。通过对语义矢量流线数据的分析，判断流线的连续性和一致性，有助于识别出因噪声干扰导致的流线中断或不规则变化。通过涡流检测，能够识别出语音信号中的复杂模式和潜在噪声区域。语音涡流区域数据帮助确定语音信号中存在干扰的部分。流线速度的计算能够揭示语音信号在不同时间尺度上的变化特征。通过提取语义加速度，可以分析语音信号的变化趋势，量化涡流强度则有助于评估噪声对语音信号的影响。通过综合考虑涡流强度和流线稳定性，能够准确识别和标记语音信号中的噪声区域。The present invention constructs a semantic vector field according to speech semantic trajectory data and semantic embedding space data, represents the semantic information of speech signals in the form of space vectors, and provides a comprehensive understanding of speech content. By performing grid tensor feature analysis on speech semantic vector field data, important features in the semantic vector field can be extracted, and the structural features of speech signals can be revealed. By tracking semantic vector streamlines on speech semantic vector field data through grid semantic tensor data, the flow and change of semantic information in space can be displayed, which helps to identify the dynamic characteristics of speech signals. By analyzing the semantic vector streamline data, the continuity and consistency of the streamlines can be judged, which helps to identify the interruption or irregular change of the streamlines caused by noise interference. By eddy current detection, complex patterns and potential noise areas in speech signals can be identified. Speech eddy current area data helps to determine the part of the speech signal where interference exists. The calculation of streamline velocity can reveal the changing characteristics of speech signals on different time scales. By extracting semantic acceleration, the changing trend of speech signals can be analyzed, and quantifying eddy current intensity helps to evaluate the impact of noise on speech signals. By comprehensively considering eddy current intensity and streamline stability, the noise area in the speech signal can be accurately identified and marked.

优选地，步骤S31包括以下步骤：Preferably, step S31 includes the following steps:

步骤S311：根据语义嵌入空间数据进行语义空间网格化处理，生成网格语义空间数据；对网格语义空间数据进行初始语义势能值计算，生成初始语义势能数据；Step S311: performing semantic space gridding processing on the semantic embedding space data to generate grid semantic space data; performing initial semantic potential energy value calculation on the grid semantic space data to generate initial semantic potential energy data;

步骤S312：通过初始语义势能数据对网格语义空间数据进行初始矢量场构建，得到初始语义矢量场数据；Step S312: constructing an initial vector field for the grid semantic space data using the initial semantic potential energy data to obtain initial semantic vector field data;

步骤S313：通过语音语义轨迹数据对初始语义矢量场数据进行矢量场动态调整，生成动态语义矢量场数据；其中，步骤S313具体为：Step S313: dynamically adjusting the initial semantic vector field data using the speech semantic trajectory data to generate dynamic semantic vector field data; wherein step S313 is specifically as follows:

步骤S3131：利用语音语义轨迹数据对网格语义空间数据进行语义轨迹映射处理，生成语义轨迹网格序列数据；Step S3131: using the speech semantic trajectory data to perform semantic trajectory mapping processing on the grid semantic space data to generate semantic trajectory grid sequence data;

步骤S3132：利用网格语义空间数据对语义轨迹网格序列数据进行网格剔除，得到非语义轨迹网格数据；Step S3132: using the grid semantic space data to perform grid elimination on the semantic trajectory grid sequence data to obtain non-semantic trajectory grid data;

步骤S3133：基于语义轨迹网格序列数据利用语音语义轨迹数据对初始语义矢量场数据进行语义轨迹矢量调整，生成语义轨迹矢量调整数据；Step S3133: adjusting the semantic trajectory vector of the initial semantic vector field data using the speech semantic trajectory data based on the semantic trajectory grid sequence data to generate semantic trajectory vector adjustment data;

步骤S3134：基于非语义轨迹网格数据利用语义轨迹矢量调整数据对初始语义矢量场数据进行势能场扩散模拟，并进行矢量调整值推导，生成非轨迹矢量调整数据；Step S3134: Based on the non-semantic trajectory grid data, the semantic trajectory vector adjustment data is used to perform potential field diffusion simulation on the initial semantic vector field data, and vector adjustment value derivation is performed to generate non-trajectory vector adjustment data;

步骤S3135：通过语义轨迹矢量调整数据以及非轨迹矢量调整数据对初始语义矢量场数据进行矢量场动态调整，生成动态语义矢量场数据；Step S3135: dynamically adjusting the vector field of the initial semantic vector field data by using the semantic trajectory vector adjustment data and the non-trajectory vector adjustment data to generate dynamic semantic vector field data;

步骤S314：根据动态语义矢量场数据进行矢量场平滑优化，生成语音语义矢量场数据。Step S314: performing vector field smoothing optimization according to the dynamic semantic vector field data to generate speech semantic vector field data.

本发明通过将语义嵌入空间划分为多个网格，可以更系统地组织和表示语义信息，能够在细粒度上捕捉语义变化。通过计算势能值，构建的初始语义矢量场能够有效地反映语义信息在空间中的分布特征，识别出语义高低点，能够帮助确定语义信息的优先级。通过语音语义轨迹数据对初始语义矢量场数据进行矢量场动态调整，能够实时反映语音信号的变化，增强对语音内容的适应性。通过映射语音语义轨迹到网格语义空间，明确了语义流动的路径。利用网格语义空间数据对语义轨迹网格序列数据进行网格剔除，效区分了噪声或无关信号影响的区域。通过语义轨迹矢量调整和非轨迹矢量调整，实现了对初始语义矢量场的精确优化。前者针对主要语义轨迹进行直接的矢量调整，确保了语义信息的准确表达；后者通过势能场扩散模拟和矢量调整值推导，对非语义区域进行了合理的平滑和调整，有效抑制了噪声对矢量场的干扰，保持了矢量场的整体连贯性。经过平滑处理的语音语义矢量场数据能够更准确地反映语音信号的真实特征，消除不必要噪声带来的波动。The present invention can organize and represent semantic information more systematically by dividing the semantic embedding space into multiple grids, and can capture semantic changes at a fine granularity. By calculating the potential energy value, the constructed initial semantic vector field can effectively reflect the distribution characteristics of semantic information in space, identify semantic high and low points, and help determine the priority of semantic information. The initial semantic vector field data is dynamically adjusted by the speech semantic trajectory data, which can reflect the changes of speech signals in real time and enhance the adaptability to speech content. By mapping the speech semantic trajectory to the grid semantic space, the path of semantic flow is clarified. The grid semantic space data is used to grid-remove the semantic trajectory grid sequence data, effectively distinguishing the areas affected by noise or irrelevant signals. Through semantic trajectory vector adjustment and non-trajectory vector adjustment, the initial semantic vector field is accurately optimized. The former performs direct vector adjustment on the main semantic trajectory to ensure the accurate expression of semantic information; the latter performs reasonable smoothing and adjustment on the non-semantic area through potential field diffusion simulation and vector adjustment value derivation, effectively suppresses the interference of noise on the vector field, and maintains the overall coherence of the vector field. The smoothed speech semantic vector field data can more accurately reflect the true characteristics of the speech signal and eliminate the fluctuations caused by unnecessary noise.

优选地，步骤S4包括以下步骤：Preferably, step S4 comprises the following steps:

步骤S41：通过语义噪声区域数据对语音语义矢量场数据进行噪声抑制处理，并进行矢量场自适应去噪处理，生成降噪语义矢量场数据；Step S41: performing noise suppression processing on the speech semantic vector field data through the semantic noise region data, and performing vector field adaptive denoising processing to generate noise-reduced semantic vector field data;

步骤S42：利用语音语义轨迹数据对降噪语义矢量场数据进行语义轨迹重映射处理，生成语义轨迹重映射数据；Step S42: performing semantic trajectory remapping processing on the noise reduction semantic vector field data using the speech semantic trajectory data to generate semantic trajectory remapping data;

步骤S43：通过语义轨迹重映射数据对语音语义矢量场数据进行动态势能约束，并进行轨迹漂移势能计算，得到语义轨迹漂移势能数据；Step S43: Dynamic potential energy constraints are performed on the speech semantic vector field data through the semantic trajectory remapping data, and trajectory drift potential energy is calculated to obtain semantic trajectory drift potential energy data;

步骤S44：基于预设的噪声势能阈值通过语义轨迹漂移势能数据对语音语义轨迹数据进行关键语义噪声帧识别，并进行声学特征逆映射，生成语音关键噪声帧数据；Step S44: Based on a preset noise potential energy threshold, the speech semantic trajectory data is subjected to key semantic noise frame recognition through the semantic trajectory drift potential energy data, and acoustic feature inverse mapping is performed to generate speech key noise frame data;

步骤S45：通过语音关键噪声帧数据对分帧语音信号数据进行噪音帧修复处理，生成降噪语音信号数据。Step S45: Perform noise frame repair processing on the framed speech signal data using the speech key noise frame data to generate noise-reduced speech signal data.

本发明通过针对语义噪声区域数据进行的噪声抑制处理，矢量场自适应去噪处理根据语音信号的特性和噪声分布自适应调整去噪策略，避免了过度去噪造成的有用信息损失，保证了降噪语义矢量场数据的高质量，直接针对已识别的噪声区域采取措施，有效减小了噪声对语音信号的干扰。通过重映射处理，可以将降噪后的语义矢量场数据与实际的语义轨迹进行对齐，确保语音信号的语义信息能够准确反映其动态变化。通过动态势能约束和轨迹漂移势能计算，量化了语义轨迹在去噪处理后出现的微小偏移或不连续性。通过对比语义轨迹漂移势能数据，准确识别出那些受到噪声严重影响的关键帧，声学特征逆映射则将识别出的噪声影响转化为原始语音信号中的具体帧，为实际信号修复操作建立了直接的联系。通过语音关键噪声帧数据对分帧语音信号数据进行的噪音帧修复处理，直接针对被噪声污染的帧实施修复，不仅减少了噪声干扰，还尽可能地保留了语音信号的原始特征和语义完整性。生成的降噪语音信号数据在保持语音清晰度的同时，大幅度提升了语音质量。The present invention performs noise suppression processing on semantic noise area data, and the vector field adaptive denoising processing adaptively adjusts the denoising strategy according to the characteristics of the speech signal and the noise distribution, thereby avoiding the loss of useful information caused by excessive denoising, ensuring the high quality of the denoised semantic vector field data, taking measures directly for the identified noise area, and effectively reducing the interference of noise on the speech signal. Through remapping processing, the semantic vector field data after denoising can be aligned with the actual semantic trajectory to ensure that the semantic information of the speech signal can accurately reflect its dynamic changes. Through dynamic potential energy constraints and trajectory drift potential energy calculation, the slight offset or discontinuity of the semantic trajectory after denoising is quantified. By comparing the semantic trajectory drift potential energy data, the key frames that are severely affected by noise are accurately identified, and the acoustic feature inverse mapping converts the identified noise impact into specific frames in the original speech signal, establishing a direct connection for the actual signal repair operation. The noise frame repair processing performed on the framed speech signal data by the speech key noise frame data directly implements the repair on the noise-contaminated frames, which not only reduces the noise interference, but also retains the original features and semantic integrity of the speech signal as much as possible. The generated noise-reduced speech signal data significantly improves the speech quality while maintaining speech clarity.

优选地，步骤S5包括以下步骤：Preferably, step S5 comprises the following steps:

步骤S51：对降噪语音信号数据进行时频分析，得到降噪语音频谱数据；对语音语义矢量场数据进行频域转化处理，生成语义频谱场数据；Step S51: performing time-frequency analysis on the noise reduction speech signal data to obtain noise reduction speech spectrum data; performing frequency domain conversion processing on the speech semantic vector field data to generate semantic spectrum field data;

步骤S52：根据降噪语音频谱数据进行多基频谱分解处理，生成多基频语音谱数据；Step S52: performing multi-base spectrum decomposition processing according to the noise reduction speech spectrum data to generate multi-base frequency speech spectrum data;

步骤S53：利用语义频谱场数据对多基频语音谱数据进行语义谐波关联分析，生成语义谐波关联矩阵数据；Step S53: using the semantic spectrum field data to perform semantic harmonic association analysis on the multi-fundamental frequency speech spectrum data to generate semantic harmonic association matrix data;

步骤S54：根据语义谐波关联矩阵数据进行谐波语义置信度评估，并进行语音增强掩码处理，生成语音增强引导掩码数据；Step S54: performing harmonic semantic confidence evaluation according to the semantic harmonic association matrix data, and performing speech enhancement mask processing to generate speech enhancement guidance mask data;

步骤S55：通过语音增强引导掩码数据对降噪语音频谱数据进行语义增强处理，并进行听觉掩蔽频谱修正，生成修正语音频谱数据；Step S55: semantically enhancing the noise reduction speech spectrum data through the speech enhancement guide mask data, and performing auditory masking spectrum correction to generate corrected speech spectrum data;

步骤S56：根据修正语音频谱数据进行语音信号重构处理，生成增强语音信号数据。Step S56: Perform speech signal reconstruction processing according to the modified speech spectrum data to generate enhanced speech signal data.

本发明通过时频分析能够有效地将语音信号在时间和频率域中进行表示，揭示出语音的频域特性。通过多基频谱分解，能够将复杂的语音信号分解成多个频率成分，使得每个频率成分的特征更加明显，揭示了语音信号中各个频率成分的细微差异。通过分析语义频谱场与语音谱之间的谐波关系，能够识别出语音信号中的重要频率成分及其与语义的关联性。根据语义谐波关联矩阵数据进行的谐波语义置信度评估，评估了各个频谱成分对于传达语义信息的贡献度，通过语音增强掩码处理生成的语音增强引导掩码数据，有效指明了哪些频谱成分应该被加强或削弱。通过语音增强引导掩码数据对降噪语音频谱数据进行的语义增强处理，针对性地增强了语音信号中携带重要语义信息的频谱成分，结合听觉掩蔽频谱修正，进一步根据人类听觉系统的特性调整频谱，生成的修正语音频谱数据在提高语音清晰度的同时，使得人耳对语音信号的感知更加自然和清晰，确保了增强后的语音信号更符合听觉特性。The present invention can effectively represent the speech signal in the time and frequency domains through time-frequency analysis, revealing the frequency domain characteristics of the speech. Through multi-base spectrum decomposition, the complex speech signal can be decomposed into multiple frequency components, so that the characteristics of each frequency component are more obvious, revealing the subtle differences of each frequency component in the speech signal. By analyzing the harmonic relationship between the semantic spectrum field and the speech spectrum, the important frequency components in the speech signal and their relevance to the semantics can be identified. The harmonic semantic confidence evaluation based on the semantic harmonic association matrix data evaluates the contribution of each spectrum component to conveying semantic information, and the speech enhancement guide mask data generated by the speech enhancement mask processing effectively indicates which spectrum components should be strengthened or weakened. The semantic enhancement processing of the noise-reduced speech spectrum data by the speech enhancement guide mask data specifically enhances the spectrum components carrying important semantic information in the speech signal, combines the auditory masking spectrum correction, and further adjusts the spectrum according to the characteristics of the human auditory system. The generated corrected speech spectrum data improves the speech clarity while making the human ear perceive the speech signal more natural and clear, ensuring that the enhanced speech signal is more in line with the auditory characteristics.

优选地，本发明还提供一种语音信号降噪系统，执行如上所述的语音信号降噪方法，该语音信号降噪系统包括：Preferably, the present invention further provides a speech signal denoising system, which executes the speech signal denoising method as described above, and the speech signal denoising system comprises:

自适应分帧模块，用于对待处理语音信号数据进行自适应分帧处理，得到分帧语音信号数据；根据分帧语音信号数据进行基频声纹特征融合，生成初始声纹基底数据；The adaptive framing module is used to perform adaptive framing processing on the voice signal data to be processed to obtain framed voice signal data; perform fundamental frequency voiceprint feature fusion according to the framed voice signal data to generate initial voiceprint base data;

语音语义分析模块，用于根据初始声纹基底数据进行声学上下文增强处理，生成声学特征向量数据；通过声学特征向量数据对分帧语音信号数据进行语义嵌入空间处理，得到语义嵌入空间数据；根据语义嵌入空间数据进行时序语义轨迹处理，得到语音语义轨迹数据；The speech semantic analysis module is used to perform acoustic context enhancement processing based on the initial voiceprint base data to generate acoustic feature vector data; perform semantic embedding space processing on the framed speech signal data through the acoustic feature vector data to obtain semantic embedding space data; perform time-series semantic trajectory processing based on the semantic embedding space data to obtain speech semantic trajectory data;

语音噪声识别模块，用于根据语音语义轨迹数据以及语义嵌入空间数据进行语义矢量场构建，生成语音语义矢量场数据；对语音语义矢量场数据进行语音噪声区域标记，生成语义噪声区域数据；The speech noise recognition module is used to construct a semantic vector field based on the speech semantic trajectory data and the semantic embedding space data to generate speech semantic vector field data; mark the speech noise area of the speech semantic vector field data to generate semantic noise area data;

语音噪声修复模块，用于通过语义噪声区域数据对语音语义矢量场数据进行噪声抑制处理，生成降噪语义矢量场数据；利用降噪语义矢量场数据对语音语义轨迹数据进行关键语义噪声帧识别，生成语音关键噪声帧数据；根据语音关键噪声帧数据进行噪音帧修复处理，生成降噪语音信号数据；The speech noise repair module is used to perform noise suppression processing on the speech semantic vector field data through the semantic noise area data to generate noise reduction semantic vector field data; perform key semantic noise frame recognition on the speech semantic trajectory data using the noise reduction semantic vector field data to generate speech key noise frame data; perform noise frame repair processing according to the speech key noise frame data to generate noise reduction speech signal data;

语音信号增强模块，用于对降噪语音信号数据进行多基频谱分解处理，生成多基频语音谱数据；根据多基频语音谱数据进行听觉掩蔽频谱修正，并进行语音信号重构处理，生成增强语音信号数据。The speech signal enhancement module is used to perform multi-base spectrum decomposition processing on the noise reduction speech signal data to generate multi-base frequency speech spectrum data; perform auditory masking spectrum correction based on the multi-base frequency speech spectrum data, and perform speech signal reconstruction processing to generate enhanced speech signal data.

优选地，本发明还提供了一种语音信号降噪设备，包括：Preferably, the present invention further provides a speech signal noise reduction device, comprising:

存储器，用于存储计算机程序；Memory for storing computer programs;

处理器，用于执行所述计算机程序时实现如权利要求1至8任一项所述的语音信号降噪方法的步骤。A processor, configured to implement the steps of the method for reducing noise of a speech signal as claimed in any one of claims 1 to 8 when executing the computer program.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明语音信号降噪方法的步骤流程示意图；FIG1 is a schematic diagram of the steps of a method for reducing noise in a speech signal according to the present invention;

图2为图1中步骤S1的详细实施步骤流程示意图；FIG2 is a schematic diagram of a detailed implementation process of step S1 in FIG1 ;

图3为图1中步骤S2的详细实施步骤流程示意图；FIG3 is a schematic diagram of a detailed implementation process of step S2 in FIG1 ;

图4为图1中步骤S4的详细实施步骤流程示意图；FIG4 is a schematic diagram of a detailed implementation process of step S4 in FIG1 ;

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further explained in conjunction with embodiments and with reference to the accompanying drawings.

具体实施方式DETAILED DESCRIPTION

下面结合附图对本发明的技术方法进行清楚、完整的描述，显然，所描述的实施例是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域所属的技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical method of the present invention is described clearly and completely below in conjunction with the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by technicians in this field without creative work are within the scope of protection of the present invention.

此外，附图仅为本发明的示意性图解，并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分，因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体，不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器方法和/或微控制器方法中实现这些功能实体。In addition, the accompanying drawings are only schematic illustrations of the present invention and are not necessarily drawn to scale. The same reference numerals in the figures represent the same or similar parts, and their repeated description will be omitted. Some of the block diagrams shown in the accompanying drawings are functional entities and do not necessarily correspond to physically or logically independent entities. The functional entities can be implemented in software form, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor methods and/or microcontroller methods.

应当理解的是，虽然在这里可能使用了术语“第一”、“第二”等等来描述各个单元，但是这些单元不应当受这些术语限制。使用这些术语仅仅是为了将一个单元与另一个单元进行区分。举例来说，在不背离示例性实施例的范围的情况下，第一单元可以被称为第二单元，并且类似地第二单元可以被称为第一单元。这里所使用的术语“和/或”包括其中一个或更多所列出的相关联项目的任意和所有组合。It should be understood that, although the terms "first", "second", etc. may be used herein to describe various units, these units should not be limited by these terms. These terms are used only to distinguish one unit from another unit. For example, without departing from the scope of the exemplary embodiments, the first unit may be referred to as the second unit, and similarly the second unit may be referred to as the first unit. The term "and/or" used herein includes any and all combinations of one or more of the listed associated items.

为实现上述目的，请参阅图1至图4，本发明提供一种语音信号降噪方法，包括以下步骤：To achieve the above object, referring to FIG. 1 to FIG. 4 , the present invention provides a method for reducing noise of a speech signal, comprising the following steps:

本发明实施例中，参考图1所述，为本发明语音信号降噪方法的步骤流程示意图，在本实施例中，所述的语音信号降噪方法包括以下步骤：In the embodiment of the present invention, referring to FIG. 1 , which is a schematic flow chart of the steps of the speech signal noise reduction method of the present invention, in this embodiment, the speech signal noise reduction method comprises the following steps:

本发明实施例中，对待处理的语音信号数据进行自适应分帧处理，可以使用Hamming窗对语音信号进行分帧，帧长设置为25ms，帧移设置为10ms，以保证相邻帧之间有一定的重叠，从而保留语音信号的连续性。然后，根据分帧语音信号数据进行基频声纹特征融合。首先，利用Praat软件提取每帧语音信号的基频、共振峰等声学参数，构建声纹特征向量。接着，采用动态时间规整（DTW）算法计算相邻帧之间的声纹特征向量距离，并根据距离大小对相邻帧进行加权融合，生成初始声纹基底数据。In an embodiment of the present invention, the voice signal data to be processed is adaptively framed, and the Hamming window can be used to frame the voice signal, with the frame length set to 25ms and the frame shift set to 10ms to ensure a certain overlap between adjacent frames, thereby retaining the continuity of the voice signal. Then, the fundamental frequency voiceprint feature fusion is performed based on the framed voice signal data. First, the Praat software is used to extract the fundamental frequency, resonance peak and other acoustic parameters of each frame of the voice signal to construct a voiceprint feature vector. Next, the dynamic time warping (DTW) algorithm is used to calculate the distance between the voiceprint feature vectors of adjacent frames, and the adjacent frames are weighted fused according to the distance to generate the initial voiceprint base data.

本发明实施例中，基于声学特征向量数据，利用预设的语音识别模型对分帧语音信号进行逐帧语音识别。可以使用基于混合声学模型和语言模型的语音识别系统，例如基于隐马尔可夫模型（HMM）和N-gram语言模型的语音识别系统，或基于深度神经网络（DNN）的端到端语音识别系统，得到帧级语音文本数据。根据帧级语音文本数据进行语义嵌入空间处理。可以使用预训练的词向量模型，例如Word2Vec、GloVe或BERT，将帧级语音文本数据中的每个词或音素转换为对应的词向量，生成语义嵌入空间数据。根据帧级语音文本数据进行语义嵌入空间处理。可以使用预训练的词向量模型，例如Word2Vec、GloVe或BERT，将帧级语音文本数据中的每个词或音素转换为对应的词向量，生成语义嵌入空间数据。In an embodiment of the present invention, based on the acoustic feature vector data, a preset speech recognition model is used to perform frame-by-frame speech recognition on the framed speech signal. A speech recognition system based on a hybrid acoustic model and a language model, such as a speech recognition system based on a hidden Markov model (HMM) and an N-gram language model, or an end-to-end speech recognition system based on a deep neural network (DNN) can be used to obtain frame-level speech text data. Semantic embedding space processing is performed according to the frame-level speech text data. A pre-trained word vector model, such as Word2Vec, GloVe, or BERT, can be used to convert each word or phoneme in the frame-level speech text data into a corresponding word vector to generate semantic embedding space data. Semantic embedding space processing is performed according to the frame-level speech text data. A pre-trained word vector model, such as Word2Vec, GloVe, or BERT, can be used to convert each word or phoneme in the frame-level speech text data into a corresponding word vector to generate semantic embedding space data.

本发明实施例中，根据语音语义轨迹数据以及语义嵌入空间数据进行语义矢量场构建，可以使用自组织映射（SOM）网络。将语音语义轨迹数据和语义嵌入空间数据作为SOM网络的输入，通过网络的自组织学习过程，将高维的语义信息映射到二维平面空间，生成语音语义矢量场数据。利用语音语义轨迹数据的特征对语音语义矢量场数据进行分析，例如计算语义矢量场的旋度、散度等，识别出语义信息发生突变或不连续的区域，将其标记为潜在的语音噪声区域，生成语义噪声区域数据。In an embodiment of the present invention, a self-organizing map (SOM) network can be used to construct a semantic vector field based on speech semantic trajectory data and semantic embedding space data. The speech semantic trajectory data and semantic embedding space data are used as inputs of the SOM network, and through the self-organizing learning process of the network, high-dimensional semantic information is mapped to a two-dimensional plane space to generate speech semantic vector field data. The speech semantic vector field data is analyzed using the characteristics of the speech semantic trajectory data, such as calculating the curl and divergence of the semantic vector field, identifying areas where semantic information has undergone mutations or discontinuities, marking them as potential speech noise areas, and generating semantic noise area data.

本发明实施例中，利用语义噪声区域数据对语音语义矢量场数据进行噪声抑制处理。可以使用掩码方法，将语义噪声区域的矢量值抑制或置零，生成降噪语义矢量场数据。利用降噪语义矢量场数据对语音语义轨迹数据进行分析，识别出受噪声影响的关键语义噪声帧。例如，可以计算语义轨迹在降噪前后对应位置的语义向量差值，将差值较大的帧标记为关键语义噪声帧，生成语音关键噪声帧数据。根据语音关键噪声帧数据对分帧语音信号数据进行噪音帧修复处理。可以使用基于深度学习的语音生成模型，例如WaveNet或WaveRNN，对关键噪声帧进行修复，生成估计的干净语音帧，替换原始语音信号中对应的帧，生成降噪语音信号数据。In an embodiment of the present invention, the semantic noise area data is used to perform noise suppression processing on the speech semantic vector field data. A masking method can be used to suppress or set the vector value of the semantic noise area to zero to generate noise-reduced semantic vector field data. The speech semantic trajectory data is analyzed using the noise-reduced semantic vector field data to identify key semantic noise frames affected by noise. For example, the difference in semantic vectors at corresponding positions of the semantic trajectory before and after noise reduction can be calculated, and the frames with larger differences can be marked as key semantic noise frames to generate speech key noise frame data. Noise frame repair processing is performed on the framed speech signal data according to the speech key noise frame data. A speech generation model based on deep learning, such as WaveNet or WaveRNN, can be used to repair key noise frames, generate estimated clean speech frames, replace the corresponding frames in the original speech signal, and generate noise-reduced speech signal data.

本发明实施例中，对降噪语音信号数据进行多基频谱分解处理，例如使用经验模态分解(EMD)算法，将其分解成多个不同频率尺度的固有模态函数(IMF)，生成多基频语音谱数据。根据多基频语音谱数据进行听觉掩蔽频谱修正。根据人耳的听觉掩蔽效应，对多基频语音谱数据进行修正，例如降低噪声在掩蔽阈值以下的能量，以提高语音的可懂度。将修正后的多基频语音谱数据进行语音信号重构处理。将各个IMF分量叠加，生成增强语音信号数据，完成语音信号降噪。In an embodiment of the present invention, the noise reduction speech signal data is subjected to multi-base spectrum decomposition processing, for example, using the empirical mode decomposition (EMD) algorithm to decompose it into a plurality of intrinsic mode functions (IMFs) of different frequency scales, and generate multi-baseband speech spectrum data. Auditory masking spectrum correction is performed based on the multi-baseband speech spectrum data. Based on the auditory masking effect of the human ear, the multi-baseband speech spectrum data is corrected, for example, the energy of the noise below the masking threshold is reduced to improve the intelligibility of the speech. The corrected multi-baseband speech spectrum data is subjected to speech signal reconstruction processing. The various IMF components are superimposed to generate enhanced speech signal data, and the speech signal noise reduction is completed.

作为本发明的一个实例，参考图2所示，为图1中步骤S1的详细实施步骤流程示意图，在本实例中所述步骤S1包括：As an example of the present invention, referring to FIG. 2 , it is a schematic diagram of a detailed implementation step flow of step S1 in FIG. 1 . In this example, step S1 includes:

本发明实施例中，对输入的原始语音信号数据进行语音端点检测，可以使用基于能量和过零率的双门限法。首先，设定能量阈值和过零率阈值，并计算语音信号的短时能量和短时过零率。当短时能量和短时过零率同时超过预设阈值时，判定为语音段的起始点；当短时能量和短时过零率同时低于预设阈值时，判定为语音段的结束点。根据检测到的语音端点，剔除静音段落，保留包含语音信息的片段，生成初始语音信号数据。In an embodiment of the present invention, a double threshold method based on energy and zero-crossing rate can be used to detect speech endpoints of input original speech signal data. First, an energy threshold and a zero-crossing rate threshold are set, and the short-time energy and short-time zero-crossing rate of the speech signal are calculated. When the short-time energy and the short-time zero-crossing rate both exceed the preset threshold, it is determined as the starting point of the speech segment; when the short-time energy and the short-time zero-crossing rate are both lower than the preset threshold, it is determined as the end point of the speech segment. According to the detected speech endpoints, the silent segments are eliminated, and the segments containing speech information are retained to generate the initial speech signal data.

本发明实施例中，对初始语音信号数据进行预处理，包括预加重和加窗操作，以突出语音信号中的高频信息和平滑信号。然后，计算每帧语音信号的自相关函数，并找到自相关函数的第一个峰值位置，该峰值对应基音周期。为了提高基频周期提取的准确性，可以采用动态阈值方法来确定峰值位置，并对提取到的基频周期进行中值滤波处理，去除异常值，最终得到平滑的基频轮廓特征数据。In the embodiment of the present invention, the initial speech signal data is preprocessed, including pre-emphasis and windowing operations, to highlight the high-frequency information and smooth signal in the speech signal. Then, the autocorrelation function of each frame of the speech signal is calculated, and the first peak position of the autocorrelation function is found, and the peak corresponds to the fundamental frequency period. In order to improve the accuracy of the fundamental frequency period extraction, a dynamic threshold method can be used to determine the peak position, and the extracted fundamental frequency period is subjected to median filtering to remove abnormal values, and finally obtain smooth fundamental frequency contour feature data.

本发明实施例中，对初始语音信号数据进行自适应分帧处理。可以使用Hamming窗对语音信号进行分帧，帧长设置为25ms，帧移设置为10ms，以保证相邻帧之间有一定的重叠，从而保留语音信号的连续性，得到分帧语音信号数据。例如，使用短时傅里叶变换（STFT）技术，将连续的语音信号划分为若干帧。每帧的长度和重叠度可以根据语音信号的特性进行调整，通常选择20ms至30ms的帧长及50%的重叠。通过这种方式，能够生成分帧语音信号数据。In an embodiment of the present invention, the initial speech signal data is subjected to adaptive framing processing. The speech signal can be framed using a Hamming window, with the frame length set to 25ms and the frame shift set to 10ms to ensure a certain overlap between adjacent frames, thereby retaining the continuity of the speech signal and obtaining framed speech signal data. For example, a continuous speech signal is divided into several frames using a short-time Fourier transform (STFT) technique. The length and overlap of each frame can be adjusted according to the characteristics of the speech signal, and a frame length of 20ms to 30ms and an overlap of 50% are usually selected. In this way, framed speech signal data can be generated.

本发明实施例中，对分帧语音信号数据进行快速傅里叶变换（FFT）处理，将时域信号转换到频域，得到语音信号的频谱信息。然后，根据梅尔滤波器组对频谱进行滤波，模拟人耳的听觉特性，得到不同频率带的能量分布。最后，对滤波后的能量谱进行对数运算和离散余弦变换（DCT），得到梅尔频率倒谱系数（MFCC），并将所有帧的MFCC系数构成梅尔频谱特征矩阵数据。In the embodiment of the present invention, the framed speech signal data is processed by fast Fourier transform (FFT), the time domain signal is converted to the frequency domain, and the spectrum information of the speech signal is obtained. Then, the spectrum is filtered according to the Mel filter bank to simulate the auditory characteristics of the human ear and obtain the energy distribution of different frequency bands. Finally, the filtered energy spectrum is logarithmically operated and discrete cosine transformed (DCT) to obtain the Mel frequency cepstrum coefficient (MFCC), and the MFCC coefficients of all frames constitute the Mel spectrum feature matrix data.

本发明实施例中，利用动态时间规整（DTW）算法对梅尔频谱特征矩阵数据和基频轮廓特征数据进行对齐，以消除不同语音信号在时间轴上的差异。DTW算法通过寻找两个时间序列之间的最佳匹配路径，使得累积距离最小，从而实现时间规整。根据对齐后的特征序列，进行基频声纹特征融合。可以使用线性加权融合的方式，根据经验或实验结果设置权重系数，将梅尔频谱特征和基频轮廓特征进行加权求和，得到基频融合特征数据。In an embodiment of the present invention, a dynamic time warping (DTW) algorithm is used to align the Mel spectrum feature matrix data and the fundamental frequency contour feature data to eliminate the differences between different speech signals on the time axis. The DTW algorithm achieves time warping by finding the best matching path between two time series to minimize the cumulative distance. Based on the aligned feature sequence, fundamental frequency voiceprint feature fusion is performed. A linear weighted fusion method can be used to set the weight coefficient based on experience or experimental results, and the Mel spectrum features and fundamental frequency contour features are weighted and summed to obtain fundamental frequency fusion feature data.

本发明实施例中，基于基频融合特征数据进行声纹分布概率估计，可以使用高斯混合模型（GMM）。GMM可以将复杂的语音特征分布建模为多个高斯分布的加权和，每个高斯分布代表一种声纹模式。通过估计GMM的参数，可以得到每个语音帧属于不同声纹模式的概率。通过这种概率估计，可以更好地理解不同声纹数据之间的关系。随后，对声纹分布进行高斯模糊泛化处理，以降低特征数据的敏感性和噪声影响，从而生成初始声纹基底数据。In an embodiment of the present invention, a Gaussian mixture model (GMM) can be used to estimate the probability of voiceprint distribution based on the baseband fusion feature data. GMM can model complex speech feature distribution as a weighted sum of multiple Gaussian distributions, each of which represents a voiceprint pattern. By estimating the parameters of GMM, the probability that each speech frame belongs to a different voiceprint pattern can be obtained. Through this probability estimation, the relationship between different voiceprint data can be better understood. Subsequently, the voiceprint distribution is subjected to Gaussian fuzzy generalization processing to reduce the sensitivity of the feature data and the influence of noise, thereby generating initial voiceprint base data.

作为本发明的一个实例，参考图3所示，为图1中步骤S2的详细实施步骤流程示意图，在本实例中所述步骤S2包括：As an example of the present invention, referring to FIG3 , which is a schematic flow chart of detailed implementation steps of step S2 in FIG1 , in this example, step S2 includes:

本发明实施例中，使用循环神经网络（RNN），例如长短时记忆网络（LSTM），来学习语音信号的时序特征。将初始声纹基底数据作为RNN的输入，通过网络的循环结构和非线性激活函数，可以捕捉到语音信号的上下文信息，从而生成更具表征能力的声学特征向量数据。例如，可以将每帧语音的初始声纹基底数据，以及其前后几帧的声纹数据拼接成一个向量，作为LSTM的输入。LSTM网络会学习到相邻帧之间的声学特征关系，并输出一个包含更丰富上下文信息的声学特征向量。In an embodiment of the present invention, a recurrent neural network (RNN), such as a long short-term memory network (LSTM), is used to learn the temporal characteristics of speech signals. The initial voiceprint base data is used as the input of the RNN. Through the network's cyclic structure and nonlinear activation function, the contextual information of the speech signal can be captured, thereby generating more representative acoustic feature vector data. For example, the initial voiceprint base data of each frame of speech and the voiceprint data of the previous and next frames can be spliced into a vector as the input of the LSTM. The LSTM network will learn the acoustic feature relationship between adjacent frames and output an acoustic feature vector containing richer contextual information.

本发明实施例中，基于声学特征向量数据，利用预设的语音识别模型对分帧语音信号数据进行逐帧语音识别。可以选择使用混合声学模型和语言模型的语音识别系统，例如基于隐马尔可夫模型（HMM）和N-gram语言模型的语音识别系统，或基于深度神经网络（DNN）的端到端语音识别系统。将每一帧的声学特征向量输入到语音识别模型中，解码得到对应帧的语音识别结果，最终得到帧级语音文本数据，即每个语音帧对应的音素或词序列。In an embodiment of the present invention, based on the acoustic feature vector data, a preset speech recognition model is used to perform frame-by-frame speech recognition on the framed speech signal data. A speech recognition system using a hybrid acoustic model and a language model can be selected, such as a speech recognition system based on a hidden Markov model (HMM) and an N-gram language model, or an end-to-end speech recognition system based on a deep neural network (DNN). The acoustic feature vector of each frame is input into the speech recognition model, and the speech recognition result of the corresponding frame is obtained by decoding, and finally the frame-level speech text data, that is, the phoneme or word sequence corresponding to each speech frame, is obtained.

本发明实施例中，根据帧级语音文本数据进行语义嵌入空间处理，可以使用预训练的词向量模型，例如Word2Vec、GloVe或BERT。这些模型可以将每个词或音素映射到一个高维向量空间，使得语义相似的词或音素在向量空间中的距离更近。将帧级语音文本数据中的每个词或音素转换为对应的词向量，并对每个语音帧的词向量进行平均或加权平均，得到每个语音帧的语义嵌入向量，最终构成语义嵌入空间数据。In an embodiment of the present invention, semantic embedding space processing is performed according to the frame-level speech text data, and a pre-trained word vector model, such as Word2Vec, GloVe or BERT, can be used. These models can map each word or phoneme to a high-dimensional vector space, so that semantically similar words or phonemes are closer in the vector space. Each word or phoneme in the frame-level speech text data is converted into a corresponding word vector, and the word vector of each speech frame is averaged or weighted averaged to obtain a semantic embedding vector for each speech frame, and finally constitute the semantic embedding space data.

本发明实施例中，通过语义嵌入空间数据对分帧语音信号数据进行语义空间映射。可以将每个语音帧的声学特征向量和语义嵌入向量拼接成一个新的向量，并利用线性变换或非线性变换将其映射到一个新的语义空间中。例如，可以使用多层感知机（MLP）对拼接后的向量进行非线性变换，学习声学特征和语义特征之间的复杂关系，并将每个语音帧映射到语义空间中的一个点，其坐标即为分帧语音语义坐标数据，实现语音信号从声学空间到语义空间的转换。In an embodiment of the present invention, semantic space mapping is performed on the framed speech signal data through semantic embedding space data. The acoustic feature vector and semantic embedding vector of each speech frame can be spliced into a new vector, and mapped to a new semantic space using linear transformation or nonlinear transformation. For example, a multi-layer perceptron (MLP) can be used to perform nonlinear transformation on the spliced vector, learn the complex relationship between acoustic features and semantic features, and map each speech frame to a point in the semantic space, whose coordinates are the framed speech semantic coordinate data, to achieve the conversion of the speech signal from the acoustic space to the semantic space.

本发明实施例中，根据分帧语音语义坐标数据进行时序语义轨迹处理，可以将每个语音帧在语义空间中的坐标连接起来，形成一条连续的轨迹，即语音语义轨迹数据。这条轨迹反映了语音信号的语义信息随时间的变化趋势。为了更好地捕捉语义轨迹的动态信息，可以对轨迹进行平滑处理，例如使用移动平均滤波器或高斯滤波器，去除高频噪声，得到更加平滑的语音语义轨迹。In the embodiment of the present invention, the time-series semantic trajectory processing is performed according to the framed speech semantic coordinate data, and the coordinates of each speech frame in the semantic space can be connected to form a continuous trajectory, i.e., speech semantic trajectory data. This trajectory reflects the changing trend of the semantic information of the speech signal over time. In order to better capture the dynamic information of the semantic trajectory, the trajectory can be smoothed, for example, using a moving average filter or a Gaussian filter to remove high-frequency noise and obtain a smoother speech semantic trajectory.

本发明实施例中，使用预训练的语言模型，例如BERT或GPT。将帧级语音文本数据输入到预训练的语言模型中，获取每个词或子词对应的语义向量表示。为了更好地捕捉句子级别的语义信息，可以采用BERT模型中的[CLS]标记位对应的向量表示，或对句子中所有词的向量表示进行平均池化操作，生成语义原向量数据。根据语义原向量数据进行高维流形结构构建，可以使用t-SNE等降维算法。t-SNE算法可以将高维数据映射到低维空间，同时保留数据点之间的距离关系，从而揭示数据内在的流形结构。将语义原向量数据作为输入，通过t-SNE算法将其映射到二维或三维空间，并构建Delaunay三角剖分或k-近邻图，得到语义流形结构数据。通过帧级语音文本数据对语义流形结构数据进行核心语义因子标识，可以使用基于图论的算法。首先，将帧级语音文本数据中的每个词语映射到语义流形结构中的对应节点。然后，计算每个节点的度中心性、介数中心性或PageRank值等指标，用于衡量节点在图中的重要程度。最后，选择中心性指标排名靠前的节点作为核心语义因子，并将其标识在语义流形结构上，生成核心语义因子数据。利用预设的语音语料库对帧级语音文本数据进行上下文语义嵌入匹配，可以使用SentenceBERT等模型计算句子级别的语义相似度。将帧级语音文本数据作为查询语句，在预设的语音语料库中检索语义相似的句子。然后，使用对抗生成网络（GAN）进行语义语料生成。将检索到的相似句子作为GAN的训练数据，训练一个生成器网络，使其能够生成与输入句子语义相似的新句子，从而扩展语料库的多样性，得到多样化语料数据。使MINE方法估计不同语义特征之间的互信息，并选择互信息较低的特征组合进行解耦，从而消除语义特征之间的冗余信息。然后，使用图注意力网络（GAT）或图卷积网络（GCN）等方法进行语义关联网络处理。将多样化语料数据中的词语作为节点，词语之间的共现关系或语义相似关系作为边，构建语义关联网络，并使用GAT或GCN学习节点的隐藏表示，从而捕获词语之间的语义关联关系，生成语义关系网络数据。通过语义关系网络数据对语义流形结构数据进行动态语义融合，可以使用图神经网络（GNN）模型。将语义流形结构数据和语义关系网络数据作为GNN模型的输入，通过图卷积或消息传递机制，将节点的局部结构信息和全局语义关联信息进行融合，更新节点的表示，得到动态语义嵌入数据。对动态语义嵌入数据进行语义刻度标定，可以使用Procrustes分析等方法。将动态语义嵌入数据与预先定义好的语义空间进行对齐，例如WordNet或ConceptNet，从而为动态语义嵌入数据提供语义解释。然后，根据核心语义因子数据进行KD树快速索引构建。将核心语义因子作为KD树的节点，根据其在语义嵌入空间中的坐标进行层次划分，构建KD树索引结构，用于快速搜索最近邻节点，最终得到语义嵌入空间数据。In an embodiment of the present invention, a pre-trained language model, such as BERT or GPT, is used. The frame-level speech text data is input into the pre-trained language model to obtain the semantic vector representation corresponding to each word or subword. In order to better capture the semantic information at the sentence level, the vector representation corresponding to the [CLS] tag in the BERT model can be used, or the vector representation of all words in the sentence can be averaged and pooled to generate semantic original vector data. According to the semantic original vector data, a high-dimensional manifold structure is constructed, and a dimensionality reduction algorithm such as t-SNE can be used. The t-SNE algorithm can map high-dimensional data to a low-dimensional space while retaining the distance relationship between data points, thereby revealing the inherent manifold structure of the data. The semantic original vector data is used as input, mapped to a two-dimensional or three-dimensional space by the t-SNE algorithm, and a Delaunay triangulation or a k-nearest neighbor graph is constructed to obtain semantic manifold structure data. The core semantic factors of the semantic manifold structure data are identified by frame-level speech text data, and an algorithm based on graph theory can be used. First, each word in the frame-level speech text data is mapped to a corresponding node in the semantic manifold structure. Then, the degree centrality, betweenness centrality or PageRank value of each node are calculated to measure the importance of the node in the graph. Finally, the nodes with the highest centrality index are selected as core semantic factors and marked on the semantic manifold structure to generate core semantic factor data. The frame-level speech text data is matched with contextual semantic embedding using the preset speech corpus, and the semantic similarity at the sentence level can be calculated using models such as SentenceBERT. The frame-level speech text data is used as a query sentence to retrieve semantically similar sentences in the preset speech corpus. Then, the adversarial generative network (GAN) is used for semantic corpus generation. The retrieved similar sentences are used as training data for GAN to train a generator network so that it can generate new sentences that are semantically similar to the input sentences, thereby expanding the diversity of the corpus and obtaining diversified corpus data. The MINE method is used to estimate the mutual information between different semantic features, and feature combinations with low mutual information are selected for decoupling, thereby eliminating redundant information between semantic features. Then, methods such as graph attention network (GAT) or graph convolutional network (GCN) are used for semantic association network processing. The words in the diverse corpus data are used as nodes, and the co-occurrence relationship or semantic similarity relationship between words is used as the edge to construct a semantic association network, and the hidden representation of the node is learned using GAT or GCN to capture the semantic association relationship between words and generate semantic relationship network data. The graph neural network (GNN) model can be used to perform dynamic semantic fusion of semantic manifold structure data through semantic relationship network data. The semantic manifold structure data and semantic relationship network data are used as the input of the GNN model. The local structure information and global semantic association information of the node are fused through graph convolution or message passing mechanism, and the representation of the node is updated to obtain dynamic semantic embedding data. The dynamic semantic embedding data can be semantically calibrated using methods such as Procrustes analysis. The dynamic semantic embedding data is aligned with a pre-defined semantic space, such as WordNet or ConceptNet, to provide semantic interpretation for the dynamic semantic embedding data. Then, the KD tree fast index is constructed based on the core semantic factor data. The core semantic factor is used as the node of the KD tree, and the core semantic factor is hierarchically divided according to its coordinates in the semantic embedding space to construct a KD tree index structure for fast searching of the nearest neighbor node, and finally the semantic embedding space data is obtained.

本发明实施例中，根据语音语义轨迹数据以及语义嵌入空间数据，可以使用插值方法构建语义矢量场。首先，将语义嵌入空间看作一个连续的场，语音语义轨迹数据中的每个点都对应着语义空间中的一个位置。然后，利用径向基函数（RBF）插值或反距离加权插值等方法，将离散的轨迹点扩展成连续的矢量场。对于语义空间中的任意一点，可以根据其与轨迹点的距离和轨迹点上对应的语义向量，计算该点的语义向量，从而得到完整的语音语义矢量场数据，用于表征语音信号在不同语义位置上的变化趋势。将语音语义矢量场所在的语义空间划分成均匀的网格，并在每个网格点上计算矢量场的梯度张量。梯度张量可以描述矢量场在该点的变化方向和变化率，反映了语义信息的局部变化特征。通过网格语义张量数据对语音语义矢量场数据进行语义矢量流线追踪，可以使用Runge-Kutta法。从语义矢量场的起始点开始，沿着梯度方向迭代计算下一个点的位置，直到达到矢量场的终点或预设的步数，从而得到一条完整的语义矢量流线。重复上述步骤，可以得到多条语义矢量流线，构成语义矢量流线数据。可以使用Lyapunov指数。Lyapunov指数可以衡量系统在相空间中相邻轨迹的发散速度，用于判断系统的混沌程度。对于每条语义矢量流线，可以计算其Lyapunov指数，并根据指数的正负性判断流线的稳定性。正的Lyapunov指数表示流线是不稳定的，容易受到噪声干扰；负的Lyapunov指数表示流线是稳定的，能够抵抗噪声干扰。旋度是一个向量场，用于描述矢量场在某一点的旋转程度。对于语音语义矢量场，可以计算每个点的旋度，并根据旋度的大小和方向判断是否存在语义涡流。例如，可以设定一个旋度阈值，当某个点的旋度模长大于该阈值时，则认为该点存在语义涡流。设定多个不同大小的滑动窗口，例如5点、10点和20点。然后，将滑动窗口沿着每条语义矢量流线滑动，计算窗口内所有点的平均速度。具体来说，可以将每个点看作一个时间步，利用相邻两个时间步上语义向量的差值除以时间间隔，得到该点的速度。设定多个不同大小的滑动窗口，例如5点、10点和20点。然后，将滑动窗口沿着每条语义矢量流线滑动，计算窗口内所有点的平均速度。具体来说，可以将每个点看作一个时间步，利用相邻两个时间步上语义向量的差值除以时间间隔，得到该点的速度。设定多个不同大小的滑动窗口，例如5点、10点和20点。然后，将滑动窗口沿着每条语义矢量流线滑动，计算窗口内所有点的平均速度。具体来说，可以将每个点看作一个时间步，利用相邻两个时间步上语义向量的差值除以时间间隔，得到该点的速度。In an embodiment of the present invention, an interpolation method can be used to construct a semantic vector field based on speech semantic trajectory data and semantic embedding space data. First, the semantic embedding space is regarded as a continuous field, and each point in the speech semantic trajectory data corresponds to a position in the semantic space. Then, the discrete trajectory points are expanded into a continuous vector field using methods such as radial basis function (RBF) interpolation or inverse distance weighted interpolation. For any point in the semantic space, the semantic vector of the point can be calculated based on its distance from the trajectory point and the semantic vector corresponding to the trajectory point, thereby obtaining complete speech semantic vector field data for characterizing the change trend of the speech signal at different semantic positions. The semantic space where the speech semantic vector field is located is divided into a uniform grid, and the gradient tensor of the vector field is calculated at each grid point. The gradient tensor can describe the direction and rate of change of the vector field at the point, reflecting the local change characteristics of the semantic information. The Runge-Kutta method can be used to track the semantic vector streamlines of the speech semantic vector field data using the grid semantic tensor data. Starting from the starting point of the semantic vector field, the position of the next point is iteratively calculated along the gradient direction until the end point of the vector field or the preset number of steps is reached, thereby obtaining a complete semantic vector streamline. Repeating the above steps, multiple semantic vector streamlines can be obtained to form semantic vector streamline data. The Lyapunov index can be used. The Lyapunov index can measure the divergence speed of adjacent trajectories of the system in the phase space and is used to judge the degree of chaos of the system. For each semantic vector streamline, its Lyapunov index can be calculated, and the stability of the streamline can be judged according to the positive or negative nature of the index. A positive Lyapunov index indicates that the streamline is unstable and easily disturbed by noise; a negative Lyapunov index indicates that the streamline is stable and can resist noise interference. Curl is a vector field that describes the degree of rotation of a vector field at a certain point. For the speech semantic vector field, the curl of each point can be calculated, and the presence of a semantic vortex can be judged according to the size and direction of the curl. For example, a curl threshold can be set. When the curl modulus of a point is greater than the threshold, it is considered that a semantic vortex exists at the point. Set multiple sliding windows of different sizes, such as 5 points, 10 points and 20 points. Then, slide the sliding window along each semantic vector streamline, and calculate the average speed of all points in the window. Specifically, each point can be regarded as a time step, and the speed of the point can be obtained by dividing the difference between the semantic vectors of two adjacent time steps by the time interval. Set multiple sliding windows of different sizes, such as 5 points, 10 points and 20 points. Then, slide the sliding window along each semantic vector streamline, and calculate the average speed of all points in the window. Specifically, each point can be regarded as a time step, and the speed of the point can be obtained by dividing the difference between the semantic vectors of two adjacent time steps by the time interval. Set multiple sliding windows of different sizes, such as 5 points, 10 points and 20 points. Then, slide the sliding window along each semantic vector streamline, and calculate the average speed of all points in the window. Specifically, each point can be regarded as a time step, and the speed of the point can be obtained by dividing the difference between the semantic vectors of two adjacent time steps by the time interval.

本发明实施例中，将语义嵌入空间看作一个多维空间，按照预设的网格大小，将每个维度等距离划分成若干个区间，从而将整个语义空间分割成多个大小相等的超立方体网格。每个网格代表语义空间中的一个局部区域，生成网格语义空间数据。然后，对网格语义空间数据进行初始语义势能值计算。可以利用每个网格内的语义向量密度或信息熵来表示初始语义势能值。将语义嵌入空间看作一个多维空间，按照预设的网格大小，将每个维度等距离划分成若干个区间，从而将整个语义空间分割成多个大小相等的超立方体网格。每个网格代表语义空间中的一个局部区域，生成网格语义空间数据。然后，对网格语义空间数据进行初始语义势能值计算，可以利用每个网格内的语义向量密度或信息熵来表示初始语义势能值。利用语音语义轨迹数据对网格语义空间数据进行语义轨迹映射处理，可以采用最近邻搜索法。对于语音语义轨迹数据中的每个点，找到与其在语义空间中距离最近的网格，并将该点映射到该网格上，生成语义轨迹网格序列数据。利用网格语义空间数据对语义轨迹网格序列数据进行网格剔除，可以采用集合运算。将所有语义轨迹网格从网格语义空间数据中剔除，得到非语义轨迹网格数据，用于标识语音语义轨迹未经过的网格区域。对于每个语义轨迹网格，根据其在语音语义轨迹中的位置和方向，调整该网格上初始语义矢量的方向和大小。例如，可以将初始语义矢量的方向调整为与语音语义轨迹在该网格处的切线方向一致，并将初始语义矢量的大小设置为与语音语义轨迹在该网格处的速度成正比，生成语义轨迹矢量调整数据。可以采用热传导方程进行模拟，例如，将语义轨迹矢量调整数据看作热源，非语义轨迹网格看作热传导介质，利用热传导方程模拟热量从热源向周围扩散的过程。根据模拟结果，可以得到每个非语义轨迹网格上的温度变化量，并将其转化为对应的矢量调整值，生成非轨迹矢量调整数据。通过语义轨迹矢量调整数据以及非轨迹矢量调整数据对初始语义矢量场数据进行矢量场动态调整，可以采用加权平均法。将初始语义矢量、语义轨迹矢量调整量和非轨迹矢量调整量进行加权平均，得到每个网格上最终的语义矢量，生成动态语义矢量场数据。In an embodiment of the present invention, the semantic embedding space is regarded as a multidimensional space, and each dimension is equally divided into several intervals according to a preset grid size, so that the entire semantic space is divided into multiple hypercube grids of equal size. Each grid represents a local area in the semantic space, and grid semantic space data is generated. Then, the initial semantic potential energy value is calculated for the grid semantic space data. The initial semantic potential energy value can be represented by the semantic vector density or information entropy in each grid. The semantic embedding space is regarded as a multidimensional space, and each dimension is equally divided into several intervals according to a preset grid size, so that the entire semantic space is divided into multiple hypercube grids of equal size. Each grid represents a local area in the semantic space, and grid semantic space data is generated. Then, the initial semantic potential energy value is calculated for the grid semantic space data, and the initial semantic potential energy value can be represented by the semantic vector density or information entropy in each grid. The speech semantic trajectory data is used to perform semantic trajectory mapping processing on the grid semantic space data, and the nearest neighbor search method can be used. For each point in the speech semantic trajectory data, find the grid closest to it in the semantic space, and map the point to the grid to generate semantic trajectory grid sequence data. The semantic trajectory grid sequence data is meshed using the grid semantic space data, and set operations can be used. All semantic trajectory grids are removed from the grid semantic space data to obtain non-semantic trajectory grid data, which is used to identify the grid area that the speech semantic trajectory has not passed through. For each semantic trajectory grid, according to its position and direction in the speech semantic trajectory, the direction and size of the initial semantic vector on the grid are adjusted. For example, the direction of the initial semantic vector can be adjusted to be consistent with the tangent direction of the speech semantic trajectory at the grid, and the size of the initial semantic vector can be set to be proportional to the speed of the speech semantic trajectory at the grid, and the semantic trajectory vector adjustment data is generated. The heat conduction equation can be used for simulation, for example, the semantic trajectory vector adjustment data is regarded as a heat source, the non-semantic trajectory grid is regarded as a heat conduction medium, and the heat conduction equation is used to simulate the process of heat diffusion from the heat source to the surrounding. According to the simulation results, the temperature change on each non-semantic trajectory grid can be obtained, and it is converted into a corresponding vector adjustment value to generate non-trajectory vector adjustment data. The initial semantic vector field data is dynamically adjusted by using the semantic trajectory vector adjustment data and the non-trajectory vector adjustment data, and a weighted average method can be used. The initial semantic vector, the semantic trajectory vector adjustment amount, and the non-trajectory vector adjustment amount are weighted averaged to obtain the final semantic vector on each grid and generate dynamic semantic vector field data.

作为本发明的一个实例，参考图4所示，为图1中步骤S4的详细实施步骤流程示意图，在本实例中所述步骤S4包括：As an example of the present invention, referring to FIG4 , which is a schematic flow chart of detailed implementation steps of step S4 in FIG1 , in this example, step S4 includes:

本发明实施例中，通过语义噪声区域数据对语音语义矢量场数据进行噪声抑制处理，可以采用掩码方法。首先，根据语义噪声区域数据生成一个二值掩码，将语义噪声区域对应的网格点标记为0，其他区域标记为1。然后，将该掩码与语音语义矢量场数据相乘，即可抑制噪声区域的矢量值，保留非噪声区域的矢量信息。为了避免过度抑制，可以采用软掩码方法，将噪声区域的矢量值乘以一个小于1的权重，而非直接置零。可以使用各向异性扩散方程来控制扩散的方向和速度，例如沿着语义矢量的方向进行扩散，并在语义边界处减缓扩散速度，以避免语义信息的过度平滑，最终生成降噪语义矢量场数据。In an embodiment of the present invention, a masking method can be used to perform noise suppression processing on speech semantic vector field data through semantic noise area data. First, a binary mask is generated according to the semantic noise area data, and the grid points corresponding to the semantic noise area are marked as 0, and other areas are marked as 1. Then, the mask is multiplied with the speech semantic vector field data to suppress the vector value of the noise area and retain the vector information of the non-noise area. In order to avoid excessive suppression, a soft masking method can be used to multiply the vector value of the noise area by a weight less than 1 instead of directly setting it to zero. The anisotropic diffusion equation can be used to control the direction and speed of diffusion, for example, diffusing along the direction of the semantic vector and slowing down the diffusion speed at the semantic boundary to avoid excessive smoothing of the semantic information, and finally generating noise-reduced semantic vector field data.

本发明实施例中，采用最近邻插值法进行语义轨迹重映射。首先，将降噪后的语义矢量场看作一个新的语义空间，该空间中的每个网格点都对应着一个语义向量。然后，对于语音语义轨迹数据中的每一个轨迹点，计算其与新语义空间中所有网格点的距离。可以选择欧氏距离、余弦距离等常用的距离度量方法计算距离。找到距离最近的网格点，并将该轨迹点映射到该网格点上。这样，就完成了原始语音语义轨迹到降噪语义矢量场上的映射。假设原始语义空间中有一个轨迹点A，其坐标为(1,2)，在降噪后的语义空间中，与其距离最近的网格点的坐标为(1.1,1.9)，则将轨迹点A重映射到坐标为(1.1,1.9)的网格点上。对语音语义轨迹数据中的所有轨迹点进行相同的操作，即可得到语义轨迹重映射数据。In an embodiment of the present invention, the nearest neighbor interpolation method is used for semantic trajectory remapping. First, the semantic vector field after noise reduction is regarded as a new semantic space, and each grid point in the space corresponds to a semantic vector. Then, for each trajectory point in the speech semantic trajectory data, the distance between it and all grid points in the new semantic space is calculated. Common distance measurement methods such as Euclidean distance and cosine distance can be selected to calculate the distance. Find the nearest grid point and map the trajectory point to the grid point. In this way, the mapping of the original speech semantic trajectory to the noise reduction semantic vector field is completed. Assuming that there is a trajectory point A in the original semantic space, its coordinates are (1,2), and in the semantic space after noise reduction, the coordinates of the grid point closest to it are (1.1,1.9), then the trajectory point A is remapped to the grid point with coordinates (1.1,1.9). Perform the same operation on all trajectory points in the speech semantic trajectory data to obtain the semantic trajectory remapping data.

步骤S43：通过语义轨迹重映射数据对语音语义矢量场数据进行动态势能约束，并进行轨迹漂移势能计算，得到语义轨迹漂移势能数据；Step S43: Dynamically constraining the speech semantic vector field data through the semantic trajectory remapping data, and calculating the trajectory drift potential energy to obtain semantic trajectory drift potential energy data;

本发明实施例中，以每个重映射后的轨迹点为中心，构建一个高斯函数，该函数的值在轨迹点处最大，随着距离轨迹点距离的增加而衰减。高斯函数的幅值可以根据轨迹点的重要性进行调整，例如可以使用轨迹点的速度或曲率作为权重。将所有轨迹点对应的高斯函数叠加起来，就可以得到一个描述语义轨迹影响力的势能约束场。以轨迹点为中心，距离轨迹点越近的区域，其势能约束作用越强。然后，进行轨迹漂移势能计算。对于重映射后的语义轨迹上的每个点，计算其在原始语义矢量场和降噪语义矢量场上对应位置的势能差值，并将该差值作为该点的轨迹漂移势能，反映了噪声抑制处理对语音语义轨迹的影响程度，得到语义轨迹漂移势能数据。In an embodiment of the present invention, a Gaussian function is constructed with each remapped trajectory point as the center, and the value of the function is the largest at the trajectory point and decays as the distance from the trajectory point increases. The amplitude of the Gaussian function can be adjusted according to the importance of the trajectory point. For example, the speed or curvature of the trajectory point can be used as a weight. By superimposing the Gaussian functions corresponding to all trajectory points, a potential constraint field that describes the influence of the semantic trajectory can be obtained. With the trajectory point as the center, the closer the area is to the trajectory point, the stronger its potential constraint effect. Then, the trajectory drift potential energy calculation is performed. For each point on the remapped semantic trajectory, the potential energy difference between the corresponding positions on the original semantic vector field and the denoised semantic vector field is calculated, and the difference is used as the trajectory drift potential energy of the point, which reflects the degree of influence of the noise suppression processing on the speech semantic trajectory, and the semantic trajectory drift potential energy data is obtained.

本发明实施例中，通过语义轨迹漂移势能数据对语音语义轨迹数据进行关键语义噪声帧识别。如果某个语音帧对应的语义轨迹点的漂移势能高于预设的阈值，则认为该语音帧受到了噪声的干扰，将其标记为关键语义噪声帧。因为较大的漂移势能表示该语音帧在语义空间中的位置发生了显著变化，而这通常是由于噪声导致的。然后，对关键语义噪声帧进行声学特征逆映射。由于之前的处理步骤都是在语义空间中进行的，因此需要将关键语义噪声帧映射回原始的声学特征空间。可以使用预训练的声学模型或深度神经网络将语义特征转换为声学特征，生成语音关键噪声帧数据。In an embodiment of the present invention, the speech semantic trajectory data is used to identify key semantic noise frames. If the drift potential energy of the semantic trajectory point corresponding to a certain speech frame is higher than a preset threshold, it is considered that the speech frame is interfered by noise and is marked as a key semantic noise frame. Because a larger drift potential energy indicates that the position of the speech frame in the semantic space has changed significantly, which is usually caused by noise. Then, the acoustic features of the key semantic noise frame are inversely mapped. Since the previous processing steps are all performed in the semantic space, it is necessary to map the key semantic noise frame back to the original acoustic feature space. A pre-trained acoustic model or a deep neural network can be used to convert semantic features into acoustic features to generate speech key noise frame data.

本发明实施例中，确保分帧语音信号数据和语音关键噪声帧数据均已准备好。分帧语音信号数据通常为一个二维数组，其中每一行代表一个时间帧，每一列代表特征维度。语音关键噪声帧数据则记录了被识别为噪声的帧索引。将语音关键噪声帧的声学特征以及其前后几帧的干净语音帧的声学特征作为输入，结合语音语义矢量场数据，对关键噪声帧的声学特征向量进行调整，使其更接近于目标语音信号的声学特征向量。训练WaveNet模型学习噪声语音到干净语音的映射关系。训练完成后，可以使用该模型对关键噪声帧进行修复，生成估计的干净语音帧。In an embodiment of the present invention, it is ensured that both the framed speech signal data and the speech key noise frame data are prepared. The framed speech signal data is usually a two-dimensional array, in which each row represents a time frame and each column represents a feature dimension. The speech key noise frame data records the frame index identified as noise. The acoustic features of the speech key noise frame and the acoustic features of the clean speech frames before and after it are taken as input, and the acoustic feature vector of the key noise frame is adjusted in combination with the speech semantic vector field data to make it closer to the acoustic feature vector of the target speech signal. The WaveNet model is trained to learn the mapping relationship from noisy speech to clean speech. After the training is completed, the model can be used to repair the key noise frame to generate an estimated clean speech frame.

本发明实施例中，对降噪语音信号数据进行时频分析，可以使用短时傅里叶变换(STFT)。将降噪语音信号数据划分成多个重叠的短时帧，对每一帧进行傅里叶变换，得到该帧的频谱信息。将所有帧的频谱信息按时间顺序排列，即可得到降噪语音频谱数据。将语义矢量场中的每个语义向量投影到预先定义的语义基向量上，得到该语义向量在不同语义维度上的分量。然后，对每个语义维度上的分量进行傅里叶变换，得到该语义维度在频域上的表示。根据降噪语音频谱数据进行多基频谱分解处理，可以使用经验模态分解(EMD)算法。EMD算法可以将一个非平稳信号分解成多个不同频率尺度的固有模态函数(IMF)，每个IMF分量代表了原始信号在特定频率范围内的振荡模式。将降噪语音频谱数据作为输入，通过EMD算法将其分解成多个IMF分量，生成多基频语音谱数据，计算每个IMF分量的瞬时频率，该频率代表了该IMF分量在特定时间点的频率值。然后，根据瞬时频率，将每个IMF分量与语义频谱场数据中对应频率的语义信息进行关联。可以计算IMF分量与语义频谱之间在特定频率点上的相似度，例如使用余弦相似度。最后，将所有时间点和所有IMF分量的相似度组合起来，生成语义谐波关联矩阵数据。可以根据相似度值的大小，设定一个阈值，将相似度高于阈值的IMF分量标记为与该语义信息强相关，并赋予较高的置信度；将相似度低于阈值的IMF分量标记为与该语义信息弱相关，并赋予较低的置信度。然后，进行语音增强掩码处理。根据谐波语义置信度，生成语音增强引导掩码数据。置信度高的区域对应着重要的语音信息，掩码值接近于1；置信度低的区域对应着噪声或不重要的语音信息，掩码值接近于0。将语音增强引导掩码数据与降噪语音频谱数据相乘，增强与重要语义信息相关的语音成分，抑制与噪声或不重要语义信息相关的语音成分。然后，进行听觉掩蔽频谱修正。根据人耳的听觉掩蔽效应，对语义增强后的语音频谱进行修正，例如降低噪声在掩蔽阈值以下的能量，以提高语音的可懂度，生成修正语音频谱数据。可以使用逆短时傅里叶变换(ISTFT)。将修正后的频谱数据进行ISTFT，即可得到增强语音信号数据，完成语音信号的增强过程。In an embodiment of the present invention, a short-time Fourier transform (STFT) can be used to perform time-frequency analysis on the noise reduction speech signal data. The noise reduction speech signal data is divided into a plurality of overlapping short-time frames, and a Fourier transform is performed on each frame to obtain the spectrum information of the frame. The spectrum information of all frames is arranged in chronological order to obtain the noise reduction speech spectrum data. Each semantic vector in the semantic vector field is projected onto a predefined semantic basis vector to obtain the components of the semantic vector in different semantic dimensions. Then, a Fourier transform is performed on the components on each semantic dimension to obtain the representation of the semantic dimension in the frequency domain. The empirical mode decomposition (EMD) algorithm can be used to perform multi-base spectrum decomposition processing according to the noise reduction speech spectrum data. The EMD algorithm can decompose a non-stationary signal into a plurality of intrinsic mode functions (IMFs) of different frequency scales, and each IMF component represents the oscillation mode of the original signal within a specific frequency range. The noise reduction speech spectrum data is used as input, and it is decomposed into a plurality of IMF components by the EMD algorithm to generate multi-baseband speech spectrum data, and the instantaneous frequency of each IMF component is calculated, which represents the frequency value of the IMF component at a specific time point. Then, according to the instantaneous frequency, each IMF component is associated with the semantic information of the corresponding frequency in the semantic spectrum field data. The similarity between the IMF component and the semantic spectrum at a specific frequency point can be calculated, for example, using cosine similarity. Finally, the similarities of all time points and all IMF components are combined to generate semantic harmonic association matrix data. According to the size of the similarity value, a threshold can be set, and the IMF components with a similarity higher than the threshold are marked as strongly associated with the semantic information and given a higher confidence; the IMF components with a similarity lower than the threshold are marked as weakly associated with the semantic information and given a lower confidence. Then, speech enhancement mask processing is performed. According to the harmonic semantic confidence, speech enhancement guide mask data is generated. The area with high confidence corresponds to important speech information, and the mask value is close to 1; the area with low confidence corresponds to noise or unimportant speech information, and the mask value is close to 0. The speech enhancement guide mask data is multiplied with the noise reduction speech spectrum data to enhance the speech components related to important semantic information and suppress the speech components related to noise or unimportant semantic information. Then, the auditory masking spectrum is corrected. According to the auditory masking effect of the human ear, the semantically enhanced speech spectrum is corrected, for example, the energy of the noise below the masking threshold is reduced to improve the intelligibility of the speech, and the corrected speech spectrum data is generated. The inverse short-time Fourier transform (ISTFT) can be used. The ISTFT of the corrected spectrum data can be performed to obtain the enhanced speech signal data, completing the speech signal enhancement process.

存储器，用于存储计算机程序；Memory for storing computer programs;

本申请有益效果在于，对待处理语音信号数据进行自适应分帧处理，将连续的语音信号划分为若干帧，能够充分捕捉到语音信号的动态特性。与传统方法相比，自适应分帧处理不仅能提供更精确的时间分辨率，还能更好地反映信号的瞬时变化。利用语音识别技术将语音信号转换为文本信息。在此基础上，利用预训练的语义模型将文本信息映射到语义嵌入空间，构建出包含丰富语义信息的语音表示。将语义嵌入空间进一步构建为语义矢量场，通过分析语音信号随时间变化的语义特征，系统能够识别出语音信号中的关键部分，降低因静默或背景噪声干扰而导致的语音信号丢失问题。将语音信号在时间维度上的语义变化轨迹映射到该矢量场中。通过分析语义矢量场的拓扑结构和语音语义轨迹的运动特征，例如语义涡旋、轨迹漂移等，可以有效识别出潜在的噪声区域，尤其是在噪声背景下的微弱信号部分。对语义矢量场进行针对性的噪声抑制处理，例如对噪声区域的矢量进行抑制或置零，识别出受噪声影响的关键语音帧，并利用深度学习模型对这些帧进行精细的修复，从而在最大程度上降低噪声的同时，保留原始语音信息。对降噪后的语音进行多基频谱分解，并结合语义信息进行谐波关联分析，评估不同频率成分与语义信息的相关性，并据此生成语音增强引导掩码。最后，利用该掩码对语音频谱进行增强处理，并结合听觉掩蔽效应进行修正，最终重构出清晰、高可懂度的增强语音信号，显著提升用户体验。The beneficial effect of the present application is that the processed speech signal data is adaptively framed and the continuous speech signal is divided into several frames, which can fully capture the dynamic characteristics of the speech signal. Compared with the traditional method, the adaptive frame processing can not only provide more accurate time resolution, but also better reflect the instantaneous changes of the signal. The speech signal is converted into text information using speech recognition technology. On this basis, the text information is mapped to the semantic embedding space using a pre-trained semantic model to construct a speech representation containing rich semantic information. The semantic embedding space is further constructed into a semantic vector field. By analyzing the semantic features of the speech signal that change over time, the system can identify the key parts of the speech signal and reduce the problem of speech signal loss caused by silence or background noise interference. The semantic change trajectory of the speech signal in the time dimension is mapped to the vector field. By analyzing the topological structure of the semantic vector field and the motion characteristics of the speech semantic trajectory, such as semantic vortex, trajectory drift, etc., potential noise areas can be effectively identified, especially the weak signal part under the noise background. Targeted noise suppression processing is performed on the semantic vector field, such as suppressing or setting the vectors in the noise area to zero, identifying key speech frames affected by noise, and using deep learning models to perform fine repair on these frames, thereby minimizing noise while retaining the original speech information. The denoised speech is subjected to multi-base spectrum decomposition, and harmonic correlation analysis is performed in combination with semantic information to evaluate the correlation between different frequency components and semantic information, and a speech enhancement guidance mask is generated accordingly. Finally, the speech spectrum is enhanced using the mask and corrected in combination with the auditory masking effect, ultimately reconstructing a clear, highly intelligible enhanced speech signal, significantly improving the user experience.

因此，无论从哪一点来看，均应将实施例看作是示范性的，而且是非限制性的，本发明的范围由所附权利要求而不是上述说明限定，因此旨在将落在申请文件的等同要件的含义和范围内的所有变化涵括在本发明内。Therefore, the embodiments should be regarded as illustrative and non-restrictive from all points, and the scope of the present invention is limited by the appended claims rather than the above description, and it is therefore intended that all changes falling within the meaning and range of equivalent elements of the application documents are included in the present invention.

以上所述仅是本发明的具体实施方式，使本领域技术人员能够理解或实现本发明。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所发明的原理和新颖特点相一致的最宽的范围。The above description is only a specific embodiment of the present invention, so that those skilled in the art can understand or implement the present invention. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to the embodiments shown herein, but should conform to the widest scope consistent with the principles and novel features invented herein.

Claims

Translated fromChinese

1.一种语音信号降噪方法，其特征在于，包括以下步骤：1. A method for reducing noise of a speech signal, comprising the following steps:

2.根据权利要求1所述的语音信号降噪方法，其特征在于，步骤S1包括以下步骤：2. The method for reducing noise of a speech signal according to claim 1, wherein step S1 comprises the following steps:

3.根据权利要求1所述的语音信号降噪方法，其特征在于，步骤S2包括以下步骤：3. The method for reducing noise of a speech signal according to claim 1, wherein step S2 comprises the following steps:

4.根据权利要求3所述的语音信号降噪方法，其特征在于，步骤S23包括以下步骤：4. The method for reducing noise of a speech signal according to claim 3, wherein step S23 comprises the following steps:

5.根据权利要求1所述的语音信号降噪方法，其特征在于，步骤S3包括以下步骤：5. The method for reducing noise of a speech signal according to claim 1, wherein step S3 comprises the following steps:

6.根据权利要求5所述的语音信号降噪方法，其特征在于，步骤S31包括以下步骤：6. The method for reducing noise of a speech signal according to claim 5, wherein step S31 comprises the following steps:

7.根据权利要求1所述的语音信号降噪方法，其特征在于，步骤S4包括以下步骤：7. The method for reducing noise of a speech signal according to claim 1, wherein step S4 comprises the following steps:

8.根据权利要求1所述的语音信号降噪方法，其特征在于，步骤S5包括以下步骤：8. The method for reducing noise of a speech signal according to claim 1, wherein step S5 comprises the following steps:

9.一种语音信号降噪系统，其特征在于，用于执行如权利要求1所述的语音信号降噪方法，该语音信号降噪系统包括：9. A speech signal denoising system, characterized in that it is used to execute the speech signal denoising method according to claim 1, and the speech signal denoising system comprises:

10.一种语音信号降噪设备，其特征在于，包括：10. A speech signal noise reduction device, comprising:

存储器，用于存储计算机程序；Memory for storing computer programs;

处理器，用于执行所述计算机程序时实现如权利要求1至8任一项所述的语音信号降噪方法。A processor, configured to implement the speech signal denoising method according to any one of claims 1 to 8 when executing the computer program.