技术领域Technical Field
本发明涉及数据分析技术领域,尤其涉及一种非经营性交互式直播数据智能分析系统。The present invention relates to the technical field of data analysis, and in particular to a non-commercial interactive live broadcast data intelligent analysis system.
背景技术Background Art
随着互联网技术的快速发展,交互式直播已成为一种流行的信息传播和社交方式。然而,现有的直播数据分析技术往往局限于表面的数据收集,如观众数量、弹幕量,缺乏对直播内容深度分析的能力。这导致了直播内容质量的评估和直播者表现的深入理解存在不足,难以满足用户对高质量直播内容的需求。With the rapid development of Internet technology, interactive live broadcasting has become a popular way of information dissemination and social interaction. However, existing live broadcast data analysis technologies are often limited to superficial data collection, such as the number of viewers and the amount of barrage, and lack the ability to conduct in-depth analysis of live broadcast content. This leads to deficiencies in the evaluation of live broadcast content quality and the in-depth understanding of the live broadcaster's performance, making it difficult to meet users' demand for high-quality live broadcast content.
进一步地,现有的直播数据分析系统大多侧重于经营性指标,如收益、广告效果,而非经营性指标,如直播者的情绪表达、语言的专业性、条理性和互动性指标,对于提升观众的观看体验和直播的社交互动具有显著影响。然而,这些指标往往因为缺乏量化和系统化的分析方法而被现有系统所忽视。Furthermore, most existing live broadcast data analysis systems focus on operational indicators, such as revenue and advertising effectiveness, rather than on non-operational indicators, such as the emotional expression of the live broadcaster, the professionalism of the language, the rationality and interactivity of the live broadcaster, which have a significant impact on improving the audience's viewing experience and the social interaction of the live broadcast. However, these indicators are often ignored by existing systems due to the lack of quantitative and systematic analysis methods.
此外,直播过程中的实时互动,尤其是弹幕的交互,是衡量直播活跃度的重要指标。现有技术在处理弹幕数据时,往往只关注数量而忽略质量,无法有效分析弹幕的峰值变化、波动比以及与直播者的互动情况,从而无法全面评估直播的互动质量和观众的参与度。In addition, real-time interaction during live broadcast, especially the interaction of bullet comments, is an important indicator for measuring live broadcast activity. When processing bullet comment data, existing technologies often only focus on quantity and ignore quality, and cannot effectively analyze the peak changes, fluctuation ratios, and interaction with the live broadcaster of the bullet comment, thus failing to fully evaluate the interactive quality of the live broadcast and the audience's participation.
最后,直播类型多样,如氛围型直播、复合型直播与专业型直播,而现有的直播分析系统往往采用单一的分析模型,无法适应不同类型的直播内容。Finally, there are various types of live broadcasts, such as atmospheric live broadcasts, complex live broadcasts, and professional live broadcasts, but existing live broadcast analysis systems often use a single analysis model and cannot adapt to different types of live broadcast content.
发明内容Summary of the invention
基于此,本发明有必要提供一种非经营性交互式直播数据智能分析系统,以解决至少一个上述技术问题。Based on this, it is necessary for the present invention to provide a non-commercial interactive live broadcast data intelligent analysis system to solve at least one of the above-mentioned technical problems.
为实现上述目的,一种非经营性交互式直播数据智能分析系统,包括以下模块:To achieve the above purpose, a non-commercial interactive live broadcast data intelligent analysis system includes the following modules:
数据采集模块S1,用于对直播间进行实时多源数据采集,得到直播数据集,其中直播数据集包括直播者音频数据、弹幕交互数据、直播间视频数据;根据直播者音频数据对直播者进行语言属性分布计算,得到语言属性分布数据;The data collection module S1 is used to collect real-time multi-source data from the live broadcast room to obtain a live broadcast data set, where the live broadcast data set includes the live broadcaster's audio data, barrage interaction data, and live broadcast room video data; the language attribute distribution of the live broadcaster is calculated based on the live broadcaster's audio data to obtain language attribute distribution data;
音域语速分析模块S2,用于基于直播者音频数据对直播者进行音域与语速识别,得到直播者音域数据与直播者语速数据;The voice range and speech speed analysis module S2 is used to identify the voice range and speech speed of the live broadcaster based on the live broadcaster's audio data, and obtain the voice range data and speech speed data of the live broadcaster;
情绪分析模块S3,用于基于直播间视频数据对直播者进行情绪占比分析,得到直播者情绪占比数据;基于直播间视频数据对直播者进行情绪丰富度计算,得到情绪丰富度分布数据;The emotion analysis module S3 is used to analyze the emotion proportion of the live broadcaster based on the video data of the live broadcast room to obtain the emotion proportion data of the live broadcaster; calculate the emotion richness of the live broadcaster based on the video data of the live broadcast room to obtain the emotion richness distribution data;
弹幕分析模块S4,用于根据弹幕交互数据对直播间进行弹幕量峰值变化分析,得到弹幕峰值跳变频率数据;根据弹幕交互数据对直播间进行弹幕峰值波动比计算,得到弹幕峰值波动比分布数据;基于直播者音频数据与弹幕交互数据对直播间进行弹幕互动分析,得到弹幕互动回复率数据;The bullet screen analysis module S4 is used to analyze the peak value change of the bullet screen volume in the live broadcast room according to the bullet screen interaction data to obtain the bullet screen peak jump frequency data; calculate the bullet screen peak fluctuation ratio of the live broadcast room according to the bullet screen interaction data to obtain the bullet screen peak fluctuation ratio distribution data; analyze the bullet screen interaction of the live broadcast room based on the live broadcaster audio data and the bullet screen interaction data to obtain the bullet screen interaction reply rate data;
直播类型判定模块S5,用于根据直播者音频数据与直播间视频数据对直播间进行直播内容识别,得到直播内容类型数据集;根据语言属性分布数据、直播者音域数据、直播者语速数据、直播者情绪占比数据、情绪丰富度分布数据、弹幕峰值跳变频率数据、弹幕峰值波动比分布数据、弹幕互动回复率数据与直播内容类型数据集对直播间进行直播类型判定,得到直播类型数据。The live broadcast type determination module S5 is used to identify the live broadcast content of the live broadcast room according to the audio data of the live broadcaster and the video data of the live broadcast room to obtain a live broadcast content type data set; determine the live broadcast type of the live broadcast room according to the language attribute distribution data, the vocal range data of the live broadcaster, the speech speed data of the live broadcaster, the proportion data of the live broadcaster's emotions, the distribution data of the richness of emotions, the peak jump frequency data of the barrage, the peak fluctuation ratio distribution data of the barrage, the interactive reply rate data of the barrage and the live content type data set to obtain the live broadcast type data.
本发明中,通过实时多源数据采集模块,能够全面获取直播者的音频数据、弹幕交互数据和直播间视频数据,确保数据的多维度性和丰富性,克服了数据单一、片面的缺点。通过语言属性分布计算和音域语速分析模块,能够精确识别直播者的音域和语速数据,帮助深入了解直播者的语言表达特征,从而评估直播内容的条理性和专业性,提高了语言分析的准确性和细致程度。通过情绪分析模块利用视频数据对直播者的情绪占比和情绪丰富度进行分析,能够实时监测和量化直播者的情绪状态,提供情绪变化的直观数据支持,为直播内容质量评估提供重要依据,弥补了现有技术在情绪表达分析上的空白。通过弹幕分析模块对弹幕量峰值变化、弹幕峰值波动比和弹幕互动回复率的分析,能够深入了解观众互动情况和弹幕质量,帮助评估直播间的互动活跃度和观众参与度,全面反映弹幕的互动质量,克服了只关注数量而忽视质量的局限。通过直播类型判定模块结合多维度数据进行直播内容识别和类型判定,确保能够适应不同类型的直播内容,提供针对性的数据分析和评估,提升直播分析的适应性和精确性。综上所述,本发明通过对直播过程中的实时数据进行分析,确保对直播互动和内容变化的动态监测与即时反馈,提高了数据分析的时效性和准确性;还通过融合直播者音频、视频数据和弹幕数据,从多角度对直播内容进行综合评估,克服了现有技术中数据分析片面、指标单一的缺陷,提供更为全面和深入的直播数据分析结果。In the present invention, through the real-time multi-source data acquisition module, the audio data, bullet screen interaction data and live broadcast room video data of the live broadcaster can be fully obtained, ensuring the multidimensionality and richness of the data, overcoming the shortcomings of single and one-sided data. Through the language attribute distribution calculation and range and speed analysis module, the range and speed data of the live broadcaster can be accurately identified, helping to deeply understand the language expression characteristics of the live broadcaster, thereby evaluating the orderliness and professionalism of the live broadcast content, and improving the accuracy and meticulousness of language analysis. Through the emotion analysis module, the emotional proportion and emotional richness of the live broadcaster are analyzed using video data, and the emotional state of the live broadcaster can be monitored and quantified in real time, providing intuitive data support for emotional changes, providing an important basis for the quality evaluation of live broadcast content, and filling the gap in the existing technology in emotional expression analysis. Through the analysis of the peak change of bullet screen volume, the peak fluctuation ratio of bullet screen and the interactive reply rate of bullet screen by the bullet screen analysis module, the audience interaction situation and the quality of bullet screen can be deeply understood, helping to evaluate the interactive activity and audience participation of the live broadcast room, and comprehensively reflecting the interactive quality of bullet screen, overcoming the limitation of only focusing on quantity and ignoring quality. The live broadcast type determination module combines multi-dimensional data to identify and determine the live broadcast content, ensuring that it can adapt to different types of live broadcast content, providing targeted data analysis and evaluation, and improving the adaptability and accuracy of live broadcast analysis. In summary, the present invention analyzes real-time data during the live broadcast process to ensure dynamic monitoring and immediate feedback of live broadcast interactions and content changes, thereby improving the timeliness and accuracy of data analysis; it also integrates the live broadcaster's audio, video data, and barrage data to comprehensively evaluate the live broadcast content from multiple angles, thereby overcoming the defects of one-sided data analysis and single indicators in the prior art, and providing more comprehensive and in-depth live broadcast data analysis results.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
通过阅读参照以下附图所作的对非限制性实施所作的详细描述,本发明的其它特征、目的和优点将会变得更明显:Other features, objects and advantages of the present invention will become more apparent from the detailed description of non-limiting embodiments thereof made with reference to the following drawings:
图1示出了一实施例的非经营性交互式直播数据智能分析系统的模块流程示意图。FIG1 shows a module flow diagram of a non-commercial interactive live broadcast data intelligent analysis system according to an embodiment.
图2示出了一实施例的S3的详细步骤流程示意图。FIG. 2 is a schematic diagram showing a detailed step flow chart of S3 in an embodiment.
图3示出了一实施例的S37的详细步骤流程示意图。FIG. 3 is a schematic diagram showing a detailed step flow chart of S37 according to an embodiment.
具体实施方式DETAILED DESCRIPTION
下面结合附图对本发明的技术方法进行清楚、完整的描述,显然,所描述的实施例是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域所属的技术人员在没有做出创造性劳动前提下所获得到所有其他实施例,都属于本发明保护的范围。The technical method of the present invention is described clearly and completely below in conjunction with the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by technicians in this field without creative work are within the scope of protection of the present invention.
此外,附图仅为本发明的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器方法和/或微控制器方法中实现这些功能实体。In addition, the accompanying drawings are only schematic illustrations of the present invention and are not necessarily drawn to scale. The same reference numerals in the figures represent the same or similar parts, and their repeated description will be omitted. Some of the block diagrams shown in the accompanying drawings are functional entities and do not necessarily correspond to physically or logically independent entities. The functional entities can be implemented in software form, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor methods and/or microcontroller methods.
应当理解的是,虽然在这里可能使用了术语“第一”、“第二”等等来描述各个单元,但是这些单元不应当受这些术语限制。使用这些术语仅仅是为了将一个单元与另一个单元进行区分。举例来说,在不背离示例性实施例的范围的情况下,第一单元可以被称为第二单元,并且类似地第二单元可以被称为第一单元。这里所使用的术语“和/或”包括其中一个或更多所列出的相关联项目的任意和所有组合。It should be understood that, although the terms "first", "second", etc. may be used herein to describe various units, these units should not be limited by these terms. These terms are used only to distinguish one unit from another unit. For example, without departing from the scope of the exemplary embodiments, the first unit may be referred to as the second unit, and similarly the second unit may be referred to as the first unit. The term "and/or" used herein includes any and all combinations of one or more of the listed associated items.
为实现上述目的,请参阅图1至图3,本发明提供了一种非经营性交互式直播数据智能分析系统,包括以下模块:To achieve the above purpose, please refer to Figures 1 to 3. The present invention provides a non-commercial interactive live broadcast data intelligent analysis system, including the following modules:
数据采集模块S1,用于对直播间进行实时多源数据采集,得到直播数据集,其中直播数据集包括直播者音频数据、弹幕交互数据、直播间视频数据;根据直播者音频数据对直播者进行语言属性分布计算,得到语言属性分布数据;The data collection module S1 is used to collect real-time multi-source data from the live broadcast room to obtain a live broadcast data set, where the live broadcast data set includes the live broadcaster's audio data, barrage interaction data, and live broadcast room video data; the language attribute distribution of the live broadcaster is calculated based on the live broadcaster's audio data to obtain language attribute distribution data;
音域语速分析模块S2,用于基于直播者音频数据对直播者进行音域与语速识别,得到直播者音域数据与直播者语速数据;The voice range and speech speed analysis module S2 is used to identify the voice range and speech speed of the live broadcaster based on the live broadcaster's audio data, and obtain the voice range data and speech speed data of the live broadcaster;
情绪分析模块S3,用于基于直播间视频数据对直播者进行情绪占比分析,得到直播者情绪占比数据;基于直播间视频数据对直播者进行情绪丰富度计算,得到情绪丰富度分布数据;The emotion analysis module S3 is used to analyze the emotion proportion of the live broadcaster based on the video data of the live broadcast room to obtain the emotion proportion data of the live broadcaster; calculate the emotion richness of the live broadcaster based on the video data of the live broadcast room to obtain the emotion richness distribution data;
弹幕分析模块S4,用于根据弹幕交互数据对直播间进行弹幕量峰值变化分析,得到弹幕峰值跳变频率数据;根据弹幕交互数据对直播间进行弹幕峰值波动比计算,得到弹幕峰值波动比分布数据;基于直播者音频数据与弹幕交互数据对直播间进行弹幕互动分析,得到弹幕互动回复率数据;The bullet screen analysis module S4 is used to analyze the peak value change of the bullet screen volume in the live broadcast room according to the bullet screen interaction data to obtain the bullet screen peak jump frequency data; calculate the bullet screen peak fluctuation ratio of the live broadcast room according to the bullet screen interaction data to obtain the bullet screen peak fluctuation ratio distribution data; analyze the bullet screen interaction of the live broadcast room based on the live broadcaster audio data and the bullet screen interaction data to obtain the bullet screen interaction reply rate data;
直播类型判定模块S5,用于根据直播者音频数据与直播间视频数据对直播间进行直播内容识别,得到直播内容类型数据集;根据语言属性分布数据、直播者音域数据、直播者语速数据、直播者情绪占比数据、情绪丰富度分布数据、弹幕峰值跳变频率数据、弹幕峰值波动比分布数据、弹幕互动回复率数据与直播内容类型数据集对直播间进行直播类型判定,得到直播类型数据。The live broadcast type determination module S5 is used to identify the live broadcast content of the live broadcast room according to the audio data of the live broadcaster and the video data of the live broadcast room to obtain a live broadcast content type data set; determine the live broadcast type of the live broadcast room according to the language attribute distribution data, the vocal range data of the live broadcaster, the speech speed data of the live broadcaster, the proportion data of the live broadcaster's emotions, the distribution data of the richness of emotions, the peak jump frequency data of the barrage, the peak fluctuation ratio distribution data of the barrage, the interactive reply rate data of the barrage and the live content type data set to obtain the live broadcast type data.
本发明中,通过实时多源数据采集模块,能够全面获取直播者的音频数据、弹幕交互数据和直播间视频数据,确保数据的多维度性和丰富性,克服了数据单一、片面的缺点。通过语言属性分布计算和音域语速分析模块,能够精确识别直播者的音域和语速数据,帮助深入了解直播者的语言表达特征,从而评估直播内容的条理性和专业性,提高了语言分析的准确性和细致程度。通过情绪分析模块利用视频数据对直播者的情绪占比和情绪丰富度进行分析,能够实时监测和量化直播者的情绪状态,提供情绪变化的直观数据支持,为直播内容质量评估提供重要依据,弥补了现有技术在情绪表达分析上的空白。通过弹幕分析模块对弹幕量峰值变化、弹幕峰值波动比和弹幕互动回复率的分析,能够深入了解观众互动情况和弹幕质量,帮助评估直播间的互动活跃度和观众参与度,全面反映弹幕的互动质量,克服了只关注数量而忽视质量的局限。通过直播类型判定模块结合多维度数据进行直播内容识别和类型判定,确保系统能够适应不同类型的直播内容,提供针对性的数据分析和评估,提升直播分析的适应性和精确性。综上所述,本发明通过对直播过程中的实时数据进行分析,确保对直播互动和内容变化的动态监测与即时反馈,提高了数据分析的时效性和准确性;还通过融合直播者音频、视频数据和弹幕数据,从多角度对直播内容进行综合评估,克服了现有技术中数据分析片面、指标单一的缺陷,提供更为全面和深入的直播数据分析结果。In the present invention, through the real-time multi-source data acquisition module, the audio data, bullet screen interaction data and live broadcast room video data of the live broadcaster can be fully obtained, ensuring the multidimensionality and richness of the data, overcoming the shortcomings of single and one-sided data. Through the language attribute distribution calculation and range and speed analysis module, the range and speed data of the live broadcaster can be accurately identified, helping to deeply understand the language expression characteristics of the live broadcaster, thereby evaluating the orderliness and professionalism of the live broadcast content, and improving the accuracy and meticulousness of language analysis. Through the emotion analysis module, the emotional proportion and emotional richness of the live broadcaster are analyzed using video data, and the emotional state of the live broadcaster can be monitored and quantified in real time, providing intuitive data support for emotional changes, providing an important basis for the quality evaluation of live broadcast content, and filling the gap in the existing technology in emotional expression analysis. Through the analysis of the peak change of bullet screen volume, the peak fluctuation ratio of bullet screen and the interactive reply rate of bullet screen by the bullet screen analysis module, the audience interaction situation and the quality of bullet screen can be deeply understood, helping to evaluate the interactive activity and audience participation of the live broadcast room, and comprehensively reflecting the interactive quality of bullet screen, overcoming the limitation of only focusing on quantity and ignoring quality. The live broadcast type determination module combines multi-dimensional data to identify and determine the live broadcast content, ensuring that the system can adapt to different types of live broadcast content, providing targeted data analysis and evaluation, and improving the adaptability and accuracy of live broadcast analysis. In summary, the present invention analyzes real-time data during the live broadcast process to ensure dynamic monitoring and immediate feedback of live broadcast interactions and content changes, thereby improving the timeliness and accuracy of data analysis; it also integrates the live broadcaster's audio, video data, and barrage data to comprehensively evaluate the live broadcast content from multiple angles, thereby overcoming the defects of one-sided data analysis and single indicators in the prior art, and providing more comprehensive and in-depth live broadcast data analysis results.
本实施例中,通过利用专用的API接口和实时流处理技术(如Kafka或Flink),从直播平台提取音频、视频和弹幕数据。然后,基于直播者音频数据,采用自然语言处理(NLP)技术(如Google Cloud Speech-to-Text),对直播者的语言进行分析,识别话术类型和频率,计算并得到语言属性分布数据。使用音频信号处理工具(如Librosa或Praat)对直播者音频数据进行分析,通过频谱分析和语速检测,识别直播者的音域和语速,分别得到直播者音域数据和直播者语速数据。例如,通过计算音频信号的平均频率范围可以确定音域,通过检测每分钟的词数可以确定语速。使用计算机视觉和面部表情识别模型技术,检测并分析直播者的面部表情,计算情绪占比,得到直播者情绪占比数据。同时,通过时间序列分析,评估直播者在不同时间段的情绪变化频率,得到情绪丰富度分布数据。例如,识别出直播者在每分钟内表现的不同情绪次数,以评估其情绪丰富度。使用数据流处理技术和峰值检测算法,计算每分钟的弹幕数量变化,得到弹幕峰值跳变频率数据。接着,通过分析弹幕数量的波动情况,计算弹幕峰值波动比分布数据。对于弹幕互动回复率,通过对比直播者音频数据与弹幕交互数据,识别直播者对弹幕问题的回复情况,计算得到及时回复率。最后,利用图像识别和文本分析技术,识别直播中的内容类型,例如商品展示、讲解类内容,并生成直播内容类型数据集。然后,将语言属性分布数据、直播者音域数据、直播者语速数据、直播者情绪占比数据、情绪丰富度分布数据、弹幕峰值跳变频率数据、弹幕峰值波动比分布数据、弹幕互动回复率数据与直播内容类型数据集进行对比分析。通过预设的直播类型模板库(如平和型直播、互动型直播),对直播间进行直播类型判定,最终得到直播类型数据。In this embodiment, audio, video and bullet screen data are extracted from the live broadcast platform by using a dedicated API interface and real-time stream processing technology (such as Kafka or Flink). Then, based on the live broadcaster's audio data, natural language processing (NLP) technology (such as Google Cloud Speech-to-Text) is used to analyze the live broadcaster's language, identify the type and frequency of speech, and calculate and obtain language attribute distribution data. Use an audio signal processing tool (such as Librosa or Praat) to analyze the live broadcaster's audio data, identify the live broadcaster's range and speech rate through spectrum analysis and speech rate detection, and obtain the live broadcaster's range data and live broadcaster's speech rate data respectively. For example, the range can be determined by calculating the average frequency range of the audio signal, and the speech rate can be determined by detecting the number of words per minute. Computer vision and facial expression recognition model technology are used to detect and analyze the live broadcaster's facial expressions, calculate the emotional proportion, and obtain the live broadcaster's emotional proportion data. At the same time, through time series analysis, the frequency of emotional changes of the live broadcaster in different time periods is evaluated to obtain emotional richness distribution data. For example, the number of different emotions expressed by the live broadcaster per minute is identified to evaluate its emotional richness. Using data stream processing technology and peak detection algorithm, the change in the number of bullet screens per minute is calculated to obtain the bullet screen peak jump frequency data. Then, by analyzing the fluctuation of the number of bullet screens, the bullet screen peak fluctuation ratio distribution data is calculated. For the bullet screen interactive reply rate, by comparing the live broadcaster's audio data with the bullet screen interaction data, the live broadcaster's response to the bullet screen questions is identified, and the timely reply rate is calculated. Finally, using image recognition and text analysis technology, the content type in the live broadcast, such as product display and explanation content, is identified, and a live broadcast content type data set is generated. Then, the language attribute distribution data, the live broadcaster's voice range data, the live broadcaster's speech speed data, the live broadcaster's emotional proportion data, the emotional richness distribution data, the bullet screen peak jump frequency data, the bullet screen peak fluctuation ratio distribution data, and the bullet screen interactive reply rate data are compared and analyzed with the live broadcast content type data set. Through the preset live broadcast type template library (such as peaceful live broadcast and interactive live broadcast), the live broadcast type of the live broadcast room is determined, and the live broadcast type data is finally obtained.
优选地,数据采集模块S1包括:Preferably, the data acquisition module S1 includes:
S11:对直播间进行实时多源数据采集,得到直播数据集,其中直播数据集包括直播者音频数据、弹幕交互数据、直播间视频数据;S11: Real-time multi-source data collection is performed on the live broadcast room to obtain a live broadcast data set, where the live broadcast data set includes live broadcaster audio data, bullet screen interaction data, and live broadcast room video data;
具体地,通过高质量的麦克风和音频采集卡采集直播者的音频数据。使用OBS(Open Broadcaster Software)这样的直播软件来处理音频输入,并录制为音频文件(如WAV或MP3格式),最终得到直播者音频数据。通过弹幕捕获工具或API(例如Bilibili的弹幕API)实时抓取直播间的弹幕数据,这些API可以提供弹幕内容、发送时间和发送用户的信息,将这些数据记录到数据库或JSON文件中,即得到弹幕交互数据。接着,使用视频采集卡将直播视频信号输入计算机,同样使用直播软件(如OBS)实时录制视频数据。OBS软件可以将直播视频保存为多种格式(如MP4、FLV),从而得到直播间视频数据。最终,所有这些数据(音频、弹幕、视频)被同步存储在一个结构化的直播数据集中(例如HDF5文件或关系数据库)。Specifically, the audio data of the live broadcaster is collected through a high-quality microphone and an audio capture card. Live broadcast software such as OBS (Open Broadcaster Software) is used to process the audio input and record it as an audio file (such as WAV or MP3 format), and finally the audio data of the live broadcaster is obtained. The live broadcast room's barrage data is captured in real time through barrage capture tools or APIs (such as Bilibili's barrage API). These APIs can provide barrage content, sending time, and information about the sending user. These data are recorded in a database or JSON file to obtain barrage interaction data. Next, a video capture card is used to input the live video signal into a computer, and live broadcast software (such as OBS) is also used to record video data in real time. OBS software can save live videos in multiple formats (such as MP4, FLV), thereby obtaining live broadcast room video data. Finally, all these data (audio, barrage, video) are synchronously stored in a structured live broadcast data set (such as an HDF5 file or a relational database).
S12:对直播者音频数据进行背景杂音去除,得到净人声音频数据;S12: removing background noise from the live broadcaster's audio data to obtain clean human voice audio data;
具体地,本实施例的具体处理流程请参考S12的子步骤。Specifically, please refer to sub-step S12 for the specific processing flow of this embodiment.
S13:对净人声音频数据进行自动语音识别,得到直播者文本转录数据;S13: Performing automatic speech recognition on the clean human voice audio data to obtain text transcription data of the live broadcaster;
具体地,使用Google Cloud Speech-to-Text,将净人声音频数据上传到GoogleCloud Storage,然后通过API调用,将音频流传输给服务进行实时语音识别。识别结果将以文本形式返回,最终得到直播者文本转录数据。Specifically, Google Cloud Speech-to-Text is used to upload clean human voice audio data to Google Cloud Storage, and then the audio stream is transmitted to the service for real-time speech recognition through API calls. The recognition results will be returned in text form, and the live broadcaster's text transcription data will be obtained in the end.
S14:对直播者文本转录数据进行语义无关词语剔除,得到直播者有效文本数据;S14: removing semantically irrelevant words from the live broadcaster's text transcription data to obtain valid text data of the live broadcaster;
具体地,通过自然语言处理技术,识别并列出直播者文本转录数据中常见的停用词,如连接词、介词、冠词,这些词语在文本分析中通常不携带重要信息。接着,利用文本编辑工具或专业的文本分析软件,对转录文本进行扫描,将这些停用词以及无意义的重复词汇、标点符号进行标记和删除。例如,如果直播者文本转录数据中出现大量的填充词(如“嗯”、“啊”),或者出现重复的词语如“非常非常”,则通过查找和替换功能,将这些词语从文本中移除。此外,可以通过设置规则来识别和删除超出一定频率的常见词,最终得到直播者有效文本数据。Specifically, through natural language processing technology, common stop words in the live broadcaster's text transcription data, such as conjunctions, prepositions, and articles, are identified and listed. These words usually do not carry important information in text analysis. Then, use text editing tools or professional text analysis software to scan the transcribed text, mark and delete these stop words and meaningless repeated words and punctuation marks. For example, if a large number of filler words (such as "um" and "ah") appear in the live broadcaster's text transcription data, or repeated words such as "very, very" appear, these words are removed from the text through the find and replace function. In addition, rules can be set to identify and delete common words that exceed a certain frequency, and finally obtain the live broadcaster's valid text data.
S15:根据直播者有效文本数据对直播者进行语言属性分布计算,得到语言属性分布数据。S15: Calculate the language attribute distribution of the live broadcaster according to the live broadcaster's valid text data to obtain language attribute distribution data.
具体地,本实施例的具体处理流程请参考S15的子步骤。Specifically, please refer to sub-step S15 for the specific processing flow of this embodiment.
本发明通过实时多源数据采集,确保了直播数据的全面性和时效性,为后续分析提供了充足而准确的数据基础,有助于全面了解直播情况。通过背景杂音去除能够提高音频数据的质量,有利于提取出直播者的清晰语音信息,提高了音频数据的可用性。通过自动语音识别,将直播者的语音转换成文本数据,能够提高语音信息的可读性和可分析性。通过剔除语义无关词语可以减少数据的干扰,使得直播者的有效文本数据更具有代表性和重要性,有助于提高语言属性分布数据的准确性和可信度。通过对直播者有效文本数据进行语言属性分布计算,能够深入了解直播者的语言特征和表达方式,为后续的语言分析和内容评估提供重要依据,提高了对直播内容的深度理解和评估能力。The present invention ensures the comprehensiveness and timeliness of live broadcast data through real-time multi-source data collection, provides a sufficient and accurate data basis for subsequent analysis, and helps to fully understand the live broadcast situation. The quality of audio data can be improved by removing background noise, which is conducive to extracting clear voice information of the live broadcaster and improving the availability of audio data. By automatic speech recognition, the voice of the live broadcaster is converted into text data, which can improve the readability and analyzability of the voice information. By eliminating semantically irrelevant words, the interference of the data can be reduced, making the effective text data of the live broadcaster more representative and important, which helps to improve the accuracy and credibility of the language attribute distribution data. By calculating the language attribute distribution of the effective text data of the live broadcaster, the language characteristics and expressions of the live broadcaster can be deeply understood, which provides an important basis for subsequent language analysis and content evaluation, and improves the in-depth understanding and evaluation capabilities of the live broadcast content.
优选地,S12包括:Preferably, S12 includes:
S121:对直播者音频数据进行背景音乐消除,得到无背景乐音频数据;S121: removing the background music from the live broadcaster's audio data to obtain audio data without background music;
具体地,将直播者音频数据文件导入音频编辑软件(如Adobe Audition或Audacity)中。接着,利用软件的“效果”菜单中的“降噪”功能,选择背景音乐部分作为噪声样本。软件将分析所选部分的音频特性,并尝试识别和消除整个音频轨道中的相似声音模式。此外,使用“频谱编辑”工具,通过视觉识别背景音乐所在的频率范围,并尝试衰减或移除这些频率,从而实现背景音乐的消除,最终得到无背景乐音频数据。Specifically, the live broadcaster audio data file is imported into audio editing software (such as Adobe Audition or Audacity). Next, the "Noise Reduction" function in the "Effects" menu of the software is used to select the background music part as the noise sample. The software will analyze the audio characteristics of the selected part and try to identify and eliminate similar sound patterns in the entire audio track. In addition, the "Spectrum Editing" tool is used to visually identify the frequency range where the background music is located, and try to attenuate or remove these frequencies, thereby eliminating the background music and finally obtaining audio data without background music.
S122:对无背景乐音频数据进行扬声器计数,得到扬声器计数结果数据;S122: Count the loudspeakers of the audio data without background music to obtain loudspeaker counting result data;
具体地,将无背景乐音频数据导入Praat。然后,使用软件的“声音”菜单中的“声音分析”功能,选择“声音信号处理”选项,对音频进行频谱分析。通过观察频谱图,可以识别出不同频率成分的分布情况。接着,使用“声源分离”工具,根据音频中的声音特性,如音高、音量和音色,来区分和计数音频中的独立声源,从而得到扬声器计数结果数据。Specifically, the audio data without background music was imported into Praat. Then, the "Sound Analysis" function in the "Sound" menu of the software was used, and the "Sound Signal Processing" option was selected to perform spectrum analysis on the audio. By observing the spectrum graph, the distribution of different frequency components can be identified. Next, the "Sound Source Separation" tool was used to distinguish and count independent sound sources in the audio based on the sound characteristics in the audio, such as pitch, volume, and timbre, to obtain the speaker counting result data.
S123:若扬声器计数结果数据显示单人扬声,则将无背景乐音频数据作为主播语音数据,并执行S126;S123: If the loudspeaker counting result data indicates that a single person is speaking, the audio data without background music is used as the anchor voice data, and S126 is executed;
具体地,如果确认直播者音频中只有单一的发言者,则直接将无背景乐音频数据作为主播语音数据,并执行S126。Specifically, if it is confirmed that there is only a single speaker in the live broadcast audio, the audio data without background music is directly used as the host voice data, and S126 is executed.
S124:若扬声器计数结果数据显示多人扬声,则对无背景乐音频数据进行语音分离,得到单声道语音数据集,其中单声道语音数据集包含若干个单声道语音数据;S124: if the loudspeaker counting result data indicates that multiple people are speaking, voice separation is performed on the audio data without background music to obtain a mono voice data set, wherein the mono voice data set includes a plurality of mono voice data;
具体地,如果扬声器计数结果数据显示存在多人扬声,则对无背景乐的音频数据进行语音分离,将无背景乐音频数据导入处理软件(如Spleeter)中。使用Spleeter自动分析音频并识别出不同声源,处理完成后,将得到一个包含多个单声道语音的单声道语音数据集,其中单声道语音数据集包含若干个单声道语音数据,即每个单声道包含一个独立声源的语音。Specifically, if the speaker count result data shows that there are multiple speakers, the audio data without background music is separated and imported into the processing software (such as Spleeter). Spleeter is used to automatically analyze the audio and identify different sound sources. After the processing is completed, a mono voice data set containing multiple mono voices will be obtained, where the mono voice data set contains several mono voice data, that is, each mono channel contains the voice of an independent sound source.
S125:获取当前主播音频指纹数据,根据当前主播音频指纹数据对单声道语音数据集进行主播语音匹配,得到主播语音数据;S125: Acquire the current host audio fingerprint data, and perform host voice matching on the mono voice data set according to the current host audio fingerprint data to obtain the host voice data;
具体地,在直播开始前,使用麦克风主动录入当前主播的音频数据,使用工具如Shazam或ACRCloud对当前主播的音频数据进行音频指纹提取,得到当前主播音频指纹数据。将当前主播音频指纹数据作为参照,使用音频比对软件对单声道语音数据集进行匹配。软件会分析每个单声道语音的特征,并与主播的音频指纹进行比对,找出最匹配的音频数据,最终,得到主播语音数据。Specifically, before the live broadcast begins, use a microphone to actively record the current host's audio data, and use tools such as Shazam or ACRCloud to extract the audio fingerprint of the current host's audio data to obtain the current host's audio fingerprint data. Use the current host's audio fingerprint data as a reference and use audio comparison software to match the mono voice data set. The software will analyze the characteristics of each mono voice and compare it with the host's audio fingerprint to find the best matching audio data, and finally obtain the host's voice data.
S126:对主播语音数据进行残余噪声消除,得到净人声音频数据。S126: Eliminate residual noise from the anchor's voice data to obtain clean human voice audio data.
具体地,在Adobe Audition中,选取主播语音数据中一小段只包含噪声的音频样本,然后使用“降噪”效果来学习噪声的特性。学习完成后,软件将应用这些参数到主播语音数据整个音频轨道上,以消除噪声,从而得到净人声音频数据。Specifically, in Adobe Audition, a small audio sample containing only noise is selected from the host's voice data, and then the "Noise Reduction" effect is used to learn the characteristics of the noise. After learning is completed, the software will apply these parameters to the entire audio track of the host's voice data to eliminate the noise, thereby obtaining clean human voice audio data.
本发明通过消除背景音乐有助于提高音频数据的清晰度和纯净度,使得后续的语音识别和分析更加准确可靠。通过扬声器计数能够准确判断直播者的发声情况,有助于识别单人或多人发言情况,为后续的音频处理提供指导,确保了对直播者语音的准确提取。在识别到单人发言情况下,直接将无背景乐音频数据作为主播语音数据,节省了额外的处理步骤,提高了处理效率和准确性。对于多人发言情况,通过语音分离得到单声道语音数据集,有助于提取出各个发言者的独立语音信息。通过根据当前主播音频指纹数据对单声道语音数据集进行匹配,能够准确提取出主播的语音信息。通过残余噪声消除,提高了主播语音数据的清晰度和质量,使得后续的语音识别和分析更加准确可靠。The present invention helps to improve the clarity and purity of audio data by eliminating background music, making subsequent speech recognition and analysis more accurate and reliable. The speaker counting can accurately determine the vocalization of the live broadcaster, which helps to identify the situation of single or multiple speakers, provides guidance for subsequent audio processing, and ensures the accurate extraction of the live broadcaster's voice. When a single speaker is identified, the audio data without background music is directly used as the host's voice data, saving additional processing steps and improving processing efficiency and accuracy. For multiple speakers, a monophonic voice data set is obtained through voice separation, which helps to extract the independent voice information of each speaker. By matching the monophonic voice data set according to the current host audio fingerprint data, the host's voice information can be accurately extracted. The residual noise elimination improves the clarity and quality of the host's voice data, making subsequent speech recognition and analysis more accurate and reliable.
优选地,S15包括:Preferably, S15 includes:
S151:对直播者有效文本数据进行分词与词性标注,得到直播者标注有效文本数据;S151: performing word segmentation and part-of-speech tagging on the live broadcaster's valid text data to obtain the live broadcaster's annotated valid text data;
具体地,将直播者有效文本数据输入到NLP工具中。工具将文本按照分隔符(如空格、标点符号)进行分割,形成词汇列表。然后,对这些词汇进行词性标注,为每个词汇添加一个词性标签,从而得到直播者标注有效文本数据。例如,对于句子“直播者在直播中表现得非常专业”,分词结果是[“直播者”,“在”,“直播”,“中”,“表现”,“得”,“非常”,“专业”],词性标注结果是[“名词”,“介词”,“动词”,“副词”,“副词”,“动词”,“副词”,“形容词”]。Specifically, the live broadcaster's valid text data is input into the NLP tool. The tool splits the text according to delimiters (such as spaces and punctuation marks) to form a vocabulary list. Then, these words are tagged with parts of speech, and a part-of-speech tag is added to each word to obtain the live broadcaster's tagged valid text data. For example, for the sentence "The live broadcaster behaves very professionally in the live broadcast", the word segmentation result is ["live broadcaster", "in", "live broadcast", "in", "performance", "get", "very", "professional"], and the part-of-speech tagging result is ["noun", "preposition", "verb", "adverb", "adverb", "verb", "adverb", "adjective"].
S152:对直播者标注有效文本数据进行语义特征提取,得到直播者有效语义向量;S152: extracting semantic features from the live broadcaster's annotated valid text data to obtain a live broadcaster's valid semantic vector;
具体地,选择一个预训练的词嵌入模型,如Google的Word2Vec模型。然后,将直播者标注有效文本数据的词汇列表输入到词嵌入模型中,模型将为每个词汇生成一个向量表示。接着,对这些向量进行聚合操作,如取平均值或加权和,以形成整个文本的语义向量,即直播者有效语义向量。例如,对于一个句子的词汇列表,通过Word2Vec模型得到的向量集合可能包括[向量1,向量2,...,向量n],通过计算这些向量的平均值,得到句子的语义向量。Specifically, a pre-trained word embedding model is selected, such as Google's Word2Vec model. Then, the vocabulary list of the live broadcaster's annotated valid text data is input into the word embedding model, and the model will generate a vector representation for each vocabulary. Next, these vectors are aggregated, such as taking the average or weighted sum, to form a semantic vector for the entire text, that is, the live broadcaster's effective semantic vector. For example, for a sentence's vocabulary list, the vector set obtained by the Word2Vec model may include [vector 1, vector 2, ..., vector n], and the semantic vector of the sentence is obtained by calculating the average of these vectors.
S153:根据预设的语言属性语料库对直播者有效语义向量进行相似度计算,得到语义-属性相似度矩阵,其中,预设的语言属性语料库包括感受型语料库、情绪型语料库、专业型语料库、条理型语料库、故事型语料库、互动型语料库、引导型语料库、介绍型语料库、推荐型语料库与条理型语料库;S153: performing similarity calculation on the effective semantic vector of the live broadcaster according to a preset language attribute corpus to obtain a semantic-attribute similarity matrix, wherein the preset language attribute corpus includes a feeling corpus, an emotional corpus, a professional corpus, a logical corpus, a story corpus, an interactive corpus, a guiding corpus, an introductory corpus, a recommended corpus and a logical corpus;
具体地,获取预设的语言属性语料库,每个语料库代表一种语言属性,如感受型、情绪型。然后,使用与S152步骤相同的词嵌入模型,将语料库中的文本转换为向量形式。接着,计算直播者的语义向量与每个语料库向量的相似度。例如,使用余弦相似度公式计算两个向量之间的夹角的余弦值,得到相似度分数。将这些分数组织成一个矩阵,其中行代表语料库,列代表直播者的文本数据,矩阵中的每个元素是相似度分数,从而得到语义-属性相似度矩阵。Specifically, a preset language attribute corpus is obtained, each corpus represents a language attribute, such as feeling type and emotional type. Then, the text in the corpus is converted into a vector form using the same word embedding model as step S152. Next, the similarity between the semantic vector of the live broadcaster and each corpus vector is calculated. For example, the cosine similarity formula is used to calculate the cosine value of the angle between the two vectors to obtain a similarity score. These scores are organized into a matrix, in which the rows represent the corpus, the columns represent the text data of the live broadcaster, and each element in the matrix is a similarity score, thereby obtaining a semantic-attribute similarity matrix.
S154:基于语义-属性相似度矩阵对直播者有效语义向量进行图论聚类,得到直播者有效语义聚类结果数据;S154: performing graph-theoretic clustering on the live streamer's effective semantic vectors based on the semantic-attribute similarity matrix to obtain live streamer effective semantic clustering result data;
具体地,首先,根据语义-属性相似度矩阵,构建一个加权图,其中每个节点代表一个文本数据点(如句子或短语),节点间的边权重由相似度分数决定。接着,使用谱聚类算法,通过计算图的拉普拉斯矩阵的特征向量来确定节点的聚类结构。在聚类过程中,算法将识别出紧密连接的节点群组,这些群组代表了具有相似语义特征的文本数据集合。通过设置合适的聚类阈值或迭代次数,可以得到最终的聚类结果。每个聚类结果将包含一组具有共同语义属性的文本数据点,形成直播者有效语义聚类结果数据。Specifically, first, a weighted graph is constructed based on the semantic-attribute similarity matrix, in which each node represents a text data point (such as a sentence or phrase), and the edge weights between nodes are determined by the similarity score. Then, the spectral clustering algorithm is used to determine the clustering structure of the nodes by calculating the eigenvectors of the Laplacian matrix of the graph. During the clustering process, the algorithm will identify tightly connected groups of nodes that represent text data sets with similar semantic features. By setting an appropriate clustering threshold or number of iterations, the final clustering result can be obtained. Each clustering result will contain a set of text data points with common semantic attributes, forming effective semantic clustering result data for live broadcasters.
S155:对直播者有效语义聚类结果数据进行语言属性归类统计,得到语言属性分布数据。S155: Perform language attribute classification and statistics on the live broadcaster's effective semantic clustering result data to obtain language attribute distribution data.
具体地,首先,通过人工分析或使用机器学习分类器来对直播者有效语义聚类结果数据中每个聚类结果中的文本数据点进行审查,识别其主要的语言属性特征。例如,如果一个聚类结果中的句子大多包含解释性表达词汇,这个聚类将被归类为“讲解型”。接着,统计每个语言属性类别中的聚类数量或文本数据点数量。通过创建一个语言属性分布矩阵来实现,其中行代表不同的语言属性类别,列代表不同的聚类或文本数据点,矩阵中的元素表示归类统计的结果。最后,通过对分布矩阵的分析,可以得到每种语言属性在直播者文本数据中的分布情况,包括它们出现的频率或占比,从而得到语言属性分布数据。Specifically, first, the text data points in each clustering result of the live broadcaster's effective semantic clustering result data are reviewed by manual analysis or by using a machine learning classifier to identify its main language attribute characteristics. For example, if the sentences in a clustering result mostly contain explanatory expression words, this cluster will be classified as "explanatory type". Next, the number of clusters or text data points in each language attribute category is counted. This is achieved by creating a language attribute distribution matrix, in which rows represent different language attribute categories, columns represent different clusters or text data points, and the elements in the matrix represent the results of the classification statistics. Finally, by analyzing the distribution matrix, the distribution of each language attribute in the live broadcaster's text data can be obtained, including their frequency or proportion of occurrence, thereby obtaining language attribute distribution data.
本发明通过分词与词性标注有助于对直播者有效文本数据进行更细致的语义分析,提高了对文本数据的理解和处理能力。通过语义特征提取,能够从直播者标注有效文本数据中提取出关键的语义信息,有助于深入理解直播者的语言特征和表达方式。通过根据预设的语言属性语料库,对直播者有效语义向量进行相似度计算,能够准确评估直播者语义与各语言属性的相似度。通过图论聚类,对直播者有效语义向量进行聚类分析,能够有效地将相似的语义特征归类到同一类别中。通过对直播者有效语义聚类结果数据进行语言属性归类统计,能够准确地统计出各种语言属性在直播者语言表达中的分布情况,有助于全面了解直播者的语言特征和表达风格。The present invention helps to perform more detailed semantic analysis on the live broadcaster's effective text data through word segmentation and part-of-speech tagging, thereby improving the ability to understand and process text data. Through semantic feature extraction, key semantic information can be extracted from the live broadcaster's annotated effective text data, which helps to deeply understand the live broadcaster's language characteristics and expressions. By performing similarity calculation on the live broadcaster's effective semantic vector according to a preset language attribute corpus, the similarity between the live broadcaster's semantics and various language attributes can be accurately evaluated. Through graph clustering, cluster analysis is performed on the live broadcaster's effective semantic vectors, which can effectively classify similar semantic features into the same category. By performing language attribute classification statistics on the live broadcaster's effective semantic clustering result data, the distribution of various language attributes in the live broadcaster's language expression can be accurately counted, which helps to fully understand the live broadcaster's language characteristics and expression style.
优选地,音域语速分析模块S2包括:Preferably, the range and speech rate analysis module S2 includes:
S21:对直播者音频数据进行谱加性滤波增强,得到直播者增强音频数据;S21: Perform spectral additive filtering enhancement on the live broadcaster's audio data to obtain live broadcaster enhanced audio data;
具体地,将直播者音频数据导入编程环境如Python的scipy库中的信号处理模块中。使用如scipy.signal模块中的butter或cheby1滤波器设计函数来创建滤波器,并应用lfilter函数对直播者音频数据进行滤波处理。处理后的音频数据将具有更清晰的声音特征,从而得到直播者增强音频数据。Specifically, the live broadcaster audio data is imported into a programming environment such as a signal processing module in Python's scipy library. A filter design function such as butter or cheby1 in the scipy.signal module is used to create a filter, and the lfilter function is applied to filter the live broadcaster audio data. The processed audio data will have clearer sound characteristics, thereby obtaining live broadcaster enhanced audio data.
S22:对直播者增强音频数据进行基频轨迹提取,得到直播者音频基频轨迹数据;S22: extracting the fundamental frequency trajectory of the live broadcaster's enhanced audio data to obtain the fundamental frequency trajectory data of the live broadcaster's audio;
具体地,对直播者增强音频数据进行预处理,如采样率转换和窗函数应用,以准备数据进行基频检测。然后,使用Python中librosa库中的pitches函数,该函数可以应用如YIN(一种算法,用于估计音频信号的基频)算法来检测音频信号的基频。通过分析音频信号在时间序列上的基频变化,可以得到直播者音频基频轨迹数据,这些数据反映了直播者在不同时间点的音高变化。Specifically, the live broadcaster enhanced audio data is preprocessed, such as sampling rate conversion and window function application, to prepare the data for fundamental frequency detection. Then, the pitches function in the librosa library in Python is used, which can apply algorithms such as YIN (an algorithm for estimating the fundamental frequency of an audio signal) to detect the fundamental frequency of the audio signal. By analyzing the fundamental frequency changes of the audio signal in the time series, the live broadcaster audio fundamental frequency trajectory data can be obtained, which reflects the pitch changes of the live broadcaster at different time points.
S23:基于直播者音频基频轨迹数据对直播者进行音域统计,得到直播者音域数据;S23: performing vocal range statistics on the live broadcaster based on the audio fundamental frequency trajectory data of the live broadcaster to obtain the vocal range data of the live broadcaster;
具体地,通过简单地扫描基频轨迹数据,找到最小值和最大值来实现。然后,根据这些极值,计算音域的宽度,即最高音高与最低音高之间的差值,从而得到直播者音域数据。Specifically, it is achieved by simply scanning the fundamental frequency trajectory data and finding the minimum and maximum values. Then, based on these extreme values, the width of the range is calculated, that is, the difference between the highest pitch and the lowest pitch, so as to obtain the live broadcaster's range data.
S24:对直播者音频数据进行发音元分段,得到直播者发音元序列数据,并对直播者音频数据进行语音活动检测,得到直播者语音分段数据;S24: segmenting the live broadcaster's audio data into pronunciation units to obtain the live broadcaster's pronunciation unit sequence data, and performing voice activity detection on the live broadcaster's audio data to obtain the live broadcaster's voice segmentation data;
具体地,使用Praat或Python的librosa库来加载直播者音频数据。使用这些工具的算法对直播者音频数据的音频信号进行帧分割,即把连续的音频信号切分成短时间内的小段。接下来,对每一帧音频进行特征提取,如梅尔频率倒谱系数(MFCCs),这些特征能够反映音频信号的发音特性。然后,应用语音活动检测算法高斯混合模型(GMM)来区分语音帧和非语音帧,可以得到直播者语音分段数据,即标记出音频中哪些部分是语音,哪些部分是非语音。Specifically, use Praat or Python's librosa library to load the live broadcaster's audio data. Use the algorithms of these tools to perform frame segmentation on the audio signal of the live broadcaster's audio data, that is, to divide the continuous audio signal into short segments. Next, extract features from each frame of audio, such as Mel-frequency cepstral coefficients (MFCCs), which can reflect the pronunciation characteristics of the audio signal. Then, apply the voice activity detection algorithm Gaussian mixture model (GMM) to distinguish between speech frames and non-speech frames, and obtain the live broadcaster's voice segmentation data, that is, mark which parts of the audio are speech and which parts are non-speech.
S25:获取发音速率参照表,并根据发音速率参照表、直播者发音元序列数据与直播者语音分段数据对直播者进行语速估计,得到直播者语速数据,其中,直播者语速数据包括字节语速数据与发音元语速数据。S25: Obtain a pronunciation rate reference table, and estimate the speaker's speaking rate based on the pronunciation rate reference table, the speaker's pronunciation unit sequence data, and the speaker's voice segmentation data to obtain the speaker's speaking rate data, wherein the speaker's speaking rate data includes byte speaking rate data and pronunciation unit speaking rate data.
具体地,通过人工录入获取发音速率参照表,这可以是一个包含不同语言或方言中,平均每个音素或单词所需时间的数据库。参照表是基于先前的实验数据,提供了不同语言环境下的标准发音速率。然后,统计直播者发音元序列数据中的音素或单词数量,以及这些音素或单词对应的实际发音时间。这可以通过分析语音分段数据来实现,即测量每个发音元或单词的持续时间。接着,使用发音速率参照表来估计直播者的基准语速。例如,如果参照表指出一个语言环境中的平均发音速率是每秒3个音素,而直播者实际上在相同时间内发了4个音素,那么直播者的语速就比基准快。最后,计算直播者的字节语速数据和发音元语速数据。字节语速数据是指每秒发出的音素数量,而发音元语速数据则是指每秒发出的音节或单词数量,从而得到直播者语速数据。Specifically, a pronunciation rate reference table is obtained by manual entry, which can be a database containing the average time required for each phoneme or word in different languages or dialects. The reference table is based on previous experimental data and provides standard pronunciation rates in different language environments. Then, the number of phonemes or words in the pronunciation unit sequence data of the live broadcaster and the actual pronunciation time corresponding to these phonemes or words are counted. This can be achieved by analyzing the speech segmentation data, that is, measuring the duration of each pronunciation unit or word. Next, the pronunciation rate reference table is used to estimate the baseline speaking rate of the live broadcaster. For example, if the reference table indicates that the average pronunciation rate in a language environment is 3 phonemes per second, and the live broadcaster actually sends 4 phonemes in the same time, then the live broadcaster's speaking speed is faster than the benchmark. Finally, the byte speaking rate data and pronunciation unit speaking rate data of the live broadcaster are calculated. The byte speaking rate data refers to the number of phonemes emitted per second, while the pronunciation unit speaking rate data refers to the number of syllables or words emitted per second, thereby obtaining the live broadcaster's speaking rate data.
本发明通过谱加性滤波增强有助于提高音频数据的清晰度和可听性。通过基频轨迹提取能够准确提取出音频中的基频信息,有助于对直播者音域进行统计分析。通过基于直播者音频基频轨迹数据进行音域统计,能够准确评估直播者的音域范围和分布情况。通过发音元分段和语音活动检测有助于将音频数据分割成语音活动段和非语音活动段,提取出直播者的发音元序列数据。通过根据发音速率参照表和直播者的发音元序列数据与语音分段数据,对直播者的语速进行估计,得到直播者的语速数据。这些数据包括字节语速数据和发音元语速数据,有助于评估直播者的语言表达速度和节奏感。The present invention helps to improve the clarity and audibility of audio data through spectral additive filtering enhancement. The fundamental frequency information in the audio can be accurately extracted through fundamental frequency trajectory extraction, which is helpful for statistical analysis of the vocal range of the live broadcaster. By performing vocal range statistics based on the audio fundamental frequency trajectory data of the live broadcaster, the vocal range and distribution of the live broadcaster can be accurately evaluated. Pronunciation unit segmentation and voice activity detection help to segment the audio data into voice activity segments and non-voice activity segments, and extract the pronunciation unit sequence data of the live broadcaster. By estimating the speaking speed of the live broadcaster according to the pronunciation rate reference table and the pronunciation unit sequence data and voice segmentation data of the live broadcaster, the speaking speed data of the live broadcaster is obtained. These data include byte speech rate data and pronunciation unit speech rate data, which help to evaluate the language expression speed and sense of rhythm of the live broadcaster.
优选地,情绪分析模块S3包括:Preferably, the sentiment analysis module S3 includes:
S31:对直播间视频数据进行视频解码,得到直播间视频帧序列;S31: Decode the live broadcast room video data to obtain a live broadcast room video frame sequence;
具体地,使用FFmpeg通过命令行工具将直播间视频数据的视频流解码为图像序列,从而得到直播间视频帧序列。Specifically, FFmpeg is used through a command line tool to decode the video stream of the live broadcast room video data into an image sequence, thereby obtaining a live broadcast room video frame sequence.
S32:对直播间视频帧序列进行视频帧稳像,得到直播间稳定视频帧序列;S32: Performing video frame stabilization on the live broadcast room video frame sequence to obtain a stable video frame sequence of the live broadcast room;
具体地,使用OpenCV来分析直播间视频帧序列中的运动,应用光流算法(OpticalFlow)来估计帧间的运动矢量,这有助于检测和跟踪视频中的运动物体或特征点。然后,根据运动矢量计算一个稳定的变换矩阵,如平移、旋转或缩放,以对齐连续帧中的特征点。使用OpenCV中的warpAffine函数,将这个变换应用到每一帧上,从而实现帧的稳定,得到直播间稳定视频帧序列。Specifically, OpenCV is used to analyze the motion in the live video frame sequence, and the optical flow algorithm (OpticalFlow) is applied to estimate the motion vector between frames, which helps to detect and track moving objects or feature points in the video. Then, a stable transformation matrix such as translation, rotation, or scaling is calculated based on the motion vector to align the feature points in consecutive frames. The warpAffine function in OpenCV is used to apply this transformation to each frame to achieve frame stabilization and obtain a stable video frame sequence in the live broadcast room.
S33:对直播间稳定视频帧序列进行人脸代表视频帧提取,得到最佳清晰人脸图像帧;S33: extracting representative face video frames from the stable video frame sequence in the live broadcast room to obtain the best clear face image frame;
具体地,使用OpenCV中的Haar级联分类器来识别直播间稳定视频帧序列中的人脸区域。算法可以在每一帧中标记出人脸的位置和大小。然后,评估每一帧中检测到的人脸图像的质量,如清晰度、对比度和面部特征的完整性。可以设计一个质量评分机制,为每一帧的人脸图像分配一个分数。接着,选择得分最高的帧作为最佳清晰人脸图像帧。Specifically, the Haar cascade classifier in OpenCV is used to identify the face area in the stable video frame sequence of the live broadcast room. The algorithm can mark the position and size of the face in each frame. Then, the quality of the face image detected in each frame is evaluated, such as clarity, contrast, and integrity of facial features. A quality scoring mechanism can be designed to assign a score to each frame of the face image. Then, the frame with the highest score is selected as the best clear face image frame.
S34:将最佳清晰人脸图像帧与预设的直播者面部图像库进行人脸识别匹配,得到直播者身份数据,并根据直播者身份数据从预设的直播者专属微表情识别模型库提取对应的直播者专属微表情识别模型,得到直播者专属微表情识别模型;S34: performing face recognition matching on the best clear face image frame and the preset live broadcaster face image library to obtain the live broadcaster identity data, and extracting the corresponding live broadcaster exclusive micro-expression recognition model from the preset live broadcaster exclusive micro-expression recognition model library according to the live broadcaster identity data to obtain the live broadcaster exclusive micro-expression recognition model;
具体地,收集并建立一个包含不同直播者面部特征的图像库。这个图像库将作为人脸识别的参考数据集(即预设的直播者面部图像库)。接着,使用OpenCV库中的FaceRecognizer类来处理最佳清晰人脸图像帧。通过将这一帧与面部图像库中的图像进行比较,识别出直播者的身份。识别成功后,根据得到的直播者身份数据,从预设的直播者专属微表情识别模型库中提取对应的模型。这个模型库是事先根据每个直播者的表情数据训练得到的,可以通过调用模型库的检索系统来实现自动匹配和提取。Specifically, an image library containing facial features of different broadcasters is collected and established. This image library will be used as a reference data set for face recognition (i.e., a preset live broadcaster facial image library). Next, the FaceRecognizer class in the OpenCV library is used to process the best clear face image frame. The identity of the broadcaster is identified by comparing this frame with the image in the facial image library. After successful recognition, the corresponding model is extracted from the preset live broadcaster-exclusive micro-expression recognition model library based on the obtained live broadcaster identity data. This model library is trained in advance based on the expression data of each live broadcaster, and automatic matching and extraction can be achieved by calling the retrieval system of the model library.
S35:利用直播者专属微表情识别模型对直播间稳定视频帧序列进行微表情检测与跟踪,得到直播者微表情时序数据;S35: Detect and track micro-expressions of the stable video frame sequence in the live broadcast room using a micro-expression recognition model exclusive to the live broadcaster, and obtain micro-expression time series data of the live broadcaster;
具体地,使用视频处理软件或库,如OpenCV,来逐帧处理直播间稳定视频帧序列。对于每一帧,使用直播者专属微表情识别模型来检测是否存在微表情。微表情是短暂的、难以察觉的面部表情变化,可以揭示直播者的真实情绪状态。然后,如果模型检测到微表情,记录下微表情的类型、出现的时间点以及持续的时间。这些数据将被用于构建直播者的微表情时序数据,即一个包含微表情特征和时间信息的数据序列。Specifically, video processing software or libraries, such as OpenCV, are used to process the stable video frame sequence of the live broadcast room frame by frame. For each frame, a micro-expression recognition model dedicated to the broadcaster is used to detect whether there is a micro-expression. Micro-expressions are short-lived, imperceptible changes in facial expressions that can reveal the true emotional state of the broadcaster. Then, if the model detects a micro-expression, the type of micro-expression, the time point when it appears, and the duration of the micro-expression are recorded. These data will be used to construct the micro-expression time series data of the broadcaster, that is, a data sequence containing micro-expression features and time information.
S36:对直播者微表情时序数据进行情绪分类与统计,得到直播者情绪占比数据;S36: classifying and counting the emotions of the live streamer's micro-expression time series data to obtain the live streamer's emotion proportion data;
具体地,在Python的Pandas库中对直播者微表情时序数据进行整理,确保每条记录都包含情绪类别标签和对应的情绪事件发生的起止时间。随后,对每种情绪类别的每个事件计算持续时间,通过简单地从结束时间减去开始时间得到的。在Python的Pandas库中,通过相应的时间序列操作来计算持续时间。接下来,为了得到每种情绪在整个直播时段中的占比,将每种情绪的总持续时间除以已知的直播总时长,在Pandas中通过比例计算完成。此外,情绪频率的统计也是必要的,它通过计数每个情绪类别的事件来实现,在Pandas中利用value_counts函数来快速得到每种情绪的出现次数。通过这些步骤,能够获得直播者情绪的持续时间、占比和频率关键数据。根据得到的关键数汇总形成直播者情绪占比数据。Specifically, the time series data of the micro-expressions of the live streamers are sorted in the Python Pandas library to ensure that each record contains the emotion category label and the start and end time of the corresponding emotion event. Subsequently, the duration is calculated for each event of each emotion category, which is obtained by simply subtracting the start time from the end time. In the Python Pandas library, the duration is calculated by the corresponding time series operation. Next, in order to obtain the proportion of each emotion in the entire live broadcast period, the total duration of each emotion is divided by the known total live broadcast time, which is completed in Pandas by proportion calculation. In addition, the statistics of emotion frequency are also necessary, which is achieved by counting the events of each emotion category. The value_counts function is used in Pandas to quickly obtain the number of occurrences of each emotion. Through these steps, the key data of the duration, proportion and frequency of the live streamer's emotions can be obtained. The live streamer's emotion proportion data is summarized according to the key numbers obtained.
S37:根据直播者微表情时序数据对直播者进行情绪丰富度计算,得到情绪丰富度分布数据。S37: Calculate the emotional richness of the live streamer based on the live streamer's micro-expression time series data to obtain emotional richness distribution data.
具体地,本实施例的具体处理流程请参考S375的子步骤。Specifically, please refer to sub-step S375 for the specific processing flow of this embodiment.
本发明通过视频解码,得到直播间视频帧序列,为后续的视频帧稳像和人脸代表视频帧提取提供了准确的视频数据基础,确保了情绪分析的准确性和可靠性。通过视频帧稳像有助于减少视频帧的抖动和晃动,提高了视频质量和清晰度,使得后续的人脸识别和微表情检测更加准确可靠。通过提取最佳清晰人脸图像帧,确保了情绪分析的对象准确性,有助于提高情绪分析的可信度和准确性。通过人脸识别匹配,得到直播者身份数据,并提取出直播者专属微表情识别模型,确保了情绪分析的对象准确性和个性化分析,提高了情绪分析的针对性和精确性。通过利用直播者专属微表情识别模型对稳定视频帧序列进行微表情检测与跟踪,得到直播者微表情时序数据,有助于深入理解直播者的情绪变化和表达方式,提高了情绪分析的细致程度和深度。通过对微表情时序数据进行情绪分类与统计,得到直播者情绪占比数据,能够准确评估直播者的情绪状态,有助于全面了解直播者的情绪表达特征。通过根据直播者微表情时序数据对直播者进行情绪丰富度计算,得到情绪丰富度分布数据,有助于评估直播者情绪表达的丰富程度和变化规律,提高了情绪分析的深度和准确性。The present invention obtains a live broadcast room video frame sequence through video decoding, provides an accurate video data basis for subsequent video frame stabilization and face representative video frame extraction, and ensures the accuracy and reliability of emotional analysis. Video frame stabilization helps to reduce the jitter and shake of video frames, improves video quality and clarity, and makes subsequent face recognition and micro-expression detection more accurate and reliable. By extracting the best clear face image frame, the object accuracy of emotional analysis is ensured, which helps to improve the credibility and accuracy of emotional analysis. Through face recognition matching, the identity data of the live broadcaster is obtained, and the exclusive micro-expression recognition model of the live broadcaster is extracted, which ensures the object accuracy and personalized analysis of emotional analysis, and improves the pertinence and accuracy of emotional analysis. By using the exclusive micro-expression recognition model of the live broadcaster to detect and track the stable video frame sequence, the micro-expression time series data of the live broadcaster is obtained, which helps to deeply understand the emotional changes and expressions of the live broadcaster, and improves the meticulousness and depth of emotional analysis. By performing emotional classification and statistics on the micro-expression time series data, the emotional proportion data of the live broadcaster is obtained, which can accurately evaluate the emotional state of the live broadcaster and help to fully understand the emotional expression characteristics of the live broadcaster. By calculating the emotional richness of the live streamer based on the time series data of his/her micro-expressions, we can obtain the emotional richness distribution data, which helps to evaluate the richness and changing patterns of the live streamer's emotional expression and improves the depth and accuracy of emotional analysis.
优选地,所述预设的直播者专属微表情识别模型库中包含若干个直播者专属微表情识别模型,其中,直播者专属微表情识别模型与直播者存在唯一对应关系,所述直播者专属微表情识别模型的具体构建过程包括:Preferably, the preset live streamer-specific micro-expression recognition model library contains a number of live streamer-specific micro-expression recognition models, wherein the live streamer-specific micro-expression recognition model has a unique corresponding relationship with the live streamer, and the specific construction process of the live streamer-specific micro-expression recognition model includes:
S3401:获取情绪类别列表;S3401: Get a list of emotion categories;
具体地,通过人工录入获取情绪类别列表,情绪类别列表中包含多种情绪类别,如快乐、悲伤、愤怒、惊讶、恐惧、厌恶。Specifically, an emotion category list is obtained through manual entry, and the emotion category list includes multiple emotion categories, such as happiness, sadness, anger, surprise, fear, and disgust.
S3402:根据情绪类别列表对直播者进行各情绪面部表情图像采集,得到直播者面部表情图像集,其中,直播者面部表情图像集中包含若干张直播者面部表情图像,所述直播者面部表情图像与情绪类别列表中的情绪类别一一对应;S3402: collecting facial expression images of each emotion of the live broadcaster according to the emotion category list to obtain a facial expression image set of the live broadcaster, wherein the facial expression image set of the live broadcaster includes a plurality of facial expression images of the live broadcaster, and the facial expression images of the live broadcaster correspond to the emotion categories in the emotion category list one by one;
具体地,根据情绪类别列表,指导直播者依次展现每种情绪的面部表情。采用静态图像采集,即让直播者保持某一情绪表情几秒钟,连续拍摄多张照片,将所拍的照片中最清晰的一张照片作为对应情绪类别的代表性情绪面部表情图像,并对其进行对应的情绪类别标注,依次遍历完情绪类别列表中的情绪类别后,最后得到直播者面部表情图像集,其中,直播者面部表情图像集中包含若干张直播者面部表情图像,所述直播者面部表情图像与情绪类别列表中的情绪类别一一对应。Specifically, according to the list of emotion categories, the live broadcaster is guided to show the facial expression of each emotion in turn. Static image acquisition is adopted, that is, the live broadcaster is asked to maintain a certain emotional expression for a few seconds, and multiple photos are taken continuously. The clearest photo among the photos taken is used as the representative emotional facial expression image of the corresponding emotion category, and the corresponding emotion category is labeled for it. After traversing the emotion categories in the emotion category list in turn, a live broadcaster facial expression image set is finally obtained, wherein the live broadcaster facial expression image set contains a plurality of live broadcaster facial expression images, and the live broadcaster facial expression images correspond to the emotion categories in the emotion category list one by one.
S3403:对直播者面部表情图像进行面部关键点检测与标注,得到面部关键点标注数据;S3403: Detect and annotate facial key points of the live broadcaster's facial expression image to obtain facial key point annotated data;
具体地,使用OpenCV库中的dlib或face_recognition模块来识别每张直播者面部表情图像中的关键点。这些工具可以自动检测图像中的面部特征,并在图像上标注出相应的关键点。使用图像标注软件,如LabelImg,来手动编辑和确认关键点的位置。最后,保存面部关键点标注数据,通常包括每个关键点的坐标和相关特征描述。Specifically, use the dlib or face_recognition module in the OpenCV library to identify the key points in each live broadcaster's facial expression image. These tools can automatically detect facial features in the image and mark the corresponding key points on the image. Use image annotation software, such as LabelImg, to manually edit and confirm the location of the key points. Finally, save the facial key point annotation data, which usually includes the coordinates of each key point and related feature descriptions.
S3404:根据面部关键点标注数据对对应的直播者面部表情图像进行仿射变换数据增广,得到增广面部表情图像集,其中增广面部表情图像集中包含若干张具有相同情绪的增广面部表情图像;S3404: performing affine transformation data augmentation on the corresponding live broadcaster's facial expression image according to the facial key point annotation data to obtain an augmented facial expression image set, wherein the augmented facial expression image set includes a plurality of augmented facial expression images with the same emotion;
具体地,根据面部关键点标注数据,使用图像处理软件或编程库,如OpenCV,来对应的直播者面部表情图像执行仿射变换。仿射变换包括但不限于平移、旋转、缩放和剪切操作,可以在不改变面部表情特征的情况下,改变图像的几何属性。然后,设计一系列的仿射变换参数,如旋转角度、缩放比例,以生成不同变体的面部表情图像。接着,应用这些变换参数到原始图像上,生成增广的面部表情图像。每张原始图像都可以根据设计的参数集生成多张变换后的图像,从而构成增广面部表情图像集。Specifically, according to the facial key point annotation data, image processing software or programming library, such as OpenCV, is used to perform affine transformation on the corresponding live broadcaster's facial expression image. Affine transformation includes but is not limited to translation, rotation, scaling and shearing operations, which can change the geometric properties of the image without changing the facial expression features. Then, a series of affine transformation parameters, such as rotation angle and scaling ratio, are designed to generate facial expression images of different variants. Next, these transformation parameters are applied to the original image to generate an augmented facial expression image. Each original image can generate multiple transformed images according to the designed parameter set, thereby forming an augmented facial expression image set.
S3405:遍历直播者面部表情图像集,执行S3403-S3404,从而得到若干组增广面部表情图像集;S3405: traverse the live broadcaster's facial expression image set, execute S3403-S3404, and thus obtain several sets of augmented facial expression image sets;
具体地,利用预设的自动化的脚本顺序处理直播者面部表情图像集中的每张图像。脚本将依次对每张图像执行关键点检测与标注,然后根据标注结果应用仿射变换。然后,对于每张图像,重复应用不同的仿射变换参数集,生成多组增广面部表情图像集。每组增广面部表情图像集将包含若干张具有相同情绪但经过不同变换的面部表情图像。最后,将生成的增广图像保存到相应的图像集中,每组增广面部表情图像集都与特定的原始图像和情绪类别相关联。Specifically, a preset automated script is used to sequentially process each image in the live broadcaster's facial expression image set. The script will perform key point detection and annotation on each image in turn, and then apply affine transformation based on the annotation results. Then, for each image, different affine transformation parameter sets are repeatedly applied to generate multiple sets of augmented facial expression image sets. Each set of augmented facial expression image sets will contain several facial expression images with the same emotion but with different transformations. Finally, the generated augmented images are saved in the corresponding image set, and each set of augmented facial expression image sets is associated with a specific original image and emotion category.
S3406:获取情绪重要级别字典;S3406: Obtaining a dictionary of emotional importance levels;
具体地,通过人工录入来获取情绪重要级别字典,例如,可以设置一个从1到10的等级量表,10表示最重要的情绪类别。Specifically, the emotion importance level dictionary is obtained through manual entry. For example, a rating scale from 1 to 10 may be set, where 10 represents the most important emotion category.
S3407:根据情绪重要级别字典和预设的重要等级图像集划分比例对各组增广面部表情图像集进行图像集划分,得到多组面部表情训练图像集、多组面部表情验证图像集与多组面部表情测试图像集;S3407: dividing each group of augmented facial expression image sets into image sets according to the emotion importance level dictionary and the preset importance level image set division ratio, to obtain multiple groups of facial expression training image sets, multiple groups of facial expression verification image sets, and multiple groups of facial expression test image sets;
具体地,首先,根据情绪重要级别字典,确定每种情绪类别在图像集中的占比。例如,对于更重要的情绪类别,需要分配更多的图像到训练集中。然后,根据预设的重要等级图像集划分比例(如70%的训练集、15%的验证集和15%的测试集)使用编程语言(如Python)来随机划分对应组增广面部表情图像集,同时确保每种情绪类别的图像在各个集合中的比例符合预设的要求。这个比例可以根据实际需要和经验进行调整。最终得到多组面部表情训练图像集、多组面部表情验证图像集与多组面部表情测试图像集。Specifically, first, according to the emotion importance level dictionary, determine the proportion of each emotion category in the image set. For example, for more important emotion categories, more images need to be allocated to the training set. Then, according to the preset importance level image set division ratio (such as 70% training set, 15% validation set and 15% test set), use a programming language (such as Python) to randomly divide the corresponding group of augmented facial expression image sets, while ensuring that the proportion of images of each emotion category in each set meets the preset requirements. This ratio can be adjusted according to actual needs and experience. Finally, multiple groups of facial expression training image sets, multiple groups of facial expression verification image sets and multiple groups of facial expression test image sets are obtained.
S3408:利用多组面部表情训练图像集、多组面部表情验证图像集与多组面部表情测试图像集对预设的基于注意力机制的生成对抗网络进行模型训练,得到直播者专属微表情识别模型。S3408: Use multiple sets of facial expression training image sets, multiple sets of facial expression verification image sets and multiple sets of facial expression test image sets to perform model training on a preset attention mechanism-based generative adversarial network to obtain a micro-expression recognition model exclusive to the live broadcaster.
具体地,选择合适的基于注意力机制的生成对抗网络(GAN)架构,如DCGAN或WGAN,作为微表情识别模型的基础。然后,使用多组面部表情训练图像集来训练GAN的生成器和判别器。生成器负责生成逼真的面部表情图像,而判别器则负责区分真实图像和生成图像。接着,使用面部表情验证图像集来调整和优化模型的参数,如学习率、批次大小。最后,使用面部表情测试图像集来评估模型的最终性能,包括识别准确率、召回率和F1分数指标。确保模型在各种情绪类别上都具有良好的识别能力,特别是那些在情绪重要级别字典中标记为重要的情绪类别,最终得到直播者专属微表情识别模型。Specifically, a suitable attention-based generative adversarial network (GAN) architecture, such as DCGAN or WGAN, is selected as the basis of the micro-expression recognition model. Then, multiple sets of facial expression training image sets are used to train the generator and discriminator of the GAN. The generator is responsible for generating realistic facial expression images, while the discriminator is responsible for distinguishing between real images and generated images. Next, the facial expression verification image set is used to adjust and optimize the parameters of the model, such as learning rate and batch size. Finally, the facial expression test image set is used to evaluate the final performance of the model, including recognition accuracy, recall rate, and F1 score indicators. Ensure that the model has good recognition capabilities on various emotion categories, especially those marked as important in the emotion importance level dictionary, and finally obtain a micro-expression recognition model exclusive to live broadcasters.
本发明通过获取情绪类别列表有助于明确需要识别的情绪类型,为后续的图像采集和模型训练提供了明确的方向和目标,确保了模型的针对性和有效性。通过对直播者进行各情绪面部表情图像的采集,建立了与情绪类别一一对应的面部表情图像集,为模型训练提供了丰富的情绪表达样本。通过关键点检测和仿射变换数据增广,增加了训练数据的多样性和丰富性,使得模型更具鲁棒性和泛化能力,提高了模型对不同姿态和光照条件下面部表情的识别准确度。通过遍历直播者面部表情图像集,得到多组增广面部表情图像集,进一步丰富了训练数据的多样性和数量,有助于提高模型的训练效果和泛化能力。通过根据情绪重要级别字典和预设的重要等级图像集划分比例,得到多组训练、验证和测试图像集,能够有效地评估模型的性能和泛化能力,提高了模型的稳定性和可靠性。利用多组图像集对基于注意力机制的生成对抗网络进行模型训练,得到直播者专属微表情识别模型,有助于提高模型对直播者微表情的准确识别能力。The present invention helps to clarify the type of emotion to be identified by obtaining a list of emotion categories, provides a clear direction and goal for subsequent image acquisition and model training, and ensures the pertinence and effectiveness of the model. By collecting facial expression images of each emotion of the live broadcaster, a facial expression image set corresponding to the emotion category is established, providing rich emotion expression samples for model training. Through key point detection and affine transformation data augmentation, the diversity and richness of the training data are increased, making the model more robust and generalizable, and improving the recognition accuracy of the model for facial expressions under different postures and lighting conditions. By traversing the facial expression image set of the live broadcaster, multiple groups of augmented facial expression image sets are obtained, which further enriches the diversity and quantity of the training data and helps to improve the training effect and generalization ability of the model. By dividing the ratio according to the emotional importance level dictionary and the preset importance level image set, multiple groups of training, verification and test image sets are obtained, which can effectively evaluate the performance and generalization ability of the model and improve the stability and reliability of the model. Using multiple groups of image sets to train the generative adversarial network based on the attention mechanism, a live broadcaster-exclusive micro-expression recognition model is obtained, which helps to improve the model's accurate recognition ability of the live broadcaster's micro-expressions.
优选地,S37包括:Preferably, S37 includes:
S371:按照预设的时间分割窗口对直播者微表情时序数据进行表情断点分割,得到直播者微表情片段集,其中直播者微表情片段集中包含若干个直播者微表情片段;S371: segmenting the micro-expression time series data of the live broadcaster by expression breakpoints according to a preset time segmentation window to obtain a micro-expression segment set of the live broadcaster, wherein the micro-expression segment set of the live broadcaster includes a plurality of micro-expression segments of the live broadcaster;
具体地,确定预设的时间分割窗口的大小,例如,选择1分钟或2分钟作为一个窗口的时长。使用如FFmpeg或OpenCV来遍历直播者微表情时序数据。按照时间窗口对视频帧进行分割,确保每个分割出的片段包含连续的视频帧,从而得到直播者微表情片段集,其中直播者微表情片段集中包含若干个直播者微表情片段。Specifically, determine the size of the preset time segmentation window, for example, select 1 minute or 2 minutes as the duration of a window. Use FFmpeg or OpenCV to traverse the micro-expression time series data of the live broadcaster. Segment the video frames according to the time window to ensure that each segmented segment contains continuous video frames, thereby obtaining a live broadcaster micro-expression segment set, wherein the live broadcaster micro-expression segment set contains several live broadcaster micro-expression segments.
S372:对直播者微表情片段集中各个直播者微表情片段进行情绪类别提取,得到直播者情绪类别序列集;S372: extracting emotion categories from each micro-expression segment of the live broadcaster in the micro-expression segment set to obtain a sequence set of emotion categories of the live broadcaster;
具体地,对直播者微表情片段集中各个直播者微表情片段的情绪标注进行统计。使用电子表格软件(Microsoft Excel)或数据分析软件(Python Pandas库),来计算每个直播者微表情片中情绪类别出现次数与顺序,从而得到直播者情绪类别序列集。Specifically, the emotion annotations of each micro-expression clip of the live broadcaster in the micro-expression clip set are counted. Spreadsheet software (Microsoft Excel) or data analysis software (Python Pandas library) is used to calculate the number and order of occurrences of emotion categories in each micro-expression clip of the live broadcaster, thereby obtaining a sequence set of emotion categories of the live broadcaster.
S373:利用直播者情绪类别序列集对直播者微表情片段集中各个直播者微表情片段进行情绪标注,得到直播者情绪标注序列集;S373: using the live broadcaster emotion category sequence set to perform emotion annotation on each live broadcaster micro-expression segment in the live broadcaster micro-expression segment set, to obtain a live broadcaster emotion annotation sequence set;
具体地,使用视频分析软件打开直播者微表情片段集,对每个片段进行逐一复查。对于每个片段,根据情绪类别序列集确定其情绪类别,并在片段的元数据中记录这一类别。完成所有片段的情绪标注后,将标注信息整理成直播者情绪标注序列集。Specifically, the live broadcaster micro-expression clip set is opened using video analysis software, and each clip is reviewed one by one. For each clip, its emotion category is determined according to the emotion category sequence set, and this category is recorded in the metadata of the clip. After completing the emotion labeling of all clips, the labeling information is organized into the live broadcaster emotion labeling sequence set.
S374:根据直播者情绪标注序列集对直播者微表情片段集中各个直播者微表情片段进行情绪变化计数,得到直播者情绪变化频率集;S374: counting emotion changes of each micro-expression segment of the live broadcaster in the micro-expression segment set according to the live broadcaster emotion annotation sequence set, and obtaining a frequency set of emotion changes of the live broadcaster;
具体地,遍历直播者微表情片段集,对于序列集中的每个连续片段,根据直播者情绪标注序列集统计出每个直播者微表情片段中直播者的情绪类别的变化次数,从而得到直播者情绪变化频率集;例如,一个直播者微表情片段中直播者的情绪由喜悦到激动,在由激动到惊讶,则确定该片段的直播者情绪变化频率为2次/min。Specifically, the set of live broadcaster micro-expression segments is traversed, and for each continuous segment in the sequence set, the number of times the live broadcaster's emotion category changes in each live broadcaster's micro-expression segment is counted according to the live broadcaster's emotion annotation sequence set, so as to obtain the live broadcaster's emotion change frequency set; for example, in a live broadcaster's micro-expression segment, the live broadcaster's emotion changes from joy to excitement, and then from excitement to surprise, then the live broadcaster's emotion change frequency of the segment is determined to be 2 times/min.
S375:对直播者情绪变化频率集进行分布建模,得到情绪丰富度分布数据。S375: Perform distribution modeling on the frequency set of the live broadcaster's emotion changes to obtain emotion richness distribution data.
具体地,将直播者情绪变化频率集按照时间顺序排列,确保每个时间单元或时间段的情绪变化次数都被记录和排序。确定分析的时间窗口大小,例如,可以按照直播的每个小时、每半小时或自定义时间段来划分。在每个时间窗口内,统计该窗口内所有微表情片段的情绪变化次数,得到该时间段的情绪变化频率。分析情绪变化频率在不同时间窗口的分布特征,包括频率的中心趋势、离散程度和偏态。选择泊松分布统计分布模型来拟合情绪变化频率的数据。根据拟合的分布模型,计算情绪丰富度指标,如情绪变化的多样性、频率的波动范围。最终得到情绪丰富度分布数据。Specifically, arrange the frequency set of the live broadcaster's emotion changes in chronological order to ensure that the number of emotion changes in each time unit or time period is recorded and sorted. Determine the size of the time window for analysis. For example, it can be divided according to each hour, half hour or custom time period of the live broadcast. In each time window, count the number of emotion changes in all micro-expression fragments in the window to obtain the frequency of emotion changes in the time period. Analyze the distribution characteristics of the frequency of emotion changes in different time windows, including the central tendency, discreteness and skewness of the frequency. Select the Poisson distribution statistical distribution model to fit the data of the frequency of emotion changes. According to the fitted distribution model, calculate the emotion richness indicators, such as the diversity of emotion changes and the fluctuation range of frequency. Finally, the emotion richness distribution data is obtained.
本发明通过按照预设的时间分割窗口对直播者微表情时序数据进行表情断点分割,这可以更好地对微表情进行分析,将长时间的微表情序列切分成多个片段,提高了情绪变化的分辨率和准确性。通过对直播者微表情片段集中各个直播者微表情片段进行情绪类别提取,这有助于从直播者微表情片段中提取出情绪在时间维度上出现的次数。通过情绪标注,能够准确地识别出每个微表情片段所表达的情绪,为情绪变化频率的计算提供了准确的依据。通过根据情绪标注序列集对微表情片段集中各个微表情片段进行情绪变化计数,这有助于了解直播者情绪的变化规律和频率。通过对情绪变化频率集进行分布建模,这可以量化直播者情绪的变化情况,为直播者情绪的综合评估提供了准确的数据依据,有助于了解直播者的情绪表达特点和情绪变化规律。The present invention performs expression breakpoint segmentation on the micro-expression time series data of the live broadcaster according to a preset time segmentation window, which can better analyze the micro-expressions, divide the long micro-expression sequence into multiple segments, and improve the resolution and accuracy of emotional changes. By extracting the emotion category of each live broadcaster micro-expression segment in the live broadcaster micro-expression segment set, it is helpful to extract the number of times the emotion appears in the time dimension from the live broadcaster micro-expression segment. Through emotion annotation, the emotion expressed by each micro-expression segment can be accurately identified, which provides an accurate basis for the calculation of the frequency of emotional changes. By counting the emotional changes of each micro-expression segment in the micro-expression segment set according to the emotion annotation sequence set, it is helpful to understand the changing rules and frequency of the live broadcaster's emotions. By performing distribution modeling on the emotion change frequency set, it is possible to quantify the changes in the emotions of the live broadcaster, provide an accurate data basis for the comprehensive evaluation of the emotions of the live broadcaster, and help understand the emotional expression characteristics and emotional change rules of the live broadcaster.
优选地,弹幕分析模块S4包括:Preferably, the bullet comment analysis module S4 includes:
S41:对弹幕交互数据进行弹幕时间戳归一化处理,得到归一化弹幕时间戳数据;S41: performing a normalization process on the bullet chat interaction data to obtain normalized bullet chat timestamp data;
具体地,例如,如果弹幕交互数据的原始时间戳包括年月日时分秒,而分析只需要到分钟或秒级,使用日期时间处理库,如Python的datetime模块,来转换时间戳到所需的格式。首先,解析原始时间戳数据到datetime对象,然后根据需要提取出分钟或秒的数值,作为归一化后的时间戳,从而得到归一化弹幕时间戳数据。Specifically, for example, if the original timestamp of the barrage interaction data includes year, month, day, hour, minute, and second, and the analysis only needs to be done to the minute or second level, use a date and time processing library, such as Python's datetime module, to convert the timestamp to the required format. First, parse the original timestamp data into a datetime object, and then extract the minute or second value as needed as the normalized timestamp to obtain the normalized barrage timestamp data.
S42:根据归一化弹幕时间戳数据与弹幕交互数据对直播间进行平均弹幕频率计算,得到平均弹幕频率数据;S42: Calculate the average barrage frequency of the live broadcast room according to the normalized barrage timestamp data and the barrage interaction data to obtain average barrage frequency data;
具体地,例如,在Python中,使用Pandas库来处理归一化弹幕时间戳数据,通过resample方法对数据进行重采样,然后计算每个时间间隔(如1分钟)的弹幕数量,最后取平均值得到平均弹幕频率。Specifically, for example, in Python, the Pandas library is used to process the normalized barrage timestamp data, the data is resampled through the resample method, and then the number of barrages in each time interval (such as 1 minute) is calculated, and finally the average is taken to obtain the average barrage frequency.
S43:根据归一化弹幕时间戳数据与弹幕交互数据对直播间进行弹幕频率序列绘制,得到弹幕频率序列折线图;S43: Draw a barrage frequency sequence for the live broadcast room according to the normalized barrage timestamp data and the barrage interaction data to obtain a barrage frequency sequence line graph;
具体地,在编程环境Python中,根据归一化弹幕时间戳和弹幕数量数据创建数据对。然后,使用Python的Matplotlib库,来绘制折线图。在Python中,使用Matplotlib的plot函数来绘制时间序列数据,并使用show函数展示图表。折线图的X轴表示时间(归一化时间戳),Y轴表示弹幕数量,通过观察折线图的波动,可以了解弹幕活动的密集程度和变化趋势。从而得到弹幕频率序列折线图。Specifically, in the programming environment Python, create a data pair based on the normalized bullet chat timestamp and the bullet chat number data. Then, use Python's Matplotlib library to draw a line chart. In Python, use Matplotlib's plot function to plot time series data, and use the show function to display the chart. The X-axis of the line chart represents time (normalized timestamp), and the Y-axis represents the number of bullet chats. By observing the fluctuations of the line chart, you can understand the density and changing trend of the bullet chat activity. Thus, a bullet chat frequency series line chart is obtained.
S44:基于弹幕频率序列折线图按照预设的时间窗口尺度对直播间进行弹幕频率峰值跳变统计,得到弹幕峰值跳变频率数据;S44: Based on the barrage frequency sequence line graph, the barrage frequency peak jump statistics of the live broadcast room are performed according to the preset time window scale to obtain the barrage peak jump frequency data;
具体地,定义“峰值跳变”的标准,例如,设定为弹幕数量的突然增加超过某个百分比或绝对值阈值。使用图表分析工具,如Excel或Python的Matplotlib,对弹幕频率序列折线图进行详细审查。根据预设的时间窗口尺度(如每分钟或每5分钟),检查每个窗口内弹幕数量的峰值。确定峰值跳变的事件,设置一个阈值,比如弹幕数量比前一时间窗口增加50%以上。记录每次峰值跳变的发生时间和弹幕数量,以及该事件在直播过程中的频率,从而得到弹幕峰值跳变频率数据。Specifically, define the criteria for "peak jump", for example, set it as a sudden increase in the number of barrages exceeding a certain percentage or absolute value threshold. Use chart analysis tools, such as Excel or Python's Matplotlib, to conduct a detailed review of the barrage frequency series line chart. According to the preset time window scale (such as every minute or every 5 minutes), check the peak number of barrages in each window. Determine the event of peak jump and set a threshold, such as the number of barrages increasing by more than 50% compared with the previous time window. Record the time and number of barrages of each peak jump, as well as the frequency of the event during the live broadcast, so as to obtain the barrage peak jump frequency data.
S45:对弹幕频率序列折线图进行峰值弹幕量与对应时间点提取,得到弹幕频率峰值序列与弹幕峰值时间戳序列;S45: extracting the peak value of the bullet screen and the corresponding time point from the bullet screen frequency sequence line graph to obtain a bullet screen frequency peak sequence and a bullet screen peak timestamp sequence;
具体地,使用NumPy和Matplotlib库来识别弹幕频率序列折线图中的峰值点,这些点代表弹幕数量的局部最大值。为每个峰值点记录弹幕的数量和对应的时间戳。创建两个序列:一个包含所有峰值弹幕的数量,另一个包含对应的时间戳。最后,得到弹幕频率峰值序列与弹幕峰值时间戳序列。Specifically, we use NumPy and Matplotlib libraries to identify the peak points in the barrage frequency series line chart, which represent the local maximum of the number of barrages. We record the number of barrages and the corresponding timestamp for each peak point. We create two sequences: one containing the number of all peak barrages and the other containing the corresponding timestamps. Finally, we get the barrage frequency peak sequence and the barrage peak timestamp sequence.
S46:根据平均弹幕频率数据与弹幕频率峰值序列进行弹幕峰值波动比计算,得到弹幕峰值波动比分布数据;S46: Calculate the peak fluctuation ratio of the bullet screen according to the average bullet screen frequency data and the peak sequence of the bullet screen frequency to obtain distribution data of the peak fluctuation ratio of the bullet screen;
具体地,使用平均弹幕频率数据作为基准。对于每个峰值弹幕量,计算其与平均弹幕频率的比值,这个比值即为波动比。利用Python中的Pandas库,创建一个数据框(DataFrame),将平均弹幕频率数据和弹幕频率峰值序列作为数据框的列。使用Python中的Pandas库对数据框进行操作,通过简单的除法计算每个峰值的波动比,最终得到弹幕峰值波动比分布数据。Specifically, the average barrage frequency data is used as a benchmark. For each peak barrage volume, calculate its ratio to the average barrage frequency, and this ratio is the fluctuation ratio. Using the Pandas library in Python, create a data frame (DataFrame), and use the average barrage frequency data and the barrage frequency peak sequence as columns of the data frame. Use the Pandas library in Python to operate the data frame, calculate the fluctuation ratio of each peak value through simple division, and finally obtain the barrage peak fluctuation ratio distribution data.
S47:对弹幕交互数据进行观众提问问题识别,得到观众提问问题数据集;S47: Identify audience questions on the bullet screen interaction data to obtain an audience question dataset;
具体地,使用文本分析工具或自然语言处理(NLP)库,如Python的NLTK或spaCy,来处理弹幕交互数据中的弹幕文本数据。定义问题的识别规则来识别提问。问题的识别规则包括关键词识别(如“什么”、“怎么”、“为什么”)。应用定义的规则对每条弹幕进行分析,标记出被识别为问题的弹幕。将识别出的提问问题数据收集起来,形成一个观众提问问题数据集。Specifically, use text analysis tools or natural language processing (NLP) libraries, such as Python's NLTK or spaCy, to process the bullet text data in the bullet interaction data. Define question recognition rules to identify questions. Question recognition rules include keyword recognition (such as "what", "how", "why"). Apply the defined rules to analyze each bullet, and mark the bullets that are identified as questions. Collect the identified question data to form a dataset of audience questions.
S48:对净人声音频数据进行自动语音识别,得到直播者文本转录数据,并根据直播者文本转录数据与观众提问问题数据集对直播间进行互动回复率计算,得到弹幕互动回复率数据。S48: Automatically perform speech recognition on the clean human voice audio data to obtain the live broadcaster's text transcription data, and calculate the interactive response rate of the live broadcast room based on the live broadcaster's text transcription data and the audience question data set to obtain the barrage interactive response rate data.
具体地,使用Google Speech-to-Text API对直播者的净人声音频数据进行转录,得到直播者文本转录数据。将直播者文本转录数据与观众提问问题数据集进行匹配。这可以通过查找直播者文本中是否直接回答了观众提问的问题来实现。对于每个识别出的提问,检查直播者的后续发言中是否有相应的回答。设置一个时间阈值,如在提问后一分钟内的回答被认为是有效的互动回复。记录有效的互动回复数量,即直播者实际回答的观众问题数量。计算弹幕互动回复率,通过将有效互动回复的数量除以总提问数量来实现。例如,在Excel中,可以使用公式:弹幕互动回复率数据=有效回复数量÷提问总数。Specifically, the Google Speech-to-Text API is used to transcribe the live broadcaster's net human voice audio data to obtain the live broadcaster's text transcription data. The live broadcaster's text transcription data is matched with the audience question data set. This can be achieved by finding whether the live broadcaster's text directly answers the audience's questions. For each identified question, check whether there is a corresponding answer in the live broadcaster's subsequent speech. Set a time threshold, such as answers within one minute after the question is considered a valid interactive response. Record the number of valid interactive responses, that is, the number of audience questions actually answered by the live broadcaster. Calculate the barrage interactive response rate by dividing the number of valid interactive responses by the total number of questions. For example, in Excel, you can use the formula: Barrage interactive response rate data = number of valid responses ÷ total number of questions.
本发明通过归一化处理,使得弹幕数据的时间戳具有统一的尺度,有助于提高数据的可比性和准确性。通过计算得到平均弹幕频率数据,可以帮助了解直播间整体的弹幕互动情况。通过绘制弹幕频率序列的折线图,直观展示了弹幕的变化趋势,有助于发现弹幕活跃度的高低变化规律。通过统计弹幕频率在预设时间窗口内的峰值跳变情况,有助于发现直播间弹幕活跃度的突变情况,提高了对直播互动活跃度变化的感知能力。通过提取弹幕频率的峰值量和对应的时间点,有助于了解弹幕互动的高峰时刻。通过根据平均弹幕频率数据和弹幕频率峰值序列计算弹幕峰值波动比,有助于了解弹幕活跃度的波动情况,为评估直播互动质量提供了定量数据支持。通过识别观众的提问问题,可以了解直播间观众的关注点和疑问。通过自动语音识别得到直播者文本转录数据,并结合观众提问问题数据集计算互动回复率,可以评估主播对观众提问问题的回复效率和质量,为改善直播互动体验提供了指导。The present invention uses normalization processing to make the timestamp of the barrage data have a unified scale, which helps to improve the comparability and accuracy of the data. By calculating the average barrage frequency data, it can help understand the overall barrage interaction in the live broadcast room. By drawing a line graph of the barrage frequency sequence, the changing trend of the barrage is intuitively displayed, which helps to find the high and low changing rules of the barrage activity. By counting the peak jump of the barrage frequency in the preset time window, it is helpful to find the sudden change of the barrage activity in the live broadcast room, and improve the perception of the change of the live interactive activity. By extracting the peak value of the barrage frequency and the corresponding time point, it is helpful to understand the peak moment of the barrage interaction. By calculating the barrage peak fluctuation ratio according to the average barrage frequency data and the barrage frequency peak sequence, it is helpful to understand the fluctuation of the barrage activity, and provide quantitative data support for evaluating the quality of live interactive. By identifying the audience's questions, the focus and questions of the audience in the live broadcast room can be understood. By obtaining the live broadcaster's text transcription data through automatic speech recognition, and calculating the interactive response rate in combination with the audience's question data set, the efficiency and quality of the host's response to the audience's questions can be evaluated, providing guidance for improving the live interactive experience.
优选地,直播类型判定模块S5包括:Preferably, the live broadcast type determination module S5 includes:
S51:对直播者音频数据进行语音分割与语音增强,得到增强直播者语音片段序列,并对增强直播者语音片段序列进行自动语音识别,得到直播语音文本段序列;S51: performing speech segmentation and speech enhancement on the live broadcaster audio data to obtain a sequence of enhanced live broadcaster speech segments, and performing automatic speech recognition on the sequence of enhanced live broadcaster speech segments to obtain a sequence of live broadcaster speech text segments;
具体地,使用Adobe Audition对直播者音频数据进行噪声降低和回声消除。应用Praat或Python的librosa库,执行语音分割,将直播者音频数据中非语音部分(如背景音乐或长时间的静默)从音频中移除,从而得到增强直播者语音片段序列。使用Google CloudSpeech-to-Text对增强直播者语音片段序列进行识别,将语音转换为文本数据,从而得到直播语音文本段序列。Specifically, Adobe Audition is used to reduce noise and cancel echoes on the live broadcaster's audio data. Praat or Python's librosa library is used to perform speech segmentation, and the non-speech parts (such as background music or long periods of silence) in the live broadcaster's audio data are removed from the audio, thereby obtaining a sequence of enhanced live broadcaster's speech segments. Google Cloud Speech-to-Text is used to recognize the sequence of enhanced live broadcaster's speech segments, and the speech is converted into text data, thereby obtaining a sequence of live broadcast speech text segments.
S52:对直播语音文本段序列进行主题分类,得到直播者语音主题标签序列;S52: performing topic classification on the live broadcast voice text segment sequence to obtain a live broadcaster voice topic label sequence;
具体地,应用主题建模技术,如Latent Dirichlet Allocation (LDA)或Non-negative Matrix Factorization (NMF),来识别直播语音文本段序列中的主要主题。使用支持向量机(SVM)或随机森林,训练模型对直播语音文本段进行分类,根据训练结果为每个文本段分配主题标签。收集所有文本段的主题标签,形成直播者语音主题标签序列。Specifically, topic modeling techniques, such as Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF), are applied to identify the main topics in the live speech text segment sequence. Using support vector machine (SVM) or random forest, the model is trained to classify the live speech text segments, and a topic label is assigned to each text segment based on the training results. The topic labels of all text segments are collected to form a sequence of live broadcaster speech topic labels.
S53:对直播间视频数据进行视频解码,得到直播间视频帧序列,并对直播间视频帧序列进行视觉场景识别,得到直播场景分类结果序列;S53: performing video decoding on the live broadcast room video data to obtain a live broadcast room video frame sequence, and performing visual scene recognition on the live broadcast room video frame sequence to obtain a live broadcast scene classification result sequence;
具体地,使用FFmpeg对直播间视频数据进行解码,提取视频帧序列,得到直播间视频帧序列。准备一个标注好的场景数据集,包含不同类型的直播内容场景,如演绎类、互动类、商品展示。使用TensorFlow或PyTorch,训练一个场景识别模型。这涉及使用卷积神经网络(CNN)来学习视频帧中的特征和模式。将训练好的模型应用于直播间视频帧序列,进行实时或批量的场景识别。对于每个视频帧,模型将输出一个场景类别标签,如“商品展示”或“讲解类内容”。将识别结果按视频帧的顺序排列,构建直播场景分类结果序列。这可以通过编程实现,将每个帧的识别结果与时间戳关联,形成有序的数据集,最终得到直播场景分类结果序列。Specifically, use FFmpeg to decode the live video data, extract the video frame sequence, and obtain the live video frame sequence. Prepare a labeled scene dataset containing different types of live content scenes, such as interpretation, interaction, and product display. Use TensorFlow or PyTorch to train a scene recognition model. This involves using a convolutional neural network (CNN) to learn features and patterns in video frames. Apply the trained model to the live video frame sequence for real-time or batch scene recognition. For each video frame, the model will output a scene category label, such as "product display" or "explanatory content." Arrange the recognition results in the order of the video frames to construct a sequence of live scene classification results. This can be achieved through programming, associating the recognition results of each frame with a timestamp to form an ordered dataset, and finally obtaining a sequence of live scene classification results.
S54:基于直播者语音主题标签序列与直播场景分类结果序列对直播间进行多模态内容融合分析,得到直播内容标签序列;S54: performing a multimodal content fusion analysis on the live broadcast room based on the live broadcaster voice topic label sequence and the live broadcast scene classification result sequence to obtain a live broadcast content label sequence;
具体地,将直播者语音主题标签序列与直播场景分类结果序列进行时间同步对齐,通过时间戳或帧对齐的方法,例如,第10分钟到第15分钟的语音主题标签是“产品介绍”,而同一时间段的场景分类结果是“产品展示”。然后,使用多模态深度学习模型对对齐后的数据进行融合分析,例如,通过结合“产品介绍”的语音主题标签和“产品展示”的场景分类结果,可以生成更具描述性的直播内容标签,如“详细产品介绍”、“特写产品展示”。最终,根据融合分析的结果,通过规则引擎或分类器(如决策树、支持向量机)生成最终的直播内容标签序列,例如包括“详细产品介绍”、“用户互动问答”、“促销特写镜头”。Specifically, the live broadcaster's voice topic label sequence is time-synchronized with the live broadcast scene classification result sequence, through the method of timestamp or frame alignment. For example, the voice topic label from the 10th minute to the 15th minute is "product introduction", while the scene classification result of the same time period is "product display". Then, a multimodal deep learning model is used to perform fusion analysis on the aligned data. For example, by combining the voice topic label of "product introduction" and the scene classification result of "product display", more descriptive live broadcast content labels such as "detailed product introduction" and "close-up product display" can be generated. Finally, according to the results of the fusion analysis, the final live broadcast content label sequence is generated through a rule engine or classifier (such as a decision tree, support vector machine), for example, including "detailed product introduction", "user interactive question and answer", and "promotion close-up".
S55:根据预设的直播内容标签对应规则对直播内容标签序列进行内容类型映射,得到直播内容类型数据集;S55: performing content type mapping on the live content tag sequence according to a preset live content tag corresponding rule to obtain a live content type data set;
具体地,定义一个映射规则表,明确直播内容的标签如何映射到具体的内容类型。例如,标签“商品展示”对应于“销售型内容”直播,而“互动类内容”对应于“互动型内容”直播。使用Python中if-elif-else语句或字典映射来实现映射过程,根据映射规则表将直播内容标签序列转换为具体的内容类型。对于每个直播内容标签,确定其对应的内容类型,并记录在直播内容类型数据集中。Specifically, define a mapping rule table to clarify how live content labels are mapped to specific content types. For example, the label "product display" corresponds to "sales content" live broadcasts, while "interactive content" corresponds to "interactive content" live broadcasts. Use if-elif-else statements or dictionary mappings in Python to implement the mapping process, and convert the live content label sequence into specific content types according to the mapping rule table. For each live content label, determine its corresponding content type and record it in the live content type dataset.
S56:将语言属性分布数据、直播者音域数据、直播者语速数据、直播者情绪占比数据、情绪丰富度分布数据、弹幕峰值跳变频率数据、弹幕峰值波动比分布数据、弹幕互动回复率数据、直播内容类型数据集与预设的直播类型模板库进行对照匹配,得到直播类型数据。S56: Compare and match the language attribute distribution data, the live broadcaster's vocal range data, the live broadcaster's speaking speed data, the live broadcaster's emotion proportion data, the emotion richness distribution data, the barrage peak jump frequency data, the barrage peak fluctuation ratio distribution data, the barrage interaction reply rate data, and the live broadcast content type data set with the preset live broadcast type template library to obtain the live broadcast type data.
具体地,例如,假设语言属性分布数据为引导型话术占比30%,介绍型话术占比20%,推荐型话术占比20%,条理型话术占比10%,专业型话术占比10%,故事型话术占比10%;直播者音域数据为音量中,音高中;直播者语速数据为语速中(120词/分钟);直播者情绪占比数据为中性表情占比40%,思考表情占比30%,惊讶表情占比30%;情绪丰富度分布数据为情绪变化不多于1次/min;弹幕峰值跳变频率数据为2次/min;弹幕峰值波动比分布数据为变化比例不大;弹幕互动回复率数据为未及时回复率为40%;直播内容类型数据集为商品展示50%,商品卖点关键词40%,其他内容10%。根据上述的多维度数据和预设的直播类型模板库中对应的属性参数进行一一对比,得到匹配度最高的为平和型直播,将平和型直播记为直播类型数据。其中,预设的直播类型模板库中包括氛围型直播、复合型直播、专业型直播、即兴型直播、平和型直播、讲解型直播、表演型直播。Specifically, for example, assuming that the language attribute distribution data is that the guiding speech accounts for 30%, the introduction speech accounts for 20%, the recommendation speech accounts for 20%, the logical speech accounts for 10%, the professional speech accounts for 10%, and the storytelling speech accounts for 10%; the live broadcaster's voice range data is medium volume and medium pitch; the live broadcaster's speech speed data is medium speech speed (120 words/minute); the live broadcaster's emotional proportion data is that the neutral expression accounts for 40%, the thinking expression accounts for 30%, and the surprised expression accounts for 30%; the emotional richness distribution data is that the emotional change is no more than 1 time/min; the barrage peak jump frequency data is 2 times/min; the barrage peak fluctuation ratio distribution data is that the change ratio is not large; the barrage interaction reply rate data is that the untimely reply rate is 40%; the live broadcast content type data set is 50% for product display, 40% for product selling point keywords, and 10% for other content. According to the above multi-dimensional data and the corresponding attribute parameters in the preset live broadcast type template library, a one-to-one comparison is made, and the highest matching degree is obtained as the peaceful type live broadcast, and the peaceful type live broadcast is recorded as the live broadcast type data. Among them, the preset live broadcast type template library includes atmosphere-type live broadcast, compound live broadcast, professional live broadcast, impromptu live broadcast, peaceful live broadcast, explanation-type live broadcast, and performance-type live broadcast.
本发明通过对直播者音频数据进行语音分割与增强,有助于提取出清晰的直播者语音片段序列,为后续的语音识别提供良好的输入条件,提高了语音识别的准确性。通过对直播语音文本段序列进行主题分类,可以了解直播者在直播过程中主要讨论的话题或内容,为直播类型的判断提供重要依据,增强了对直播内容的理解和把握。通过对直播间视频数据进行视觉场景识别,能够识别出直播场景的类别,提高对直播内容的多维度理解能力。通过将直播者语音主题标签序列与直播场景分类结果序列进行融合分析,可以综合考虑语音内容和视觉内容,为直播内容的综合分析提供更为全面的数据支持。根据预设的直播内容标签对应规则对直播内容标签序列进行内容类型映射,有助于将各项分析结果转化为具体的直播类型,提高了对直播内容类型的准确判断。通过将各项分析结果与预设的直播类型模板库进行对照匹配,可以得到直播类型数据,为最终的直播类型判断提供了定量化的依据,提高了直播类型判定的准确性和可靠性。The present invention helps to extract a clear sequence of live broadcaster voice segments by performing voice segmentation and enhancement on the live broadcaster audio data, provides good input conditions for subsequent voice recognition, and improves the accuracy of voice recognition. By subject-classifying the live broadcast voice text segment sequence, the topics or contents mainly discussed by the live broadcaster during the live broadcast process can be understood, which provides an important basis for judging the live broadcast type and enhances the understanding and grasp of the live broadcast content. By performing visual scene recognition on the live broadcast room video data, the category of the live broadcast scene can be identified, and the multi-dimensional understanding ability of the live broadcast content can be improved. By fusing and analyzing the live broadcaster voice topic label sequence with the live broadcast scene classification result sequence, the voice content and visual content can be comprehensively considered, and more comprehensive data support can be provided for the comprehensive analysis of the live broadcast content. According to the preset live broadcast content label corresponding rules, the live broadcast content label sequence is mapped to the content type, which helps to convert the analysis results into specific live broadcast types and improves the accurate judgment of the live broadcast content type. By comparing and matching the analysis results with the preset live broadcast type template library, the live broadcast type data can be obtained, which provides a quantitative basis for the final live broadcast type judgment and improves the accuracy and reliability of the live broadcast type judgment.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本发明的范围由所附权利要求而不是上述说明限定,因此旨在将落在申请文件的等同要件的含义和范围内的所有变化涵括在本发明内。Therefore, the embodiments should be regarded as illustrative and non-restrictive from all points, and the scope of the present invention is limited by the appended claims rather than the above description, and it is intended that all changes falling within the meaning and range of equivalent elements of the application documents are included in the present invention.
以上所述仅是本发明的具体实施方式,使本领域技术人员能够理解或实现本发明。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所发明的原理和新颖特点相一致的最宽的范围。The above description is only a specific embodiment of the present invention, so that those skilled in the art can understand or implement the present invention. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to the embodiments shown herein, but should conform to the widest scope consistent with the principles and novel features invented herein.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410881347.5ACN118413708B (en) | 2024-07-03 | 2024-07-03 | Non-business interactive live broadcast data intelligent analysis system |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410881347.5ACN118413708B (en) | 2024-07-03 | 2024-07-03 | Non-business interactive live broadcast data intelligent analysis system |
| Publication Number | Publication Date |
|---|---|
| CN118413708A CN118413708A (en) | 2024-07-30 |
| CN118413708Btrue CN118413708B (en) | 2024-09-10 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410881347.5AActiveCN118413708B (en) | 2024-07-03 | 2024-07-03 | Non-business interactive live broadcast data intelligent analysis system |
| Country | Link |
|---|---|
| CN (1) | CN118413708B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119071520A (en)* | 2024-08-06 | 2024-12-03 | 广东保伦电子股份有限公司 | Method, device, equipment and storage medium for marking key points of live video |
| CN119938673B (en)* | 2025-04-10 | 2025-07-15 | 徐州意泰林电子科技有限公司 | Data recorder information management method, system and device based on artificial intelligence |
| CN120410573B (en)* | 2025-07-01 | 2025-09-30 | 双笙互选(杭州)信息科技有限公司 | Data analysis method and device |
| CN120416569B (en)* | 2025-07-04 | 2025-09-19 | 杭州星麦云商科技有限公司 | A method and system for real-time feedback of live broadcast barrage based on interactive semantic matching |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110213610A (en)* | 2019-06-13 | 2019-09-06 | 北京奇艺世纪科技有限公司 | A kind of live scene recognition methods and device |
| CN112911323A (en)* | 2021-01-28 | 2021-06-04 | 广州虎牙科技有限公司 | Live broadcast interaction evaluation method and device, electronic equipment and readable storage medium |
| CN116502944A (en)* | 2023-04-10 | 2023-07-28 | 好易购家庭购物有限公司 | Live broadcast cargo quality evaluation method based on big data analysis |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111711830B (en)* | 2020-06-19 | 2022-08-05 | 广州市百果园信息技术有限公司 | Live broadcast bit supplementing method and device, server and storage medium |
| CN117729381A (en)* | 2024-02-07 | 2024-03-19 | 福建大娱号信息科技股份有限公司 | Live broadcast capability evaluation system based on non-operational data analysis |
| CN118101998A (en)* | 2024-02-28 | 2024-05-28 | 广州鸿蒙信息科技有限公司 | Live broadcast risk behavior monitoring and early warning system and method |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110213610A (en)* | 2019-06-13 | 2019-09-06 | 北京奇艺世纪科技有限公司 | A kind of live scene recognition methods and device |
| CN112911323A (en)* | 2021-01-28 | 2021-06-04 | 广州虎牙科技有限公司 | Live broadcast interaction evaluation method and device, electronic equipment and readable storage medium |
| CN116502944A (en)* | 2023-04-10 | 2023-07-28 | 好易购家庭购物有限公司 | Live broadcast cargo quality evaluation method based on big data analysis |
| Publication number | Publication date |
|---|---|
| CN118413708A (en) | 2024-07-30 |
| Publication | Publication Date | Title |
|---|---|---|
| CN118413708B (en) | Non-business interactive live broadcast data intelligent analysis system | |
| Hsu et al. | Speech emotion recognition considering nonverbal vocalization in affective conversations | |
| US10692500B2 (en) | Diarization using linguistic labeling to create and apply a linguistic model | |
| CN110517689B (en) | Voice data processing method, device and storage medium | |
| JP4466564B2 (en) | Document creation / viewing device, document creation / viewing robot, and document creation / viewing program | |
| US8914285B2 (en) | Predicting a sales success probability score from a distance vector between speech of a customer and speech of an organization representative | |
| Schuller | Recognizing affect from linguistic information in 3D continuous space | |
| Mariooryad et al. | Building a naturalistic emotional speech corpus by retrieving expressive behaviors from existing speech corpora. | |
| CN112418011A (en) | Integrity identification method, device, device and storage medium for video content | |
| JPWO2005069171A1 (en) | Document association apparatus and document association method | |
| CN114996506B (en) | Corpus generation method, corpus generation device, electronic equipment and computer readable storage medium | |
| US20240064383A1 (en) | Method and Apparatus for Generating Video Corpus, and Related Device | |
| WO2024188277A1 (en) | Text semantic matching method and refrigeration device system | |
| Jia et al. | A deep learning system for sentiment analysis of service calls | |
| Azab et al. | Speaker naming in movies | |
| Pápay et al. | Hucomtech multimodal corpus annotation | |
| CN119071520A (en) | Method, device, equipment and storage medium for marking key points of live video | |
| CN118690029A (en) | A video question-answering method, system and medium based on multimodal information fusion | |
| CN118612527A (en) | Video summary generation method, device, electronic device, readable storage medium and computer program product | |
| CN119273818A (en) | An interactive digital human driving method | |
| CN117892260A (en) | Multi-mode short video emotion visualization analysis method and system | |
| Zaheer et al. | Conversations in the wild: Data collection, automatic generation and evaluation | |
| Hukkeri et al. | Erratic navigation in lecture videos using hybrid text based index point generation | |
| Renals | Recognition and understanding of meetings | |
| US12142047B1 (en) | Automated audio description system and method |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |