CN118503512A

Movatterモバイル変換

Info

Publication number: CN118503512A
Application number: CN202410512695.5A
Authority: CN
Inventors: 孟令伍; 贺成龙; 刘光迟; 顾学海; 丁灿; 尹晓阳; 李惠柯; 刘佳林; 刘蛰
Original assignee: Nanjing Laiwangxin Technology Research Institute Co ltd; Southeast University
Current assignee: Nanjing Laiwangxin Technology Research Institute Co ltd; Southeast University
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2024-08-16

Abstract

The invention provides an elastic search retrieval optimization system for large-scale network public opinion, which comprises the following steps: the system comprises a data aggregation module, an optimization mechanism and a retrieval service module; wherein: the data aggregation module is used for preprocessing the network public opinion multi-mode data, sending the obtained intermediate data to a distributed message bus Kafka, and finally, persisting the intermediate data to an elastic search distributed search engine; the optimization mechanism comprises: constructing text semantic vectors based on the deep learning model SBert for realizing semantic retrieval; converting texts and pictures in the network public opinion multi-mode data into text vectors and picture vectors based on the CLIP multi-mode contrast learning model for vector retrieval; optimizing the retrieval performance of the elastic search distributed retrieval engine by adopting a fragmentation optimization strategy; the retrieval service module is used for carrying out multi-mode retrieval based on text semantic vectors, text vectors and picture vectors constructed in an optimization mechanism by using a Boolean combination querier.

Description

Translated fromChinese

一种面向大规模网络舆情的Elasticsearch检索优化系统An Elasticsearch retrieval optimization system for large-scale online public opinion

技术领域Technical Field

本发明涉及一种检索优化系统，特别是一种面向大规模网络舆情的Elasticsearch检索优化系统。The present invention relates to a retrieval optimization system, in particular to an Elasticsearch retrieval optimization system for large-scale network public opinion.

背景技术Background Art

近年来，社交平台迅速普及并快速发展，用户在网络上生成了大量社交媒体信息，包括文本、图像、视频等形式，成为了用户覆盖范围最广、信息巨大、传播速度快、社会影响广且极具发展潜力的互联网服务。社交网络形成的舆情数据中蕴含着海量信息亟需分析挖掘，多以非结构化文本数据为主，并带有丰富的图片、视频等模态信息。为了满足快速定向分析，搜索引擎在数据检索的准确性和速度上变得至关重要。传统数据库对于海量数据全文索引存储结构方面存在限制，基于Elasticsearch分布式检索引擎(简称ES)的开源分布式全文检索引擎应运而生，目前DB搜索引擎排名第一，被广泛应用于金融、购物、社交等。In recent years, social platforms have become popular and developed rapidly. Users have generated a large amount of social media information on the Internet, including text, images, videos and other forms. It has become an Internet service with the widest user coverage, huge information, fast dissemination speed, wide social influence and great development potential. The public opinion data formed by social networks contains a huge amount of information that needs to be analyzed and mined. Most of them are unstructured text data, with rich modal information such as pictures and videos. In order to meet the needs of fast and targeted analysis, search engines have become crucial in the accuracy and speed of data retrieval. Traditional databases have limitations in the full-text index storage structure of massive data. The open source distributed full-text retrieval engine based on the Elasticsearch distributed retrieval engine (ES for short) came into being. At present, DB search engines rank first and are widely used in finance, shopping, social networking, etc.

然而舆情数据在使用Elasticsearch进行检索时仍存在如下问题：1)索引分片数如何设定；2)分片在集群节点如何放置；3)负载如何均衡。工程应用实践表明，舆情大数据具备的共同特点：一是超大的数据规模，并且持续上升；二是非结构数据占比超过整个数据集90％；三是跨平台多组合式条件查询需求；四是针对具有时效性的特征数据，数据QOS会随着时间衰减。However, when using Elasticsearch to retrieve public opinion data, there are still the following problems: 1) How to set the number of index shards; 2) How to place the shards in the cluster nodes; 3) How to balance the load. Engineering application practice shows that public opinion big data has the following common characteristics: First, the data scale is extremely large and continues to increase; second, the proportion of unstructured data exceeds 90% of the entire data set; third, cross-platform multi-combination condition query requirements; fourth, for time-sensitive feature data, data QOS will decay over time.

发明内容Summary of the invention

发明目的：本发明所要解决的技术问题是针对现有技术的不足，提供一种面向大规模网络舆情的Elasticsearch检索优化系统。Purpose of the invention: The technical problem to be solved by the present invention is to provide an Elasticsearch retrieval optimization system for large-scale network public opinion in view of the deficiencies in the prior art.

为了解决上述技术问题，本发明公开了一种面向大规模网络舆情的Elasticsearch检索优化系统，包括：数据汇聚模块、优化机制和检索服务模块；其中：In order to solve the above technical problems, the present invention discloses an Elasticsearch retrieval optimization system for large-scale network public opinion, including: a data aggregation module, an optimization mechanism and a retrieval service module; wherein:

所述数据汇聚模块，用于对网络舆情多模态数据，经过预处理后得到的中间数据发送到分布式消息总线Kafka中，最终持久化保存到Elasticsearch分布式检索引擎中；The data aggregation module is used to send the intermediate data obtained after preprocessing the multimodal data of online public opinion to the distributed message bus Kafka, and finally persist it in the Elasticsearch distributed retrieval engine;

所述优化机制，包括：基于深度学习模型SBert构建文本语义向量，用于实现语义检索；基于CLIP多模态对比学习模型将网络舆情多模态数据中的文本和图片转成文本向量和图片向量，用于向量检索；采用分片优化策略优化Elasticsearch分布式检索引擎的检索性能；The optimization mechanism includes: constructing text semantic vectors based on the deep learning model SBert for semantic retrieval; converting text and images in multimodal data of online public opinion into text vectors and image vectors based on the CLIP multimodal contrastive learning model for vector retrieval; and optimizing the retrieval performance of the Elasticsearch distributed retrieval engine using a sharding optimization strategy;

所述检索服务模块，用于使用布尔组合查询器，基于优化机制中构建的文本语义向量、文本向量和图片向量，进行多模态检索。The retrieval service module is used to use a Boolean combination query to perform multimodal retrieval based on the text semantic vector, text vector and image vector constructed in the optimization mechanism.

进一步的，所述数据汇聚模块中的网络舆情多模态数据，为不同来源介质、不同媒体平台及不同格式的数据；其中，不同来源介质，至少包括：数据库、文件及采用接口方式接入。Furthermore, the multimodal data of online public opinion in the data aggregation module are data from different source media, different media platforms and different formats; among which, the different source media at least include: databases, files and access via interfaces.

进一步的，所述数据汇聚模块中的预处理，至少包括：Furthermore, the preprocessing in the data aggregation module at least includes:

采用分布式流式计算框架Flink，对所述网络舆情多模态数据进行加载、抽取和转换；The distributed streaming computing framework Flink is used to load, extract and convert the multimodal data of online public opinion;

将网络舆情多模态数据通过MD5进行关联，得到中间数据，其中，图片数据和视频数据下载后转成Base64进行存储。The multimodal data of online public opinion are associated through MD5 to obtain intermediate data, among which the image data and video data are converted into Base64 for storage after downloading.

进一步的，所述优化机制，具体包括以下步骤：Furthermore, the optimization mechanism specifically includes the following steps:

步骤1，基于深度学习模型SBert构建文本语义向量，具体包括：Step 1: Construct text semantic vector based on deep learning model SBert, including:

将网络舆情多模态数据中的文本数据中的字段采用SBert深度学习模型，构建预设维数的向量，并存储在数据库中；The fields in the text data in the multimodal data of online public opinion are constructed into vectors of preset dimensions using the SBert deep learning model and stored in the database;

步骤2，基于CLIP多模态对比学习模型将网络舆情多模态数据中的文本和图片转成文本向量和图片向量，具体包括：Step 2: Based on the CLIP multimodal contrastive learning model, the text and images in the multimodal data of online public opinion are converted into text vectors and image vectors, which specifically include:

将网络舆情多模态数据中的图片，通过预训练的CLIP多模态对比学习模型，编码成预设维度的向量，并存储在数据库中；The images in the multimodal data of online public opinion are encoded into vectors of preset dimensions through the pre-trained CLIP multimodal contrastive learning model and stored in the database;

步骤3，采用分片优化策略优化Elasticsearch分布式检索引擎的检索性能，其中，所述的分片优化策略，包括：Step 3: Use a sharding optimization strategy to optimize the retrieval performance of the Elasticsearch distributed retrieval engine, wherein the sharding optimization strategy includes:

步骤3-1，建立索引冷热分离机制；Step 3-1, establish a hot and cold index separation mechanism;

步骤3-2，索引分片数确定；Step 3-2, determine the number of index shards;

步骤3-3，设置索引分片放置策略。Step 3-3, set the index shard placement strategy.

进一步的，步骤3-1中所述的建立索引冷热分离机制，具体包括：Furthermore, the establishment of the index hot and cold separation mechanism described in step 3-1 specifically includes:

步骤3-1-1，根据数据库中的元数据信息创建索引类型；Step 3-1-1, create an index type based on the metadata information in the database;

步骤3-1-2，对网络舆情多模态数据中的文本数据采用全文检索字段类型，构建分词索引；Step 3-1-2, use the full-text search field type for the text data in the multimodal data of online public opinion to build a word segmentation index;

步骤3-1-3，将所述分词索引与其他字段索引组合成索引模板；Step 3-1-3, combining the word segmentation index with other field indexes into an index template;

步骤3-1-4，根据所述Elasticsearch分布式检索引擎的业务数据量，按预设时间段构建索引；Step 3-1-4, building an index according to a preset time period based on the business data volume of the Elasticsearch distributed retrieval engine;

步骤3-1-5，根据所述Elasticsearch分布式检索引擎的业务场景设置索引冷热机制，具体如下：Step 3-1-5, set the index hot and cold mechanism according to the business scenario of the Elasticsearch distributed search engine, as follows:

根据预设的QOS数据服务需求，将时间最新的索引放入热数据即高配置节点上，将其他历史数据放入冷数据即低配置节点中；According to the preset QOS data service requirements, the latest index is placed in the hot data, i.e., high-configuration node, and other historical data is placed in the cold data, i.e., low-configuration node;

步骤3-1-6，根据业务按照时间段为索引分段并构建别名，用于跨索引查询；Step 3-1-6, segment the index according to the business and time period and build aliases for cross-index query;

步骤3-1-7，定期将热数据分片迁移到冷数据即低配置节点上。Step 3-1-7, regularly migrate hot data shards to cold data or low-configuration nodes.

进一步的，步骤3-2中所述的索引分片数确定，即根据所述Elasticsearch分布式检索引擎处理的数据量的大小以及所述Elasticsearch分布式检索引擎的负载情况，设计索引的分片数，具体包括：Furthermore, the number of index shards described in step 3-2 is determined, that is, the number of index shards is designed according to the amount of data processed by the Elasticsearch distributed search engine and the load of the Elasticsearch distributed search engine, specifically including:

步骤3-2-1，对所述Elasticsearch分布式检索引擎的各节点性能进行校验，同时满足以下条件则判定当前节点为可用节点，将对应节点数组nodeArr[]位置置为1：Step 3-2-1, verify the performance of each node of the Elasticsearch distributed search engine, and if the following conditions are met, determine that the current node is an available node, and set the corresponding node array nodeArr[] position to 1:

条件1，校验当前节点i的磁盘使用率DSi、磁盘IO使用率DSIOi、JVM内存使用率MRi及CPU使用率CPUi，若上述各指标使用率都不超过90％，则判定为满足条件1；Condition 1: Check the disk usage DSi, disk IO usage DSIOi, JVM memory usage MRi and CPU usage CPUi of the current node i. If the usage of the above indicators does not exceed 90%, it is determined that condition 1 is met;

条件2，校验当前节点的现有分片数量SNi，若小于当前节点i的最大分片数量MNi，则判定为满足条件2；Condition 2: Check the number of existing shards SNi of the current node. If it is less than the maximum number of shards MNi of the current node i, it is determined that condition 2 is met;

步骤3-2-2，计算初始主分片数量shardNum₁，具体如下：Step 3-2-2, calculate the initial number of primary shards shardNum₁ , as follows:

其中，D表示所述Elasticsearch分布式检索引擎的索引业务数据量大小；Wherein, D represents the amount of index business data of the Elasticsearch distributed retrieval engine;

步骤3-2-3，计算每个节点放置的主分片数量N，具体如下：Step 3-2-3, calculate the number N of primary shards placed on each node, as follows:

其中，N_au表示可用节点数量；当N>X时，取N＝X，其中，X表示节点i允许同一索引最大分片数量；Where N_au represents the number of available nodes; when N>X, N=X, where X represents the maximum number of shards of the same index allowed by node i;

所述的可用节点个数Nau，计算方法如下：The number of available nodes Nau is calculated as follows:

若任一节点的现有分片数SNi与主分片数量N的和大于等于当前节点i的最大分片数量MNi时，则判定该节点不可用，更新节点数组nodeArr[]对应值为0，可用节点由更新后的nodeArri[]数组得到，即可用节点个数为：If the sum of the existing number of shards SNi and the number of primary shards N of any node is greater than or equal to the maximum number of shards MNi of the current node i, the node is considered unavailable, and the corresponding value of the node array nodeArr[] is updated to 0. The available nodes are obtained from the updated nodeArri[] array, that is, the number of available nodes is:

步骤3-2-4，计算初始主分片数量shardNum₂，具体如下：Step 3-2-4, calculate the initial number of primary shards shardNum₂ , as follows:

步骤3-2-5，计算索引分片数shardNum，具体如下：Step 3-2-5, calculate the number of index shards shardNum, as follows:

其中，k1和k2为权重系数，k1+k2＝1，θ为扩展系数，用于衡量索引数据量扩展度，θ∈(0，1]。Among them, k1 and k2 are weight coefficients, k1+k2=1, and θ is the expansion coefficient, which is used to measure the expansion degree of index data volume, θ∈(0,1].

进一步的，步骤3-3中所述的设置索引分片放置策略，具体包括：Furthermore, the index shard placement strategy described in step 3-3 specifically includes:

步骤3-3-1，选取性能指标参数，包括：CPU使用率、JVM内存使用率和磁盘IO使用率；Step 3-3-1, select performance indicator parameters, including: CPU usage, JVM memory usage and disk IO usage;

步骤3-3-2，对所述Elasticsearch分布式检索引擎中的性能指标参数进行检测，具体包括：Step 3-3-2, detecting the performance indicator parameters in the Elasticsearch distributed retrieval engine, specifically includes:

采用kibana工具进行资源监控，并以f秒为采集频率，d分钟为采集周期，计算每个采集周期的各性能指标参数的平均值，并将结果保存；基于二次指数平滑法对各性能指标参数进行预测，通过调整平滑系数α值来计算偏方差S，取偏方差S最小时对应的平滑系数α值；Use the kibana tool to monitor resources, with a collection frequency of f seconds and a collection period of d minutes. Calculate the average value of each performance indicator parameter in each collection period and save the result. Predict each performance indicator parameter based on the quadratic exponential smoothing method, calculate the partial variance S by adjusting the smoothing coefficient α value, and take the smoothing coefficient α value corresponding to the minimum partial variance S.

步骤3-3-3，指标权重判定，即通过历史性能指标参数进行统计分析，采用熵值法确定指标权重，具体如下：Step 3-3-3, indicator weight determination, is to conduct statistical analysis through historical performance indicator parameters and use entropy method to determine indicator weights, as follows:

步骤3-3-3-1，构建负载信息决策矩阵M，如下：Step 3-3-3-1, construct the load information decision matrix M as follows:

其中，n代表采集周期数，CAU_n、MAU_n和DAIO_n分别代表第n个采集周期的CPU使用率、JVM内存使用率和磁盘IO使用率；Where n represents the number of collection cycles, CAU_n , MAU_n and DAIO_n represent the CPU usage, JVM memory usage and disk IO usage of the nth collection cycle respectively;

步骤3-3-3-2，对负载信息决策矩阵M的每列进行标准化处理，得到决策R，如下：Step 3-3-3-2, standardize each column of the load information decision matrix M to obtain the decision R, as follows:

其中，第t行第1列元素第t行第2列元素第t行第3列元素决策R的每一列满足归一性，即即每一列值的和为1，其中，j＝1,2,3，表示指标参数的序号；Among them, the element in the tth row and the first column The element in row t and column 2 The element in row t and column 3 Each column of decision R satisfies the normalization property, that is, That is, the sum of the values in each column is 1, where j = 1, 2, 3, indicating the serial number of the indicator parameter;

步骤3-3-3-3，利用熵公式计算性能指标参数的不确定度，用E_j表示第j个性能指标参数的熵，如下：Step 3-3-3-3, use the entropy formula to calculate the uncertainty of the performance indicator parameter, and use_Ej to represent the entropy of the j-th performance indicator parameter, as follows:

其中，常数K＝1/ln(n)；Where, constant K = 1/ln(n);

步骤3-3-3-4，定义D_j为第j个性能指标参数的贡献度，如下：Step 3-3-3-4, define_Dj as the contribution of the jth performance indicator parameter, as follows:

D_j＝1-E_jD_j = 1-E_j

步骤3-3-3-5，计算性能指标参数的客观权重值，如下：Step 3-3-3-5, calculate the objective weight value of the performance indicator parameter as follows:

其中，WO_j代表第j个性能指标参数的客观权重值，且WO₁+WO₂+WO₃＝1；Wherein, WO_j represents the objective weight value of the jth performance indicator parameter, and WO₁ +WO₂ +WO₃ =1;

步骤3-3-4，根据客观权重值设置分片数，具体包括：Step 3-3-4, set the number of shards according to the objective weight value, including:

步骤3-3-4-1，根据步骤3-3-3得到CPU使用率、JVM内存使用率和磁盘IO使用率三种指标的客观权重值分别为WO₁、WO₂和WO₃；Step 3-3-4-1, according to step 3-3-3, the objective weight values of the three indicators of CPU usage, JVM memory usage and disk IO usage are obtained as WO₁ , WO₂ and WO₃ respectively;

步骤3-3-4-2，通过上述客观权重值来获得每个节点的处理能力CA_i，如下：Step 3-3-4-2, obtain the processing capacity CA_i of each node through the above objective weight value, as follows:

其中，分别代表采用二次指数平滑法进行预测后得到的CPU使用率、JVM内存使用率和磁盘IO使用率，i代表第i节点；in, They represent the CPU usage, JVM memory usage, and disk IO usage respectively after prediction using the quadratic exponential smoothing method, and i represents the i-th node;

步骤3-3-4-3，计算每个节点应分配的数据量占比即所述的分片数，如下：Step 3-3-4-3, calculate the proportion of data that should be allocated to each node, that is, the number of shards, as follows:

其中，DP_i代表第i节点应分配的数据量占比，m为节点数量；Among them, DP_i represents the proportion of data that should be allocated to the i-th node, and m is the number of nodes;

步骤3-3-4-4，若每个节点可用分片数小于步骤3-3-4-3中计算得到的分片数，则修改该节点的分片上限参数。Step 3-3-4-4, if the number of available shards for each node is less than the number of shards calculated in step 3-3-4-3, modify the shard upper limit parameter of the node.

进一步的，所述检索服务模块，具体包括：Furthermore, the search service module specifically includes:

所述的布尔组合查询器，对应所述Elasticsearch分布式检索引擎中的布尔查询算子；The Boolean combination query operator corresponds to the Boolean query operator in the Elasticsearch distributed search engine;

所述的多模态检索，至少包括：以文搜文、以文搜图和以图搜图。The multimodal search at least includes: searching for text by text, searching for images by text, and searching for images by images.

进一步的，步骤1和步骤2中所述的预设维数为768维。Furthermore, the preset dimension in step 1 and step 2 is 768 dimensions.

进一步的，步骤1和步骤2中所述的进行存储，即采用dense_vector向量类型存储。Furthermore, the storage described in step 1 and step 2 is performed using dense_vector vector type storage.

有益效果：Beneficial effects:

1、本发明通过深度预训练模型SBert构建句子语义向量，实现语义检索，克服了传统只能进行关键词检索导致关键信息丢失的问题，增加了召回率。1. The present invention constructs sentence semantic vectors through the deep pre-training model SBert to realize semantic retrieval, overcomes the problem that traditional retrieval can only be performed by keywords, resulting in the loss of key information, and increases the recall rate.

2、本发明通过CLIP多模态模型构建文本和图片向量，实现以文搜图和以图搜图。2. The present invention constructs text and image vectors through the CLIP multimodal model to realize text-to-image search and image-to-image search.

3、本发明在索引创建采用冷热分离、分片数设置及分片放置策略，实现集群负载均衡，降低了用户响应时延和系统软硬件开销。3. The present invention adopts hot-cold separation, shard number setting and shard placement strategy in index creation to achieve cluster load balancing, reduce user response delay and system software and hardware overhead.

4、本发明设计出布尔组合查询器，满足了用户复杂的业务检索场景。4. The present invention designs a Boolean combination query to meet the user's complex business search scenarios.

综上所述，本发明相对传统网络舆情Elasticsearch检索系统，在功能上，满足了大规模网络舆情领域复杂组合业务查询需求，实现了布尔组合、以文搜文、以文搜图和以图搜图检索功能；在性能上，信息检索召回率和时延上得到了较大提升。整体增加了用户体验，提供了强而稳定的数据支撑服务，创造了商业价值。In summary, compared with the traditional network public opinion Elasticsearch retrieval system, the present invention meets the complex combination business query requirements in the field of large-scale network public opinion in terms of function, and realizes the retrieval functions of Boolean combination, text search, text search and image search; in terms of performance, the information retrieval recall rate and latency have been greatly improved. The overall user experience is improved, strong and stable data support services are provided, and commercial value is created.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

下面结合附图和具体实施方式对本发明做更进一步的具体说明，本发明的上述和/或其他方面的优点将会变得更加清楚。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments, and the above and/or other advantages of the present invention will become more clear.

图1是面向大规模网络舆情的Elasticsearch检索优化系统架构图。Figure 1 is an architecture diagram of the Elasticsearch retrieval optimization system for large-scale online public opinion.

图2是Elasticsearch索引创建及冷热分离机制图。Figure 2 is a diagram of the Elasticsearch index creation and hot and cold separation mechanism.

图3a是kibana监控Elasticsearch负载的CPU使用率趋势变化图。Figure 3a is a graph showing the CPU usage trend of Elasticsearch load monitored by kibana.

图3b是kibana监控Elasticsearch负载的JVM内存使用率趋势变化图。Figure 3b is a graph showing the trend of JVM memory usage when Kibana monitors the Elasticsearch load.

图3c是kibana监控Elasticsearch负载的磁盘IO使用率趋势变化图。Figure 3c is a diagram showing the disk IO usage trend of the Elasticsearch load monitored by kibana.

图4是Elasticsearch分片存储架构图。Figure 4 is a diagram of the Elasticsearch shard storage architecture.

图5是布尔组合运算流程图。FIG5 is a flow chart of Boolean combination operation.

图6是布尔组合检索系统可视化效果图。FIG6 is a visualization diagram of the Boolean combination retrieval system.

图7是向量检索可视化效果图。Figure 7 is a visualization diagram of vector retrieval.

图8是句子语义检索可视化效果图。Figure 8 is a visualization of sentence semantic retrieval.

图9是以文搜图可视化效果图。Figure 9 is a visualization diagram of the text-to-image search effect.

图10是以图搜图可视化效果图。Figure 10 is a visualization diagram of image search.

具体实施方式DETAILED DESCRIPTION

本发明提出了一种面向舆情领域的检索优化策略。首先，提出了布尔组合运算器满足舆情跨平台多组合式条件查询需求；其次，根据节点性能和业务数据量估计值构建索引分片放置模型，提高数据检索性能。This paper proposes a retrieval optimization strategy for the public opinion field. First, a Boolean combination operator is proposed to meet the needs of cross-platform multi-combination conditional queries of public opinion; secondly, an index sharding placement model is constructed based on node performance and business data volume estimates to improve data retrieval performance.

本发明提供的面向大规模网络舆情的Elasticsearch跨模检索优化系统，提高网络舆情数据检索的召回率和响应效率，实现海量数据的快速、精准地获取数据，最终及时的为企业提供决策，增加效益。The Elasticsearch cross-model retrieval optimization system for large-scale network public opinion provided by the present invention improves the recall rate and response efficiency of network public opinion data retrieval, realizes the rapid and accurate acquisition of massive data, and ultimately provides timely decision-making for enterprises to increase benefits.

本发明的总体架构如图1所示，包括数据汇聚模块、Elasticsearch优化机制和检索模块；The overall architecture of the present invention is shown in FIG1 , and includes a data aggregation module, an Elasticsearch optimization mechanism, and a retrieval module;

所述数据汇聚模块用于，对网络舆情多模态数据采用分布式流式计算框架Flink进行加载、抽取和转换，处理后的中间数据发送到分布式消息总线kafka中，最终持久化保存到Elasticsearch中；The data aggregation module is used to load, extract and convert the multimodal data of online public opinion using the distributed streaming computing framework Flink, send the processed intermediate data to the distributed message bus kafka, and finally persist it in Elasticsearch;

所述Elasticsearch优化机制用于，首先，基于深度学习模型SBert(Sentence-Bert)构建文本语义向量，实现语义检索，克服传统关键词设置不合理导致关键信息丢失的问题；接着，基于CLIP(Contrastive Language-Image Pre-Training)多模态对比学习模型将文本和图片转成向量，以向量检索实现以文搜图、以图搜图；提出Elasticsearch分片优化策略，通过索引冷热分离机制、索引分片数确定及索引分片放置策略来优化Elasticsearch检索性能；The Elasticsearch optimization mechanism is used to, first, construct text semantic vectors based on the deep learning model SBert (Sentence-Bert) to achieve semantic retrieval and overcome the problem of key information loss caused by unreasonable traditional keyword settings; then, based on the CLIP (Contrastive Language-Image Pre-Training) multimodal contrast learning model, convert text and images into vectors, and use vector retrieval to achieve text-to-image search and image-to-image search; propose an Elasticsearch sharding optimization strategy to optimize the Elasticsearch retrieval performance through the index hot and cold separation mechanism, index shard number determination and index shard placement strategy;

所述Elasticsearch检索服务模块用于，设计出了Elasticsearch布尔组合查询器，可以灵活动态扩展查询条件，支持网络舆情复杂条件查询业务场景；基于构建的向量，实现以文搜文、以文搜图和以图搜图；The Elasticsearch retrieval service module is used to design an Elasticsearch Boolean combination query, which can flexibly and dynamically expand query conditions and support complex query business scenarios of network public opinion; based on the constructed vector, it can realize text search, text search and image search;

所述数据汇聚模块包括对不同来源介质，不同媒体平台数据，不同格式数据进行汇聚入Kafka消息总线，最终持久化入Elasticsearch中。来源介质包括数据库、文件及接口方式接入；媒体平台主要包括互联网主流媒体数据；将舆情不同格式数据通过MD5进行关联，图片和视频下载转成Base64进行存储。The data aggregation module includes aggregating different source media, different media platform data, and different format data into the Kafka message bus, and finally persisting it into Elasticsearch. The source media includes database, file and interface access; the media platform mainly includes Internet mainstream media data; the public opinion data in different formats are associated through MD5, and the downloaded pictures and videos are converted into Base64 for storage.

所述Elasticsearch优化机制执行如下步骤：The Elasticsearch optimization mechanism performs the following steps:

(1)SBert构建文本向量(1) SBert builds text vector

将舆情文本字段采用sbert-base-chinese-nli深度学习模型构建768维向量存入Elasticsearch中，以dense_vector向量类型存储；The public opinion text field is constructed using the sbert-base-chinese-nli deep learning model to construct a 768-dimensional vector and stored in Elasticsearch as a dense_vector vector type;

(2)CLIP多模态向量模型(2) CLIP multimodal vector model

将舆情图片以CLIP-ViT-B-32-multilingual-v1多模态预训练模型编码成768维向量存入Elasticsearch中，以dense_vector向量类型存储；The public opinion pictures are encoded into 768-dimensional vectors using the CLIP-ViT-B-32-multilingual-v1 multimodal pre-trained model and stored in Elasticsearch as dense_vector vector type;

(3)Elasticsearch索引分片策略(3) Elasticsearch index sharding strategy

1、冷热分离机制，如图2所示，具体步骤如下：1. Cold and hot separation mechanism, as shown in Figure 2, the specific steps are as follows:

1)根据元数据信息创建ES中索引类型；2)文档类型数据需要采用全文检索字段类型，构建分词索引；3)将IK分词索引与其他字段索引类型组合成索引模板；4)根据业务数据量按天/月/年时间段构建索引；5)根据业务场景设置索引冷热机制，根据QOS数据服务需求将时间最新的索引放入热数据/高配置节点上，将比较久远的历史数据放入冷数据/低配置节点中；6)根据业务为分段索引构建别名，便于跨索引查询；7)定期将历史热数据分片迁移到冷数据/低配置节点上。1) Create index types in ES based on metadata information; 2) Document type data needs to use full-text search field types to build a segmentation index; 3) Combine IK segmentation indexes with other field index types into index templates; 4) Build indexes by day/month/year time period based on business data volume; 5) Set up index hot and cold mechanisms based on business scenarios, and put the latest indexes on hot data/high-configuration nodes based on QOS data service requirements, and put older historical data on cold data/low-configuration nodes; 6) Build aliases for segmented indexes based on business to facilitate cross-index queries; 7) Regularly migrate historical hot data shards to cold data/low-configuration nodes.

2、索引分片数确定，如图3所示，采用kibana对Elasitcsearch进行资源监控，结合本发明中索引分片数策略确定最终分片数。索引分片数确定参数如表1所示：2. Determine the number of index shards. As shown in Figure 3, use kibana to monitor the resources of Elasticsearch, and determine the final number of shards in combination with the index shard number strategy of the present invention. The parameters for determining the number of index shards are shown in Table 1:

表1索引分片数确定参数表Table 1 Parameter table for determining the number of index fragments

ES默认为每个索引都创建5个分片和1个副本分片，主分片一旦确定就不能再修改，各分片均匀地分布在ES集群的各个节点上。如果某个索引上数据量比较大，那么数据检索时就会出现分片过少而不能充分利用资源；如果分片过多会花费大量时间建立请求、数据传输而浪费资源；同时，数据分片的不均匀也会影响整体性能。综上，分片数过多过少或者集群负载不均衡都会影响数据写入和查询性能。因此，应根据数据量的大小以及集群负载情况为索引设计合适的分片数。By default, ES creates 5 shards and 1 replica shard for each index. Once the primary shard is determined, it cannot be modified. The shards are evenly distributed on each node of the ES cluster. If the amount of data on a certain index is relatively large, there will be too few shards during data retrieval and resources cannot be fully utilized; if there are too many shards, it will take a lot of time to establish requests and data transmission, wasting resources; at the same time, uneven data sharding will also affect the overall performance. In summary, too many or too few shards or unbalanced cluster load will affect data writing and query performance. Therefore, the appropriate number of shards for the index should be designed according to the amount of data and the cluster load.

索引分片之前先对集群各节点性能进行校验，同时满足以下条件则初步判定当前节点为可用节点，对应节点数组nodeArr[]位置置为1。本文设置了以下两类校验条件：Before index sharding, the performance of each node in the cluster is verified. If the following conditions are met, the current node is preliminarily determined to be an available node, and the corresponding node array nodeArr[] position is set to 1. This article sets the following two types of verification conditions:

条件一：校验节点磁盘使用率DSi、磁盘IO使用率DSIOi、JVM内存使用率MRi及CPU使用率CPUi，判断各指标使用率都不超过90％，若是则满足要求。Condition 1: Check the node disk usage DSi, disk IO usage DSIOi, JVM memory usage MRi and CPU usage CPUi, and determine whether the usage of each indicator does not exceed 90%. If so, the requirement is met.

条件二：校验节点现有分片数量SNi，数量是否小于MNi，若是则满足要求。Condition 2: Check the number of existing shards SNi of the node to see if it is less than MNi. If so, the requirement is met.

首先，据相关研究说明，ES的单个分片业务数据量最好不要超过25G，通过如下公式得到一个初始主分片数量。First, according to relevant research, the business data volume of a single ES shard should not exceed 25G. The initial number of primary shards is obtained by the following formula.

然后，ES会默认均匀地分配到各个节点上，通过如下公式得到节点放置的主分片数量N。Then, ES will distribute them evenly to each node by default, and the number of primary shards N placed on the node is obtained by the following formula.

当N>X时，取N＝X。根据集群中节点已有分片数SNi与N之和大于等于MNi设置的上限时，则判定当前节点不可用，更新节点数组nodeArr[]对应值为0，最后可用节点由更新后的nodeArri数组得到，可用节点个数When N>X, take N=X. When the sum of the number of existing shards SNi and N in the cluster is greater than or equal to the upper limit set by MNi, the current node is determined to be unavailable, and the corresponding value of the node array nodeArr[] is updated to 0. Finally, the available nodes are obtained from the updated nodeArri array. The number of available nodes

节点分片数量的设置与集群中可用节点的个数相关，据相关研究，开始阶段可按照可用节点数量(N_au)1.5倍到3倍的原则来创建分片。本文考虑到后续数据的扩展以及设置一个或者多个副本分片保证高可用，本文按照节点的3倍来创建分片(主分片和副本分片)，根据可用节点个数粗略地得到主分片结果。The number of node shards is related to the number of available nodes in the cluster. According to relevant research, shards can be created at the beginning according to the principle of 1.5 to 3 times the number of available nodes (N_au ). This article takes into account the subsequent expansion of data and the setting of one or more replica shards to ensure high availability. This article creates shards (primary shards and replica shards) according to 3 times the number of nodes, and roughly obtains the primary shard result based on the number of available nodes.

综合考虑集群可用节点个数、索引业务数据量及索引数据扩展度，将节点个数和业务数据量进行线性加权并除以一个扩展系数θ。Taking into account the number of available nodes in the cluster, the amount of index business data, and the scalability of index data, the number of nodes and the amount of business data are linearly weighted and divided by an expansion factor θ.

其中，shardNum为最终分片数k1，k2为权重系数，k1+k2＝1，θ用于衡量索引数据量扩展度，θ∈(0，1]。对于数据扩展度较大的索引，θ值可以适当较小。Where shardNum is the final shard number k1, k2 is the weight coefficient, k1+k2=1, and θ is used to measure the scalability of the index data volume, θ∈(0,1]. For indexes with large data scalability, the value of θ can be appropriately smaller.

3、索引分片放置3. Index shard placement

索引分片放置参数如表2所示：The index shard placement parameters are shown in Table 2:

表2索引分片放置参数说明表Table 2 Index shard placement parameter description

现有的分片策略主要考虑分散原理，如图4所示，本发明提出的索引分片放置方法，具体如下：The existing sharding strategy mainly considers the principle of dispersion. As shown in FIG4 , the index sharding placement method proposed in the present invention is as follows:

通常来说，对于负载较低的节点处理能力较强，应分配较高的分片数。由于ES无法初始按照异构集群节点资源情况进行分片放置，但可以通过初始化分片之后进行分片迁移间接地实现分片放置，并设置cluster.routing.allocation.disableallocation为true，停止Elasticsearch默认负载均衡策略。Generally speaking, nodes with lower loads have stronger processing capabilities and should be assigned a higher number of shards. Since ES cannot initially place shards according to the node resources of a heterogeneous cluster, it can be indirectly placed by performing shard migration after initialization, and setting cluster.routing.allocation.disableallocation to true to stop the default load balancing strategy of Elasticsearch.

本发明选取了Elasticsearch重要性能指标参数CPU、JVM内存和磁盘IO利用率。The present invention selects the important performance indicator parameters of Elasticsearch, CPU, JVM memory and disk IO utilization.

(1)负载监测(1) Load monitoring

采用kibana对Elasitcsearch进行资源监控，通过kibana监控可知，资源使用指标存在上下波动，周期性变化规律，如果采用瞬时负载信息会导致负载指标监测的不准确。Kibana is used to monitor the resources of Elasitcsearch. Through kibana monitoring, it can be seen that the resource usage indicators fluctuate up and down and change periodically. If instantaneous load information is used, the load indicator monitoring will be inaccurate.

(2)采集模块(2) Acquisition Module

通过负载监测模块以f秒采集频率，d分钟为采集周期，计算每个周期各指标平均负载，并将结果存入数据库。The load monitoring module uses a collection frequency of f seconds and a collection period of d minutes to calculate the average load of each indicator in each period and store the results in the database.

(3)资源判定模块(3) Resource determination module

为避免瞬时负载峰值导致的负载指标监测不准确问题，本发明基于二次指数平滑法(参考：黄炜达，朱维骏，蓝映彬.基于二次指数平滑法的能源分析预测方法[J].节能与环保,2023,(02)：63-65)对各指标负载进行预测，通过调整平滑系数α值来计算偏方差S，取S最小时对应的平滑系数α值。In order to avoid the problem of inaccurate load indicator monitoring caused by instantaneous load peak, the present invention predicts the load of each indicator based on the quadratic exponential smoothing method (reference: Huang Weida, Zhu Weijun, Lan Yingbin. Energy analysis and prediction method based on quadratic exponential smoothing method [J]. Energy Conservation and Environmental Protection, 2023, (02): 63-65), calculates the partial variance S by adjusting the smoothing coefficient α value, and takes the smoothing coefficient α value corresponding to the minimum S.

(4)指标权重判定(4) Determination of indicator weights

本文权重系数通过历史负载指标统计分析，采用熵值法确定指标权重。熵值法是一种通过指标离散度反映相应指标的影响程度的数学方法，能够通过指标值变异程度客观地确定权重。指标权重与变异度呈正相关关系，即指标值变异程度越大，其权重越大，反之越小。The weight coefficient in this paper is determined by the entropy method through the statistical analysis of historical load indicators. The entropy method is a mathematical method that reflects the influence of the corresponding indicator through the dispersion of the indicator, and can objectively determine the weight through the variation of the indicator value. The indicator weight is positively correlated with the variation, that is, the greater the variation of the indicator value, the greater its weight, and vice versa.

具体步骤如下：The specific steps are as follows:

a.构建负载信息决策矩阵Ma. Construct load information decision matrix M

其中，n代表周期数，CAU_n、MAU_n和DAIO_n分别代表第n个周期的CPU、JVM内存和磁盘IO的利用率。Here, n represents the number of cycles, and CAU_n , MAU_n , and DAIO_n represent the utilization of CPU, JVM memory, and disk IO in the nth cycle, respectively.

b.对决策矩阵M每列进行标准化处理得到决策Rb. Standardize each column of the decision matrix M to obtain the decision R

其中矩阵R每一列满足归一性，即即每一列值的和为1。in Each column of the matrix R satisfies the normalization property, that is, That is, the sum of the values in each column is 1.

c.利用熵公式计算指标的不确定度，用E表示某种指标的熵，公式如下：c. Use the entropy formula to calculate the uncertainty of the indicator, and use E to represent the entropy of a certain indicator. The formula is as follows:

E_j代表指标的熵值，常数K＝1/ln(n)，这样能保证0≤E≤1，即E最大为1。_Ej represents the entropy value of the indicator, and the constant K=1/ln(n), which can ensure that 0≤E≤1, that is, the maximum value of E is 1.

由式中可以看出，当某个属性下各值的贡献度趋于一致时，E趋于1，即此时属性的权重为0，也就可以不考虑该目标的属性在决策中的作用。综上分析可看出属性值由某个属性列值的差异大小来影响权系数的大小。为此可定义D_j为某个指标的贡献度，D_j＝1-E_j。It can be seen from the formula that when the contribution of each value under a certain attribute tends to be consistent, E tends to 1, that is, the weight of the attribute is 0 at this time, and the role of the attribute of the target in the decision-making can be ignored. From the above analysis, it can be seen that the attribute value is affected by the difference in the value of a certain attribute column to the size of the weight coefficient. For this reason, D_j can be defined as the contribution of a certain indicator, D_j = 1-E_j .

d.计算每种指标的客观权重值，公式如下：d. Calculate the objective weight value of each indicator. The formula is as follows:

WO₁，WO₂，WO₃代表CPU、JVM内存和磁盘IO对于节点负载影响的客观权重值，并且WO₁+WO₂+WO₃＝1。计算每种指标客观权重值，算法输入每种指标不同周期负载值矩阵，通过熵值法计算得到每种指标的客观权重值。WO₁ , WO₂ , WO₃ represent the objective weight values of the impact of CPU, JVM memory and disk IO on node load, and WO₁ + WO₂ + WO₃ = 1. To calculate the objective weight value of each indicator, the algorithm inputs the load value matrix of each indicator in different periods, and calculates the objective weight value of each indicator through the entropy method.

(5)分片放置(5) Shard placement

首先，由前面模块得到了CPU、JVM内存、磁盘IO三种指标在负载中所占权重分别为w₁，w₂，w₃。First, from the previous module, we get that the weights of the three indicators, CPU, JVM memory, and disk IO, in the load are w₁ , w₂ , and w₃ respectively.

然后，通过每种指标的权重来获得每个节点的处理能力，公式如下：Then, the processing capacity of each node is obtained by the weight of each indicator. The formula is as follows:

其中，分别代表预测后的CPU、内存、磁盘IO利用率，i代表第i节点。in, They represent the predicted CPU, memory, and disk IO utilization respectively, and i represents the i-th node.

最后，得出每个节点要分配的数据量的占比，公式如下：Finally, the proportion of data to be allocated to each node is obtained, and the formula is as follows:

其中DP_i代表第i节点应分配的数据量占比，有m个节点。DP_i represents the proportion of data that should be allocated to the i-th node, and there are m nodes.

通过以上步骤后可知给集群中每个节点分配的数据量，即相应的分片数。判断每个节点可用分片数是否大于对应分片数。如果小于则修改相应节点分片上限参数total_shards_per_node。After the above steps, you can know the amount of data allocated to each node in the cluster, that is, the corresponding number of shards. Determine whether the number of available shards for each node is greater than the corresponding number of shards. If less, modify the corresponding node shard upper limit parameter total_shards_per_node.

所述Elasticsearch查询服务模块执行如下步骤：The Elasticsearch query service module performs the following steps:

(1)布尔组合查询器(1) Boolean combination query

如图5所示，针对舆情业务场景，查询过程中需要针对不同社交平台、业务标签、关键词及时间范围等条件组合式查询，存在平台内不同条件组合和平台间条件组合两种方式，组合逻辑方式为与或非，对应Elasticsearch中布尔查询算子must/should/mustNot。具体步骤如表3所示：As shown in Figure 5, for the public opinion business scenario, the query process requires a combination of conditions such as different social platforms, business tags, keywords, and time ranges. There are two ways of combining different conditions within the platform and combining conditions between platforms. The combination logic is AND or NOT, which corresponds to the Boolean query operators must/should/mustNot in Elasticsearch. The specific steps are shown in Table 3:

表3布尔组合查询步骤表Table 3 Boolean combination query steps

(2)以文搜文(2) Search by text

通过Elasticsearch构建向量768维索引类型dense_vector，将需要检索的句子利用SBert句子向量模型转换为768维向量作为查询入参，与Elasticsearch存储的768维向量进行余弦相似度计算，相似度高的信息则返回。A vector 768-dimensional index type dense_vector is constructed through Elasticsearch. The sentences to be retrieved are converted into 768-dimensional vectors using the SBert sentence vector model as query input parameters. The cosine similarity is calculated with the 768-dimensional vectors stored in Elasticsearch, and the information with high similarity is returned.

(3)多模态检索(3) Multimodal retrieval

通过Elasticsearch构建向量768维索引类型dense_vector，将需要检索的句子或者图片利用CLIP多模态模型转换为768维向量作为查询入参，与Elasticsearch存储的图片768维向量进行余弦相似度计算，相似度高的信息则返回。A 768-dimensional index type dense_vector is constructed through Elasticsearch. The sentences or pictures to be retrieved are converted into 768-dimensional vectors using the CLIP multimodal model as query input parameters. The cosine similarity is calculated with the 768-dimensional vectors of the pictures stored in Elasticsearch, and the information with high similarity is returned.

本发明具体提供了一个面向大规模网络舆情的Elasticsearch检索优化系统，首先，基于Flink+Kafka+Elasticsearch流式处理架构实现大规模网络舆情数据汇聚；其次，Elasticsearch优化机制模块主要包括SBERT句子向量模型、CLIP多模态向量模型及Elasticsearch索引分片策略；Elasticsearch检索服务设计了布尔组合查询器和跨模检索。The present invention specifically provides an Elasticsearch retrieval optimization system for large-scale online public opinion. Firstly, large-scale online public opinion data aggregation is realized based on the Flink+Kafka+Elasticsearch streaming processing architecture; secondly, the Elasticsearch optimization mechanism module mainly includes the SBERT sentence vector model, the CLIP multimodal vector model and the Elasticsearch index sharding strategy; the Elasticsearch retrieval service designs a Boolean combination query and cross-modal retrieval.

实施例：Example:

(1)实验数据(1) Experimental data

本实验是基于互联网产生的网络舆情数据，150万/天，时间范围1年的数据量。数据集具体详情组成如表4所示。实验过程中循环测试三次，每次都清除缓存，取平均值作为最终实验结果。This experiment is based on the Internet public opinion data generated by the Internet, 1.5 million per day, and a time range of 1 year. The specific details of the data set are shown in Table 4. During the experiment, the test was repeated three times, and the cache was cleared each time, and the average value was taken as the final experimental result.

表4数据集具体详情表Table 4 Dataset details

(2)实验环境(2) Experimental environment

集群由6个Elasticsearch节点组成集群，如表5所示：The cluster consists of 6 Elasticsearch nodes, as shown in Table 5:

表5实验环境表Table 5 Experimental environment table

(3)实验过程(3) Experimental process

本发明提供了一套面向大规模网络舆情数据的Elasticsearch检索优化系统，具体实现该技术方案的方法和途径很多，以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。本实施例中未明确的各组成部分均可用现有技术加以实现。The present invention provides an Elasticsearch retrieval optimization system for large-scale network public opinion data. There are many methods and ways to implement the technical solution. The above is only a preferred implementation of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principles of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention. All components not specified in this embodiment can be implemented using existing technologies.

1、布尔组合运算检索1. Boolean combination operation search

如图6所示，通过Elasicsearch检索系统对网络舆情数据进行复杂布尔组合运算查询，查询条件为：关键词“(江苏|南京|苏州)&大学&！苏州”、不同平台类型及时间范围等等，结果将标题命中关键词返回，返回的结果信息验证了本发明提出的布尔组合运算查询的有效性。As shown in FIG6 , a complex Boolean combination operation query is performed on the network public opinion data through the Elasicsearch retrieval system, and the query conditions are: keywords "(Jiangsu|Nanjing|Suzhou)&University&!Suzhou", different platform types and time ranges, etc. The result returns the title that hits the keyword, and the returned result information verifies the effectiveness of the Boolean combination operation query proposed in the present invention.

2、多模态检索2. Multimodal retrieval

(1)以文搜文(1) Search by text

如图7所示，对第一条舆情信息进行检索，首先需要将需要检索的句子编码生成的768维向量，再通过向量进行余弦相似度检索结果显示语义相似度(对应score值)高的在前面展示，最后一条基本不相关，相似度为最小(其实为负值，由于+1变为正值)。如图8所示舆情系统对句子进行语义检索，效果表现良好。As shown in Figure 7, to retrieve the first piece of public opinion information, we first need to encode the sentence to be retrieved to generate a 768-dimensional vector, and then use the vector to perform a cosine similarity search. The results show that the semantic similarity (corresponding to the score value) with the highest is displayed in the front, and the last one is basically irrelevant, with the lowest similarity (actually a negative value, which becomes a positive value due to +1). As shown in Figure 8, the public opinion system performs semantic retrieval on sentences, and the effect is good.

(2)以文搜图(2) Search images by text

如图9所示，对“多人篮球”文本进行检索，首先需要将需要检索的句子编码生成的768维向量，再通过向量进行余弦相似度检索结果显示多人篮球图片信息，结果展示良好。As shown in Figure 9, to search for the "multi-person basketball" text, we first need to encode the sentence to be searched to generate a 768-dimensional vector, and then use the vector to perform cosine similarity retrieval to display the multi-person basketball picture information. The results are well displayed.

(3)以图搜图(3) Search by image

如图10所示，上传图片进行检索，首先需要将上传图片编码生成的768维向量，再通过向量进行余弦相似度检索结果显示相似图片信息，结果展示良好。As shown in Figure 10, to upload a picture for retrieval, you first need to encode the uploaded picture to generate a 768-dimensional vector, and then use the vector to perform a cosine similarity search to display similar picture information. The results are well displayed.

3、Elasticsearch优化机制3. Elasticsearch optimization mechanism

表6四种网络舆情业务场景的查询任务表Table 6 Query task table for four network public opinion business scenarios

如表6所示，实验中执行了四种网络舆情业务场景的查询语句。从整体上看，Elasticsearch默认分片机制查询效率最差，按月构建分段索引效率有所提升，再通过本发明提出的分段+冷热+ssd构建的冷热机制进一步提高了Elasticsearch检索系统性能，并且随着数据量的横向扩充，性能优化的效果越发明显。As shown in Table 6, four query statements of network public opinion business scenarios were executed in the experiment. Overall, the query efficiency of the default sharding mechanism of Elasticsearch is the worst, and the efficiency of constructing segmented indexes by month is improved. The hot and cold mechanism constructed by segmentation + hot and cold + ssd proposed in the present invention further improves the performance of the Elasticsearch retrieval system, and with the horizontal expansion of the data volume, the effect of performance optimization becomes more and more obvious.

表7Q2查询语句执行结果表Table 7Q2 query statement execution result table

如表7所示，实验中执行了Q2查询语句，通过本发明提出的分片数优化+放置优化策略取得了较大检索性能的提高，并且随着数据量的横向扩充，性能优化的效果越发明显。As shown in Table 7, the Q2 query statement was executed in the experiment. The shard number optimization + placement optimization strategy proposed in the present invention achieved a significant improvement in retrieval performance, and with the horizontal expansion of the data volume, the effect of performance optimization became more and more obvious.

具体实现中，本申请提供计算机存储介质以及对应的数据处理单元，其中，该计算机存储介质能够存储计算机程序，所述计算机程序通过数据处理单元执行时可运行本发明提供的一种面向大规模网络舆情的Elasticsearch检索优化系统的发明内容以及各实施例中的部分或全部步骤。所述的存储介质可为磁碟、光盘、只读存储记忆体(read-onlymemory，ROM)或随机存储记忆体(random access memory，RAM)等。In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, wherein the computer storage medium can store a computer program, and when the computer program is executed by the data processing unit, the invention content of an Elasticsearch retrieval optimization system for large-scale network public opinion provided by the present invention and some or all of the steps in each embodiment can be run. The storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.

本领域的技术人员可以清楚地了解到本发明实施例中的技术方案可借助计算机程序以及其对应的通用硬件平台的方式来实现。基于这样的理解，本发明实施例中的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机程序即软件产品的形式体现出来，该计算机程序软件产品可以存储在存储介质中，包括若干指令用以使得一台包含数据处理单元的设备(可以是个人计算机，服务器，单片机，MCU或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。Those skilled in the art can clearly understand that the technical solutions in the embodiments of the present invention can be implemented by means of computer programs and their corresponding general hardware platforms. Based on such an understanding, the technical solutions in the embodiments of the present invention are essentially or the part that contributes to the prior art can be embodied in the form of a computer program, i.e., a software product, which can be stored in a storage medium and includes several instructions for enabling a device including a data processing unit (which can be a personal computer, a server, a single-chip microcomputer, an MCU or a network device, etc.) to execute the methods described in various embodiments of the present invention or certain parts of the embodiments.

本发明提供了一种面向大规模网络舆情的Elasticsearch检索优化系统的思路及方法，具体实现该技术方案的方法和途径很多，以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。本实施例中未明确的各组成部分均可用现有技术加以实现。The present invention provides an idea and method for an Elasticsearch retrieval optimization system for large-scale network public opinion. There are many methods and ways to implement the technical solution. The above is only a preferred implementation of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principles of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention. All components not specified in this embodiment can be implemented using existing technologies.