Movatterモバイル変換


[0]ホーム

URL:


CN116306605A - Title screening method, device and computer equipment - Google Patents

Title screening method, device and computer equipment
Download PDF

Info

Publication number
CN116306605A
CN116306605ACN202310308319.XACN202310308319ACN116306605ACN 116306605 ACN116306605 ACN 116306605ACN 202310308319 ACN202310308319 ACN 202310308319ACN 116306605 ACN116306605 ACN 116306605A
Authority
CN
China
Prior art keywords
text
titles
title
numbers
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310308319.XA
Other languages
Chinese (zh)
Inventor
柴玉倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qichacha Technology Co ltd
Original Assignee
Qichacha Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qichacha Technology Co ltdfiledCriticalQichacha Technology Co ltd
Priority to CN202310308319.XApriorityCriticalpatent/CN116306605A/en
Publication of CN116306605ApublicationCriticalpatent/CN116306605A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本申请涉及一种标题筛选方法。所述方法包括:获取文本中的已识别标题,所述已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题;根据所述文本样式属性、文本大小属性获取文本中的未识别标题;将同一类别的已识别标题、未识别标题加入同一标题序列,将所述标题序列中的标题按照文本数字编号进行排序,若所述标题序列的文本数字编号连续,所述标题序列中的标题为文本中的标题。采用本方法能够获取未识别标题,解决了识别标题序列不完整的问题。

Figure 202310308319

This application relates to a title screening method. The method includes: obtaining the identified titles in the text, the identified title information includes text style attributes, text size attributes, and text number numbers, and classifying the identified titles according to the text style attributes and text size attributes to obtain multiple The recognized title of the category; according to the text style attribute and the text size attribute, the unrecognized title in the text is obtained; the recognized title and unidentified title of the same category are added to the same title sequence, and the titles in the title sequence are arranged according to the text The numerical numbers are sorted. If the numerical numbers of the text in the title sequence are continuous, the titles in the title sequence are the titles in the text. By adopting the method, unrecognized titles can be obtained, and the problem of incomplete sequence of recognized titles is solved.

Figure 202310308319

Description

Translated fromChinese
一种标题筛选方法、装置、计算机设备Title screening method, device and computer equipment

技术领域technical field

本申请涉及计算机技术领域,特别是涉及一种标题筛选方法、装置、计算机设备。The present application relates to the field of computer technology, in particular to a title screening method, device and computer equipment.

背景技术Background technique

在文本中存在多个格式、属性、字体大小不同的标题,同一类型的标题包括相同的标题元素,可以通过获取文本中的标题了解文本的内容。There are multiple titles with different formats, attributes, and font sizes in the text, and titles of the same type include the same title elements. You can get the content of the text by obtaining the titles in the text.

相关技术中,可以通过版面分析模型识别文本中的标题元素,但是,版本分析模型识别的精确程度较低。一般,在标题前包括文本数字编号,使用版本分析模型获取的标题文本数字编号存在不连续的情况,无法完整的获取文本中所有的标题。In the related art, the title element in the text can be identified through the layout analysis model, however, the version analysis model is less accurate in identification. Generally, the text number is included before the title, and the title text number obtained by using the version analysis model is discontinuous, and it is impossible to completely obtain all the titles in the text.

发明内容Contents of the invention

基于此,有必要针对上述技术问题,提供了一种标题筛选的方法,可以根据版面分析模型识别后得到的标题的信息,获得文本样式属性、文本大小属性、文本数字编号,可以识别出文本中未识别出的标题。Based on this, it is necessary to provide a title screening method for the above-mentioned technical problems, which can obtain text style attributes, text size attributes, and text number numbers according to the information of the titles identified by the layout analysis model, and can identify the titles in the text. Unrecognized title.

第一方面,本申请提供了一种标题筛选方法。所述方法包括:In the first aspect, the present application provides a title screening method. The methods include:

获取文本中的已识别标题,所述已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题;Obtain the identified titles in the text, the identified title information includes text style attributes, text size attributes, and text number numbers, classify the identified titles according to the text style attributes and text size attributes, and obtain multiple categories of identified titles ;

根据所述文本样式属性、文本大小属性获取文本中的未识别标题;Acquire unrecognized titles in the text according to the text style attribute and the text size attribute;

将同一类别的已识别标题、未识别标题加入同一标题序列,将所述标题序列中的标题按照文本数字编号进行排序,若所述标题序列的文本数字编号连续,所述标题序列中的标题为文本中的标题。Add the identified titles and unidentified titles of the same category to the same title sequence, and sort the titles in the title sequence according to the text and number numbers. If the text and number numbers of the title sequence are consecutive, the titles in the title sequence are Title in the text.

在其中一个实施例中,相同文本样式属性的标题根据文本大小属性获取标题序列,所述文本大小属性小于预设的文本阈值。In one embodiment, titles with the same text style attribute acquire a title sequence according to a text size attribute, and the text size attribute is smaller than a preset text threshold.

在其中一个实施例中,所述标题序列的文本数字编号连续包括所述标题序列中存在相同的文本数字编号。In one of the embodiments, the consecutive text numerals of the title sequence include the same text numerals existing in the title sequence.

在其中一个实施例中,所述标题序列中存在相同的文本数字编号,将相同的文本数字编号中除首位之外的其余文本数字编号删除,得到连续的文本数字编号。In one of the embodiments, there are identical textual numerals in the title sequence, and the remaining textual numerals in the same textual numerals except the first one are deleted to obtain continuous textual numerals.

在其中一个实施例中,所述预设的文本阈值小于或等于1.5磅。In one embodiment, the preset text threshold is less than or equal to 1.5 points.

第二方面,本申请还提供了一种标题筛选装置,所述装置包括:In a second aspect, the present application also provides a title screening device, the device comprising:

分类模块,用于获取文本中的已识别标题,所述已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题;The classification module is used to obtain the identified titles in the text. The identified title information includes text style attributes, text size attributes, and text number numbers, and classifies the identified titles according to the text style attributes and text size attributes to obtain multiple the identified title of the category;

未识别标题获取模块,用于根据所述文本样式属性、文本大小属性获取文本中的未识别标题;An unrecognized title acquisition module, used to acquire unrecognized titles in the text according to the text style attribute and text size attribute;

排序模块,用于将同一类别的已识别标题、未识别标题加入同一标题序列,将所述标题序列中的标题按照文本数字编号进行排序,若所述标题序列的文本数字编号连续,所述标题序列中的标题为文本中的标题。The sorting module is used to add the identified titles and unidentified titles of the same category to the same title sequence, sort the titles in the title sequence according to the text numbers, and if the text numbers of the title sequence are continuous, the titles The titles in the sequence are the titles in the text.

在其中一个实施例中,相同文本样式属性的标题根据文本大小属性获取标题序列,所述文本大小属性小于预设的文本阈值。In one embodiment, titles with the same text style attribute acquire a title sequence according to a text size attribute, and the text size attribute is smaller than a preset text threshold.

在其中一个实施例中,所述标题序列的文本数字编号连续包括所述标题序列中存在相同的文本数字编号。In one of the embodiments, the consecutive text numerals of the title sequence include the same text numerals existing in the title sequence.

在其中一个实施例中,所述标题序列中存在相同的文本数字编号,将相同的文本数字编号中除首位之外的其余文本数字编号删除,得到连续的文本数字编号。In one of the embodiments, there are identical textual numerals in the title sequence, and the remaining textual numerals in the same textual numerals except the first one are deleted to obtain continuous textual numerals.

在其中一个实施例中,所述预设的文本阈值小于或等于1.5磅。In one embodiment, the preset text threshold is less than or equal to 1.5 points.

第三方面,本公开还提供了一种计算机设备。所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现标题筛选方法的步骤。In a third aspect, the present disclosure also provides a computer device. The computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the title screening method when executing the computer program.

第四方面,本公开还提供了一种计算机可读存储介质。所述计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现标题筛选方法的步骤。In a fourth aspect, the present disclosure further provides a computer-readable storage medium. The computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the steps of the title screening method are realized.

第五方面,本公开还提供了一种计算机程序产品。所述计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现标题筛选方法的步骤。In a fifth aspect, the present disclosure further provides a computer program product. The computer program product includes a computer program, and when the computer program is executed by a processor, the steps of the title screening method are realized.

上述标题筛选方法,至少包括以下有益效果:The above title screening method at least includes the following beneficial effects:

本公开提供的实施例方案,可以通过版本分析模型得到的已识别标题,得到标题信息,可以根据标题信息获取文本中的未识别标题,将同一类别的已识别标题、未识别标题加入同一标题序列,可以通过判断标题序列中的文本数字编号是否连续,可以得到文本中存在的所有标题。The embodiment solution provided by the present disclosure can obtain the title information through the identified titles obtained by the version analysis model, and can obtain the unidentified titles in the text according to the title information, and add the identified titles and unidentified titles of the same category into the same title sequence , you can get all the titles that exist in the text by judging whether the text number numbers in the title sequence are continuous.

应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明Description of drawings

为了更清楚地说明本公开实施例或传统技术中的技术方案,下面将对实施例或传统技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure or the conventional technology, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the conventional technology. Obviously, the accompanying drawings in the following description are only the present invention. For some disclosed embodiments, those skilled in the art can also obtain other drawings based on these drawings without any creative work.

图1为一个实施例中标题筛选方法的应用环境图;Fig. 1 is an application environment diagram of a title screening method in an embodiment;

图2为一个实施例中标题筛选方法的流程示意图;Fig. 2 is a schematic flow chart of a title screening method in an embodiment;

图3为一个实施例中已识别标题信息表;Fig. 3 is an identified title information table in one embodiment;

图4为一个实施例中的未识别标题表;Figure 4 is an unrecognized title table in one embodiment;

图5为一个实施例中标题序列示意图;FIG. 5 is a schematic diagram of a title sequence in an embodiment;

图6为一个实施例中标题筛选装置的结构框图;Fig. 6 is a structural block diagram of a title screening device in an embodiment;

图7为一个实施例中计算机设备的内部结构图;Figure 7 is an internal structural diagram of a computer device in an embodiment;

图8为一个实施例中一种服务器的内部结构图。Fig. 8 is an internal structure diagram of a server in an embodiment.

具体实施方式Detailed ways

为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to enable ordinary persons in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.

需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、产品或者设备不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、方法、产品或者设备所固有的要素。在没有更多限制的情况下,并不排除在包括所述要素的过程、方法、产品或者设备中还存在另外的相同或等同要素。例如若使用到第一,第二等词语用来表示名称,而并不表示任何特定的顺序。It should be noted that the terms "first" and "second" in the specification and claims of the present disclosure and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims. The term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, product, or apparatus comprising a set of elements includes not only those elements, but also other elements not expressly listed. elements, or also elements inherent in such a process, method, product, or apparatus. Without further limitations, it is not excluded that there are additional identical or equivalent elements in a process, method, product or device comprising said elements. For example, if the words first, second, etc. are used, they are used to indicate names and do not indicate any specific order.

本公开实施例提供一种标题筛选方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上,也可以放在云上或其他网络服务器上。其中,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The embodiment of the present disclosure provides a title screening method, which can be applied to the application environment shown in FIG. 1 . Wherein, theterminal 102 communicates with theserver 104 through the network. The data storage system can store data that needs to be processed by theserver 104 . The data storage system can be integrated on theserver 104, or placed on the cloud or other network servers. Among them, the terminal 102 can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, Internet of Things devices and portable wearable devices, and the Internet of Things devices can be smart speakers, smart TVs, smart air conditioners, smart vehicle-mounted devices, etc. . Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, and the like. Theserver 104 can be implemented by an independent server or a server cluster composed of multiple servers.

在本公开的一些实施例中,如图2所示,提供了一种标题筛选方法,以该方法应用于图1中的服务器对标题进行处理为例进行说明。可以理解的是,该方法可以应用于服务器,还可以应用于包括终端和服务器的系统,并通过终端和服务器的交互实现。具体的一个实施例中,所述方法可以包括以下步骤:In some embodiments of the present disclosure, as shown in FIG. 2 , a method for screening titles is provided, which is described by taking the method applied to the server in FIG. 1 to process titles as an example. It can be understood that the method can be applied to a server, and can also be applied to a system including a terminal and a server, and can be implemented through interaction between the terminal and the server. In a specific embodiment, the method may include the following steps:

S202:获取文本中的已识别标题,所述已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题。S202: Obtain the identified titles in the text, the identified title information includes text style attributes, text size attributes, and text number numbers, classify the identified titles according to the text style attributes and text size attributes, and obtain multiple categories of identified titles Identify the title.

在文本中可能存在多个类型的标题,不同类型的标题可能包括不同的文本样式属性、文本大小属性。文本中的已识别标题可以通过版本分析模型获得,但是,版本分析模型的识别精确度较低,在文本中可能存在未识别标题。文本样式属性可以包括主题字体、加粗、颜色等样式,文本大小属性表示标题文字大小,文本数字标号可以表示标题层级的序号,例如,第一章节、一、1等。There may be multiple types of titles in the text, and different types of titles may include different text style attributes and text size attributes. The recognized titles in the text can be obtained through the version analysis model, however, the recognition accuracy of the version analysis model is low, and there may be unrecognized titles in the text. The text style attribute can include theme font, bold, color and other styles, the text size attribute indicates the title text size, and the text number label can indicate the serial number of the title level, for example, the first chapter, one, 1, etc.

在本公开的一些实施例中,可以获取已识别标题信息中的文本样式属性、文本大小属性、文本数字编号,将同一文本样式属性、文本大小属性的已识别标题划分为同一列别的标题,可以得到多个类别的已识别标题。图3为一个实施例中的已识别标题信息表,包括标题所在的页数、文本内容、文本样式属性、文本大小属性、文本数字编号,其中,文本内容为“第一节、声明”的文本样式属性为“第*节、”,文本内容为“二、报告”的文本样式属性为“*、”。可以将已识别标题信息表中的标题分为两组,序号为1、7为A组,序号为2、3、4、5、6为B组,可以根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题。In some embodiments of the present disclosure, the text style attribute, text size attribute, and text number number in the identified title information can be obtained, and the identified titles with the same text style attribute and text size attribute can be divided into titles of the same column, Recognized titles for multiple categories are available. Figure 3 is an identified title information table in one embodiment, including the number of pages where the title is located, text content, text style attributes, text size attributes, and text number numbers, where the text content is the text of "section one, statement" The style attribute is "Section *," and the text content is "Second, report" and the text style attribute is "*,". The titles in the identified title information table can be divided into two groups, withserial numbers 1 and 7 as group A, and serial numbers as 2, 3, 4, 5, and 6 as group B. The identified headings are classified to obtain recognized headings of multiple categories.

S204:根据所述文本样式属性、文本大小属性获取文本中的未识别标题。S204: Obtain an unrecognized title in the text according to the text style attribute and the text size attribute.

通过图3的已识别标题信息表可以获取已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,并且已经根据文本样式属性、文本大小属性将已识别标题进行分类。文本中的标题具有概括文本的作用,因此,标题的文本样式属性、文本大小属性会与普通文本存在不同,可以根据已识别标题的信息可以获取文本中标题的文本样式属性、文本大小属性,可以根据文本样式属性、文本大小属性获取文本中的未识别标题。未识别标题可以是潜在的标题,普通文本是除标题外的其余文本,一些普通文本中可能包括引用标题的情况,因此,可以对未识别标题进行二次筛选,判断未识别标题是否符合需求。The identified title information can be obtained through the identified title information table in FIG. 3 , including text style attributes, text size attributes, and text number numbers, and the identified titles have been classified according to the text style attributes and text size attributes. The title in the text has the function of summarizing the text. Therefore, the text style attribute and text size attribute of the title will be different from ordinary text. The text style attribute and text size attribute of the title in the text can be obtained according to the information of the recognized title. Get unrecognized headings in text according to text style property, text size property. Unrecognized titles can be potential titles, and normal text is the rest of the text except titles. Some common texts may include references to titles. Therefore, secondary screening can be performed on unrecognized titles to determine whether unrecognized titles meet the requirements.

图4为一个实施例中的未识别标题表。其中,包括标题的文本内容、文本样式属性、文本大小属性,所述组别。可以根据文本样式属性、文本大小属性将序号1、3的文本分为B组,序号为2的文本分为A组。Figure 4 is a table of unrecognized titles in one embodiment. Among them, it includes the text content of the title, the text style attribute, the text size attribute, and the group. According to the text style attribute and text size attribute, the texts withserial number 1 and 3 can be divided into group B, and the text withserial number 2 can be divided into group A.

S206:将同一类别的已识别标题、未识别标题加入同一标题序列,将所述标题序列中的标题按照文本数字编号进行排序,若所述标题序列的文本数字编号连续,所述标题序列中的标题为文本中的标题。S206: Add the identified titles and unidentified titles of the same category to the same title sequence, sort the titles in the title sequence according to the text numbers, if the text numbers of the title sequence are consecutive, the titles in the title sequence Title is the title in the text.

同一标题序列中可以包括同一类别的标题,可以将同一类别的已识别标题、未识别标题加入同一标题序列,通过标题序列中的文本数字编号判断是否识别出文本中该类别的所有标题。若标题序列中的文本数字编号连续,不存在中断是数字编号,可以得到包括已识别标题、未识别标题的标题序列是完整的标题序列,包括文本中该类别的所有标题。连续可以包括文本数字编号呈现递增、递减的趋势,或连续的文本数字编号中存在相同的文本数字编号。Titles of the same category can be included in the same title sequence, recognized titles and unrecognized titles of the same category can be added to the same title sequence, and whether all titles of this category in the text are recognized through the text number in the title sequence. If the numbers of the text in the title sequence are continuous and there is no interruption in the numbering, the title sequence including recognized titles and unrecognized titles can be obtained as a complete title sequence, including all titles of this category in the text. Consecutive may include text number numbers showing an increasing or decreasing trend, or the same text number number existing in consecutive text number numbers.

图5为一个实施例中的标题序列示意图。其中,文本数字编号为1、3的标题为已识别标题,文本数字编号为2的标题为未识别标题,后续还可以通过判断文本数字编号是否连续的方式判断当前序列是否是完整的标题序列,可以在文本数字编号为2的标题后增加待添加的备注,用于提示此标题为未识别标题。连续的文本数字编号可以表示为[1,2,3]或[1,2,3,4],文本数字编号的第一个数字为1,若首数字不为1,也可以判定当前序列不是完整的文本序列。若文本数字编号为[1,3,7]或[1,7,3]都判定为当前序列不是完整的文本序列。Figure 5 is a schematic diagram of a title sequence in one embodiment. Among them, the titles withtext numbers 1 and 3 are recognized titles, and the titles withtext number 2 are unrecognized titles. You can also determine whether the current sequence is a complete title sequence by judging whether the text number numbers are continuous. A note to be added can be added after the title whose text number is 2 to remind that the title is an unrecognized title. Continuous text numbers can be expressed as [1, 2, 3] or [1, 2, 3, 4]. The first number of text numbers is 1. If the first number is not 1, it can also be determined that the current sequence is not complete text sequence. If the number of the text is [1, 3, 7] or [1, 7, 3], it is determined that the current sequence is not a complete text sequence.

上述一种标题筛选的方法中,可以通过版面分析模型得到已识别标题,通过已识别标题的文本样式属性、文本大小属性获取文本中的未识别标题,将同一类别的已识别标题、未识别标题加入同一标题序列,可以通过判断标题序列中的文本数字编号是否连续,可以得到文本中存在的所有标题。In the above method of title screening, the recognized title can be obtained through the layout analysis model, the unrecognized title in the text can be obtained through the text style attribute and the text size attribute of the identified title, and the recognized title and unrecognized title of the same category can be obtained. Adding the same title sequence, you can get all the titles in the text by judging whether the text numbers in the title sequence are continuous.

在本公开的一些实施例中,相同文本样式属性的标题根据文本大小属性获取标题序列,所述文本大小属性小于预设的文本阈值。In some embodiments of the present disclosure, titles with the same text style attribute obtain a title sequence according to a text size attribute, and the text size attribute is smaller than a preset text threshold.

同一标题序列中可以包括同一类别的标题,可以根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题。首先可以获取相同文本样式属性的标题,相同文本属性的标题中可能存在大标题或小标题,可以再根据文本大小属性将相同文本样式属性的标题分为同一序列。The same title sequence may include titles of the same category, and the identified titles may be classified according to text style attributes and text size attributes to obtain multiple categories of identified titles. Firstly, the titles with the same text style attribute can be obtained, and there may be large titles or small titles in the titles with the same text attribute, and then the titles with the same text style attribute can be divided into the same sequence according to the text size attribute.

在本公开的一些实施例中,所述标题序列的文本数字编号连续包括所述标题序列中存在相同的文本数字编号。In some embodiments of the present disclosure, the consecutive text-numeric numbers of the title sequence include the presence of the same text-numeric number in the title sequence.

在将同一类别的已识别标题、未识别标题加入同一标题序列后,可以通过判断文本数字编号是否连续的方式判断标题序列是否完整。若文本数字编号中存在连续编号相同的情况,同样认定为文本数字编号连续,已识别标题和未识别标题中可能存在相同的标题,因此,文本数字编号相同。After adding recognized titles and unidentified titles of the same category into the same title sequence, it can be judged whether the title sequence is complete by judging whether the text number numbers are continuous. If there are consecutive numbers in the text and number numbers that are the same, it is also deemed that the text and number numbers are continuous, and there may be the same title in the recognized title and the unrecognized title, so the text and number numbers are the same.

在本公开的一些实施例中,所述标题序列中存在相同的文本数字编号,将相同的文本数字编号中除首位之外的其余文本数字编号删除,得到连续的文本数字编号。In some embodiments of the present disclosure, there are identical textual numerals in the title sequence, and the remaining textual numerals in the same textual numerals except the first one are deleted to obtain continuous textual numerals.

当已识别标题和未识别标题中存在相同的标题时,可能会出现相同的文本数字编号,可以保留相同的文本数字编号中首位的文本数字编号,将其余文本数字编号删除,得到连续的文本数字编号。When the same title exists in the recognized title and the unrecognized title, the same text number may appear, and the first text number in the same text number can be retained, and the rest of the text number can be deleted to obtain continuous text numbers serial number.

在本公开的一些实施例中,所述预设的文本阈值小于或等于1.5磅。In some embodiments of the present disclosure, the preset text threshold is less than or equal to 1.5 points.

相同文本样式属性的标题根据文本大小属性获取标题序列,文本大小属性小于预设的文本阈值,在文本样式属性相同的情况下,可能存在父标题、子标题,此时需要再次对标题中的文本大小属性进行划分,一般不同层级的标题文本大小差距为2以上,可以将预设的文本阈值设置为小于或等于1.5磅。例如,父标题为一、释义,子标题为一、说明,在文本样式属性相同的情况下,文本大小属性不同,可以根据文本大小属性将父标题、子标题划分为两个不同的标题序列。Titles with the same text style attribute obtain the title sequence according to the text size attribute. The text size attribute is smaller than the preset text threshold. In the case of the same text style attribute, there may be parent titles and subtitles. In this case, the text in the title needs to be adjusted again The size attribute is used for division. Generally, the difference between the title text size of different levels is more than 2, and the preset text threshold can be set to be less than or equal to 1.5 points. For example, the parent title is one, explanation, and the subtitle is one, description. In the case of the same text style attribute, the text size attribute is different, and the parent title and subtitle can be divided into two different title sequences according to the text size attribute.

应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or stages, and these steps or stages are not necessarily executed at the same time, but may be performed at different times For execution, the execution order of these steps or stages is not necessarily performed sequentially, but may be executed in turn or alternately with other steps or at least a part of steps or stages in other steps.

基于同样的发明构思,本公开实施例还提供了一种用于实现上述所涉及的针对标题筛选方法的标题筛选装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似,故下面所提供的标题筛选装置实施例中的具体限定可以参见上文中对于标题筛选方法的限定,在此不再赘述。Based on the same inventive concept, an embodiment of the present disclosure further provides a title screening device for implementing the above-mentioned title screening method. The solution to the problem provided by the device is similar to the implementation described in the above method, so the specific limitations in the title screening device embodiment provided below can refer to the above definition of the title screening method, and will not be repeated here repeat.

所述装置可以包括使用了本说明书实施例所述方法的系统(包括分布式系统)、软件(应用)、模块、组件、服务器、客户端等并结合必要的实施硬件的装置。基于同一创新构思,本公开实施例提供的一个或多个实施例中的装置如下面的实施例所述。由于装置解决问题的实现方案与方法相似,因此本说明书实施例具体的装置的实施可以参见前述方法的实施,重复之处不再赘述。以下所使用的,术语“单元”或者“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。The device may include a system (including a distributed system), software (application), module, component, server, client, etc. using the methods described in the embodiments of this specification combined with necessary implementation hardware. Based on the same innovative idea, the devices in one or more embodiments provided by the embodiments of the present disclosure are described in the following embodiments. Since the implementation of the device to solve the problem is similar to the method, the implementation of the specific device in the embodiment of this specification can refer to the implementation of the aforementioned method, and the repetition will not be repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

在一个实施例中,如图6所示,提供了一种标题筛选装置600,所述装置可以为前述服务器,或者集成于所述服务器的模块、组件、器件、单元等。该装置600可以包括:In one embodiment, as shown in FIG. 6 , atitle screening apparatus 600 is provided, and the apparatus may be the aforementioned server, or a module, component, device, unit, etc. integrated in the server. Thedevice 600 may include:

分类模块602,用于获取文本中的已识别标题,所述已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题;Theclassification module 602 is used to obtain the identified titles in the text. The identified title information includes text style attributes, text size attributes, and text number numbers, and classifies the identified titles according to the text style attributes and text size attributes to obtain multiple identified titles for categories;

未识别标题获取模块604,用于根据所述文本样式属性、文本大小属性获取文本中的未识别标题;An unrecognizedtitle acquisition module 604, configured to acquire unrecognized titles in the text according to the text style attribute and text size attribute;

排序模块606,用于将同一类别的已识别标题、未识别标题加入同一标题序列,将所述标题序列中的标题按照文本数字编号进行排序,若所述标题序列的文本数字编号连续,所述标题序列中的标题为文本中的标题。Thesorting module 606 is used to add the identified titles and unidentified titles of the same category to the same title sequence, sort the titles in the title sequence according to the text numbers, if the text numbers of the title sequence are continuous, the The titles in the title sequence are the titles in the text.

在一个实施例中,相同文本样式属性的标题根据文本大小属性获取标题序列,所述文本大小属性小于预设的文本阈值。In one embodiment, titles with the same text style attribute acquire a title sequence according to a text size attribute, and the text size attribute is smaller than a preset text threshold.

在一个实施例中,所述标题序列的文本数字编号连续包括所述标题序列中存在相同的文本数字编号。In one embodiment, the sequence of text-numeric numbers of the title sequence comprises the presence of the same text-numeric number in the title sequence.

在一个实施例中,所述标题序列中存在相同的文本数字编号,将相同的文本数字编号中除首位之外的其余文本数字编号删除,得到连续的文本数字编号。In one embodiment, if there are identical textual numerals in the title sequence, the remaining textual numerals in the same textual numerals except the first one are deleted to obtain continuous textual numerals.

在一个实施例中,所述预设的文本阈值小于或等于1.5磅。In one embodiment, the preset text threshold is less than or equal to 1.5 points.

关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

上述针对标题筛选装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。All or part of the above-mentioned modules in the device for screening titles can be realized by software, hardware and combinations thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图7所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储已识别标题、未识别标题。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种标题筛选方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure may be as shown in FIG. 7 . The computer device includes a processor, memory and a network interface connected by a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs and databases. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store recognized titles and unrecognized titles. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by a processor, a title screening method is realized.

在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现标题筛选方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure may be as shown in FIG. 8 . The computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies. When the computer program is executed by the processor, the title screening method can be realized. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.

本领域技术人员可以理解,图7、图8中示出的结构,仅仅是与本公开方案相关的部分结构的框图,并不构成对本公开方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structures shown in FIG. 7 and FIG. 8 are only block diagrams of partial structures related to the disclosed solution, and do not constitute a limitation to the computer equipment on which the disclosed solution is applied. The computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现本公开任一实施例所述的方法。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the method described in any embodiment of the present disclosure is implemented.

在一个实施例中,提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现本公开任一实施例所述的方法。In one embodiment, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the method described in any embodiment of the present disclosure is implemented.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本公开所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory,MRAM)、铁电存储器(Ferroelectric Random Access Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic RandomAccess Memory,DRAM)等。本公开所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等,不限于此。本公开所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any reference to storage, database or other media used in the various embodiments provided by the present disclosure may include at least one of non-volatile and volatile storage. Non-volatile memory can include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive variable memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory, MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (Phase Change Memory, PCM), graphene memory, etc. The volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory. As an illustration and not a limitation, RAM can be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (Dynamic Random Access Memory, DRAM). The databases involved in the various embodiments provided by the present disclosure may include at least one of relational databases and non-relational databases. The non-relational database may include a blockchain-based distributed database, etc., but is not limited thereto. The processors involved in the various embodiments provided by the present disclosure may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, data processing logic devices based on quantum computing, etc., and are not limited to this.

以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.

以上所述实施例仅表达了本公开的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本公开专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本公开构思的前提下,还可以做出若干变形和改进,这些都属于本公开的保护范围。因此,本公开的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present disclosure, and the description thereof is relatively specific and detailed, but should not be construed as limiting the patent scope of the present disclosure. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present disclosure, and these all belong to the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the appended claims.

Claims (13)

CN202310308319.XA2023-03-272023-03-27 Title screening method, device and computer equipmentPendingCN116306605A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202310308319.XACN116306605A (en)2023-03-272023-03-27 Title screening method, device and computer equipment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202310308319.XACN116306605A (en)2023-03-272023-03-27 Title screening method, device and computer equipment

Publications (1)

Publication NumberPublication Date
CN116306605Atrue CN116306605A (en)2023-06-23

Family

ID=86779630

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202310308319.XAPendingCN116306605A (en)2023-03-272023-03-27 Title screening method, device and computer equipment

Country Status (1)

CountryLink
CN (1)CN116306605A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112231468A (en)*2020-10-152021-01-15平安科技(深圳)有限公司Information generation method and device, electronic equipment and storage medium
WO2021068684A1 (en)*2019-10-112021-04-15平安科技(深圳)有限公司Method and apparatus for automatically generating document directory, computer device and storage medium
CN112818687A (en)*2021-03-252021-05-18杭州数澜科技有限公司Method, device, electronic equipment and storage medium for constructing title recognition model
CN114330313A (en)*2021-11-302022-04-12广州金山移动科技有限公司 Method and apparatus, electronic device, and storage medium for identifying document chapter titles
CN114691919A (en)*2022-04-022022-07-01广州故新智能科技有限责任公司Text format auditing module for financial long text rechecking system
CN115393127A (en)*2022-08-182022-11-25北京航天情报与信息研究所Method and device for optimizing collection of legal and legal items

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2021068684A1 (en)*2019-10-112021-04-15平安科技(深圳)有限公司Method and apparatus for automatically generating document directory, computer device and storage medium
CN112231468A (en)*2020-10-152021-01-15平安科技(深圳)有限公司Information generation method and device, electronic equipment and storage medium
CN112818687A (en)*2021-03-252021-05-18杭州数澜科技有限公司Method, device, electronic equipment and storage medium for constructing title recognition model
CN114330313A (en)*2021-11-302022-04-12广州金山移动科技有限公司 Method and apparatus, electronic device, and storage medium for identifying document chapter titles
CN114691919A (en)*2022-04-022022-07-01广州故新智能科技有限责任公司Text format auditing module for financial long text rechecking system
CN115393127A (en)*2022-08-182022-11-25北京航天情报与信息研究所Method and device for optimizing collection of legal and legal items

Similar Documents

PublicationPublication DateTitle
TWI718643B (en) Method and device for identifying abnormal groups
CN102999551B (en)The automatization of data entity divides scope
US20210089614A1 (en)Automatically Styling Content Based On Named Entity Recognition
CN113761185A (en)Main key extraction method, equipment and storage medium
US8954838B2 (en)Presenting data in a tabular format
CN113343102A (en)Data recommendation method and device based on feature screening, electronic equipment and medium
CN114168836A (en)Webpage data analysis and visualization method and device, electronic equipment and medium
US9430528B2 (en)Grid queries
US20170323007A1 (en)Identifier Based Glyph Search
CN112561642A (en)Multidimensional product comparison analysis method and device, computer equipment and storage medium
CN110837559B (en)Statement sample set generation method, electronic device and storage medium
CN115048536A (en)Knowledge graph generation method and device, computer equipment and storage medium
CN117373046A (en)Information extraction method, device, computer equipment and storage medium
CN115827864A (en)Processing method for automatic classification of bulletins
CN116306605A (en) Title screening method, device and computer equipment
CN117389960A (en)File parsing method, apparatus, device, storage medium and program product
CN116561181A (en) Data query method, device, computer equipment, and computer-readable storage medium
CN114610749B (en)Database execution statement optimization method, apparatus, device, medium and program product
Viswanathan et al.R data analysis cookbook
CN116702024B (en)Method, device, computer equipment and storage medium for identifying type of stream data
CN115098686B (en) Method, device and computer equipment for determining classification information
CN116578583B (en)Abnormal statement identification method, device, equipment and storage medium
CN107608947A (en)Html file processing method and processing device, electronic equipment
CN119961949A (en) Contract text desensitization method, computer device, readable storage medium and program product
CN115146051A (en) Sample processing method, apparatus, computer equipment and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
CB02Change of applicant information
CB02Change of applicant information

Country or region after:China

Address after:No. 8 Huizhi Street, Suzhou Industrial Park, Suzhou Area, China (Jiangsu) Pilot Free Trade Zone, Suzhou City, Jiangsu Province, 215000

Applicant after:Qichacha Technology Co.,Ltd.

Address before:Room 503, 5 / F, C1 building, 88 Dongchang Road, Suzhou Industrial Park, 215000, Jiangsu Province

Applicant before:Qicha Technology Co.,Ltd.

Country or region before:China


[8]ページ先頭

©2009-2025 Movatter.jp