







技术领域technical field
本申请涉及计算机技术领域,特别是涉及一种标题筛选方法、装置、计算机设备。The present application relates to the field of computer technology, in particular to a title screening method, device and computer equipment.
背景技术Background technique
在文本中存在多个格式、属性、字体大小不同的标题,同一类型的标题包括相同的标题元素,可以通过获取文本中的标题了解文本的内容。There are multiple titles with different formats, attributes, and font sizes in the text, and titles of the same type include the same title elements. You can get the content of the text by obtaining the titles in the text.
相关技术中,可以通过版面分析模型识别文本中的标题元素,但是,版本分析模型识别的精确程度较低。一般,在标题前包括文本数字编号,使用版本分析模型获取的标题文本数字编号存在不连续的情况,无法完整的获取文本中所有的标题。In the related art, the title element in the text can be identified through the layout analysis model, however, the version analysis model is less accurate in identification. Generally, the text number is included before the title, and the title text number obtained by using the version analysis model is discontinuous, and it is impossible to completely obtain all the titles in the text.
发明内容Contents of the invention
基于此,有必要针对上述技术问题,提供了一种标题筛选的方法,可以根据版面分析模型识别后得到的标题的信息,获得文本样式属性、文本大小属性、文本数字编号,可以识别出文本中未识别出的标题。Based on this, it is necessary to provide a title screening method for the above-mentioned technical problems, which can obtain text style attributes, text size attributes, and text number numbers according to the information of the titles identified by the layout analysis model, and can identify the titles in the text. Unrecognized title.
第一方面,本申请提供了一种标题筛选方法。所述方法包括:In the first aspect, the present application provides a title screening method. The methods include:
获取文本中的已识别标题,所述已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题;Obtain the identified titles in the text, the identified title information includes text style attributes, text size attributes, and text number numbers, classify the identified titles according to the text style attributes and text size attributes, and obtain multiple categories of identified titles ;
根据所述文本样式属性、文本大小属性获取文本中的未识别标题;Acquire unrecognized titles in the text according to the text style attribute and the text size attribute;
将同一类别的已识别标题、未识别标题加入同一标题序列,将所述标题序列中的标题按照文本数字编号进行排序,若所述标题序列的文本数字编号连续,所述标题序列中的标题为文本中的标题。Add the identified titles and unidentified titles of the same category to the same title sequence, and sort the titles in the title sequence according to the text and number numbers. If the text and number numbers of the title sequence are consecutive, the titles in the title sequence are Title in the text.
在其中一个实施例中,相同文本样式属性的标题根据文本大小属性获取标题序列,所述文本大小属性小于预设的文本阈值。In one embodiment, titles with the same text style attribute acquire a title sequence according to a text size attribute, and the text size attribute is smaller than a preset text threshold.
在其中一个实施例中,所述标题序列的文本数字编号连续包括所述标题序列中存在相同的文本数字编号。In one of the embodiments, the consecutive text numerals of the title sequence include the same text numerals existing in the title sequence.
在其中一个实施例中,所述标题序列中存在相同的文本数字编号,将相同的文本数字编号中除首位之外的其余文本数字编号删除,得到连续的文本数字编号。In one of the embodiments, there are identical textual numerals in the title sequence, and the remaining textual numerals in the same textual numerals except the first one are deleted to obtain continuous textual numerals.
在其中一个实施例中,所述预设的文本阈值小于或等于1.5磅。In one embodiment, the preset text threshold is less than or equal to 1.5 points.
第二方面,本申请还提供了一种标题筛选装置,所述装置包括:In a second aspect, the present application also provides a title screening device, the device comprising:
分类模块,用于获取文本中的已识别标题,所述已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题;The classification module is used to obtain the identified titles in the text. The identified title information includes text style attributes, text size attributes, and text number numbers, and classifies the identified titles according to the text style attributes and text size attributes to obtain multiple the identified title of the category;
未识别标题获取模块,用于根据所述文本样式属性、文本大小属性获取文本中的未识别标题;An unrecognized title acquisition module, used to acquire unrecognized titles in the text according to the text style attribute and text size attribute;
排序模块,用于将同一类别的已识别标题、未识别标题加入同一标题序列,将所述标题序列中的标题按照文本数字编号进行排序,若所述标题序列的文本数字编号连续,所述标题序列中的标题为文本中的标题。The sorting module is used to add the identified titles and unidentified titles of the same category to the same title sequence, sort the titles in the title sequence according to the text numbers, and if the text numbers of the title sequence are continuous, the titles The titles in the sequence are the titles in the text.
在其中一个实施例中,相同文本样式属性的标题根据文本大小属性获取标题序列,所述文本大小属性小于预设的文本阈值。In one embodiment, titles with the same text style attribute acquire a title sequence according to a text size attribute, and the text size attribute is smaller than a preset text threshold.
在其中一个实施例中,所述标题序列的文本数字编号连续包括所述标题序列中存在相同的文本数字编号。In one of the embodiments, the consecutive text numerals of the title sequence include the same text numerals existing in the title sequence.
在其中一个实施例中,所述标题序列中存在相同的文本数字编号,将相同的文本数字编号中除首位之外的其余文本数字编号删除,得到连续的文本数字编号。In one of the embodiments, there are identical textual numerals in the title sequence, and the remaining textual numerals in the same textual numerals except the first one are deleted to obtain continuous textual numerals.
在其中一个实施例中,所述预设的文本阈值小于或等于1.5磅。In one embodiment, the preset text threshold is less than or equal to 1.5 points.
第三方面,本公开还提供了一种计算机设备。所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现标题筛选方法的步骤。In a third aspect, the present disclosure also provides a computer device. The computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the title screening method when executing the computer program.
第四方面,本公开还提供了一种计算机可读存储介质。所述计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现标题筛选方法的步骤。In a fourth aspect, the present disclosure further provides a computer-readable storage medium. The computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the steps of the title screening method are realized.
第五方面,本公开还提供了一种计算机程序产品。所述计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现标题筛选方法的步骤。In a fifth aspect, the present disclosure further provides a computer program product. The computer program product includes a computer program, and when the computer program is executed by a processor, the steps of the title screening method are realized.
上述标题筛选方法,至少包括以下有益效果:The above title screening method at least includes the following beneficial effects:
本公开提供的实施例方案,可以通过版本分析模型得到的已识别标题,得到标题信息,可以根据标题信息获取文本中的未识别标题,将同一类别的已识别标题、未识别标题加入同一标题序列,可以通过判断标题序列中的文本数字编号是否连续,可以得到文本中存在的所有标题。The embodiment solution provided by the present disclosure can obtain the title information through the identified titles obtained by the version analysis model, and can obtain the unidentified titles in the text according to the title information, and add the identified titles and unidentified titles of the same category into the same title sequence , you can get all the titles that exist in the text by judging whether the text number numbers in the title sequence are continuous.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
为了更清楚地说明本公开实施例或传统技术中的技术方案,下面将对实施例或传统技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure or the conventional technology, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the conventional technology. Obviously, the accompanying drawings in the following description are only the present invention. For some disclosed embodiments, those skilled in the art can also obtain other drawings based on these drawings without any creative work.
图1为一个实施例中标题筛选方法的应用环境图;Fig. 1 is an application environment diagram of a title screening method in an embodiment;
图2为一个实施例中标题筛选方法的流程示意图;Fig. 2 is a schematic flow chart of a title screening method in an embodiment;
图3为一个实施例中已识别标题信息表;Fig. 3 is an identified title information table in one embodiment;
图4为一个实施例中的未识别标题表;Figure 4 is an unrecognized title table in one embodiment;
图5为一个实施例中标题序列示意图;FIG. 5 is a schematic diagram of a title sequence in an embodiment;
图6为一个实施例中标题筛选装置的结构框图;Fig. 6 is a structural block diagram of a title screening device in an embodiment;
图7为一个实施例中计算机设备的内部结构图;Figure 7 is an internal structural diagram of a computer device in an embodiment;
图8为一个实施例中一种服务器的内部结构图。Fig. 8 is an internal structure diagram of a server in an embodiment.
具体实施方式Detailed ways
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to enable ordinary persons in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、产品或者设备不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、方法、产品或者设备所固有的要素。在没有更多限制的情况下,并不排除在包括所述要素的过程、方法、产品或者设备中还存在另外的相同或等同要素。例如若使用到第一,第二等词语用来表示名称,而并不表示任何特定的顺序。It should be noted that the terms "first" and "second" in the specification and claims of the present disclosure and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims. The term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, product, or apparatus comprising a set of elements includes not only those elements, but also other elements not expressly listed. elements, or also elements inherent in such a process, method, product, or apparatus. Without further limitations, it is not excluded that there are additional identical or equivalent elements in a process, method, product or device comprising said elements. For example, if the words first, second, etc. are used, they are used to indicate names and do not indicate any specific order.
本公开实施例提供一种标题筛选方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上,也可以放在云上或其他网络服务器上。其中,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The embodiment of the present disclosure provides a title screening method, which can be applied to the application environment shown in FIG. 1 . Wherein, the
在本公开的一些实施例中,如图2所示,提供了一种标题筛选方法,以该方法应用于图1中的服务器对标题进行处理为例进行说明。可以理解的是,该方法可以应用于服务器,还可以应用于包括终端和服务器的系统,并通过终端和服务器的交互实现。具体的一个实施例中,所述方法可以包括以下步骤:In some embodiments of the present disclosure, as shown in FIG. 2 , a method for screening titles is provided, which is described by taking the method applied to the server in FIG. 1 to process titles as an example. It can be understood that the method can be applied to a server, and can also be applied to a system including a terminal and a server, and can be implemented through interaction between the terminal and the server. In a specific embodiment, the method may include the following steps:
S202:获取文本中的已识别标题,所述已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题。S202: Obtain the identified titles in the text, the identified title information includes text style attributes, text size attributes, and text number numbers, classify the identified titles according to the text style attributes and text size attributes, and obtain multiple categories of identified titles Identify the title.
在文本中可能存在多个类型的标题,不同类型的标题可能包括不同的文本样式属性、文本大小属性。文本中的已识别标题可以通过版本分析模型获得,但是,版本分析模型的识别精确度较低,在文本中可能存在未识别标题。文本样式属性可以包括主题字体、加粗、颜色等样式,文本大小属性表示标题文字大小,文本数字标号可以表示标题层级的序号,例如,第一章节、一、1等。There may be multiple types of titles in the text, and different types of titles may include different text style attributes and text size attributes. The recognized titles in the text can be obtained through the version analysis model, however, the recognition accuracy of the version analysis model is low, and there may be unrecognized titles in the text. The text style attribute can include theme font, bold, color and other styles, the text size attribute indicates the title text size, and the text number label can indicate the serial number of the title level, for example, the first chapter, one, 1, etc.
在本公开的一些实施例中,可以获取已识别标题信息中的文本样式属性、文本大小属性、文本数字编号,将同一文本样式属性、文本大小属性的已识别标题划分为同一列别的标题,可以得到多个类别的已识别标题。图3为一个实施例中的已识别标题信息表,包括标题所在的页数、文本内容、文本样式属性、文本大小属性、文本数字编号,其中,文本内容为“第一节、声明”的文本样式属性为“第*节、”,文本内容为“二、报告”的文本样式属性为“*、”。可以将已识别标题信息表中的标题分为两组,序号为1、7为A组,序号为2、3、4、5、6为B组,可以根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题。In some embodiments of the present disclosure, the text style attribute, text size attribute, and text number number in the identified title information can be obtained, and the identified titles with the same text style attribute and text size attribute can be divided into titles of the same column, Recognized titles for multiple categories are available. Figure 3 is an identified title information table in one embodiment, including the number of pages where the title is located, text content, text style attributes, text size attributes, and text number numbers, where the text content is the text of "section one, statement" The style attribute is "Section *," and the text content is "Second, report" and the text style attribute is "*,". The titles in the identified title information table can be divided into two groups, with
S204:根据所述文本样式属性、文本大小属性获取文本中的未识别标题。S204: Obtain an unrecognized title in the text according to the text style attribute and the text size attribute.
通过图3的已识别标题信息表可以获取已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,并且已经根据文本样式属性、文本大小属性将已识别标题进行分类。文本中的标题具有概括文本的作用,因此,标题的文本样式属性、文本大小属性会与普通文本存在不同,可以根据已识别标题的信息可以获取文本中标题的文本样式属性、文本大小属性,可以根据文本样式属性、文本大小属性获取文本中的未识别标题。未识别标题可以是潜在的标题,普通文本是除标题外的其余文本,一些普通文本中可能包括引用标题的情况,因此,可以对未识别标题进行二次筛选,判断未识别标题是否符合需求。The identified title information can be obtained through the identified title information table in FIG. 3 , including text style attributes, text size attributes, and text number numbers, and the identified titles have been classified according to the text style attributes and text size attributes. The title in the text has the function of summarizing the text. Therefore, the text style attribute and text size attribute of the title will be different from ordinary text. The text style attribute and text size attribute of the title in the text can be obtained according to the information of the recognized title. Get unrecognized headings in text according to text style property, text size property. Unrecognized titles can be potential titles, and normal text is the rest of the text except titles. Some common texts may include references to titles. Therefore, secondary screening can be performed on unrecognized titles to determine whether unrecognized titles meet the requirements.
图4为一个实施例中的未识别标题表。其中,包括标题的文本内容、文本样式属性、文本大小属性,所述组别。可以根据文本样式属性、文本大小属性将序号1、3的文本分为B组,序号为2的文本分为A组。Figure 4 is a table of unrecognized titles in one embodiment. Among them, it includes the text content of the title, the text style attribute, the text size attribute, and the group. According to the text style attribute and text size attribute, the texts with
S206:将同一类别的已识别标题、未识别标题加入同一标题序列,将所述标题序列中的标题按照文本数字编号进行排序,若所述标题序列的文本数字编号连续,所述标题序列中的标题为文本中的标题。S206: Add the identified titles and unidentified titles of the same category to the same title sequence, sort the titles in the title sequence according to the text numbers, if the text numbers of the title sequence are consecutive, the titles in the title sequence Title is the title in the text.
同一标题序列中可以包括同一类别的标题,可以将同一类别的已识别标题、未识别标题加入同一标题序列,通过标题序列中的文本数字编号判断是否识别出文本中该类别的所有标题。若标题序列中的文本数字编号连续,不存在中断是数字编号,可以得到包括已识别标题、未识别标题的标题序列是完整的标题序列,包括文本中该类别的所有标题。连续可以包括文本数字编号呈现递增、递减的趋势,或连续的文本数字编号中存在相同的文本数字编号。Titles of the same category can be included in the same title sequence, recognized titles and unrecognized titles of the same category can be added to the same title sequence, and whether all titles of this category in the text are recognized through the text number in the title sequence. If the numbers of the text in the title sequence are continuous and there is no interruption in the numbering, the title sequence including recognized titles and unrecognized titles can be obtained as a complete title sequence, including all titles of this category in the text. Consecutive may include text number numbers showing an increasing or decreasing trend, or the same text number number existing in consecutive text number numbers.
图5为一个实施例中的标题序列示意图。其中,文本数字编号为1、3的标题为已识别标题,文本数字编号为2的标题为未识别标题,后续还可以通过判断文本数字编号是否连续的方式判断当前序列是否是完整的标题序列,可以在文本数字编号为2的标题后增加待添加的备注,用于提示此标题为未识别标题。连续的文本数字编号可以表示为[1,2,3]或[1,2,3,4],文本数字编号的第一个数字为1,若首数字不为1,也可以判定当前序列不是完整的文本序列。若文本数字编号为[1,3,7]或[1,7,3]都判定为当前序列不是完整的文本序列。Figure 5 is a schematic diagram of a title sequence in one embodiment. Among them, the titles with
上述一种标题筛选的方法中,可以通过版面分析模型得到已识别标题,通过已识别标题的文本样式属性、文本大小属性获取文本中的未识别标题,将同一类别的已识别标题、未识别标题加入同一标题序列,可以通过判断标题序列中的文本数字编号是否连续,可以得到文本中存在的所有标题。In the above method of title screening, the recognized title can be obtained through the layout analysis model, the unrecognized title in the text can be obtained through the text style attribute and the text size attribute of the identified title, and the recognized title and unrecognized title of the same category can be obtained. Adding the same title sequence, you can get all the titles in the text by judging whether the text numbers in the title sequence are continuous.
在本公开的一些实施例中,相同文本样式属性的标题根据文本大小属性获取标题序列,所述文本大小属性小于预设的文本阈值。In some embodiments of the present disclosure, titles with the same text style attribute obtain a title sequence according to a text size attribute, and the text size attribute is smaller than a preset text threshold.
同一标题序列中可以包括同一类别的标题,可以根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题。首先可以获取相同文本样式属性的标题,相同文本属性的标题中可能存在大标题或小标题,可以再根据文本大小属性将相同文本样式属性的标题分为同一序列。The same title sequence may include titles of the same category, and the identified titles may be classified according to text style attributes and text size attributes to obtain multiple categories of identified titles. Firstly, the titles with the same text style attribute can be obtained, and there may be large titles or small titles in the titles with the same text attribute, and then the titles with the same text style attribute can be divided into the same sequence according to the text size attribute.
在本公开的一些实施例中,所述标题序列的文本数字编号连续包括所述标题序列中存在相同的文本数字编号。In some embodiments of the present disclosure, the consecutive text-numeric numbers of the title sequence include the presence of the same text-numeric number in the title sequence.
在将同一类别的已识别标题、未识别标题加入同一标题序列后,可以通过判断文本数字编号是否连续的方式判断标题序列是否完整。若文本数字编号中存在连续编号相同的情况,同样认定为文本数字编号连续,已识别标题和未识别标题中可能存在相同的标题,因此,文本数字编号相同。After adding recognized titles and unidentified titles of the same category into the same title sequence, it can be judged whether the title sequence is complete by judging whether the text number numbers are continuous. If there are consecutive numbers in the text and number numbers that are the same, it is also deemed that the text and number numbers are continuous, and there may be the same title in the recognized title and the unrecognized title, so the text and number numbers are the same.
在本公开的一些实施例中,所述标题序列中存在相同的文本数字编号,将相同的文本数字编号中除首位之外的其余文本数字编号删除,得到连续的文本数字编号。In some embodiments of the present disclosure, there are identical textual numerals in the title sequence, and the remaining textual numerals in the same textual numerals except the first one are deleted to obtain continuous textual numerals.
当已识别标题和未识别标题中存在相同的标题时,可能会出现相同的文本数字编号,可以保留相同的文本数字编号中首位的文本数字编号,将其余文本数字编号删除,得到连续的文本数字编号。When the same title exists in the recognized title and the unrecognized title, the same text number may appear, and the first text number in the same text number can be retained, and the rest of the text number can be deleted to obtain continuous text numbers serial number.
在本公开的一些实施例中,所述预设的文本阈值小于或等于1.5磅。In some embodiments of the present disclosure, the preset text threshold is less than or equal to 1.5 points.
相同文本样式属性的标题根据文本大小属性获取标题序列,文本大小属性小于预设的文本阈值,在文本样式属性相同的情况下,可能存在父标题、子标题,此时需要再次对标题中的文本大小属性进行划分,一般不同层级的标题文本大小差距为2以上,可以将预设的文本阈值设置为小于或等于1.5磅。例如,父标题为一、释义,子标题为一、说明,在文本样式属性相同的情况下,文本大小属性不同,可以根据文本大小属性将父标题、子标题划分为两个不同的标题序列。Titles with the same text style attribute obtain the title sequence according to the text size attribute. The text size attribute is smaller than the preset text threshold. In the case of the same text style attribute, there may be parent titles and subtitles. In this case, the text in the title needs to be adjusted again The size attribute is used for division. Generally, the difference between the title text size of different levels is more than 2, and the preset text threshold can be set to be less than or equal to 1.5 points. For example, the parent title is one, explanation, and the subtitle is one, description. In the case of the same text style attribute, the text size attribute is different, and the parent title and subtitle can be divided into two different title sequences according to the text size attribute.
应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or stages, and these steps or stages are not necessarily executed at the same time, but may be performed at different times For execution, the execution order of these steps or stages is not necessarily performed sequentially, but may be executed in turn or alternately with other steps or at least a part of steps or stages in other steps.
基于同样的发明构思,本公开实施例还提供了一种用于实现上述所涉及的针对标题筛选方法的标题筛选装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似,故下面所提供的标题筛选装置实施例中的具体限定可以参见上文中对于标题筛选方法的限定,在此不再赘述。Based on the same inventive concept, an embodiment of the present disclosure further provides a title screening device for implementing the above-mentioned title screening method. The solution to the problem provided by the device is similar to the implementation described in the above method, so the specific limitations in the title screening device embodiment provided below can refer to the above definition of the title screening method, and will not be repeated here repeat.
所述装置可以包括使用了本说明书实施例所述方法的系统(包括分布式系统)、软件(应用)、模块、组件、服务器、客户端等并结合必要的实施硬件的装置。基于同一创新构思,本公开实施例提供的一个或多个实施例中的装置如下面的实施例所述。由于装置解决问题的实现方案与方法相似,因此本说明书实施例具体的装置的实施可以参见前述方法的实施,重复之处不再赘述。以下所使用的,术语“单元”或者“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。The device may include a system (including a distributed system), software (application), module, component, server, client, etc. using the methods described in the embodiments of this specification combined with necessary implementation hardware. Based on the same innovative idea, the devices in one or more embodiments provided by the embodiments of the present disclosure are described in the following embodiments. Since the implementation of the device to solve the problem is similar to the method, the implementation of the specific device in the embodiment of this specification can refer to the implementation of the aforementioned method, and the repetition will not be repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
在一个实施例中,如图6所示,提供了一种标题筛选装置600,所述装置可以为前述服务器,或者集成于所述服务器的模块、组件、器件、单元等。该装置600可以包括:In one embodiment, as shown in FIG. 6 , a
分类模块602,用于获取文本中的已识别标题,所述已识别标题信息包括文本样式属性、文本大小属性、文本数字编号,根据文本样式属性、文本大小属性将已识别标题进行分类,得到多个类别的已识别标题;The
未识别标题获取模块604,用于根据所述文本样式属性、文本大小属性获取文本中的未识别标题;An unrecognized
排序模块606,用于将同一类别的已识别标题、未识别标题加入同一标题序列,将所述标题序列中的标题按照文本数字编号进行排序,若所述标题序列的文本数字编号连续,所述标题序列中的标题为文本中的标题。The
在一个实施例中,相同文本样式属性的标题根据文本大小属性获取标题序列,所述文本大小属性小于预设的文本阈值。In one embodiment, titles with the same text style attribute acquire a title sequence according to a text size attribute, and the text size attribute is smaller than a preset text threshold.
在一个实施例中,所述标题序列的文本数字编号连续包括所述标题序列中存在相同的文本数字编号。In one embodiment, the sequence of text-numeric numbers of the title sequence comprises the presence of the same text-numeric number in the title sequence.
在一个实施例中,所述标题序列中存在相同的文本数字编号,将相同的文本数字编号中除首位之外的其余文本数字编号删除,得到连续的文本数字编号。In one embodiment, if there are identical textual numerals in the title sequence, the remaining textual numerals in the same textual numerals except the first one are deleted to obtain continuous textual numerals.
在一个实施例中,所述预设的文本阈值小于或等于1.5磅。In one embodiment, the preset text threshold is less than or equal to 1.5 points.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
上述针对标题筛选装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。All or part of the above-mentioned modules in the device for screening titles can be realized by software, hardware and combinations thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图7所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储已识别标题、未识别标题。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种标题筛选方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure may be as shown in FIG. 7 . The computer device includes a processor, memory and a network interface connected by a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs and databases. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store recognized titles and unrecognized titles. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by a processor, a title screening method is realized.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现标题筛选方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure may be as shown in FIG. 8 . The computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies. When the computer program is executed by the processor, the title screening method can be realized. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
本领域技术人员可以理解,图7、图8中示出的结构,仅仅是与本公开方案相关的部分结构的框图,并不构成对本公开方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structures shown in FIG. 7 and FIG. 8 are only block diagrams of partial structures related to the disclosed solution, and do not constitute a limitation to the computer equipment on which the disclosed solution is applied. The computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现本公开任一实施例所述的方法。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the method described in any embodiment of the present disclosure is implemented.
在一个实施例中,提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现本公开任一实施例所述的方法。In one embodiment, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the method described in any embodiment of the present disclosure is implemented.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本公开所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory,MRAM)、铁电存储器(Ferroelectric Random Access Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic RandomAccess Memory,DRAM)等。本公开所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等,不限于此。本公开所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any reference to storage, database or other media used in the various embodiments provided by the present disclosure may include at least one of non-volatile and volatile storage. Non-volatile memory can include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive variable memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory, MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (Phase Change Memory, PCM), graphene memory, etc. The volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory. As an illustration and not a limitation, RAM can be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (Dynamic Random Access Memory, DRAM). The databases involved in the various embodiments provided by the present disclosure may include at least one of relational databases and non-relational databases. The non-relational database may include a blockchain-based distributed database, etc., but is not limited thereto. The processors involved in the various embodiments provided by the present disclosure may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, data processing logic devices based on quantum computing, etc., and are not limited to this.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.
以上所述实施例仅表达了本公开的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本公开专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本公开构思的前提下,还可以做出若干变形和改进,这些都属于本公开的保护范围。因此,本公开的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present disclosure, and the description thereof is relatively specific and detailed, but should not be construed as limiting the patent scope of the present disclosure. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present disclosure, and these all belong to the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the appended claims.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310308319.XACN116306605A (en) | 2023-03-27 | 2023-03-27 | Title screening method, device and computer equipment |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310308319.XACN116306605A (en) | 2023-03-27 | 2023-03-27 | Title screening method, device and computer equipment |
| Publication Number | Publication Date |
|---|---|
| CN116306605Atrue CN116306605A (en) | 2023-06-23 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310308319.XAPendingCN116306605A (en) | 2023-03-27 | 2023-03-27 | Title screening method, device and computer equipment |
| Country | Link |
|---|---|
| CN (1) | CN116306605A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112231468A (en)* | 2020-10-15 | 2021-01-15 | 平安科技(深圳)有限公司 | Information generation method and device, electronic equipment and storage medium |
| WO2021068684A1 (en)* | 2019-10-11 | 2021-04-15 | 平安科技(深圳)有限公司 | Method and apparatus for automatically generating document directory, computer device and storage medium |
| CN112818687A (en)* | 2021-03-25 | 2021-05-18 | 杭州数澜科技有限公司 | Method, device, electronic equipment and storage medium for constructing title recognition model |
| CN114330313A (en)* | 2021-11-30 | 2022-04-12 | 广州金山移动科技有限公司 | Method and apparatus, electronic device, and storage medium for identifying document chapter titles |
| CN114691919A (en)* | 2022-04-02 | 2022-07-01 | 广州故新智能科技有限责任公司 | Text format auditing module for financial long text rechecking system |
| CN115393127A (en)* | 2022-08-18 | 2022-11-25 | 北京航天情报与信息研究所 | Method and device for optimizing collection of legal and legal items |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021068684A1 (en)* | 2019-10-11 | 2021-04-15 | 平安科技(深圳)有限公司 | Method and apparatus for automatically generating document directory, computer device and storage medium |
| CN112231468A (en)* | 2020-10-15 | 2021-01-15 | 平安科技(深圳)有限公司 | Information generation method and device, electronic equipment and storage medium |
| CN112818687A (en)* | 2021-03-25 | 2021-05-18 | 杭州数澜科技有限公司 | Method, device, electronic equipment and storage medium for constructing title recognition model |
| CN114330313A (en)* | 2021-11-30 | 2022-04-12 | 广州金山移动科技有限公司 | Method and apparatus, electronic device, and storage medium for identifying document chapter titles |
| CN114691919A (en)* | 2022-04-02 | 2022-07-01 | 广州故新智能科技有限责任公司 | Text format auditing module for financial long text rechecking system |
| CN115393127A (en)* | 2022-08-18 | 2022-11-25 | 北京航天情报与信息研究所 | Method and device for optimizing collection of legal and legal items |
| Publication | Publication Date | Title |
|---|---|---|
| TWI718643B (en) | Method and device for identifying abnormal groups | |
| CN102999551B (en) | The automatization of data entity divides scope | |
| US20210089614A1 (en) | Automatically Styling Content Based On Named Entity Recognition | |
| CN113761185A (en) | Main key extraction method, equipment and storage medium | |
| US8954838B2 (en) | Presenting data in a tabular format | |
| CN113343102A (en) | Data recommendation method and device based on feature screening, electronic equipment and medium | |
| CN114168836A (en) | Webpage data analysis and visualization method and device, electronic equipment and medium | |
| US9430528B2 (en) | Grid queries | |
| US20170323007A1 (en) | Identifier Based Glyph Search | |
| CN112561642A (en) | Multidimensional product comparison analysis method and device, computer equipment and storage medium | |
| CN110837559B (en) | Statement sample set generation method, electronic device and storage medium | |
| CN115048536A (en) | Knowledge graph generation method and device, computer equipment and storage medium | |
| CN117373046A (en) | Information extraction method, device, computer equipment and storage medium | |
| CN115827864A (en) | Processing method for automatic classification of bulletins | |
| CN116306605A (en) | Title screening method, device and computer equipment | |
| CN117389960A (en) | File parsing method, apparatus, device, storage medium and program product | |
| CN116561181A (en) | Data query method, device, computer equipment, and computer-readable storage medium | |
| CN114610749B (en) | Database execution statement optimization method, apparatus, device, medium and program product | |
| Viswanathan et al. | R data analysis cookbook | |
| CN116702024B (en) | Method, device, computer equipment and storage medium for identifying type of stream data | |
| CN115098686B (en) | Method, device and computer equipment for determining classification information | |
| CN116578583B (en) | Abnormal statement identification method, device, equipment and storage medium | |
| CN107608947A (en) | Html file processing method and processing device, electronic equipment | |
| CN119961949A (en) | Contract text desensitization method, computer device, readable storage medium and program product | |
| CN115146051A (en) | Sample processing method, apparatus, computer equipment and storage medium |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information | ||
| CB02 | Change of applicant information | Country or region after:China Address after:No. 8 Huizhi Street, Suzhou Industrial Park, Suzhou Area, China (Jiangsu) Pilot Free Trade Zone, Suzhou City, Jiangsu Province, 215000 Applicant after:Qichacha Technology Co.,Ltd. Address before:Room 503, 5 / F, C1 building, 88 Dongchang Road, Suzhou Industrial Park, 215000, Jiangsu Province Applicant before:Qicha Technology Co.,Ltd. Country or region before:China |