CN112507684A

Movatterモバイル変換

Info

Publication number: CN112507684A
Application number: CN202011378235.6A
Authority: CN
Inventors: 郑烨翰; 罗雨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-03-16
Anticipated expiration: 2040-11-30
Also published as: CN112507684B

Abstract

The embodiment of the application discloses a method and a device for detecting an original text, electronic equipment and a computer readable storage medium, and relates to the technical field of natural language processing, knowledge maps, cloud services and deep learning. One embodiment of the method comprises: extracting a theme from the acquired text to be detected; extracting a main and predicate object triple from the text to be detected; and calculating the similarity degree between the subject and the subject-predicate triple of the text to be detected and the public text, and determining whether the text to be detected is the original text or not based on the similarity degree. In order to identify the non-original text subjected to more complex rewriting operation, the embodiment can more accurately identify whether the two texts have substantial equivalence or more similarity in the contents by comparing the subject and subject-predicate triple expressing the contents expressed by the text to be detected and the open text, so that the detection result of the original text is more accurate.

Description

Method and device for detecting original text, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to the field of natural language processing, knowledge graph, cloud service, and deep learning technologies, and in particular, to a method and an apparatus for detecting an original text, an electronic device, and a computer-readable storage medium.

Background

With the increasing degree of electronic informatization and the increasing of ways of acquiring information, people can acquire rich information, but the same information is easily released on the network as original content directly or after being simply rewritten by others. Therefore, it is important to verify whether the actual content is the original content or not, to promote the generation of high-quality content, to protect the copyright of the author, to transmit low-quality information to the marketing number, and to make the academic process worse.

The prior art can only identify directly plagiarism or simply rewritten non-original text by matching the same text and directly identifying the text similarity.

Disclosure of Invention

The embodiment of the application provides a method and a device for detecting an original text, electronic equipment and a computer-readable storage medium.

In a first aspect, an embodiment of the present application provides a method for detecting an original text, including: extracting a theme from the acquired text to be detected; extracting a main and predicate object triple from a text to be detected; and calculating the similarity degree between the subject and the subject-predicate triple of the text to be detected and the public text, and determining whether the text to be detected is the original text or not based on the similarity degree.

In a second aspect, an embodiment of the present application provides an apparatus for detecting an original text, including: the theme extraction unit is configured to extract a theme from the acquired text to be detected; the system comprises a main predicate element triple extraction unit, a predicate element triple extraction unit and a predicate element triple extraction unit, wherein the main predicate element triple extraction unit is configured to extract a main predicate element triple from a text to be detected; and the original text determining unit is configured to calculate the similarity degree between the subject and the subject-predicate triple of the text to be detected and the public text, and determine whether the text to be detected is the original text or not based on the similarity degree.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for detecting original text as described in any one of the implementations of the first aspect when executed.

In a fourth aspect, the present application provides a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement the method for detecting original text as described in any implementation manner of the first aspect when executed.

According to the method, the device, the electronic equipment and the computer-readable storage medium for detecting the original text, firstly, a theme is extracted from an acquired text to be detected; then, extracting a main and predicate object triple from the text to be detected; and finally, calculating the similarity degree between the subject and the three-tuple of the subject and the predicate element of the text to be detected and the open text, and determining whether the text to be detected is the original text or not based on the similarity degree. In order to identify the non-original text subjected to more complicated rewriting operation, the subject and subject-predicate triple of the expressed content of the text to be detected and the public text are compared to identify whether the two texts have substantial equivalence or are similar in content more accurately, and therefore the detection result of the original text is more accurate.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture to which the present application may be applied;

fig. 2 is a flowchart of a method for detecting original texts according to an embodiment of the present application;

FIG. 3 is a flowchart of another method for detecting original text according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for determining whether the document belongs to an original document according to an embodiment of the present application;

fig. 5 is a block diagram illustrating a structure of an apparatus for detecting original text according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device suitable for executing the method for detecting an original text according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates anexemplary system architecture 100 to which embodiments of the method, apparatus, electronic device, and computer-readable storage medium for detecting original text of the present application may be applied.

As shown in fig. 1, thesystem architecture 100 may include

terminal devices

101, 102, 103, anetwork 104, and aserver 105. Thenetwork 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and theserver 105.Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with theserver 105 via thenetwork 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 and theserver 105 may be installed with various applications for implementing information communication between the two devices, such as an original text detection application, a data transmission application, an instant messaging application, and the like.

The

terminal apparatuses

101, 102, 103 and theserver 105 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the

terminal devices

101, 102, and 103 are software, they may be installed in the electronic devices listed above, and they may be implemented as multiple software or software modules, or may be implemented as a single software or software module, and are not limited in this respect. When theserver 105 is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server; when the server is software, the server may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not limited herein.

Theserver 105 may provide various services through various built-in applications, taking an original text detection application that may provide a detection service for detecting whether the original text is an original text as an example, when theserver 105 runs the original text detection application, the following effects may be achieved: firstly, receiving a text to be detected sent by an author through

terminal equipment

101, 102 and 103 through anetwork 104; then, extracting a theme from the text to be detected and extracting a triplet of a principal and a predicate from the text to be detected; and finally, calculating the similarity degree between the subject and the three-tuple of the subject and the predicate element of the text to be detected and the open text, and determining whether the text to be detected is the original text or not based on the similarity degree. Further, theserver 105 may also return corresponding response information to the author according to the detection result.

It should be noted that the text to be detected may be acquired from the

terminal devices

101, 102, and 103 through thenetwork 104, or may be stored locally in theserver 105 in advance in various ways. Thus, when theserver 105 detects that such data is already stored locally (e.g., a pending original text detection task remaining before starting processing), it may choose to retrieve such data directly from locally, in which case theexemplary system architecture 100 may also not include the

terminal devices

101, 102, 103 and thenetwork 104.

The method for detecting the original text provided in the following embodiments of the present application is generally executed by theserver 105 having stronger computing power and more computing resources, so as to obtain the computing result as soon as possible by fully utilizing the stronger computing power of theserver 105. Accordingly, a device for detecting the original text is also generally provided in theserver 105. However, it should be noted that, when the

terminal devices

101, 102, and 103 also have computing capabilities and computing resources meeting the requirements, the

terminal devices

101, 102, and 103 may also complete the above-mentioned operations performed by theserver 105 through the original text detection application installed thereon, and then output the same result as theserver 105. Particularly, when there are a plurality of terminal devices having different computation capabilities at the same time, but the original text detection application determines that the terminal device has a strong computation capability and a large amount of computing resources are left, the terminal device may execute the above computation to appropriately reduce the computation pressure of theserver 105, and accordingly, the device for detecting the original text may be provided in the

terminal devices

101, 102, and 103. In such a case, theexemplary system architecture 100 may also not include theserver 105 and thenetwork 104.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 is a flowchart of a method for detecting an original text according to an embodiment of the present application, where theprocess 200 includes the following steps:

step 201: extracting a theme from the acquired text to be detected;

this step is intended to extract a subject from the acquired text to be detected by an executing body (for example, theserver 105 shown in fig. 1) of the method for detecting the original text.

The text to be detected refers to a text which needs to be subjected to original text detection, the text to be detected may be a text which is submitted to the execution main body by an author for original text detection, may be submitted to the assistance of others by the author for original text detection, or may be a designated text included in the detection instruction, and the text is not limited specifically here.

The text to be detected may be obtained in a variety of ways, for example, the execution main body may obtain the text from a local text storage path (for distinguishing the detected text, the text to be detected may be distinguished based on a tag to be detected that is only attached to the text to be detected), may also obtain the text from an author through an authoring terminal (for example, the terminal device shown in fig. 1) via a network, and may also obtain the text from an output port of a text generation service, where the text generation service may be a service that automatically generates a complete text based on given keywords.

The theme of the text to be detected refers to the core meaning expressed by the content of the text to be detected, so that the content of the text to be detected can be known more accurately from the semantic level according to the theme with the core meaning, and the theme can include the intention of the text to be detected, the expressed emotional tendency, the description object and the like. Further, considering that the text to be detected usually includes a plurality of chapters, each chapter includes a plurality of paragraphs, each paragraph usually consists of a plurality of sentences, it is difficult to directly obtain the complete topic of the text to be detected, so that the topics of each chapter, each paragraph and each sentence can be obtained first, and finally summarized according to actual requirements, so as to obtain the complete topic of the text to be detected.

Step 202: extracting a main and predicate object triple from a text to be detected;

on the basis ofstep 201, this step is intended to extract a predicate-object triplet from the text to be detected by the execution main body. The subject-predicate object triple is also called as an SPO triple, and an extraction object of the triple is usually a sentence, which is the smallest unit of an article, that is, S is an abbreviation of subject, and refers to a subject and an action sender in each sentence, and the subject and the action senders are usually born by nouns, pronouns and the like and are generally placed at the beginning of the sentence; p is an abbreviation for predicate, referring to the predicate in each statement, the verb representing the various tenses, usually immediately after the subject; o is an abbreviation of object, refers to an object in each sentence, represents an object of action, is usually assumed by a noun and a pronoun, and generally appears after a verb of a predicate.

Compared with other components (such as a complementary word, a noun phrase, a shape word and the like) in a sentence, the reason why the three triples of the main and predicate objects are extracted from the text to be detected in the step is that the main and predicate objects can accurately express the actual meaning with the least cost, and particularly, in some special sentences, one or two of the three triples of the main and predicate objects can be lacked, and at the moment, the lacked other costs can be supplemented by contacting the upper context and the lower context so as to accurately determine the meaning expressed by the sentence.

Step 203: and calculating the similarity degree between the subject and the subject-predicate triple of the text to be detected and the public text, and determining whether the text to be detected is the original text or not based on the similarity degree.

On the basis ofstep 201 andstep 202, this step is intended to determine whether the text to be detected is the original text based on the calculated similarity degree by the execution main body described above by calculating the similarity degree between the subject and SPO triplet extracted from the text to be detected and the subject and SPO triplet extracted from the open text.

It can be seen that, unlike the similarity calculation mode performed only from the literal in the prior art, the similarity comparison operation performed on the basis of the theme representing the core meaning extracted instep 201 and the SPO triplet extracted instep 202 and expressing the actual content of the text compares the similarity of the expressed meaning and the actual content of the two texts actually compared, so as to more obviously find the "pseudo-original" text in the "soup change without drug change" formula.

In order to identify the non-original text subjected to more complicated rewriting operation, the method for detecting the original text provided by the embodiment of the application can more accurately identify whether the two texts have substantial equivalence or similarity in content by comparing the subject and the subject-predicate triple expressing the contents of the text to be detected and the public text, so that the detection result of the original text is more accurate.

Referring to fig. 3, fig. 3 is a flowchart of another method for detecting an original text according to an embodiment of the present application, where theprocess 300 includes the following steps:

step 301: acquiring a detection text;

step 302: splitting a text to be detected into at least one paragraph, and splitting each paragraph into at least one sentence;

step 303: extracting core phrases from each paragraph and each sentence respectively, and taking the core phrases as the subject of the corresponding paragraph or sentence;

for a text to be detected which does not contain chapters, steps 302-303 firstly split the text to be detected into a plurality of paragraphs and each paragraph into a plurality of sentences; then, extracting core phrases expressing the core meanings of the core phrases from each paragraph and each sentence respectively; and finally, taking the extracted core phrase as the subject of the corresponding paragraph and the corresponding sentence. The segmentation and sentence splitting can be realized by combining the characteristics of a Chinese text structure through a Chinese natural language processing tool, and the core phrase expressing the core meaning can be extracted in various ways, for example, the extracted keywords can be used as the core phrase through a keyword extraction technology, and the key degree of each word in the sentence can be respectively determined through a TF-IDF (Term Frequency-Inverse text Frequency index) technology, so that the core phrase can be better extracted.

Further, when there is a need to obtain a complete topic of the text to be detected, the topic extracted from the paragraphs and sentences may be statistically analyzed and summarized, and then the topic of the completed text is sublimated, and this process may also be combined with the topic (including the main topic and the subtitle) of the text to be detected, and when the text to be detected definitely belongs to a part of a larger text set (for example, a certain chapter in a novel, a chapter in a book, etc.), the complete topic of the text to be detected may also be comprehensively determined by combining the topics and context relationships of other parts of the larger text set, so as to improve accuracy.

Step 304: identifying entity texts in the texts to be detected by using an entity identification technology of a knowledge graph;

step 305: extracting an associated text having a main-predicate object relationship with the entity by using a relation extraction technology of a knowledge graph;

step 306: generating a main predicate object triple according to the main body text and the corresponding associated text;

regardingstep 302 in theflow 200, steps 304-306 in this embodiment provide a specific implementation manner: firstly, respectively utilizing an entity recognition technology and a relation extraction technology based on a knowledge graph to obtain each entity text contained in a text to be detected and an associated text having a main-subject relation with the entity text; and then, generating a main predicate object triple according to the main body text and the corresponding associated text. The combination of the knowledge graph is realized because the knowledge graph which records sentence cost and which words generally serve as which components in the article is very helpful to achieve the above purpose, and the corresponding relationship between the entities and each node recorded in the knowledge graph in a mesh manner is also helpful to extract the SPO triples.

Of course, when the above purpose cannot be achieved through the knowledge graph, other technologies capable of achieving similar effects may be adopted instead, and the selection may be flexible according to the actual application scenario.

Step 307: calculating the vector similarity of the vectorized principal-predicate-object triples of the text to be detected and the public text in the same vector description space;

the vectorization predicate triplet refers to a result of converting a text-form predicate triplet in a vector form, that is, converting a conventional text-form predicate triplet in a vector form.

The method comprises the steps that the execution main body calculates and obtains the similarity of the SPO triple of the text to be detected and the open text in the same vector description space through vector-type vectorization predicate element triple.

Step 308: determining the number of similar subjects and the same distribution number of similar subjects between the text to be detected and the open text;

the step aims to use the number of similar subjects and the same distribution number of similar subjects as the measure of the similarity of the subjects between the text to be detected and the open text by the execution main body. The same distribution number of the similar subjects refers to the number of the similar subjects distributed at the same position of the text, so as to reflect the situation that the two texts express the same viewpoint as much as possible.

Step 309: determining the similarity degree between the subject and the subject-predicate triplet of the text to be detected and the open text based on the number of similar subjects, the same distribution number of similar subjects and the vector similarity;

on the basis ofstep 307 and step 308, in this step, the execution subject determines the similarity degree between the subject and the subject-predicate triple of the text to be detected and the publication text, respectively, based on the number of similar subjects, the same distribution number of similar subjects, and the vector similarity.

For the first half ofstep 303 in theprocess 200, this embodiment provides a specific implementation manner throughsteps 307 to 309, that is, similarity calculation between vectors is performed after converting an SPO triplet into a vector form, topic similarity between a text to be detected and a public text is determined by using the number of similar topics and the same distribution number of similar topics, and finally the similarity between the text to be detected and the public text is determined by integrating the topic similarity and the SPO triplet similarity, which is helpful for more accurate and comprehensive results.

Step 310: and determining whether the text to be detected is the original text or not based on the similarity degree.

This step is the same as the second half ofstep 203 in theprocess 200 shown in fig. 2, and for the same contents, please refer to the corresponding parts in the previous embodiment, which is not described herein again.

On the basis of the previous embodiment, the present embodiment provides a specific implementation manner for extracting a theme from a text to be detected throughsteps 302 to 303, provides a specific implementation manner for extracting an SPO triple from the text to be detected throughsteps 304 to 306, and provides a specific implementation manner how to calculate the similarity degree between the text to be detected and the publication text according to the theme and the SPO triple throughsteps 307 to 309.

On the basis of any of the above embodiments, in order to improve the accuracy of the final determination result of whether the final determination result is the original text or not as much as possible, based on the topic similarity and the SPO similarity provided by the above embodiments, the text similarity, the text repetition, the repetition rate, and other parameters may be combined to assist in the judgment from the literal hierarchy, and the following description will be given by taking the text similarity as an example to explain a specific implementation scheme, and other parameters may participate in the same manner:

referring to fig. 4, fig. 4 is a flowchart of another method for detecting an original text according to an embodiment of the present application, where theprocess 300 includes the following steps:

step 401: acquiring the text similarity between a text to be detected and a public text;

Step 402: respectively acquiring a first weight and a second weight which are distributed for the similarity degree and the text similarity in advance;

the first weight is greater than the second weight, that is, the text similarity is used as an auxiliary judgment factor, for example, the first weight is 72% and the second weight is 28%.

Step 403: calculating to obtain comprehensive similarity according to the similarity weighted by the first weight and the text similarity weighted by the second weight;

step 404: determining that the text to be detected is a non-original text in response to the comprehensive similarity exceeding a preset threshold;

step 405: and determining the text to be detected as the original text in response to the comprehensive similarity not exceeding the preset threshold.

In order to deepen understanding, the application also provides a specific implementation scheme by combining a specific application scene:

the high-quality original content community provides high-quality original content for registered users, so that a good original ecology based on the original degree is created, the core of the community is original text detection service realized by a manual and automatic strategy, the original text detection service is borne by a server A of the community, and a communication port is provided for creators of the community.

1) The server A receives an article to be detected uploaded by a creator X;

2) the server A calls a built-in original text detection strategy to respectively extract topic information M from each paragraph and each sentence of the article to be detected, and extracts SPO information N from each sentence by means of a knowledge graph;

the topic information M comprises 100 topics and distribution positions of the topics in the article, and the SPO information N comprises at least 70 different SPO triples with a main guest and a subordinate guest.

3) The server a determines that one publication article closest to the article to be detected on the topic has 14 similar topics by comparing the topic information M with the topic information Mo of the publication, and the number of the 14 similar topics is only 6 in the same distribution position, so that the topic similarity rate is only 14/70-1/5-20% (6/14 < 50%, and therefore not used as an influence factor, and if the proportion of the same distribution number of the similar topics to the similar topics exceeds 50%, the calculation mode of the topic similarity rate is influenced, for example, a fixed 10% is added on the basis of the existing calculation result, and the like);

4) the server A converts the SPO information N into a vector N1 in a preset vector description space, then carries out similarity calculation on the vector N1 and N0 of a public text, and specifically adopts a vector distance as the similarity of two vectors to determine that the similarity of 52 of 70 vectors in the N1 and the SPO triple vector of a public article exceeds 60%, so that the SPO triple similarity rate of 52/70-74.3% is obtained;

5) the server A respectively carries out weighted calculation on the theme similarity, the SPO triple similarity and the conventional text repetition rate according to preset weights according to a weighted calculation method, and the specific calculation is as follows:

74.3% × 0.8+ 20% × 0.15+ 40% (text repetition rate) × 0.05 ═ 59.49%;

6) the server a returns notification information that cannot be issued as the original text to the author X through the communication port by judging 59.49% > 45% (preset similarity threshold of the third original text).

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for detecting an original text, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 5, theapparatus 500 for detecting an original text of the present embodiment may include: the system comprises atheme extraction unit 501, a principal and predicatetriple extraction unit 502 and an originaltext determination unit 503. Thetheme extracting unit 501 is configured to extract a theme from the acquired text to be detected; a predicate elementtriple extracting unit 502 configured to extract a predicate element triple from the text to be detected; the originaltext determining unit 503 is configured to calculate a similarity degree between the subject and the predicate triple of the text to be detected and the public text, and determine whether the text to be detected is the original text based on the similarity degree.

In the present embodiment, in theapparatus 500 for detecting an original text: the specific processing and the technical effects of thetheme extracting unit 501, the main predicatetriple extracting unit 502, and the originaltext determining unit 503 can refer to the related descriptions of

step

201 and 203 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, thetopic extraction unit 501 may be further configured to:

splitting a text to be detected into at least one paragraph, and splitting each paragraph into at least one sentence;

and extracting a core phrase from each paragraph and each sentence respectively, and taking the core phrase as a subject of the corresponding paragraph or sentence.

In some optional implementations of this embodiment, the predicatetriplet extraction unit 502 may be further configured to:

identifying entity texts in the texts to be detected by using an entity identification technology of a knowledge graph;

extracting an associated text having a main-predicate object relationship with the entity by using a relation extraction technology of a knowledge graph;

and generating a main predicate object triple according to the main body text and the corresponding associated text.

In some optional implementation manners of this embodiment, the originaltext determining unit 503 may include a similarity degree calculating subunit configured to calculate a degree of similarity between the subject and the predicate triple of the text to be detected and the respective subject and predicate triplets of the publication, and the similarity degree calculating subunit may be further configured to:

calculating the vector similarity of the vectorized principal-predicate-object triples of the text to be detected and the public text in the same vector description space; the vectorization predicate element triple refers to a vector form conversion result of a predicate element triple in a text form;

determining the number of similar subjects and the same distribution number of similar subjects between the text to be detected and the open text;

and determining the similarity degree between the subject and the subject-predicate triplet of the text to be detected and the open text based on the number of the similar subjects, the same distribution number of the similar subjects and the vector similarity.

In some optional implementations of the present embodiment, the originaltext determining unit 503 may include an original text determining subunit configured to determine whether the text to be detected is the original text based on the similarity degree, and the original text determining subunit may include:

and the comprehensive determining module is configured to determine whether the text to be detected is the original text or not based on the similarity degree and the text similarity.

In some optional implementations of this embodiment, the comprehensive determination module may be further configured to:

respectively acquiring a first weight and a second weight which are distributed for the similarity degree and the text similarity in advance; wherein, the first weight is larger than the second weight;

calculating to obtain comprehensive similarity according to the similarity weighted by the first weight and the text similarity weighted by the second weight;

determining that the text to be detected is a non-original text in response to the comprehensive similarity exceeding a preset threshold;

and determining the text to be detected as the original text in response to the comprehensive similarity not exceeding the preset threshold.

In order to identify a non-original text subjected to more complex rewriting operation, the device for detecting an original text provided in the embodiment of the present application identifies more accurately whether the two texts have substantial identity or are relatively similar in content by comparing the subject and subject-predicate triple expressing the contents of the text to be detected and the open text, so that the detection result of the original text is more accurate.

According to an embodiment of the present application, an electronic device and a computer-readable storage medium are also provided.

Fig. 6 shows a block diagram of an electronic device suitable for implementing the method for detecting original text of the embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one ormore processors 601,memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, oneprocessor 601 is taken as an example.

Thememory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for detecting original text provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for detecting original text provided by the present application.

Thememory 602, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for detecting an original text in the embodiment of the present application (for example, thesubject extraction unit 501, the predicatetriple extraction unit 502, and the originaltext determination unit 503 shown in fig. 5). Theprocessor 601 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in thememory 602, that is, implements the method for detecting an original text in the above method embodiment.

Thememory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store various types of data created by the electronic device in performing the method for detecting the original text, and the like. Further, thememory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, thememory 602 optionally includes memory remotely located from theprocessor 601, and these remote memories may be connected over a network to an electronic device adapted to perform the method for detecting the original text. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device adapted to perform the method for detecting an original text may further include: aninput device 603 and anoutput device 604. Theprocessor 601, thememory 602, theinput device 603 and theoutput device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

Theinput device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic apparatus suitable for performing the method for detecting the original text, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, etc. Theoutput devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in the conventional physical host and Virtual Private Server (VPS) service.

In order to identify the non-original text subjected to more complicated rewriting operation, the embodiment of the application identifies whether the two texts have substantial equivalence or similarity in content more accurately by comparing the subject and subject-predicate triple expressing the contents expressed by the text to be detected and the open text, so that the detection result of the original text is more accurate.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

Translated fromChinese

1.一种用于检测原创文本的方法，包括：1. A method for detecting original text, comprising:

从获取到的待检测文本中提取出主题；Extract the subject from the acquired text to be detected;

从所述待检测文本中提取出主谓宾三元组；Extracting subject-verb-object triples from the text to be detected;

计算所述待检测文本与公开文本各自的主题和主谓宾三元组之间的相似程度，并基于所述相似程度确定所述待检测文本是否为原创文本。Calculate the similarity between the subject and subject-verb-object triples of the text to be detected and the published text, and determine whether the text to be detected is original text based on the similarity.

2.根据权利要求1所述的方法，其中，所述从获取到的待检测文本中提取出主题，包括：2. The method according to claim 1, wherein, extracting the subject from the acquired text to be detected comprises:

将所述待检测文本拆分为至少一个段落，并将每个所述段落拆分为至少一个语句；Splitting the text to be detected into at least one paragraph, and splitting each of the paragraphs into at least one sentence;

分别从每个所述段落和每个所述语句中提取核心短语，并将所述核心短语作为对应的段落或语句的主题。A core phrase is extracted from each of the paragraphs and each of the sentences, respectively, and the core phrase is used as the subject of the corresponding paragraph or sentence.

3.根据权利要求1所述的方法，其中，所述从所述待检测文本中提取出主谓宾三元组，包括：3. The method according to claim 1, wherein the extracting subject-verb-object triples from the text to be detected comprises:

利用知识图谱的实体识别技术识别出所述待检测文本中的实体文本；Identify the entity text in the text to be detected by using the entity recognition technology of the knowledge graph;

利用知识图谱的关系抽取技术抽取出与所述实体存在主谓宾关系的关联文本；Extract the associated text that has a subject-verb-object relationship with the entity by using the relation extraction technology of the knowledge graph;

根据所述主体文本与对应的关联文本，生成所述主谓宾三元组。The subject-verb-object triple is generated according to the subject text and the corresponding associated text.

4.根据权利要求1所述的方法，其中，所述计算所述待检测文本与公开文本各自的主题和主谓宾三元组之间的相似程度，包括：4. The method according to claim 1, wherein the calculating the degree of similarity between the subject and subject-verb-object triples of the text to be detected and the published text comprises:

在相同的向量描述空间内，计算所述待检测文本与所述公开文本各自的向量化主谓宾三元组的向量相似度；其中，所述向量化主谓宾三元组指将文本形式的主谓宾三元组的向量形式转换结果；In the same vector description space, the vector similarity of the respective vectorized subject-verb-object triples of the text to be detected and the published text is calculated; wherein, the vectorized subject-verb-object triple refers to the text form The vector form conversion result of the subject-verb-object triple of ;

确定所述待检测文本与所述公开文本之间具有的相似主题个数和相似主题相同分布数；Determine the number of similar topics and the same distribution number of similar topics between the text to be detected and the published text;

基于所述相似主题个数、所述相似主题相同分布数和所述向量相似度，确定所述待检测文本与所述公开文本各自的主题和主谓宾三元组之间的相似程度。Based on the number of similar topics, the same distribution number of the similar topics, and the vector similarity, the degree of similarity between the subject and subject-verb-object triples of the text to be detected and the published text is determined.

5.根据权利要求1-4任一项所述的方法，其中，所述基于所述相似程度确定所述待检测文本是否为原创文本，包括：5. The method according to any one of claims 1-4, wherein the determining whether the text to be detected is original text based on the degree of similarity comprises:

获取所述待检测文本与所述公开文本的文本相似度；obtaining the text similarity between the text to be detected and the published text;

基于所述相似程度、所述文本相似度，确定所述待检测文本是否为原创文本。Based on the similarity degree and the text similarity degree, it is determined whether the text to be detected is original text.

6.根据权利要求5所述的方法，其中，所述基于所述相似程度、所述文本相似度，确定所述待检测文本是否为原创文本，包括：6. The method according to claim 5, wherein, determining whether the text to be detected is original text based on the similarity degree and the text similarity degree, comprising:

分别获取预先为所述相似程度、所述文本相似度分配的第一权值、第二权值；其中，所述第一权值大于所述第二权值；respectively acquiring a first weight value and a second weight value pre-allocated for the similarity degree and the text similarity degree; wherein the first weight value is greater than the second weight value;

根据使用所述第一权值加权后的相似程度、使用所述第二权值加权后的文本相似度，计算得到综合相似程度；According to the similarity degree weighted by the first weight and the text similarity weighted by the second weight, the comprehensive similarity degree is calculated and obtained;

响应于所述综合相似程度超过预设阈值，确定所述待检测文本为非原创文本；In response to the comprehensive similarity degree exceeding a preset threshold, determine that the text to be detected is non-original text;

响应于所述综合相似程度不超过所述预设阈值，确定所述待检测文本为原创文本。In response to the comprehensive similarity not exceeding the preset threshold, it is determined that the text to be detected is original text.

7.一种用于检测原创文本的装置，包括：7. A device for detecting original text, comprising:

主题提取单元，被配置成从获取到的待检测文本中提取出主题；a topic extraction unit, configured to extract topics from the acquired text to be detected;

主谓宾三元组提取单元，被配置成从所述待检测文本中提取出主谓宾三元组；a subject-verb-object triple extraction unit, configured to extract a subject-verb-object triple from the text to be detected;

原创文本确定单元，被配置成计算所述待检测文本与公开文本各自的主题和主谓宾三元组之间的相似程度，并基于所述相似程度确定所述待检测文本是否为原创文本。The original text determination unit is configured to calculate the degree of similarity between the subject and subject-verb-object triples of the text to be detected and the published text, and to determine whether the text to be detected is original text based on the degree of similarity.

8.根据权利要求7所述的装置，其中，所述主题提取单元被进一步配置成：8. The apparatus of claim 7, wherein the topic extraction unit is further configured to:

9.根据权利要求7所述的装置，其中，所述主谓宾三元组提取单元被进一步配置成：9. The apparatus of claim 7, wherein the subject-predicate-object triple extraction unit is further configured to:

10.根据权利要求7所述的装置，其中，所述原创文本确定单元包括被被配置成计算所述待检测文本与公开文本各自的主题和主谓宾三元组之间的相似程度的相似程度计算子单元，所述相似程度计算子单元被进一步配置成：10. The apparatus according to claim 7, wherein the original text determination unit comprises a similarity configured to calculate the degree of similarity between the subject and subject-verb-object triples of the text to be detected and the published text respectively a degree calculation subunit, the similarity degree calculation subunit is further configured to:

11.根据权利要求7-10任一项所述的装置，其中，所述原创文本确定单元包括被配置成基于所述相似程度确定所述待检测文本是否为原创文本的原创文本确定子单元，所述原创文本确定子单元包括：11. The apparatus according to any one of claims 7-10, wherein the original text determination unit comprises an original text determination subunit configured to determine whether the text to be detected is original text based on the similarity, The original text determination subunit includes:

文本相似度获取模块，被配置成获取所述待检测文本与所述公开文本的文本相似度；a text similarity obtaining module, configured to obtain the text similarity between the text to be detected and the published text;

综合确定模块，被配置成基于所述相似程度、所述文本相似度，确定所述待检测文本是否为原创文本。The comprehensive determination module is configured to determine whether the text to be detected is original text based on the degree of similarity and the degree of text similarity.

12.根据权利要求11所述的装置，其中，所述综合确定模块被进一步配置成：12. The apparatus of claim 11, wherein the comprehensive determination module is further configured to:

13.一种电子设备，包括：13. An electronic device comprising:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively coupled to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行权利要求1-6中任一项所述的用于检测原创文本的方法。The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-6 method for detecting original text.

14.一种存储有计算机指令的非瞬时计算机可读存储介质，所述计算机指令用于使所述计算机执行权利要求1-6中任一项所述的用于检测原创文本的方法。14. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method for detecting original text of any one of claims 1-6.