Detailed Description
In order that the above objects, features and advantages of the present application can be more clearly understood, a detailed description of the present application will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present application, and the described embodiments are a part, but not all, of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The article generation method based on deep learning provided by the embodiment of the invention is executed by computer equipment, and correspondingly, the article generation device based on deep learning runs in the computer equipment. Fig. 1 is a flowchart of an article generation method based on deep learning according to an embodiment of the present application. As shown in fig. 1, the article generation method based on deep learning may include the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements:
s11, obtaining a topical topic set, and screening candidate topics from the topical topic set according to a first preset rule.
In at least one embodiment of the present application, to ensure that the subject of the soft text is sufficiently novel, operators often need to be sensitive to various trending topics. When a hot topic appears, the soft text composition needs to be completed and output at the first time. However, as the internet age becomes more and more mature, the life cycle of a hot topic at present may be 1 day, half day and 2-3 hours, so that the ability of operators and the company auditing process system are greatly checked. The application provides a hot topic monitoring system for carry out communication connection to first target platform system and second target platform system, and follow first target platform system with acquire the hot topic collection in the second target platform system. In one embodiment, the first target platform system may be an internal customer service system, and the second target platform system may be an external news information system. The trending topic set comprises a plurality of internal topics and a plurality of external topics, wherein the internal topics can be topics acquired from the internal customer service system, and the external topics can be topics acquired from the external news information system. The internal topics may be "what is a policyholder", "what is called a policy loan", and the like, and the external topics may be "4 deaths and 2 injuries caused by traffic accidents occurring in the eastern tai' an", "collision of more than 40 vehicles in Shanxi Bao Mao", and the like, which is not limited herein. The candidate subject refers to a subject in the hot topic set for generating the target soft text. In an embodiment, the first preset rule may be a randomly selected rule, that is, the candidate topic may be a randomly selected topic in the trending topic set. In other embodiments, the first preset rule may also be a rule selected according to the reading preference of the user, that is, the reading preference of the target user in a period of time is obtained, and topics similar to the reading preference are selected from the trending topic set to determine the candidate topics, which is not limited herein.
Optionally, the obtaining of the trending topic set includes:
obtaining a chat record set from a first target platform system;
extracting topic keywords corresponding to each chat record in the chat record set, and selecting target topic keywords with the occurrence frequency higher than a preset frequency threshold value as internal topics;
collecting a plurality of news list lists from the second target platform system;
acquiring a target news list matched with preset vocabularies as external topics;
and combining the internal topics and the external topics to obtain a popular topic collection.
The method comprises the steps of establishing communication connection with an internal customer service system and an external news information system respectively, obtaining internal topics from the internal customer service system and external topics from the external news information system respectively, and combining the internal topics and the external topics to obtain a popular topic set. The chat record set can be a chat record between a client and an intelligent customer service or a chat record between a client and an artificial customer service. Most of the chat records represent the concept problem of the client for the related products, for example, the client can consult the problems of 'what is a medical insurance' and 'what is a life insurance'. The frequency of topic keywords corresponding to each question is calculated to determine whether the topic is a hot topic, wherein the topic keywords may refer to words such as "medical insurance", "life insurance", and the like associated with a product. It can be understood that when the occurrence frequency is higher than the preset frequency threshold, the target topic keyword is determined to be an internal topic; when the occurrence frequency is lower than a preset frequency threshold, determining that the target topic keyword is not an internal topic. The preset frequency threshold is a preset threshold for evaluating whether the question is an internal topic.
The preset vocabulary refers to preset vocabulary associated with a product, and can be stored in a preset database, and the preset database can be a node in a block chain in consideration of privacy and reliability of data storage. Taking a product as an insurance product and a target soft language as an insurance class soft language as an example, the preset vocabulary can refer to the common vocabulary of the insurance industry in the insurance vocabulary, 1300 more items of the whole book entry are divided into 8 classes and collected according to the classes; the basic entries are arranged in English letters by English-Chinese contrast. Whether the news list matched with the preset vocabulary exists in the external news information system or not is detected to determine whether the news list can be used as an external topic, and it can be understood that when the detection result is that the news list matched with the preset vocabulary exists in the external news information system, the target news list is determined to be the external topic; and when the detection result is that no news list matched with the preset vocabulary exists in the external news information system, determining that the target news list is not the external topic.
And S12, identifying the type of the candidate theme, and matching the corresponding target outline according to the type.
In at least one embodiment of the present application, the insurance-like soft texts are taken as an example, and the types of the candidate topics can be roughly divided into concept introduction, objection processing, product interpretation, and claim cases. The application provides different modules for different subject types, and for example, for a subject type of a claim case candidate, a corresponding target outline has 4 modules of customer condition, insurance passing, claim settlement result, story enlightenment and the like. And for different types, the system comprises corresponding target outlines, wherein the target outlines refer to module contents corresponding to the soft texts, and one outline corresponds to one module. It can be understood that, when the number of the target outline is four, the number of the module contents corresponding to the soft text is also four.
Optionally, before the identifying the type of the candidate topic, the method further comprises:
acquiring historical soft text sets corresponding to different types of hot topics;
extracting event element information corresponding to each historical soft text in the historical soft text set to generate an initial event set;
calculating the importance degree of each event element in the initial event set in the event depiction, and selecting the target event elements with the importance degree higher than a preset importance degree threshold value to form an initial summary frame corresponding to the event;
and determining target outlines corresponding to the hot topics of all types based on the initial summary framework.
The historical soft texts comprise a plurality of historical soft texts, the historical soft texts correspond to hot topics of different types, the historical soft texts can be articles stored in a preset database and written by operators, the historical soft texts are divided into a plurality of paragraphs according to categories, each paragraph corresponds to one category, and the categories are not limited. In one embodiment, the historical texts in the historical soft text set can be the texts with reading amount higher than a preset reading amount threshold value, and the preset reading amount threshold value is a preset threshold value used for evaluating the audience degree of the historical texts. According to the method and the device, the frame extraction is carried out on the soft text with the reading amount higher than the preset reading amount threshold value, the target outline meeting the preference requirement of a reader can be obtained, the accuracy of obtaining the target outline is improved, and therefore the accuracy of generating the target soft text is improved.
The main task of event extraction is to find events from massive network data and perform structured processing around event elements, illustratively, a statement extractor is used to divide historical soft texts into sentences, a natural language processing tool is used to perform lexical and syntactic analysis on the historical soft texts, the historical soft texts are analyzed into a form of a syntax tree, and dependency relationships are identified. According to the structural features of words in a grammar tree and an entity element database, conducting named entity recognition on historical soft texts, mining entity element information such as event names, event occurrence time, event occurrence addresses, event participants, event occurrence reasons and event influence involved in events, and storing the event element information according to a preset data format to obtain an initial event set. The preset data format is a preset format for storing event element information.
The more critical the event element is to describe the event, the larger the value of the importance degree of the event element is, and the value range is between 0 and 1. In one embodiment, the importance of the event element is determined by the frequency of co-occurrence of the event element in the historical soft text, and the greater the frequency of occurrence of the event element in the historical soft text, the greater the importance of the event element; the less frequently an event element occurs in the history softword, the less important the event element is. And selecting the target event elements with the importance degrees higher than a preset importance degree threshold value to form an initial summary frame corresponding to the event, wherein the preset importance degree threshold value is a preset threshold value for evaluating the importance degrees of the event elements.
Wherein the determining the target outline corresponding to each type based on the initial outline frame may include: obtaining the category of the event elements in the initial summary frame; and traversing a preset mapping relation between the categories and the schemas according to the categories to obtain the target schemas corresponding to the categories. The mapping relationship between the categories and the outline is preset, the categories can include but are not limited to party information, party experiences, event results and reason profiling, and the outline can include but is not limited to customer conditions, insurance experiences, claim settlement results and story revelations. In one embodiment, for each category, a number of necessary event elements corresponding to the category may be set, and when the necessary event elements are included in the initial summary frame, the category to which the event elements belong may be determined. Illustratively, when only the event name and the event element of the event participant appear in a certain paragraph of the history soft text, since { event participant } is a necessary event element of the category of the client situation, the category corresponding to the paragraph can be the client situation, and the schema corresponding to the category is the client situation as can be seen by traversing the mapping relationship between the category and the schema.
Optionally, the identifying the type of the candidate topic comprises:
performing word segmentation processing and part-of-speech tagging on the candidate topics by using a word segmentation and part-of-speech tagging combined model constructed by fusing external knowledge to obtain word segmentation results carrying part-of-speech tagging;
detecting whether the word segmentation result contains preset keywords or not;
and when the detection result is that the word segmentation result contains the preset keyword, determining the type corresponding to the preset keyword as the type of the candidate theme.
Generally, a syntactic analysis tool is used to perform sentence segmentation processing on a section of text, and perform word segmentation (Segmentor), part-of-speech tagging (posttagger), and syntactic analysis (Parser) in sequence to obtain a word segmentation result. For example, when the candidate topic is "what is an applicant", the corresponding segmentation result may be "what", "is", "applicant", wherein "what" belongs to a preset keyword, the preset keyword is associated with two types of "concept import" and "product interpretation", and whether the type of the candidate topic belongs to "concept import" or "product interpretation" is determined by detecting whether the segmentation result includes an insurance product name. It can be understood that, since the word segmentation result does not include the insurance product name, the type corresponding to the candidate topic is known as "concept import". For another example, when the candidate topic is "what is important disease insurance", the corresponding segmentation result may be "what", "is", "important disease insurance", wherein "what" belongs to a preset keyword, and the preset keyword is associated with two types of "concept introduction" and "product interpretation", it can be understood that since the segmentation result includes an insurance product name, the type corresponding to the candidate topic is known as "product interpretation". In addition, when the preset keyword is "why", the type of the candidate topic corresponding to the preset keyword may be "objection processing". For external topics, the type of their corresponding candidate topic may be "claim case".
Optionally, the matching of the corresponding target outline according to the type includes:
acquiring a preset mapping relation between the type and the outline;
and traversing the mapping relation according to the type to obtain a target outline matched with the type.
And S13, obtaining the subject content which accords with the candidate subject with the preset condition.
In at least one embodiment of the present application, for each candidate topic, there is topic content corresponding to the candidate topic, and the candidate topic may be divided into an external topic and an internal topic according to an acquisition route, and when the candidate topic is the external topic, topic content meeting a preset condition may be acquired from a second target platform system corresponding to the external topic. Optionally, the obtaining of the subject content meeting the preset condition with the candidate subject includes:
acquiring a text link corresponding to each candidate theme;
crawling initial subject content corresponding to the candidate subject according to the text link;
and preprocessing the initial subject content to obtain target subject content.
The external topics can be crawled from an external news information system, in the external news information system, the external topics and text links corresponding to the external topics exist, and initial topic contents corresponding to the candidate topics can be crawled through the text links. Because the initial subject content comprises some information such as text editing time, text editing objects and the like which are irrelevant to soft text generation, irrelevant information is removed through preprocessing, and the target subject content which meets the preset condition is obtained. The preset condition may refer to a preset format condition that the theme content needs to meet, and the like, and is not limited herein.
When the candidate topic is an internal topic, topic content meeting preset conditions can be acquired from a first target platform system corresponding to the internal topic. Optionally, the obtaining of the subject content meeting the preset condition with the candidate subject includes:
acquiring a target chat record set corresponding to the candidate subject;
determining a chat main body corresponding to each target chat record in the target chat record set and an initial text corresponding to the chat main body;
and combining and preprocessing the initial text to obtain target subject content.
The target chat record set is a chat record set which is related to the candidate subject and is subjected to duplication removal processing, two chat subjects exist for each target chat record in the target chat record set, and the chat subjects can be intelligent customer service and customers or artificial customer service and customers. For each chat principal, there is corresponding initial text. For example, when a customer asks "what is a policyholder", the "what is a policyholder" may be used as an initial text corresponding to the chat subject of the customer, the chat subject of an intelligent customer or a human service may answer the questions of the customer, and specific answer contents may be used as the initial text corresponding to the chat subject. For the question of "what is an applicant", other subproblems may be derived in the communication process, and each subproblem has a corresponding initial text, that is, the number of the initial texts corresponding to the candidate topic may be 1 or more. And when the number of the initial texts is multiple, combining and preprocessing the initial texts to obtain target subject contents. The preprocessing may be a processing manner such as deleting stop words, and is not limited herein.
And S14, extracting abstract texts from the subject contents according to the target outline, and filling the abstract texts into a target area corresponding to the target outline to obtain initial soft texts.
In at least one embodiment of the present application, the abstract text refers to a text corresponding to the target outline, for one target outline, there is one abstract text corresponding to the target outline, and the abstract text is filled into a target area corresponding to the target outline, so as to obtain an initial soft text. And establishing an association relation between the abstract text and the target area, and filling the abstract text to the target area corresponding to the target outline by inquiring the association relation.
Optionally, the extracting the abstract text from the subject content according to the target outline includes:
clustering the subject contents by taking sentences as units to obtain a clustering result;
calling a pre-trained subject word extraction model to extract subject words from the clustering result to obtain subject words corresponding to the subject contents;
counting words and word frequencies of the words with the same or similar semantics with the subject words in the subject content, and combining the words and the subject words to obtain a high-frequency word set corresponding to the subject content;
constructing a text network graph for the subject content, and extracting an abstract based on the text network graph and the high-frequency word set to obtain a candidate abstract sentence cluster;
and removing redundancy of the candidate abstract sentence group to obtain an initial abstract text, and optimizing the initial abstract text to obtain a target abstract text.
The conventional k-means algorithm is adopted to perform clustering analysis on the subject content to obtain a clustering result, and the clustering processing performed by the k-means algorithm belongs to the prior art and is not described herein again. And extracting subject terms of the clustering result by using Latent Dirichlet Allocation (hereinafter, referred to as LDA subject model) to obtain the subject terms corresponding to the subject contents. The LDA topic model can identify potential topic information in a large-scale text set and provides the potential topic information in a probability distribution mode. Therefore, the topic model LDA is used for extracting the topic of the clustered topic content, the topic corresponding to the text contained in the cluster can be obtained, and the main meaning of the text contained in the cluster can be further known. Words with similar or identical semantics can be subjected to vectorization of words, and then the Euclidean distance between the words is calculated to obtain the similarity value between the words.
And determining whether similarity relation exists between the nodes by taking sentences as the nodes of the text, and constructing a text network graph. If the similarity between two nodes is larger than a set threshold value, an edge exists between the two nodes, the value of the similarity is the weight of the edge, and otherwise, the edge does not exist. The set threshold is set to 0.0001.
The optimization of the initial abstract text is to obtain the appearance sequence of each sentence in the text in the original text and add a sequence number label to each sentence. The sentences in the initial abstract text are sorted according to the weight value, the problem of discontinuous expression of the sentences before and after the sentences possibly exists, and if the sentences are output according to the sequence of the sentences appearing in the original text, the generated abstract has certain semantic consistency to a certain extent. And the text matching algorithm is used for sequentially obtaining the sequence number labels of all sentences in the initial abstract text and then outputting the sequence number labels from small to large so as to obtain the target abstract text.
In other embodiments, before the optimizing the initial abstract text to obtain the target abstract text, the method further includes: acquiring a preset abstract word number requirement; and preprocessing stop words and the like on the initial abstract text according to the abstract word number requirement to obtain a target abstract text meeting the abstract word number requirement.
In other embodiments, the text < text > can be changed into a < url > type in the front page by highlighting and automatically adding a jump link to the subject word contained in the target abstract text and the high-frequency word set corresponding to the subject content. The user can click on the highlighted keyword to directly view more contents related to the theme, so that the understanding of the client is enhanced and the conversion is realized.
And S15, acquiring a target illustration corresponding to the abstract text, and filling the target illustration to the target position of the initial soft text to obtain the target soft text.
In at least one embodiment of the present application, a preset database stores a large number of illustrations related to products, each illustration carries a corresponding tag, the tag is used for marking main content of the illustration, the tag may be manually set, or may be automatically obtained after the content of the illustration is analyzed, and the content analysis of the illustration belongs to the prior art, and is not described herein any more.
Optionally, the obtaining of the target illustration corresponding to the abstract text includes:
acquiring a subject term corresponding to the abstract text;
acquiring a label set corresponding to an illustration in a preset database;
detecting whether the similarity degree of a target label and the subject term in the label set exceeds a preset similarity threshold value or not;
and when the detection result shows that the similarity degree of the target label and the subject term exceeds a preset similarity threshold value, determining an illustration corresponding to the target label as a target illustration.
The method includes the steps of calculating a label vector corresponding to each label in the label set, calculating a subject vector corresponding to the subject term, and determining the similarity degree between the label and the subject term by calculating the Euclidean distance between the label vector and the subject vector. The preset similarity threshold is a preset threshold for evaluating the distance between two vectors.
In an embodiment, the target soft text may be typeset according to a preset typesetting format, and the preset typesetting format may be, for example, a sequence of { major title, minor title, illustration, summary }. There are several locations in the target soft text for filling in target illustrations.
Optionally, the filling the target illustration to the target position of the initial soft text to obtain the target soft text includes:
acquiring a target abstract text corresponding to the target illustration;
inquiring a preset typesetting format preset by the target abstract text to obtain a target position corresponding to the target illustration;
and filling the target illustration to the target position to obtain the target soft text.
According to the article generation method based on deep learning, provided by the embodiment of the application, the external news information is connected with the internal customer service system, the hot topic collection is obtained from the external news information and the internal customer service system, then the candidate topics are screened, and the target soft texts are generated aiming at the candidate topics, so that the hot events can be automatically and comprehensively mastered, the hot topics are prevented from being manually searched and screened, the timeliness of the target soft texts is guaranteed, and the generation efficiency of the target soft texts is improved; in addition, by extracting the frame of the historical soft text with the reading amount higher than the preset reading amount threshold value, the target outline meeting the preference requirement of a reader can be obtained, the accuracy of obtaining the target outline is improved, and the accuracy of generating the target soft text is improved; in addition, the abstract text is automatically generated through a subject word extraction model and a mode of analyzing words and word frequencies in the subject content, and the abstract text is filled into a target area to obtain the soft text, so that the accuracy and the efficiency of generating the abstract text can be improved, and the accuracy and the efficiency of generating the soft text are improved. The application can be applied to various functional modules of smart cities such as smart government affairs and smart traffic, for example, the article generation module based on deep learning of smart government affairs can promote the rapid development of the smart cities.
Fig. 2 is a block diagram of an article generation apparatus based on deep learning according to a second embodiment of the present application.
In some embodiments, the deep learning basedarticle generating apparatus 20 may include a plurality of functional modules composed of computer program segments. The computer programs of the various program segments in the deep learning basedarticle generation apparatus 20 may be stored in a memory of a computer device and executed by at least one processor to perform the functions of deep learning based article generation (detailed in fig. 1).
In the present embodiment, the deep learning-basedarticle generating apparatus 20 may be divided into a plurality of functional modules according to the functions performed by the apparatus. The functional module may include: asubject screening module 201, anoutline matching module 202, acontent obtaining module 203, asummary extracting module 204 and anillustration filling module 205. A module as referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in a memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
Thetopic screening module 201 may be configured to obtain a topical topic set, and screen candidate topics from the topical topic set according to a first preset rule.
In at least one embodiment of the present application, to ensure that the soft text theme is sufficiently novel, operators often need to be sensitive to various hotspots. When a hot spot occurs, it takes the first time to complete the soft text composition and output. However, as the internet age becomes more mature, the life cycle of a hot spot may be 1 day, half day, 2-3 hours, so that the ability of operators and the company auditing process system are greatly checked. The application provides a hot topic monitoring system for be connected first target platform system and second target platform system, and follow first target platform system with acquire the hot topic collection in the second target platform system. In one embodiment, the first target platform system may be an internal customer service system, and the second target platform system may be an external news information system. The trending topics collectively include a plurality of internal topics and a plurality of external topics, wherein the internal topics may be "what is a policyholder", "what is called a policy loan", and the like, and the external topics may be "what is 4 deaths and 2 injuries caused by traffic accidents occurring in Shandong Taian", "Shaanxi Bao Mao high-speed 40 vehicles collide with each other", and the like, which is not limited herein. The candidate subject refers to a subject in the hot topic set for generating the target soft text. In an embodiment, the first preset rule may be a randomly selected rule, that is, the candidate topic may be a randomly selected topic in the trending topic set, which is not limited herein.
Optionally, the obtaining of the trending topic set includes:
obtaining a chat record set from a first target platform system;
extracting topic keywords corresponding to each chat record in the chat record set, and selecting target topic keywords with the occurrence frequency higher than a preset frequency threshold value as internal topics;
collecting a plurality of news list lists from the second target platform system;
acquiring a target news list matched with preset vocabularies as external topics;
and combining the internal topics and the external topics to obtain a popular topic collection. The method comprises the steps of establishing communication connection with an internal customer service system and an external news information system respectively to obtain internal topics from the internal customer service system and external topics from the external news information system, and combining the internal topics and the external topics to obtain a popular topic set. The chat record set can be a chat record between a client and an intelligent customer service or a chat record between a client and an artificial customer service. Most of the chat records represent the concept problem of the client for the related products, for example, the client can consult the problems of 'what is a medical insurance' and 'what is a life insurance'. Determining whether the question is a hot topic by calculating the frequency of topic keywords corresponding to each question, wherein it can be understood that when the occurrence frequency is higher than a preset frequency threshold, the target topic keyword is determined to be an internal topic; when the occurrence frequency is lower than a preset frequency threshold, determining that the target topic keyword is not an internal topic. The preset frequency threshold is a preset threshold for evaluating whether the question is an internal topic.
The preset vocabulary refers to preset vocabulary associated with a product, and can be stored in a preset database, and the preset database can be a node in a block chain in consideration of privacy and reliability of data storage. Taking a product as an insurance product and a target soft language as an insurance class soft language as an example, the preset vocabulary can refer to the common vocabulary of the insurance industry in the insurance vocabulary, 1300 more items of the whole book entry are divided into 8 classes and collected according to the classes; the basic entries are arranged in English letters by English-Chinese contrast. Whether the news list matched with the preset vocabulary exists in the external news information system or not is detected to determine whether the news list can be used as an external topic, and it can be understood that when the detection result is that the news list matched with the preset vocabulary exists in the external news information system, the target news list is determined to be the external topic; and when the detection result is that no news list matched with the preset vocabulary exists in the external news information system, determining that the target news list is not the external topic.
Theschema matching module 202 may be configured to identify a type of the candidate topic and match a corresponding target schema according to the type.
In at least one embodiment of the present application, the insurance-like soft texts are taken as an example, and the types of the candidate topics can be roughly divided into concept introduction, objection processing, product interpretation, and claim cases. The application provides different templates for different subject types, and for example, for a subject type of a claim case candidate, a corresponding target outline has 4 modules such as a customer situation, a risk process, a claim settlement result, a story enlightenment and the like. And for different types, the system comprises corresponding target outlines, wherein the target outlines refer to module contents corresponding to the soft texts, and one outline corresponds to one module. It can be understood that, when the number of the target outline is four, the number of the module contents corresponding to the soft text is also four.
Optionally, before the identifying the type of the candidate topic, the method further comprises:
acquiring historical soft text sets corresponding to different types of hot topics;
extracting event element information corresponding to each historical soft text in the historical soft text set to generate an initial event set;
calculating the importance degree of each event element in the initial event set in the event depiction, and selecting the target event elements with the importance degree higher than a preset importance degree threshold value to form an initial summary frame corresponding to the event;
and determining target outlines corresponding to the hot topics of all types based on the initial summary framework. The historical soft texts comprise a plurality of historical soft texts, the historical soft texts correspond to hot topics of different types, the historical soft texts can be articles stored in a preset database and written by operators, the historical soft texts are divided into a plurality of paragraphs according to categories, each paragraph corresponds to one category, and the categories are not limited. In one embodiment, the historical texts in the historical soft text set can be the texts with reading amount higher than a preset reading amount threshold value, and the preset reading amount threshold value is a preset threshold value used for evaluating the audience degree of the historical texts. According to the method and the device, the frame extraction is carried out on the soft text with the reading amount higher than the preset reading amount threshold value, the target outline meeting the preference requirement of a reader can be obtained, the accuracy of obtaining the target outline is improved, and therefore the accuracy of generating the target soft text is improved.
The main task of event extraction is to find events from massive network data and perform structured processing around event elements, illustratively, a statement extractor is used to divide historical soft texts into sentences, a natural language processing tool is used to perform lexical and syntactic analysis on the historical soft texts, the historical soft texts are analyzed into a form of a syntax tree, and dependency relationships are identified. According to the structural features of words in a grammar tree and an entity element database, conducting named entity recognition on historical soft texts, mining entity element information such as event names, event occurrence time, event occurrence addresses, event participants, event occurrence reasons and event influence involved in events, and storing the event element information according to a preset data format to obtain an initial event set. The preset data format is a preset format for storing event element information.
The more critical the event element is to describe the event, the larger the value of the importance degree of the event element is, and the value range is between 0 and 1. In one embodiment, the importance of the event element is determined by the frequency of co-occurrence of the event element in the historical soft text, and the greater the frequency of occurrence of the event element in the historical soft text, the greater the importance of the event element; the less frequently an event element occurs in the history softword, the less important the event element is. And selecting the target event elements with the importance degrees higher than a preset importance degree threshold value to form an initial summary frame corresponding to the event, wherein the preset importance degree threshold value is a preset threshold value for evaluating the importance degrees of the event elements.
Wherein the determining the target outline corresponding to each type based on the initial outline frame may include: obtaining the category of the event elements in the initial summary frame; and traversing a preset mapping relation between the categories and the schemas according to the categories to obtain the target schemas corresponding to the categories. The mapping relationship between the categories and the outline is preset, the categories can include but are not limited to party information, party experiences, event results and reason profiling, and the outline can include but is not limited to customer conditions, insurance experiences, claim settlement results and story revelations. In one embodiment, for each category, a number of necessary event elements corresponding to the category may be set, and when the necessary event elements are included in the initial summary frame, the category to which the event elements belong may be determined. Illustratively, when only the event name and the event element of the event participant appear in a certain paragraph of the history soft text, since { event participant } is a necessary event element of the category of the client situation, the category corresponding to the paragraph can be the client situation, and the schema corresponding to the category is the client situation as can be seen by traversing the mapping relationship between the category and the schema.
Optionally, the identifying the type of the candidate topic comprises:
performing word segmentation processing and part-of-speech tagging on the candidate topics by using a word segmentation and part-of-speech tagging combined model constructed by fusing external knowledge to obtain word segmentation results carrying part-of-speech tagging;
detecting whether the word segmentation result contains preset keywords or not;
and when the detection result is that the word segmentation result contains the preset keyword, determining the type corresponding to the preset keyword as the type of the candidate theme.
Generally, a syntactic analysis tool is used to perform sentence segmentation processing on a section of text, and perform word segmentation (Segmentor), part-of-speech tagging (posttagger), and syntactic analysis (Parser) in sequence to obtain a word segmentation result. For example, when the candidate topic is "what is an applicant", the corresponding segmentation result may be "what", "is", "applicant", wherein "what" belongs to a preset keyword, the preset keyword is associated with two types of "concept import" and "product interpretation", and whether the type of the candidate topic belongs to "concept import" or "product interpretation" is determined by detecting whether the segmentation result includes an insurance product name. It can be understood that, since the word segmentation result does not include the insurance product name, the type corresponding to the candidate topic is known as "concept import". For another example, when the candidate topic is "what is important disease insurance", the corresponding segmentation result may be "what", "is", "important disease insurance", wherein "what" belongs to a preset keyword, and the preset keyword is associated with two types of "concept introduction" and "product interpretation", it can be understood that since the segmentation result includes an insurance product name, the type corresponding to the candidate topic is known as "product interpretation". In addition, when the preset keyword is "why", the type of the candidate topic corresponding to the preset keyword may be "objection processing". For external topics, the type of their corresponding candidate topic may be "claim case".
Optionally, the matching of the corresponding target outline according to the type includes:
acquiring a preset mapping relation between the type and the outline;
and traversing the mapping relation according to the type to obtain a target outline matched with the type.
Thecontent obtaining module 203 is configured to obtain the subject content meeting the preset condition with the candidate subject.
In at least one embodiment of the present application, for each candidate topic, there is topic content corresponding to the candidate topic, and the candidate topic may be divided into an external topic and an internal topic according to an acquisition route, and when the candidate topic is the external topic, topic content meeting a preset condition may be acquired from a second target platform system corresponding to the external topic. Optionally, the obtaining of the subject content meeting the preset condition with the candidate subject includes:
acquiring a text link corresponding to each candidate theme;
crawling initial subject content corresponding to the candidate subject according to the text link;
and preprocessing the initial subject content to obtain target subject content.
The external topics can be crawled from an external news information system, in the external news information system, the external topics and text links corresponding to the external topics exist, and initial topic contents corresponding to the candidate topics can be crawled through the text links. Because the initial subject content comprises some information such as text editing time, text editing objects and the like which are irrelevant to soft text generation, irrelevant information is removed through preprocessing, and the target subject content which meets the preset condition is obtained. The preset condition may refer to a preset format condition that the theme content needs to meet, and the like, and is not limited herein. When the candidate topic is an internal topic, optionally, the obtaining the topic content meeting the preset condition with the candidate topic includes:
acquiring a target chat record set corresponding to the candidate subject;
determining a chat main body corresponding to each target chat record in the target chat record set and an initial text corresponding to the chat main body;
and combining the initial texts to obtain the subject contents related to the candidate subjects.
The target chat record set is a chat record set which is related to the candidate subject and is subjected to duplication removal processing, two chat subjects exist for each target chat record in the target chat record set, and the chat subjects can be intelligent customer service and customers or artificial customer service and customers. For each chat principal, there is corresponding initial text. For example, when a customer asks "what is a policyholder", the "what is a policyholder" may be used as an initial text corresponding to the chat subject of the customer, the chat subject of an intelligent customer or a human service may answer the questions of the customer, and specific answer contents may be used as the initial text corresponding to the chat subject. For the question of "what is an applicant", other subproblems may be derived in the communication process, and each subproblem has a corresponding initial text, that is, the number of the initial texts corresponding to the candidate topic may be 1 or more. And when the number of the initial texts is multiple, combining the initial texts to obtain the subject contents related to the candidate subjects.
The abstract extractingmodule 204 may be configured to extract an abstract text from the subject content according to the target outline, and fill the abstract text into a target area corresponding to the target outline to obtain an initial soft text.
In at least one embodiment of the present application, the abstract text refers to a text corresponding to the target outline, for one target outline, there is one abstract text corresponding to the target outline, and the abstract text is filled into a target area corresponding to the target outline, so as to obtain an initial soft text. And establishing an association relation between the abstract text and the target area, and filling the abstract text to the target area corresponding to the target outline by inquiring the association relation.
Optionally, the extracting the abstract text from the subject content according to the target outline includes:
clustering the subject contents by taking sentences as units to obtain a clustering result;
calling a pre-trained subject word extraction model to extract subject words from the clustering result to obtain subject words corresponding to the subject contents;
counting words and word frequencies of the words with the same or similar semantics with the subject words in the subject content, and combining the words and the subject words to obtain a high-frequency word set corresponding to the subject content;
constructing a text network graph for the subject content, and extracting an abstract based on the text network graph and the high-frequency word set to obtain a candidate abstract sentence cluster;
and removing redundancy of the candidate abstract sentence group to obtain an initial abstract text, and optimizing the initial abstract text to obtain a target abstract text.
The conventional k-means algorithm is adopted to perform clustering analysis on the subject content to obtain a clustering result, and the clustering processing performed by the k-means algorithm belongs to the prior art and is not described herein again. And extracting subject terms of the clustering result by using Latent Dirichlet Allocation (hereinafter, referred to as LDA subject model) to obtain the subject terms corresponding to the subject contents. The LDA topic model can identify potential topic information in a large-scale text set and provides the potential topic information in a probability distribution mode. Therefore, the topic model LDA is used for extracting the topic of the clustered topic content, the topic corresponding to the text contained in the cluster can be obtained, and the main meaning of the text contained in the cluster can be further known. Words with similar or identical semantics can be subjected to vectorization of words, and then the Euclidean distance between the words is calculated to obtain the similarity value between the words.
And determining whether similarity relation exists between the nodes by taking sentences as the nodes of the text, and constructing a text network graph. If the similarity between two nodes is larger than a set threshold value, an edge exists between the two nodes, the value of the similarity is the weight of the edge, and otherwise, the edge does not exist. The set threshold is set to 0.0001.
The optimization of the initial abstract text is to obtain the appearance sequence of each sentence in the text in the original text and add a sequence number label to each sentence. The sentences in the initial abstract text are sorted according to the weight value, the problem of discontinuous expression of the sentences before and after the sentences possibly exists, and if the sentences are output according to the sequence of the sentences appearing in the original text, the generated abstract has certain semantic consistency to a certain extent. And the text matching algorithm is used for sequentially obtaining the sequence number labels of all sentences in the initial abstract text and then outputting the sequence number labels from small to large so as to obtain the target abstract text.
In other embodiments, before the optimizing the initial abstract text to obtain the target abstract text, the method further includes: acquiring a preset abstract word number requirement; and preprocessing stop words and the like on the initial abstract text according to the abstract word number requirement to obtain a target abstract text meeting the abstract word number requirement.
In other embodiments, the text < text > can be changed into a < url > type in the front page by highlighting and automatically adding a jump link to the subject word contained in the target abstract text and the high-frequency word set corresponding to the subject content. The user can click on the highlighted keyword to directly view more contents related to the theme, so that the understanding of the client is enhanced and the conversion is realized.
Theillustration filling module 205 may be configured to obtain a target illustration corresponding to the abstract text, and fill the target illustration to a target position of the initial soft text to obtain a target soft text.
In at least one embodiment of the present application, a preset database stores a large number of illustrations related to products, each illustration carries a corresponding tag, the tag is used for marking main content of the illustration, the tag may be manually set, or may be automatically obtained after the content of the illustration is analyzed, and the content analysis of the illustration belongs to the prior art, and is not described herein any more.
Optionally, the obtaining of the target illustration corresponding to the abstract text includes:
acquiring a subject term corresponding to the abstract text;
acquiring a label set corresponding to an illustration in a preset database;
detecting whether the similarity degree of a target label and the subject term in the label set exceeds a preset similarity threshold value or not;
and when the detection result shows that the similarity degree of the target label and the subject term exceeds a preset similarity threshold value, determining an illustration corresponding to the target label as a target illustration.
The method includes the steps of calculating a label vector corresponding to each label in the label set, calculating a subject vector corresponding to the subject term, and determining the similarity degree between the label and the subject term by calculating the Euclidean distance between the label vector and the subject vector. The preset similarity threshold is a preset threshold for evaluating the distance between two vectors.
In an embodiment, the target soft text may be typeset according to a preset typesetting format, and the preset typesetting format may be, for example, a sequence of { major title, minor title, illustration, summary }. There are several locations in the target soft text for filling in target illustrations.
Optionally, the filling the target illustration to the target position of the initial soft text to obtain the target soft text includes:
acquiring a target abstract text corresponding to the target illustration;
inquiring a preset typesetting format preset by the target abstract text to obtain a target position corresponding to the target illustration;
and filling the target illustration to the target position to obtain the target soft text.
Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present application. In the preferred embodiment of the present application, the computer device 3 includes amemory 31, at least oneprocessor 32, at least onecommunication bus 33, and atransceiver 34.
It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 3 is not a limitation of the embodiments of the present application, and may be a bus-type configuration or a star-type configuration, and that the computer device 3 may include more or less hardware or software than those shown, or a different arrangement of components.
In some embodiments, the computer device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 3 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the computer device 3 is only an example, and other existing or future electronic products, such as those that may be adapted to the present application, are also included in the scope of the present application and are incorporated herein by reference.
In some embodiments, thememory 31 has stored therein a computer program that, when executed by the at least oneprocessor 32, implements all or part of the steps of the deep learning based article generation method as described. TheMemory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In some embodiments, the at least oneprocessor 32 is a Control Unit (Control Unit) of the computer device 3, connects various components of the entire computer device 3 by using various interfaces and lines, and executes various functions and processes data of the computer device 3 by running or executing programs or modules stored in thememory 31 and calling data stored in thememory 31. For example, the at least oneprocessor 32, when executing the computer program stored in the memory, implements all or part of the steps of the deep learning-based article generation method described in the embodiments of the present application; or implement all or part of the functions of the deep learning-based article generation apparatus. The at least oneprocessor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.
In some embodiments, the at least onecommunication bus 33 is arranged to enable connection communication between thememory 31 and the at least oneprocessor 32 or the like.
Although not shown, the computer device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least oneprocessor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the specification may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present application and not for limiting, and although the present application is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application.