Background
The recommendation system is an information filtering tool which aims to accurately predict the preference degree of the user for the commodities, thereby preferentially presenting the commodities which are more valuable to the user. The historical behavior data of the user is the support basis of the recommendation system, and the historical behavior data of the user often relates to personal sensitive data of the user. The premise of privacy protection of the sensitive data is that the sensitive data can be selected from a large amount of data to complete identification of the sensitive data.
The traditional sensitive data recognition methods mainly comprise a dictionary matching method and a manual recognition method. The industry mostly adopts a mode of combining a dictionary matching method and a manual identification method to identify sensitive data. The main process is as follows: a user defines a sensitive data pattern matching formula, a dictionary matching range is determined according to a predefined model, then matching scanning is carried out on a target by using dictionary matching, after scanning is completed, a matching result is filtered manually, and the pattern data matching formula is optimized, but the recognition speed is slow due to the judgment standard and the dictionary matching problem.
Disclosure of Invention
The invention aims to provide a sensitive identification method based on a knowledge graph, which improves the identification speed.
In order to achieve the aim, the invention provides a sensitive identification method based on a knowledge graph, which comprises the following steps:
preprocessing the acquired original data and constructing a pattern diagram of the user article;
constructing a knowledge graph according to the mode graph and the preprocessed data;
constructing a sensitive relation reasoning rule and completing the knowledge graph;
and inquiring the sensitive data in the knowledge graph, and outputting the sensitive data.
Preprocessing the acquired original data, and constructing a pattern diagram of the user article, wherein the method comprises the following steps:
unifying the data storage format and the coding method in the acquired various types of original data, and deleting the redundant data.
The method comprises the following steps of preprocessing acquired original data, constructing a pattern diagram of a user article, and further comprising the following steps:
and taking the age, occupation and gender of the user as the attributes of the user, marking the relationship between the user and the object as a purchasing relationship, and then performing entity alignment on the preprocessed data by adopting a database tool to construct a pattern diagram of the user object.
Wherein, according to the mode diagram and the preprocessed data, a knowledge graph is constructed, which comprises the following steps:
and taking the user and the article as nodes, and constructing an attribute graph model according to the acquired key value pair of each attribute of the user and the article.
Wherein, according to the mode diagram and the preprocessed data, a knowledge graph is constructed, and the method further comprises the following steps:
and mapping the user to a head entity, mapping the article to a tail entity, mapping the relationship between the user and the corresponding article to be 0 or 1, and storing a knowledge graph by adopting a graph database.
Inquiring the sensitive data in the knowledge graph, and outputting the sensitive data, wherein the inquiring comprises the following steps:
and querying the supplemented graph data in the knowledge graph by utilizing a graph query language, and returning all the user and article nodes with corresponding sensitive relations according to the declared query target.
The method for inquiring the sensitive data in the knowledge graph and outputting the sensitive data further comprises the following steps:
and restoring the returned sensitive nodes into corresponding original data according to the data storage format and the encoding method, and storing the original data into a corresponding storage file.
The invention relates to a sensitive identification method based on a knowledge graph, which comprises the steps of firstly preprocessing an acquired original data set in order to construct a user-commodity knowledge graph, constructing a user-article pattern graph through preprocessed data, and then constructing the knowledge graph according to the preprocessed data and the pattern graph; secondly, in order to identify sensitive data, completing the sensitive relation which does not exist originally between the user and the article in the knowledge graph through the constructed sensitive relation reasoning rule; and finally, inquiring sensitive data for the whole knowledge graph, and outputting the data to improve the identification speed.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Referring to fig. 1 and 2, the present invention provides a method for identifying sensitivity based on a knowledge-graph, which includes the following steps:
s101, preprocessing the acquired original data and constructing a pattern diagram of the user article.
Specifically, the original data includes structured scoring data and unstructured text data, such as comments of users and descriptions of products, and the part of the work mainly includes the following aspects:
1. for various types of data, firstly unifying a data storage format and an encoding method, and in order to meet the requirements of subsequent knowledge extraction and data storage, converting the scoring data of multiple data sources into a file containing a user ID, an article ID, a user attribute and an article attribute.
2. Because a lot of redundant data exist in the e-commerce data, such as repeated scoring records, advertisement information in texts, repeated paragraphs and data with low quality, the data with low reliability and incomplete quality are removed, so that the data is normalized, the usability of the data is improved, and meanwhile, the subsequent data identification amount can be reduced, and the identification speed is improved.
The data obtained after preprocessing mainly comprises three files, namely, ratio.dat, users.dat and product.dat, wherein the ratio mainly stores users and the scores of the users to the items, the users and the attributes (sex, age, occupation and the like) of the users are mainly stored in the users.dat, and the product.dat mainly stores the items and the attributes (such as the categories) of the items.
Selecting age, occupation and gender of the user as attributes of the user, and marking the relationship between the user and the object as a purchasing relationship. And (4) carrying out entity alignment on the data in the data set by adopting a database tool. The pattern diagram of the user's goods is constructed by the preprocessed information, as shown in fig. 3, including the user, the goods, the related attributes of the user, and the purchasing relationship between the user and the goods.
And S102, constructing a knowledge graph according to the mode graph and the preprocessed data.
Specifically, the user commodity knowledge graph is constructed through the preprocessed data and the designed pattern graph. The method mainly comprises the steps of knowledge representation, knowledge extraction, knowledge fusion and the like.
The specific steps of knowledge representation are as follows:
and using the attribute graph model as a data model. The user and the item are represented as nodes, the user node has attributes of age, occupation, gender and the like, the item node has attributes of category, price and the like, and each attribute is a key-value pair. Each edge has a label representing a contact. Edges also have attributes. For example, "the user has once purchased a pen.. the corresponding attribute is shown in fig. 4, and the corresponding triple can be represented as (lie, purchase, pen), meaning that there is a" purchase "connection between the head entity" lie "and the tail entity" pen ", while also displaying the age of the user and the price of the goods.
The specific steps of knowledge extraction are as follows:
the purpose of knowledge extraction is to perform entity extraction, entity relationship extraction, attribute extraction on structured data and unstructured data and store the extracted data in a knowledge graph. For the structured data, the user ID is directly mapped to the ID of the head entity, the article ID is mapped to the ID of the tail entity, and the relation between the user ID and the article ID is mapped to 1 or 0, so that whether the user interacts with the article or not is indicated. Such as a triplet (231,1,324) indicating that a user with ID 231 purchased an item with ID 324.
The specific steps of storing the knowledge graph are as follows:
the user-article relation knowledge graph is of a graph structure, and the adoption of relational database storage is not beneficial to subsequent sensitive relation searching and processing, so that the user-article knowledge graph is stored by adopting a mainstream graph database Neo4j, a user entity is stored as a node in the graph database, and the relation between a user and an article is stored as an edge connecting the nodes.
And establishing a corresponding knowledge graph after entity extraction, entity relation extraction and attribute extraction, and storing data by using a corresponding graph database in order to store the knowledge graph. The knowledge graph stored by the graph database can well fuse multi-source data in a recommendation system scene, and sensitive data can be quickly extracted by querying the database through the graph query language.
S103, establishing a sensitive relation inference rule, and completing the knowledge graph.
Specifically, in order to identify sensitive data, a sensitive relationship inference rule between a user and an article needs to be constructed first. Since the user has attributes of age, occupation, gender, and the like, the item has attributes of price, category, and the like. Examples of defined rules are as follows:
rule 1: users in the general profession often buy medications, and the item is sensitive to the current user.
And adopting a Jena tool built-in rule-based inference engine to carry out inference. The method mainly comprises three steps of modeling a basic module, constructing an ontology and adding an inference engine.
1. Module required for modeling
Model is first built with the most basic package of org, apache, jena, rdf. And secondly, establishing a binary relation between the org, apache, jena, vocibulary and RDFS. org. apache. jena. reasoner and org. apache. jena. reasoner registry are used to create inference engines.
2. Building ontologies
The Model established instep 1 above is essentially the knowledge base structure, i.e. ontology, in Jena.
3. Add inference engine
And directly selecting a built-in RDFS inference machine to finish the sensitive relation inference.
After the above Rule example Rule1 is executed, sensitive relationships are added between all nodes in the knowledge graph which meet the conditions.
And S104, inquiring the sensitive data in the knowledge graph, and outputting the sensitive data.
Specifically, in step S103, the knowledge-graph completes the sensitive relationship between the user and the object. The relationship between the nodes is expanded from the original single purchasing relationship into a purchasing relationship and a sensitive relationship. The graphic data in the knowledge graph is queried through a descriptive graphic query language, Cypher, and the main method is to directly declare the 'query target' on the knowledge graph. An example script as follows would return all nodes that have a relationship to a positive tag, i.e., all user and item nodes that have a sensitive relationship.
match(n)--(m:sensitive)
returnn;
Outputting sensitive data
And for all the identified sensitive nodes, restoring the sensitive nodes into original data according to a data storage format and a coding method, and storing the original data into a storage file or into the established graph database, so that the sensitive data are selected from a large amount of data, and the identification of the sensitive data is completed.
The innovation points of the invention comprise the following aspects:
a personalized privacy definition is proposed, taking into account the sensitivity between the user and the goods. The existing recommendation system privacy protection problem assumes that feedback data of users are sensitive, and actually, sensitivity of different users to different commodities is different, that is, different users have different privacy protection requirements, and different articles have different sensitivity degrees.
The method is provided for identifying the sensitive data by constructing the user-commodity knowledge graph and complementing the non-existing sensitive relationship in the knowledge graph through relationship reasoning, so that the problem of low identification speed of the traditional sensitive data identification method can be solved.
Compared with the prior art, the method not only effectively improves the identification accuracy of the sensitive data, but also can improve the identification speed in the face of a large amount of complex data.
The invention relates to a sensitive identification method based on a knowledge graph, which comprises the steps of firstly preprocessing an acquired original data set in order to construct a user-commodity knowledge graph, constructing a user-article pattern graph through preprocessed data, and then constructing the knowledge graph according to the preprocessed data and the pattern graph; secondly, in order to identify sensitive data, completing the sensitive relation which does not exist originally between the user and the article in the knowledge graph through the constructed sensitive relation reasoning rule; and finally, inquiring sensitive data for the whole knowledge graph, and outputting the data to improve the identification speed.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.