技术领域Technical Field
本发明涉及数据处理技术领域,尤其涉及一种简历数据信息解析及匹配方法、装置、电子设备及介质。The present invention relates to the field of data processing technology, and in particular to a resume data information parsing and matching method, device, electronic device and medium.
背景技术Background Art
现有技术方案中,在进行简历匹配时,通常需要人工筛选,并匹配到与岗位相关联的简历,不仅要耗费大量的人力成本,且耗时较长。In the existing technical solutions, when matching resumes, manual screening is usually required to match the resumes associated with the positions, which not only consumes a lot of manpower costs but also takes a long time.
而目前对简历的智能化筛选还只停留在去掉某些不符合要求的简历的初级阶段(如筛除掉不满足学历条件的简历),还无法实现岗位与简历的自动匹配。At present, the intelligent screening of resumes is still at the primary stage of removing certain resumes that do not meet the requirements (such as screening out resumes that do not meet the academic qualifications), and it is not possible to achieve automatic matching of positions and resumes.
发明内容Summary of the invention
鉴于以上内容,有必要提供一种简历数据信息解析及匹配方法、装置、电子设备及介质,能够实现对岗位与简历快速且准确地智能匹配。In view of the above, it is necessary to provide a resume data information parsing and matching method, device, electronic device and medium that can achieve fast and accurate intelligent matching of positions and resumes.
一种简历数据信息解析及匹配方法,所述方法包括:A resume data information parsing and matching method, the method comprising:
从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;Retrieving resumes from the database and preprocessing the retrieved resumes to obtain resumes to be parsed;
根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;Constructing a word segmentation directed acyclic graph according to a pre-constructed word segmentation dictionary, and segmenting the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain a resume text after word segmentation processing;
根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;Constructing a co-occurrence matrix according to the resume text after word segmentation processing, and determining keywords of the resume text based on the co-occurrence matrix;
获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;Acquire a word sequence in the keyword, and process the word sequence using a word representation model to obtain a word representation of the word sequence;
将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;Input the word representation into the constructed resume tag parsing model to obtain a predicted resume tag sequence;
计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。The similarity between each tag in the resume tag sequence and the tag of each position is calculated, and the resume matching each position is determined from the resumes to be parsed according to the calculated similarity.
根据本发明优选实施例,所述对调取的简历进行预处理包括:According to a preferred embodiment of the present invention, the preprocessing of the retrieved resume includes:
采用停用词表过滤方法对所述调取的简历进行去停用词处理。The retrieved resume is processed to remove stop words using a stop word list filtering method.
根据本发明优选实施例,所述根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词包括:According to a preferred embodiment of the present invention, constructing a co-occurrence matrix according to the resume text, and determining the keywords of the resume text based on the co-occurrence matrix includes:
根据所述简历文本中每个分词出现的次数构建所述共现矩阵;Constructing the co-occurrence matrix according to the number of occurrences of each word in the resume text;
从所述共现矩阵中提取每个分词的词频及角度;Extracting the frequency and angle of each word segment from the co-occurrence matrix;
根据每个分词的词频及角度计算每个分词的得分;Calculate the score of each participle based on its frequency and angle;
根据每个分词的得分对每个分词进行降序输出,得到所述简历文本的关键词。Each participle is output in descending order according to its score to obtain the keywords of the resume text.
根据本发明优选实施例,在得到所述简历文本的关键词后,所述方法还包括:According to a preferred embodiment of the present invention, after obtaining the keywords of the resume text, the method further includes:
当有两个关键词在同一文档中相邻的次数大于预设值时,将所述两个关键词合并为新的关键词。When the number of times two keywords are adjacent to each other in the same document is greater than a preset value, the two keywords are merged into a new keyword.
根据本发明优选实施例,所述利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示包括:According to a preferred embodiment of the present invention, the word representation processing of the word sequence using a word representation model to obtain the word representation of the word sequence includes:
将所述关键词中的字序列输入所述词表示模型,并通过正向读取所述字序列生成包含所述字序列以及所述字序列的上文信息的第一向量,及通过反向读取所述字序列生成包含所述字序列以及所述字序列的下文信息的第二向量;Inputting a word sequence in the keyword into the word representation model, and generating a first vector including the word sequence and context information of the word sequence by forward reading the word sequence, and generating a second vector including the word sequence and context information of the word sequence by reverse reading the word sequence;
连接所述第一向量及所述第二向量,得到包含所述字序列及所述字序列的上下文信息的词表示。The first vector and the second vector are connected to obtain a word representation including the word sequence and context information of the word sequence.
根据本发明优选实施例,所述方法还包括:According to a preferred embodiment of the present invention, the method further comprises:
获取简历数据;Get resume data;
拆分所述简历数据,得到训练集和验证集;Splitting the resume data to obtain a training set and a validation set;
利用所述验证集训练CRF模型,并采用条件对数似然函数及最大分值公式预测目标标签序列;The CRF model is trained using the validation set, and the target label sequence is predicted using the conditional log-likelihood function and the maximum score formula;
以所述验证集验证所述目标标签序列;Verifying the target tag sequence with the verification set;
当所述目标标签序列通过验证时,停止训练并得到所述简历标签解析模型。When the target label sequence passes the verification, the training is stopped and the resume label parsing model is obtained.
根据本发明优选实施例,所述计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历包括:According to a preferred embodiment of the present invention, the calculating the similarity between each tag in the resume tag sequence and the tag of each position, and determining the resume matching each position from the resume to be parsed according to the calculated similarity comprises:
计算每个标签与每个岗位的标签之间的余弦距离;Calculate the cosine distance between each label and the label of each position;
当存在目标标签与目标岗位之间的余弦距离小于或者等于预设距离时,从所述待解析简历中调取所述目标标签对应的目标简历;When the cosine distance between the target tag and the target position is less than or equal to the preset distance, the target resume corresponding to the target tag is retrieved from the resume to be parsed;
确定所述目标简历与所述目标岗位相匹配。Determine whether the target resume matches the target position.
一种简历数据信息解析及匹配装置,所述装置包括:A resume data information parsing and matching device, the device comprising:
预处理单元,用于从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;A preprocessing unit is used to retrieve resumes from a database and preprocess the retrieved resumes to obtain resumes to be parsed;
构建单元,用于根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;A construction unit is used to construct a word segmentation directed acyclic graph according to a pre-constructed word segmentation dictionary, and segment the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain a resume text after word segmentation processing;
确定单元,用于根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;A determination unit, configured to construct a co-occurrence matrix according to the resume text after word segmentation processing, and determine keywords of the resume text based on the co-occurrence matrix;
处理单元,用于获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;A processing unit, used for acquiring a word sequence in the keyword, and processing the word sequence using a word representation model to obtain a word representation of the word sequence;
预测单元,用于将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;A prediction unit, used for inputting the word representation into the constructed resume tag parsing model to obtain a predicted resume tag sequence;
所述确定单元,还用于计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。The determination unit is also used to calculate the similarity between each tag in the resume tag sequence and the tag of each position, and determine the resume matching each position from the resumes to be parsed based on the calculated similarity.
根据本发明优选实施例,所述预处理单元具体用于:According to a preferred embodiment of the present invention, the preprocessing unit is specifically used for:
采用停用词表过滤方法对所述调取的简历进行去停用词处理。The retrieved resume is processed to remove stop words using a stop word list filtering method.
根据本发明优选实施例,所述确定单元根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词包括:According to a preferred embodiment of the present invention, the determining unit constructs a co-occurrence matrix according to the resume text, and determines the keywords of the resume text based on the co-occurrence matrix, including:
根据所述简历文本中每个分词出现的次数构建所述共现矩阵;Constructing the co-occurrence matrix according to the number of occurrences of each word in the resume text;
从所述共现矩阵中提取每个分词的词频及角度;Extracting the frequency and angle of each word segment from the co-occurrence matrix;
根据每个分词的词频及角度计算每个分词的得分;Calculate the score of each participle based on its frequency and angle;
根据每个分词的得分对每个分词进行降序输出,得到所述简历文本的关键词。Each participle is output in descending order according to its score to obtain the keywords of the resume text.
根据本发明优选实施例,所述装置还包括:According to a preferred embodiment of the present invention, the device further comprises:
合并单元,用于在得到所述简历文本的关键词后,当有两个关键词在同一文档中相邻的次数大于预设值时,将所述两个关键词合并为新的关键词。The merging unit is used to merge the two keywords into a new keyword after obtaining the keywords of the resume text when the number of times two keywords are adjacent in the same document is greater than a preset value.
根据本发明优选实施例,所述处理单元具体用于:According to a preferred embodiment of the present invention, the processing unit is specifically used for:
将所述关键词中的字序列输入所述词表示模型,并通过正向读取所述字序列生成包含所述字序列以及所述字序列的上文信息的第一向量,及通过反向读取所述字序列生成包含所述字序列以及所述字序列的下文信息的第二向量;Inputting a word sequence in the keyword into the word representation model, and generating a first vector including the word sequence and context information of the word sequence by forward reading the word sequence, and generating a second vector including the word sequence and context information of the word sequence by reverse reading the word sequence;
连接所述第一向量及所述第二向量,得到包含所述字序列及所述字序列的上下文信息的词表示。The first vector and the second vector are connected to obtain a word representation including the word sequence and context information of the word sequence.
根据本发明优选实施例,所述装置还包括:According to a preferred embodiment of the present invention, the device further comprises:
获取单元,用于获取简历数据;An acquisition unit, used to acquire resume data;
拆分单元,用于拆分所述简历数据,得到训练集和验证集;A splitting unit, used for splitting the resume data to obtain a training set and a validation set;
训练单元,用于利用所述验证集训练CRF模型,并采用条件对数似然函数及最大分值公式预测目标标签序列;A training unit, used to train the CRF model using the validation set, and predict the target label sequence using a conditional log-likelihood function and a maximum score formula;
验证单元,用于以所述验证集验证所述目标标签序列;A verification unit, used to verify the target tag sequence with the verification set;
所述训练单元,还用于当所述目标标签序列通过验证时,停止训练并得到所述简历标签解析模型。The training unit is also used to stop training and obtain the resume label parsing model when the target label sequence passes the verification.
根据本发明优选实施例,所述确定单元计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历包括:According to a preferred embodiment of the present invention, the determination unit calculates the similarity between each tag in the resume tag sequence and the tag of each position, and determines the resume matching each position from the resume to be parsed according to the calculated similarity, including:
计算每个标签与每个岗位的标签之间的余弦距离;Calculate the cosine distance between each label and the label of each position;
当存在目标标签与目标岗位之间的余弦距离小于或者等于预设距离时,从所述待解析简历中调取所述目标标签对应的目标简历;When the cosine distance between the target tag and the target position is less than or equal to the preset distance, the target resume corresponding to the target tag is retrieved from the resume to be parsed;
确定所述目标简历与所述目标岗位相匹配。Determine whether the target resume matches the target position.
一种电子设备,所述电子设备包括:An electronic device, comprising:
存储器,存储至少一个指令;及a memory storing at least one instruction; and
处理器,执行所述存储器中存储的指令以实现所述简历数据信息解析及匹配方法。The processor executes the instructions stored in the memory to implement the resume data information parsing and matching method.
一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现所述简历数据信息解析及匹配方法。A computer-readable storage medium stores at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the resume data information parsing and matching method.
由以上技术方案可以看出,本发明能够从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历,根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后简历文本,进而能够快速得到待解析简历的分词结果,进一步根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词,获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示,提升了解析效果,将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列,进一步计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历,实现对岗位与简历快速且准确地智能匹配。It can be seen from the above technical scheme that the present invention can retrieve a resume from a database, and pre-process the retrieved resume to obtain a resume to be parsed, construct a word segmentation directed acyclic graph according to a pre-constructed word segmentation dictionary, and segment the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain a resume text after word segmentation processing, and then quickly obtain the word segmentation result of the resume to be parsed, further construct a co-occurrence matrix according to the resume text, and determine the keywords of the resume text based on the co-occurrence matrix, obtain the word sequence in the keyword, and use the word representation model to process the word sequence to obtain the word representation of the word sequence, thereby improving the parsing effect, and input the word representation into the constructed resume label parsing model to obtain a predicted resume label sequence, further calculate the similarity between each label in the resume label sequence and the label of each position, and determine the resume matching each position from the resume to be parsed according to the calculated similarity, thereby realizing fast and accurate intelligent matching of positions and resumes.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明简历数据信息解析及匹配方法的较佳实施例的流程图。FIG. 1 is a flow chart of a preferred embodiment of the resume data information parsing and matching method of the present invention.
图2是本发明简历数据信息解析及匹配装置的较佳实施例的功能模块图。FIG. 2 is a functional module diagram of a preferred embodiment of the resume data information parsing and matching device of the present invention.
图3是本发明实现简历数据信息解析及匹配方法的较佳实施例的电子设备的结构示意图。FIG3 is a schematic diagram of the structure of an electronic device of a preferred embodiment of the resume data information parsing and matching method of the present invention.
图4是本发明实现简历数据信息解析及匹配方法的较佳实施例中的共现矩阵的示意图。FIG. 4 is a schematic diagram of a co-occurrence matrix in a preferred embodiment of the resume data information parsing and matching method of the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further explained in conjunction with embodiments and with reference to the accompanying drawings.
具体实施方式DETAILED DESCRIPTION
为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the purpose, technical solutions and advantages of the present invention more clear, the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.
如图1所示,是本发明简历数据信息解析及匹配方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。As shown in Figure 1, it is a flow chart of a preferred embodiment of the resume data information parsing and matching method of the present invention. According to different requirements, the order of the steps in the flow chart can be changed, and some steps can be omitted.
所述简历数据信息解析及匹配方法应用于一个或者多个电子设备中,所述电子设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field -Programmable Gate Array,FPGA)、数字处理器(DigitalSignal Processor,DSP)、嵌入式设备等。The resume data information parsing and matching method is applied to one or more electronic devices, which are devices that can automatically perform numerical calculations and/or information processing according to pre-set or stored instructions, and whose hardware includes but is not limited to microprocessors, application specific integrated circuits (ASIC), programmable gate arrays (FPGA), digital signal processors (DSP), embedded devices, etc.
所述电子设备可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。The electronic device may be any electronic product that can perform human-computer interaction with a user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television (IPTV), a smart wearable device, etc.
所述电子设备还可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算(CloudComputing)的由大量主机或网络服务器构成的云。The electronic device may also include a network device and/or a user device, wherein the network device includes, but is not limited to, a single network server, a server group consisting of multiple network servers, or a cloud consisting of a large number of hosts or network servers based on cloud computing.
所述电子设备所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。The network where the electronic device is located includes but is not limited to the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (VPN), etc.
S10,从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历。S10, retrieving a resume from a database and preprocessing the retrieved resume to obtain a resume to be parsed.
在本发明的至少一个实施例中,所述数据库可以是与所述电子设备相通信的数据库,也可以是所述电子设备的内部数据库,根据不同的需求,可以进行自定义配置。In at least one embodiment of the present invention, the database may be a database that communicates with the electronic device, or may be an internal database of the electronic device, and may be customized according to different requirements.
例如:所述数据库可以是人才库。所述电子设备从所述人才库中进行简历的调取和整理,得到大量简历。所述简历可以归纳成一个名词集合{姓名、性别、生日、政貌、学校、学历、专业、联系方式、籍贯、教育经历、技能……},其中的每一项内容都有展开描述,并且每一项都有分隔符分开。由于求职这一社会行为的特殊性以及人与人之间的模仿,很多求职人员在描述自身特点方面有相当大的共性。所述电子设备从大量的具有共性的简历之中解析出包括简历挑选者感兴趣和关心的内容的简历,形成一个大致收敛的有限的简历集合,作为调取的简历。For example: the database may be a talent pool. The electronic device retrieves and organizes resumes from the talent pool to obtain a large number of resumes. The resume can be summarized into a noun set {name, gender, birthday, political appearance, school, education, major, contact information, native place, educational experience, skills...}, each of which has an expanded description and each separated by a separator. Due to the particularity of job hunting as a social behavior and the imitation between people, many job seekers have considerable commonality in describing their own characteristics. The electronic device parses out resumes that include content that the resume selector is interested in and concerned about from a large number of resumes with commonalities, forming a roughly convergent limited resume set as the retrieved resume.
在本发明的至少一个实施例中,由于在求职过程中,同一人有可能发送多份简历,因此所述电子设备可以首先将重复的简历进行剔除,从而实现简历的去重。In at least one embodiment of the present invention, since the same person may send multiple resumes during the job search process, the electronic device may first remove duplicate resumes to achieve resume deduplication.
进一步地,由于简历中还存在一些冗余的停用词,同样会对解析产生不利影响,因此,还需要剔除停用词,即对调取的简历进行预处理。Furthermore, since there are some redundant stop words in the resume, which will also have an adverse effect on the parsing, it is also necessary to remove the stop words, that is, pre-process the retrieved resume.
具体地,所述电子设备对调取的简历进行预处理包括:Specifically, the electronic device preprocesses the retrieved resume including:
所述电子设备采用停用词表过滤方法对所述调取的简历进行去停用词处理。The electronic device uses a stop word list filtering method to remove stop words from the retrieved resume.
其中,所述停用词是文本数据功能词中没有实际意义的词,对文本的分类没有影响,但是出现的频率高,具体可以包括常用的代词、介词等。所述停用词会降低文本分类效果的准确性。The stop words are words without actual meaning in the text data function words, which have no effect on the classification of the text, but appear frequently, and may specifically include commonly used pronouns, prepositions, etc. The stop words will reduce the accuracy of the text classification effect.
进一步地,所述电子设备可以将调取的简历中的词语与预先构建好的停用词表进行一一匹配,如果匹配成功,那么该词语就是停用词,所述电子设备将该词删除。Furthermore, the electronic device may match the words in the retrieved resume with a pre-built stop word list one by one. If the match is successful, the word is a stop word and the electronic device deletes the word.
S11,根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到经过分词处理的简历文本。S11, constructing a word segmentation directed acyclic graph according to a pre-constructed word segmentation dictionary, and segmenting the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain a resume text after word segmentation processing.
在本发明的至少一个实施例中,所述分词词典可以包括前缀字典、自定义字典等。In at least one embodiment of the present invention, the word segmentation dictionary may include a prefix dictionary, a custom dictionary, and the like.
其中,所述前缀词典包括统计的词典中每一个分词的前缀,例如:词典中的词“北京大学”的前缀分别是“北”、“北京”、“北京大”;词“大学”的前缀是“大”;所述自定义词典也可以称为专有名词词典,是在统计的词典中不存在,但是某领域特定、专有的词,如简历、工作经历等。Among them, the prefix dictionary includes the prefix of each word segment in the statistical dictionary, for example: the prefixes of the word "Peking University" in the dictionary are "北", "北京", and "北京大" respectively; the prefix of the word "大学" is "大"; the custom dictionary can also be called a proper noun dictionary, which is a word that does not exist in the statistical dictionary but is specific and proprietary to a certain field, such as resume, work experience, etc.
进一步地,所述电子设备根据预先构建的分词词典构建词语切分有向无环图,其中,每个词对应图中的一条有向边,并赋给相应的边长(权值)。进一步地,所述电子设备在起点到终点的所有路径中,求出长度值,并按严格升序排列(即:任何两个不同位置上的值一定不等,下同),依次为第1,第2,…,第i,…,第N的路径集合,作为相应的粗分结果集。如果两条或两条以上路径的长度相等,那么他们的长度并列为第i,都要列入所述粗分结果集,而且不影响其他路径的排列序号,最后的粗分结果集的大小大于或等于N,据此得到经过分词处理的简历文本。Furthermore, the electronic device constructs a word segmentation directed acyclic graph based on a pre-constructed word segmentation dictionary, wherein each word corresponds to a directed edge in the graph and is assigned a corresponding edge length (weight). Furthermore, the electronic device calculates the length value of all paths from the starting point to the end point, and arranges them in strict ascending order (i.e., the values at any two different positions must not be equal, the same below), and the path sets are the 1st, 2nd, ..., i, ..., Nth, respectively, as the corresponding rough segmentation result set. If the lengths of two or more paths are equal, then their lengths are ranked as i-th, and they must be included in the rough segmentation result set, and the arrangement numbers of other paths are not affected. The size of the final rough segmentation result set is greater than or equal to N, and the resume text processed by word segmentation is obtained accordingly.
通过上述实施方式,能够利用分词词典及有向无环图快速得到简历文本的分词结果。Through the above implementation, the word segmentation result of the resume text can be quickly obtained by using the word segmentation dictionary and the directed acyclic graph.
S12,根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词。S12, constructing a co-occurrence matrix according to the resume text after word segmentation processing, and determining keywords of the resume text based on the co-occurrence matrix.
在本发明的至少一个实施例中,所述电子设备根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词包括:In at least one embodiment of the present invention, the electronic device constructs a co-occurrence matrix according to the resume text, and determines the keywords of the resume text based on the co-occurrence matrix, including:
所述电子设备根据所述简历文本中每个分词出现的次数构建所述共现矩阵,并从所述共现矩阵中提取每个分词的词频(freq)及度(deg),所述电子设备根据每个分词的词频及度计算每个分词的得分,进一步根据每个分词的得分对每个分词进行降序输出,得到所述简历文本的关键词。The electronic device constructs the co-occurrence matrix according to the number of times each participle appears in the resume text, and extracts the word frequency (freq) and degree (deg) of each participle from the co-occurrence matrix. The electronic device calculates the score of each participle according to the word frequency and degree of each participle, and further outputs each participle in descending order according to the score of each participle to obtain the keywords of the resume text.
例如:所述电子设备根据每个分词的得分对每个分词降序输出,得到前n个词语,如按score大小降序输出前1/3的词语作为所述简历文本的关键词。For example, the electronic device outputs each word in descending order according to the score of each word to obtain the first n words, such as outputting the first 1/3 of the words in descending order of score as the keywords of the resume text.
其中,所述共现矩阵是通过统计一个事先指定大小的窗口内的词语的共现次数,以词语周边的共现词的次数作为当前词语的向量。The co-occurrence matrix is obtained by counting the number of co-occurrences of words in a window of a predetermined size, and taking the number of co-occurrences of words around the word as the vector of the current word.
例如,当所述简历文本中有如下语料:For example, when the resume text contains the following corpus:
我擅长研究。(该语料中包括分词:“我”、“擅长”、“研究”及“。”,下面两个语料采取类似的分词方式,将不再一一列举)I am good at research. (This corpus includes the participles: "I", "good at", "research" and ".". The following two corpora adopt similar participle methods and will not be listed one by one.)
我擅长编程。I am good at programming.
我享受阅读。I enjoy reading.
根据上述简历文本中的语料,构建的共现矩阵X如图4所示。在本发明的至少一个实施例中,在得到所述简历文本的关键词后,所述方法还包括:According to the corpus in the resume text, the co-occurrence matrix X constructed is shown in Figure 4. In at least one embodiment of the present invention, after obtaining the keywords of the resume text, the method further includes:
当有两个关键词在同一文档中相邻的次数大于预设值时,所述电子设备将所述两个关键词合并为新的关键词。When the number of times two keywords are adjacent to each other in the same document is greater than a preset value, the electronic device merges the two keywords into a new keyword.
其中,所述预设值可以是2次等。Among them, the preset value can be 2 times, etc.
通过上述实施方式,能够将相似的关键词进一步合并,避免出现冗余关键词。Through the above implementation, similar keywords can be further merged to avoid redundant keywords.
S13,获取所述关键词中的字序列,并利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示。S13, obtaining a character sequence in the keyword, and performing a word representation process on the character sequence using a word representation model to obtain a word representation of the character sequence.
在本发明的至少一个实施例中,所述电子设备利用词表示模型对所述字序列进行处理,得到所述字序列的词表示包括:In at least one embodiment of the present invention, the electronic device processes the word sequence using a word representation model to obtain the word representation of the word sequence, including:
所述电子设备将所述关键词中的字序列输入所述词表示模型,并通过正向读取所述字序列生成包含所述字序列以及所述字序列的上文信息的第一向量,及通过反向读取所述字序列生成包含所述字序列以及所述字序列的下文信息的第二向量,所述电子设备连接所述第一向量及所述第二向量,得到包含所述字序列及所述字序列的上下文信息的词表示。The electronic device inputs the word sequence in the keyword into the word representation model, and generates a first vector including the word sequence and context information of the word sequence by forward reading the word sequence, and generates a second vector including the word sequence and context information of the word sequence by reverse reading the word sequence. The electronic device connects the first vector and the second vector to obtain a word representation including the word sequence and context information of the word sequence.
例如:对于给定一个包含n个关键字的非结构化文本简历的字序列Char=(char1,char2…,charn),其中charn是一个维度为d维的字向量,将所述非结构化文本字序列输入到词表示模型中,从而利用该词表示模型对字序列进行建模,通过正向读取字序列,以生成一个包含字序列以及字序列上文信息的向量,表示为CharFi,同理,通过反向读取字序列,以生成一个包含字序列以及字序列下文信息的向量,表示为CharBi,然后将CharFi和CharBi连接,形成一个包含字序列以及上下文信息的词表示:For example, given a character sequence Char=(char1 , char2 …, charn ) of an unstructured text resume containing n keywords, where charn is a character vector of dimension d, the unstructured text character sequence is input into the word representation model, so as to model the character sequence using the word representation model, and the character sequence is read forward to generate a vector containing the character sequence and the context information of the character sequence, represented as CharFi . Similarly, the character sequence is read backward to generate a vector containing the character sequence and the context information of the character sequence, represented as CharBi . Then CharFi and CharBi are connected to form a word representation containing the character sequence and the context information:
Wd=[CharFi:CharBi]Wd=[CharFi :CharBi ]
据此,所述电子设备得到所述字序列的词表示。Based on this, the electronic device obtains the word representation of the word sequence.
需要说明的是,在进行自然语言处理时,可以利用各种词表示模型将“词”这一符号信息表示成数学上的向量形式。词的向量表示可以作为各种机器学习模型的输入来使用。现有的词表示模型可以包括两大类:一类是syntagmatic models,一类是paradigmaticmodels。It should be noted that when performing natural language processing, various word representation models can be used to represent the symbolic information of "words" into mathematical vector forms. The vector representation of words can be used as input for various machine learning models. Existing word representation models can be divided into two categories: one is syntagmatic models, and the other is paradigmatic models.
进一步地,对于该词表示,所述电子设备还可以进一步使用正则表达匹配对其进行格式化处理,进而解析、分类,存入指定数据库中,以供后续使用。Furthermore, for the word representation, the electronic device may further use regular expression matching to format it, and then parse, classify, and store it in a designated database for subsequent use.
S14,将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列。S14, input the word representation into the constructed resume tag parsing model to obtain a predicted resume tag sequence.
在本发明的至少一个实施例中,所述简历标签解析模型是以大量的简历数据作为训练样本进行训练,并以验证集进行验证而得到。利用所述简历标签解析模型对非结构化的词表示进行解析,能够输出相对应的标签以形成所述简历标签序列。In at least one embodiment of the present invention, the resume tag parsing model is trained using a large amount of resume data as training samples and verified using a validation set. The resume tag parsing model is used to parse unstructured word representations and output corresponding tags to form the resume tag sequence.
例如:所述简历标签序列中的标签可以包括,但不限于:本科生、研究生、熟练掌握WORD等。For example, the labels in the resume label sequence may include, but are not limited to: undergraduate, graduate, proficient in WORD, etc.
在本发明的至少一个实施例中,所述方法还包括:In at least one embodiment of the present invention, the method further comprises:
所述电子设备获取简历数据,拆分所述简历数据,得到训练集和验证集,进一步地,利用所述验证集训练CRF模型,并采用条件对数似然函数及最大分值公式预测目标标签序列,以所述验证集验证所述目标标签序列,当所述目标标签序列通过验证时,停止训练并得到所述简历标签解析模型。The electronic device obtains resume data, splits the resume data, obtains a training set and a validation set, further uses the validation set to train a CRF model, and uses the conditional log-likelihood function and the maximum score formula to predict a target label sequence, and uses the validation set to verify the target label sequence. When the target label sequence passes the verification, the training is stopped and the resume label parsing model is obtained.
其中,所述是指预测的最适合的标签序列。Herein, the above refers to the most suitable tag sequence predicted.
具体地,所述电子设备采用CRF(conditional random field,条件随机场)进行建模。假定得到非结构化文本的关键字信息的输出目标序列(即对应的标签序列)为:y=(y1,…yn)。为了有效获得非结构化文本简历信息的目标序列,模型的分值公式定义如下:Specifically, the electronic device uses CRF (conditional random field) for modeling. Assume that the output target sequence of keyword information of unstructured text (i.e., the corresponding label sequence) is: y = (y1 , ... yn ). In order to effectively obtain the target sequence of unstructured text resume information, the score formula of the model is defined as follows:
其中,P表示双向LSTM算法(Long short-term memory,长短期记忆算法)的输出分值矩阵,其大小为n×k,k表示目标标签的数量,所述目标标签即对该简历的概述评价,n表示词序列的长度,A表示转移分值矩阵。当j=0时,y0表示的是一个序列开始的标志,当j=n时,yn+1表示的是一个序列结束的标志,A方阵的大小为k+2。Wherein, P represents the output score matrix of the bidirectional LSTM algorithm (Long short-term memory), whose size is n×k, k represents the number of target tags, i.e., the summary evaluation of the resume, n represents the length of the word sequence, and A represents the transfer score matrix. When j=0, y0 represents the mark of the beginning of a sequence, when j=n, yn+1 represents the mark of the end of a sequence, and the size of the A matrix is k+2.
在所有简历信息的标签序列上,CRF生成目标序列y的概率为:On all label sequences of resume information, the probability that CRF generates the target sequence y is:
其中,YWd代表简历信息序列Wd对应的所有可能标签序列。在训练过程中,为了获得简历信息正确的标签序列,将采用最大化正确标签序列的条件对数似然函数进行计算,并使用最大分值公式预测最合适的标签序列:Among them, YWd represents all possible label sequences corresponding to the resume information sequence Wd. During the training process, in order to obtain the correct label sequence of the resume information, the conditional log-likelihood function that maximizes the correct label sequence will be used for calculation, and the maximum score formula will be used to predict the most appropriate label sequence:
通过上述实施方式,结合条件对数似然函数及最大分值公式,能够提升模型的准确率。Through the above implementation, combined with the conditional log-likelihood function and the maximum score formula, the accuracy of the model can be improved.
S15,计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。S15, calculating the similarity between each tag in the resume tag sequence and the tag of each position, and determining the resume matching each position from the resumes to be parsed according to the calculated similarity.
在本发明的至少一个实施例中,所述电子设备计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历包括:In at least one embodiment of the present invention, the electronic device calculates the similarity between each tag in the resume tag sequence and the tag of each position, and determines the resume matching each position from the resume to be parsed according to the calculated similarity, including:
所述电子设备计算每个标签与每个岗位的标签之间的余弦距离,当存在目标标签与目标岗位之间的余弦距离小于或者等于预设距离时,所述电子设备从所述待解析简历中调取所述目标标签对应的目标简历,并确定所述目标简历与所述目标岗位相匹配。The electronic device calculates the cosine distance between each tag and the tag of each position. When the cosine distance between the target tag and the target position is less than or equal to a preset distance, the electronic device retrieves the target resume corresponding to the target tag from the resume to be parsed, and determines that the target resume matches the target position.
具体地,所述余弦距离是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量,余弦值越接近1,就表明夹角越接近0度,也就是两个向量越相似。Specifically, the cosine distance uses the cosine value of the angle between two vectors in the vector space as a measure of the size of the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors are.
例如:对于所得到的简历标签序列X和入职岗位所需要的简历标签序列Y,利用下列式子进行计算,式中Xi表示简历标签序列X中第i个向量,Yi表示入职岗位所需要的简历标签序列Y中第i个向量:For example, for the obtained resume label sequence X and the resume label sequence Y required for the entry position, the following formula is used for calculation, whereXi represents the i-th vector in the resume label sequence X, andYi represents the i-th vector in the resume label sequence Y required for the entry position:
产生的相似性范围从-1到1,其中,-1意味着两个向量指向的方向正好截然相反,1表示它们的指向是完全相同的,0通常表示它们之间是独立的,而在这之间的值则表示中度的相似性或相异性,根据这一算法,能够对每份岗位选取标签相似度较高的简历,以进行快速匹配入职。The resulting similarity ranges from -1 to 1, where -1 means that the two vectors point in opposite directions, 1 means that they point in exactly the same direction, 0 usually means that they are independent, and the values in between indicate moderate similarity or dissimilarity. Based on this algorithm, resumes with higher label similarity can be selected for each position for quick matching and employment.
在本发明的至少一个实施例中,所述电子设备还可以根据得到的简历标签序列及配置的相应的权重(如:研究生标签在简历评分中所占权重为0.2,而本科生标签在简历评分中所占权重为0.1),将所述简历标签序列通过得分进行表示,进一步根据得分快速筛选出所需的员工。In at least one embodiment of the present invention, the electronic device can also represent the resume tag sequence through a score based on the obtained resume tag sequence and the corresponding configured weights (e.g., the weight of the graduate student tag in the resume score is 0.2, while the weight of the undergraduate tag in the resume score is 0.1), and further quickly screen out the required employees based on the score.
由以上技术方案可以看出,本发明能够从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历,根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本,进而能够快速得到待解析简历的分词结果,进一步根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词,获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示,提升了解析效果,将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列,进一步计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历,实现对岗位与简历快速且准确地智能匹配。It can be seen from the above technical scheme that the present invention can retrieve a resume from a database, and pre-process the retrieved resume to obtain a resume to be parsed, construct a word segmentation directed acyclic graph according to a pre-constructed word segmentation dictionary, and segment the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain a resume text after word segmentation processing, and then quickly obtain the word segmentation result of the resume to be parsed, further construct a co-occurrence matrix according to the resume text, and determine the keywords of the resume text based on the co-occurrence matrix, obtain the word sequence in the keyword, and use the word representation model to process the word sequence to obtain the word representation of the word sequence, thereby improving the parsing effect, and input the word representation into the constructed resume label parsing model to obtain a predicted resume label sequence, further calculate the similarity between each label in the resume label sequence and the label of each position, and determine the resume matching each position from the resume to be parsed according to the calculated similarity, thereby realizing fast and accurate intelligent matching of positions and resumes.
如图2所示,是本发明简历数据信息解析及匹配装置的较佳实施例的功能模块图。所述简历数据信息解析及匹配装置11包括预处理单元110、构建单元111、确定单元112、处理单元113、预测单元114、合并单元115、训练单元116、获取单元117、拆分单元118、验证单元119。本发明所称的模块/单元是指一种能够被处理器13所执行,并且能够完成固定功能的一系列计算机程序段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。As shown in Figure 2, it is a functional module diagram of a preferred embodiment of the resume data information parsing and matching device of the present invention. The resume data information parsing and matching device 11 includes a preprocessing unit 110, a construction unit 111, a determination unit 112, a processing unit 113, a prediction unit 114, a merging unit 115, a training unit 116, an acquisition unit 117, a splitting unit 118, and a verification unit 119. The module/unit referred to in the present invention refers to a series of computer program segments that can be executed by a processor 13 and can perform fixed functions, which are stored in a memory 12. In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
预处理单元110从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历。The preprocessing unit 110 retrieves a resume from a database and preprocesses the retrieved resume to obtain a resume to be parsed.
在本发明的至少一个实施例中,所述数据库可以是与电子设备相通信的数据库,也可以是所述电子设备的内部数据库,根据不同的需求,可以进行自定义配置。In at least one embodiment of the present invention, the database may be a database that communicates with the electronic device, or may be an internal database of the electronic device, and may be customized according to different requirements.
例如:所述数据库可以是人才库。所述预处理单元110从所述人才库中进行简历的调取和整理,得到大量简历。所述简历可以归纳成一个名词集合{姓名、性别、生日、政貌、学校、学历、专业、联系方式、籍贯、教育经历、技能……},其中的每一项内容都有展开描述,并且每一项都有分隔符分开。由于求职这一社会行为的特殊性以及人与人之间的模仿,很多求职人员在描述自身特点方面有相当大的共性。所述预处理单元110从大量的具有共性的简历之中解析出包括简历挑选者感兴趣和关心的内容的简历,形成一个大致收敛的有限的简历集合,作为所述调取的简历。For example, the database may be a talent pool. The preprocessing unit 110 retrieves and organizes resumes from the talent pool to obtain a large number of resumes. The resume can be summarized into a noun set {name, gender, birthday, political appearance, school, education, major, contact information, native place, educational experience, skills...}, each of which has an expanded description and each separated by a separator. Due to the particularity of job hunting as a social behavior and the imitation between people, many job seekers have considerable commonality in describing their own characteristics. The preprocessing unit 110 parses out resumes that include content that the resume selector is interested in and concerned about from a large number of resumes with commonalities, forming a roughly convergent limited resume set as the retrieved resume.
在本发明的至少一个实施例中,由于在求职过程中,同一人有可能发送多份简历,因此可以首先将重复的简历进行剔除,从而实现简历的去重。In at least one embodiment of the present invention, since the same person may send multiple resumes during the job search process, duplicate resumes may be removed first, thereby achieving resume deduplication.
进一步地,由于简历中还存在一些冗余的停用词,同样会对解析产生不利影响,因此,还需要剔除停用词,即对调取的简历进行预处理。Furthermore, since there are some redundant stop words in the resume, which will also have an adverse effect on the parsing, it is also necessary to remove the stop words, that is, pre-process the retrieved resume.
具体地,所述预处理单元110对调取的简历进行预处理包括:Specifically, the preprocessing unit 110 preprocesses the retrieved resume including:
所述预处理单元110采用停用词表过滤方法对所述调取的简历进行去停用词处理。The pre-processing unit 110 uses a stop word list filtering method to remove stop words from the retrieved resume.
其中,所述停用词是文本数据功能词中没有实际意义的词,对文本的分类没有影响,但是出现的频率高,具体可以包括常用的代词、介词等。所述停用词会降低文本分类效果的准确性。The stop words are words without actual meaning in the text data function words, which have no effect on the classification of the text, but appear frequently, and may specifically include commonly used pronouns, prepositions, etc. The stop words will reduce the accuracy of the text classification effect.
进一步地,所述预处理单元110可以将调取的简历中的词语与预先构建好的停用词表进行一一匹配,如果匹配成功,那么该词语就是停用词,所述预处理单元110将该词删除。Furthermore, the preprocessing unit 110 may match the words in the retrieved resume with the pre-constructed stop word list one by one. If the match is successful, the word is a stop word and the preprocessing unit 110 deletes the word.
构建单元111根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本。The construction unit 111 constructs a word segmentation directed acyclic graph according to a pre-constructed word segmentation dictionary, and segments the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain the resume text after word segmentation processing.
在本发明的至少一个实施例中,所述分词词典可以包括前缀字典、自定义字典等。In at least one embodiment of the present invention, the word segmentation dictionary may include a prefix dictionary, a custom dictionary, and the like.
其中,所述前缀词典包括统计的词典中每一个分词的前缀,例如:词典中的词“北京大学”的前缀分别是“北”、“北京”、“北京大”;词“大学”的前缀是“大”;所述自定义词典也可以称为专有名词词典,是在统计的词典中不存在,但是某领域特定、专有的词,如简历、工作经历等。Among them, the prefix dictionary includes the prefix of each word segment in the statistical dictionary, for example: the prefixes of the word "Peking University" in the dictionary are "北", "北京", and "北京大" respectively; the prefix of the word "大学" is "大"; the custom dictionary can also be called a proper noun dictionary, which is a word that does not exist in the statistical dictionary but is specific and proprietary to a certain field, such as resume, work experience, etc.
进一步地,所述构建单元111根据预先构建的分词词典构建词语切分有向无环图,其中,每个词对应图中的一条有向边,并赋给相应的边长(权值)。进一步地,所述构建单元111在起点到终点的所有路径中,求出长度值,并按严格升序排列(即:任何两个不同位置上的值一定不等,下同),依次为第1,第2,…,第i,…,第N的路径集合,作为相应的粗分结果集。如果两条或两条以上路径的长度相等,那么他们的长度并列为第i,都要列入所述粗分结果集,而且不影响其他路径的排列序号,最后的粗分结果集的大小大于或等于N,据此得到经过分词处理的简历文本。Furthermore, the construction unit 111 constructs a directed acyclic graph of word segmentation according to a pre-constructed word segmentation dictionary, wherein each word corresponds to a directed edge in the graph and is assigned a corresponding edge length (weight). Furthermore, the construction unit 111 calculates the length value of all paths from the starting point to the end point, and arranges them in a strict ascending order (i.e., the values at any two different positions must not be equal, the same below), and the path sets are the 1st, 2nd, ..., i, ..., Nth, respectively, as the corresponding rough segmentation result set. If the lengths of two or more paths are equal, then their lengths are ranked as i-th, and they are all included in the rough segmentation result set, and do not affect the arrangement numbers of other paths. The size of the final rough segmentation result set is greater than or equal to N, and the resume text processed by word segmentation is obtained accordingly.
通过上述实施方式,能够利用分词词典及有向无环图快速得到简历文本的分词结果。Through the above implementation, the word segmentation result of the resume text can be quickly obtained by using the word segmentation dictionary and the directed acyclic graph.
确定单元112根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词。The determination unit 112 constructs a co-occurrence matrix according to the resume text, and determines keywords of the resume text based on the co-occurrence matrix.
在本发明的至少一个实施例中,所述确定单元112根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词包括:In at least one embodiment of the present invention, the determining unit 112 constructs a co-occurrence matrix according to the resume text, and determines the keywords of the resume text based on the co-occurrence matrix, including:
所述确定单元112根据所述简历文本中每个分词出现的次数构建所述共现矩阵,并从所述共现矩阵中提取每个分词的词频(freq)及度(deg),所述确定单元112根据每个分词的词频及度计算每个分词的得分,进一步根据每个分词的得分对每个分词进行降序输出,得到所述简历文本的关键词。The determination unit 112 constructs the co-occurrence matrix according to the number of times each participle appears in the resume text, and extracts the word frequency (freq) and degree (deg) of each participle from the co-occurrence matrix. The determination unit 112 calculates the score of each participle according to the word frequency and degree of each participle, and further outputs each participle in descending order according to the score of each participle to obtain the keywords of the resume text.
例如:所述确定单元112根据每个分词的得分对每个分词降序输出,得到前n个词语,如按score大小降序输出前1/3的词语作为所述简历文本的关键词。For example, the determining unit 112 outputs each word in descending order according to the score of each word to obtain the first n words, such as outputting the first 1/3 of the words in descending order of score size as the keywords of the resume text.
其中,所述共现矩阵是通过统计一个事先指定大小的窗口内的词语的共现次数,以词语周边的共现词的次数作为当前词语的向量。The co-occurrence matrix is obtained by counting the number of co-occurrences of words in a window of a predetermined size, and taking the number of co-occurrences of words around the word as the vector of the current word.
例如,当所述简历文本中有如下语料:For example, when the resume text contains the following corpus:
我擅长研究。(该语料中包括分词:“我”、“擅长”、“研究”及“。”,下面两个语料采取类似的分词方式,将不再一一列举)I am good at research. (This corpus includes the participles: "I", "good at", "research" and ".". The following two corpora adopt similar participle methods and will not be listed one by one.)
我擅长编程。I am good at programming.
我享受阅读。I enjoy reading.
根据上述简历文本中的语料,构建的共现矩阵X如图4所示。在本发明的至少一个实施例中,在得到所述简历文本的关键词后,所述方法还包括:According to the corpus in the resume text, the co-occurrence matrix X constructed is shown in Figure 4. In at least one embodiment of the present invention, after obtaining the keywords of the resume text, the method further includes:
当有两个关键词在同一文档中相邻的次数大于预设值时,合并单元115将所述两个关键词合并为新的关键词。When the number of times two keywords are adjacent to each other in the same document is greater than a preset value, the merging unit 115 merges the two keywords into a new keyword.
其中,所述预设值可以是2次等。Among them, the preset value can be 2 times, etc.
通过上述实施方式,能够将相似的关键词进一步合并,避免出现冗余关键词。Through the above implementation, similar keywords can be further merged to avoid redundant keywords.
处理单元113获取所述关键词中的字序列,并利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示。The processing unit 113 obtains a character sequence in the keyword, and performs a word representation process on the character sequence using a word representation model to obtain a word representation of the character sequence.
在本发明的至少一个实施例中,所述处理单元113利用词表示模型对所述字序列进行处理,得到所述字序列的词表示包括:In at least one embodiment of the present invention, the processing unit 113 processes the word sequence using a word representation model to obtain a word representation of the word sequence including:
所述处理单元113将所述关键词中的字序列输入所述词表示模型,并通过正向读取所述字序列生成包含所述字序列以及所述字序列的上文信息的第一向量,及通过反向读取所述字序列生成包含所述字序列以及所述字序列的下文信息的第二向量,所述处理单元113连接所述第一向量及所述第二向量,得到包含所述字序列及所述字序列的上下文信息的词表示。The processing unit 113 inputs the character sequence in the keyword into the word representation model, and generates a first vector including the character sequence and the context information of the character sequence by forward reading the character sequence, and generates a second vector including the character sequence and the context information of the character sequence by reverse reading the character sequence. The processing unit 113 connects the first vector and the second vector to obtain a word representation including the character sequence and the context information of the character sequence.
例如:对于给定一个包含n个关键字的非结构化文本简历的字序列Char=(char1,char2…,charn),其中charn是一个维度为d维的字向量,将所述非结构化文本字序列输入到词表示模型中,从而利用该词表示模型对字序列进行建模,通过正向读取字序列,以生成一个包含字序列以及字序列上文信息的向量,表示为CharFi,同理,通过反向读取字序列,以生成一个包含字序列以及字序列下文信息的向量,表示为CharBi,然后将CharGi和CharBi连接,形成一个包含字序列以及上下文信息的词表示:For example, given a character sequence Char=(char1 , char2 …, charn ) of an unstructured text resume containing n keywords, where charn is a character vector of dimension d, the unstructured text character sequence is input into the word representation model, so as to model the character sequence using the word representation model, and the character sequence is read forward to generate a vector containing the character sequence and the context information of the character sequence, represented as CharFi . Similarly, the character sequence is read backward to generate a vector containing the character sequence and the context information of the character sequence, represented as CharBi . Then CharGi and CharBi are connected to form a word representation containing the character sequence and the context information:
Wd=[CharFi:CharBi]Wd=[CharFi :CharBi ]
据此,所述处理单元113得到所述字序列的词表示。Based on this, the processing unit 113 obtains the word representation of the character sequence.
需要说明的是,在进行自然语言处理时,可以利用各种词表示模型将“词”这一符号信息表示成数学上的向量形式。词的向量表示可以作为各种机器学习模型的输入来使用。现有的词表示模型可以包括两大类:一类是syntagmatic models,一类是paradigmaticmodels。It should be noted that when performing natural language processing, various word representation models can be used to represent the symbolic information of "words" into mathematical vector forms. The vector representation of words can be used as input for various machine learning models. Existing word representation models can be divided into two categories: one is syntagmatic models, and the other is paradigmatic models.
进一步地,对于该词表示,所述电子设备还可以进一步使用正则表达匹配对其进行格式化处理,进而解析、分类,存入指定数据库中,以供后续使用。Furthermore, for the word representation, the electronic device may further use regular expression matching to format it, and then parse, classify, and store it in a designated database for subsequent use.
预测单元114将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列。The prediction unit 114 inputs the word representation into the constructed resume tag parsing model to obtain a predicted resume tag sequence.
在本发明的至少一个实施例中,所述简历标签解析模型是以大量的简历数据作为训练样本进行训练,并以验证集进行验证而得到。利用所述简历标签解析模型对非结构化的词表示进行解析,能够输出相对应的标签以形成所述简历标签序列。In at least one embodiment of the present invention, the resume tag parsing model is trained using a large amount of resume data as training samples and verified using a validation set. The resume tag parsing model is used to parse unstructured word representations and output corresponding tags to form the resume tag sequence.
例如:所述简历标签序列中的标签可以包括,但不限于:本科生、研究生、熟练掌握WORD等。For example, the labels in the resume label sequence may include, but are not limited to: undergraduate, graduate, proficient in WORD, etc.
在本发明的至少一个实施例中,训练所述简历标签解析模型包括:In at least one embodiment of the present invention, training the resume tag parsing model includes:
获取单元117获取简历数据,拆分单元118拆分所述简历数据,得到训练集和验证集,进一步地,验证单元119利用所述验证集训练CRF模型,训练单元116采用条件对数似然函数及最大分值公式预测目标标签序列,以所述验证集验证所述目标标签序列,当所述目标标签序列通过验证时,所述训练单元116停止训练并得到所述简历标签解析模型。The acquisition unit 117 acquires the resume data, the splitting unit 118 splits the resume data to obtain a training set and a validation set. Further, the validation unit 119 uses the validation set to train the CRF model. The training unit 116 uses the conditional log-likelihood function and the maximum score formula to predict the target label sequence, and the validation set is used to verify the target label sequence. When the target label sequence passes the verification, the training unit 116 stops training and obtains the resume label parsing model.
其中,所述是指预测的最适合的标签序列。Herein, the above refers to the most suitable tag sequence predicted.
具体地,所述训练单元116采用CRF(conditional random field,条件随机场)进行建模。假定得到非结构化文本的关键字信息的输出目标序列(即对应的标签序列)为:y=(y1,…yn)。为了有效获得非结构化文本简历信息的目标序列,模型的分值公式定义如下:Specifically, the training unit 116 uses CRF (conditional random field) for modeling. Assume that the output target sequence of keyword information of unstructured text (i.e., the corresponding label sequence) is: y = (y1 , ... yn ). In order to effectively obtain the target sequence of unstructured text resume information, the score formula of the model is defined as follows:
其中,P表示双向LSTM算法(Long short-term memory,长短期记忆算法)的输出分值矩阵,其大小为n×k,k表示目标标签的数量,所述目标标签即对该简历的概述评价,n表示词序列的长度,A表示转移分值矩阵。当j=0时,y0表示的是一个序列开始的标志,当j=n时,yn+1表示的是一个序列结束的标志,A方阵的大小为k+2。Wherein, P represents the output score matrix of the bidirectional LSTM algorithm (Long short-term memory), whose size is n×k, k represents the number of target tags, i.e., the summary evaluation of the resume, n represents the length of the word sequence, and A represents the transfer score matrix. When j=0, y0 represents the mark of the beginning of a sequence, when j=n, yn+1 represents the mark of the end of a sequence, and the size of the A matrix is k+2.
在所有简历信息的标签序列上,CRF生成目标序列y的概率为:On all label sequences of resume information, the probability that CRF generates the target sequence y is:
其中,YWd代表简历信息序列Wd对应的所有可能标签序列。在训练过程中,为了获得简历信息正确的标签序列,所述训练单元116将采用最大化正确标签序列的条件对数似然函数进行计算,并使用最大分值公式预测最合适的标签序列:Wherein, YWd represents all possible label sequences corresponding to the resume information sequence Wd. During the training process, in order to obtain the correct label sequence of the resume information, the training unit 116 will use the conditional log-likelihood function that maximizes the correct label sequence for calculation, and use the maximum score formula to predict the most appropriate label sequence:
通过上述实施方式,结合条件对数似然函数及最大分值公式,能够提升模型的准确率。Through the above implementation, combined with the conditional log-likelihood function and the maximum score formula, the accuracy of the model can be improved.
所述确定单元112计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。The determination unit 112 calculates the similarity between each tag in the resume tag sequence and the tag of each position, and determines the resume matching each position from the resumes to be parsed according to the calculated similarity.
在本发明的至少一个实施例中,所述确定单元112计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历包括:In at least one embodiment of the present invention, the determining unit 112 calculates the similarity between each tag in the resume tag sequence and the tag of each position, and determines the resume matching each position from the resume to be parsed according to the calculated similarity, including:
所述确定单元112计算每个标签与每个岗位的标签之间的余弦距离,当存在目标标签与目标岗位之间的余弦距离小于或者等于预设距离时,所述确定单元112从所述待解析简历中调取所述目标标签对应的目标简历,并确定所述目标简历与所述目标岗位相匹配。The determination unit 112 calculates the cosine distance between each label and the label of each position. When the cosine distance between the target label and the target position is less than or equal to the preset distance, the determination unit 112 retrieves the target resume corresponding to the target label from the resume to be parsed, and determines that the target resume matches the target position.
具体地,所述余弦距离是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量,余弦值越接近1,就表明夹角越接近0度,也就是两个向量越相似。Specifically, the cosine distance uses the cosine value of the angle between two vectors in the vector space as a measure of the size of the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors are.
例如:对于所得到的简历标签序列X和入职岗位所需要的简历标签序列Y,利用下列式子进行计算,式中Xi表示简历标签序列X中第i个向量,Yi表示入职岗位所需要的简历标签序列Y中第i个向量:For example, for the obtained resume label sequence X and the resume label sequence Y required for the entry position, the following formula is used for calculation, whereXi represents the i-th vector in the resume label sequence X, andYi represents the i-th vector in the resume label sequence Y required for the entry position:
产生的相似性范围从-1到1,其中,-1意味着两个向量指向的方向正好截然相反,1表示它们的指向是完全相同的,0通常表示它们之间是独立的,而在这之间的值则表示中度的相似性或相异性,根据这一算法,能够对每份岗位选取标签相似度较高的简历,以进行快速匹配入职。The resulting similarity ranges from -1 to 1, where -1 means that the two vectors point in opposite directions, 1 means that they point in exactly the same direction, 0 usually means that they are independent, and the values in between indicate moderate similarity or dissimilarity. Based on this algorithm, resumes with higher label similarity can be selected for each position for quick matching and employment.
在本发明的至少一个实施例中,所述确定单元112还可以根据得到的简历标签序列及配置的相应的权重(如:研究生标签在简历评分中所占权重为0.2,而本科生标签在简历评分中所占权重为0.1),将所述简历标签序列通过得分进行表示,进一步根据得分快速筛选出所需的员工。In at least one embodiment of the present invention, the determination unit 112 may also represent the resume tag sequence by a score based on the obtained resume tag sequence and the corresponding configured weights (e.g., the weight of the graduate student tag in the resume score is 0.2, while the weight of the undergraduate tag in the resume score is 0.1), and further quickly screen out the required employees based on the score.
由以上技术方案可以看出,本发明能够从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历,根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到简历文本,进而能够快速得到待解析简历的分词结果,进一步根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词,获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示,提升了解析效果,将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列,进一步计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历,实现对岗位与简历快速且准确地智能匹配。It can be seen from the above technical scheme that the present invention can retrieve a resume from a database, and pre-process the retrieved resume to obtain a resume to be parsed, construct a word segmentation directed acyclic graph according to a pre-constructed word segmentation dictionary, and segment the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain a resume text, and then quickly obtain the word segmentation result of the resume to be parsed, further construct a co-occurrence matrix according to the resume text, and determine the keywords of the resume text based on the co-occurrence matrix, obtain the word sequence in the keyword, and use the word representation model to process the word sequence to obtain the word representation of the word sequence, thereby improving the parsing effect, input the word representation into the constructed resume label parsing model, obtain a predicted resume label sequence, further calculate the similarity between each label in the resume label sequence and the label of each position, and determine the resume matching each position from the resume to be parsed according to the calculated similarity, thereby realizing fast and accurate intelligent matching of positions and resumes.
如图3所示,是本发明实现简历数据信息解析及匹配方法的较佳实施例的电子设备的结构示意图。As shown in FIG. 3 , it is a schematic diagram of the structure of an electronic device of a preferred embodiment of the resume data information parsing and matching method of the present invention.
所述电子设备1可以包括存储器12、处理器13和总线,还可以包括存储在所述存储器12中并可在所述处理器13上运行的计算机程序,例如简历数据信息解析及匹配程序。The electronic device 1 may include a memory 12, a processor 13 and a bus, and may also include a computer program stored in the memory 12 and executable on the processor 13, such as a resume data information parsing and matching program.
本领域技术人员可以理解,所述示意图仅仅是电子设备1的示例,并不构成对电子设备1的限定,所述电子设备1既可以是总线型结构,也可以是星形结构,所述电子设备1还可以包括比图示更多或更少的其他硬件或者软件,或者不同的部件布置,例如所述电子设备1还可以包括输入输出设备、网络接入设备等。Those skilled in the art will appreciate that the schematic diagram is merely an example of the electronic device 1 and does not constitute a limitation on the electronic device 1. The electronic device 1 may have either a bus structure or a star structure. The electronic device 1 may also include more or less other hardware or software than shown in the diagram, or a different arrangement of components. For example, the electronic device 1 may also include input and output devices, network access devices, etc.
需要说明的是,所述电子设备1仅为举例,其他现有的或今后可能出现的电子产品如可适应于本发明,也应包含在本发明的保护范围以内,并以引用方式包含于此。It should be noted that the electronic device 1 is only an example, and other existing or future electronic products that are suitable for the present invention should also be included in the protection scope of the present invention and included here by reference.
其中,存储器12至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器12在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。存储器12在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,存储器12还可以既包括电子设备1的内部存储单元也包括外部存储设备。存储器12不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如简历数据信息解析及匹配程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Among them, the memory 12 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, disk, optical disk, etc. The memory 12 can be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1. The memory 12 can also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc. equipped on the electronic device 1. Further, the memory 12 can also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 12 can not only be used to store application software and various types of data installed in the electronic device 1, such as the code of the resume data information parsing and matching program, but also can be used to temporarily store data that has been output or is to be output.
处理器13在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。处理器13是所述电子设备1的控制核心(Control Unit),利用各种接口和线路连接整个电子设备1的各个部件,通过运行或执行存储在所述存储器12内的程序或者模块(例如执行简历数据信息解析及匹配程序等),以及调用存储在所述存储器12内的数据,以执行电子设备1的各种功能和处理数据。In some embodiments, the processor 13 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or a plurality of packaged integrated circuits with the same or different functions, including one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and a combination of various control chips. The processor 13 is the control core (Control Unit) of the electronic device 1, and uses various interfaces and lines to connect the various components of the entire electronic device 1, and executes or executes the programs or modules stored in the memory 12 (for example, executing resume data information parsing and matching programs, etc.), and calls the data stored in the memory 12 to execute various functions of the electronic device 1 and process data.
所述处理器13执行所述电子设备1的操作系统以及安装的各类应用程序。所述处理器13执行所述应用程序以实现上述各个简历数据信息解析及匹配方法实施例中的步骤,例如图1所示的步骤S10、S11、S12、S13、S14、S15。The processor 13 executes the operating system and various installed applications of the electronic device 1. The processor 13 executes the applications to implement the steps in the above-mentioned resume data information parsing and matching method embodiments, such as steps S10, S11, S12, S13, S14, and S15 shown in FIG1 .
或者,所述处理器13执行所述计算机程序时实现上述各装置实施例中各模块/单元的功能,例如:Alternatively, when the processor 13 executes the computer program, the functions of the modules/units in the above-mentioned device embodiments are realized, for example:
从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;Retrieving resumes from the database and preprocessing the retrieved resumes to obtain resumes to be parsed;
根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;Constructing a word segmentation directed acyclic graph according to a pre-constructed word segmentation dictionary, and segmenting the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain a resume text after word segmentation processing;
根据分词处理后的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;Constructing a co-occurrence matrix according to the resume text after word segmentation processing, and determining keywords of the resume text based on the co-occurrence matrix;
获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;Acquire a word sequence in the keyword, and process the word sequence using a word representation model to obtain a word representation of the word sequence;
将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;Input the word representation into the constructed resume tag parsing model to obtain a predicted resume tag sequence;
计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。The similarity between each tag in the resume tag sequence and the tag of each position is calculated, and the resume matching each position is determined from the resumes to be parsed according to the calculated similarity.
示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器12中,并由所述处理器13执行,以完成本发明。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在所述电子设备1中的执行过程。例如,所述计算机程序可以被分割成预处理单元110、构建单元111、确定单元112、处理单元113、预测单元114、合并单元115、训练单元116、获取单元117、拆分单元118、验证单元119。Exemplarily, the computer program may be divided into one or more modules/units, which are stored in the memory 12 and executed by the processor 13 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, which are used to describe the execution process of the computer program in the electronic device 1. For example, the computer program may be divided into a preprocessing unit 110, a construction unit 111, a determination unit 112, a processing unit 113, a prediction unit 114, a merging unit 115, a training unit 116, an acquisition unit 117, a splitting unit 118, and a verification unit 119.
上述以软件功能模块的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、计算机设备,或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的部分。The above-mentioned integrated unit implemented in the form of a software function module can be stored in a computer-readable storage medium. The above-mentioned software function module is stored in a storage medium and includes a number of instructions for enabling a computer device (which can be a personal computer, a computer device, or a network device, etc.) or a processor to execute a part of the method described in each embodiment of the present invention.
所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指示相关的硬件设备来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。If the module/unit integrated in the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the present invention implements all or part of the processes in the above-mentioned embodiment method, and can also be completed by instructing the relevant hardware devices through a computer program. The computer program can be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of each of the above-mentioned method embodiments can be implemented.
其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。The computer program includes computer program code, which may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, recording medium, USB flash drive, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM).
总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,在图3中仅用一根箭头表示,但并不表示仅有一根总线或一种类型的总线。所述总线被设置为实现所述存储器12以及至少一个处理器13等之间的连接通信。The bus may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, only one arrow is used in FIG3, but it does not mean that there is only one bus or one type of bus. The bus is configured to realize connection and communication between the memory 12 and at least one processor 13, etc.
尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器13逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。Although not shown, the electronic device 1 may also include a power source (such as a battery) for supplying power to each component. Preferably, the power source may be logically connected to the at least one processor 13 through a power management device, so that the power management device can realize functions such as charging management, discharging management, and power consumption management. The power source may also include any components such as one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, etc. The electronic device 1 may also include a variety of sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。Furthermore, the electronic device 1 may also include a network interface. Optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is generally used to establish a communication connection between the electronic device 1 and other electronic devices.
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may further include a user interface, which may be a display, an input unit (such as a keyboard), or a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device. The display may also be appropriately referred to as a display screen or a display unit, which is used to display information processed in the electronic device 1 and to display a visual user interface.
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiment is for illustration only and the scope of the patent application is not limited to this structure.
图3仅示出了具有组件12-13的电子设备1,本领域技术人员可以理解的是,图3示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG3 only shows an electronic device 1 having components 12 - 13 . Those skilled in the art will appreciate that the structure shown in FIG3 does not limit the electronic device 1 , and may include fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
结合图1,所述电子设备1中的所述存储器12存储多个指令以实现一种简历数据信息解析及匹配方法,所述处理器13可执行所述多个指令从而实现:In conjunction with FIG. 1 , the memory 12 in the electronic device 1 stores a plurality of instructions to implement a resume data information parsing and matching method, and the processor 13 can execute the plurality of instructions to implement:
从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;Retrieving resumes from the database and preprocessing the retrieved resumes to obtain resumes to be parsed;
根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;Constructing a word segmentation directed acyclic graph according to a pre-constructed word segmentation dictionary, and segmenting the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain a resume text after word segmentation processing;
根据分词处理后的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;Constructing a co-occurrence matrix according to the resume text after word segmentation processing, and determining keywords of the resume text based on the co-occurrence matrix;
获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;Acquire a word sequence in the keyword, and process the word sequence using a word representation model to obtain a word representation of the word sequence;
将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;Input the word representation into the constructed resume tag parsing model to obtain a predicted resume tag sequence;
计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。The similarity between each tag in the resume tag sequence and the tag of each position is calculated, and the resume matching each position is determined from the resumes to be parsed according to the calculated similarity.
具体地,所述处理器13对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, the specific implementation method of the processor 13 for the above instructions can refer to the description of the relevant steps in the embodiment corresponding to Figure 1, which will not be repeated here.
在本发明所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative, for example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本发明各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing unit, each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of hardware plus software functional modules.
对于本领域技术人员而言,显然本发明不限于上述示范性实施例的细节,而且在不背离本发明的精神或基本特征的情况下,能够以其他的具体形式实现本发明。It is obvious to those skilled in the art that the present invention is not limited to the details of the above exemplary embodiments, and that the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present invention.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本发明的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本发明内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Therefore, no matter from which point of view, the embodiments should be regarded as illustrative and non-restrictive, and the scope of the present invention is limited by the appended claims rather than the above description, so it is intended that all changes falling within the meaning and scope of the equivalent elements of the claims are included in the present invention. Any attached figure mark in the claims should not be regarded as limiting the claims involved.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。In addition, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claim can also be implemented by one unit or device through software or hardware. The second and other words are used to indicate names, but not to indicate any particular order.
最后应说明的是,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或等同替换,而不脱离本发明技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention rather than to limit it. Although the present invention has been described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solution of the present invention can be modified or replaced by equivalents without departing from the spirit and scope of the technical solution of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010151399.9ACN111428488B (en) | 2020-03-06 | 2020-03-06 | Resume data information parsing and matching method, device, electronic device and medium |
| PCT/CN2020/131916WO2021174919A1 (en) | 2020-03-06 | 2020-11-26 | Method and apparatus for analysis and matching of resume data information, electronic device, and medium |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010151399.9ACN111428488B (en) | 2020-03-06 | 2020-03-06 | Resume data information parsing and matching method, device, electronic device and medium |
| Publication Number | Publication Date |
|---|---|
| CN111428488A CN111428488A (en) | 2020-07-17 |
| CN111428488Btrue CN111428488B (en) | 2024-10-22 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010151399.9AActiveCN111428488B (en) | 2020-03-06 | 2020-03-06 | Resume data information parsing and matching method, device, electronic device and medium |
| Country | Link |
|---|---|
| CN (1) | CN111428488B (en) |
| WO (1) | WO2021174919A1 (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111428488B (en)* | 2020-03-06 | 2024-10-22 | 平安科技(深圳)有限公司 | Resume data information parsing and matching method, device, electronic device and medium |
| CN111782772A (en)* | 2020-07-24 | 2020-10-16 | 平安银行股份有限公司 | Automatic text generation method, device, equipment and medium based on OCR technology |
| CN111737969B (en)* | 2020-07-27 | 2020-12-08 | 北森云计算有限公司 | Resume parsing method and system based on deep learning |
| CN112052670B (en)* | 2020-08-28 | 2024-04-02 | 丰图科技(深圳)有限公司 | Address text word segmentation method, device, computer equipment and storage medium |
| CN112115705B (en)* | 2020-09-23 | 2024-06-18 | 普信恒业科技发展(北京)有限公司 | Screening method and device of electronic resume |
| CN112395408B (en)* | 2020-11-19 | 2023-11-07 | 平安科技(深圳)有限公司 | Stop word list generation method and device, electronic equipment and storage medium |
| CN112380344B (en)* | 2020-11-19 | 2023-08-22 | 平安科技(深圳)有限公司 | Text classification method, topic generation method, device, equipment and medium |
| CN112632227B (en)* | 2020-12-30 | 2023-06-23 | 北京百度网讯科技有限公司 | Resume matching method, device, electronic device, storage medium and program product |
| CN113011155B (en) | 2021-03-16 | 2023-09-05 | 北京百度网讯科技有限公司 | Method, device, device and storage medium for text matching |
| CN113297845B (en)* | 2021-06-21 | 2022-07-26 | 南京航空航天大学 | A Resume Block Classification Method Based on Multi-level Bidirectional Recurrent Neural Network |
| CN113609850B (en)* | 2021-07-02 | 2024-05-17 | 北京达佳互联信息技术有限公司 | Word segmentation processing method and device, electronic equipment and storage medium |
| CN113627182B (en)* | 2021-08-10 | 2024-07-26 | 深圳平安智汇企业信息管理有限公司 | Data matching method, device, computer equipment and storage medium |
| CN113850049B (en)* | 2021-09-26 | 2025-02-18 | 北京瑞友科技股份有限公司 | A resume automatic editing system and method based on artificial intelligence |
| CN113886527B (en)* | 2021-10-20 | 2025-07-25 | 前锦网络信息技术(上海)有限公司 | Natural language semantic extraction method and system |
| CN113905095B (en)* | 2021-12-09 | 2022-04-05 | 深圳佑驾创新科技有限公司 | Data generation method and device based on CAN communication matrix |
| CN114186978B (en)* | 2021-12-17 | 2024-12-13 | 中国人民解放军国防科技大学 | Resume and job matching prediction method and related equipment |
| CN114254951A (en)* | 2021-12-27 | 2022-03-29 | 南方电网物资有限公司 | Power grid equipment arrival sampling inspection method based on digitization technology |
| CN114328837B (en)* | 2021-12-30 | 2025-05-23 | 企查查科技股份有限公司 | Sequence labeling method, device, computer equipment, and storage medium |
| CN114168819B (en)* | 2022-02-14 | 2022-07-12 | 北京大学 | Post matching method and device based on graph neural network |
| CN114637839B (en)* | 2022-03-15 | 2024-10-29 | 平安国际智慧城市科技股份有限公司 | Text highlighting method, device, equipment and storage medium |
| CN114637836B (en)* | 2022-03-15 | 2024-11-05 | 平安国际智慧城市科技股份有限公司 | Text processing method, device, equipment and storage medium |
| CN115130024A (en)* | 2022-07-12 | 2022-09-30 | 金蝶软件(中国)有限公司 | A visual label generation method, device, computer equipment and storage medium |
| CN115186151A (en)* | 2022-07-15 | 2022-10-14 | 深圳壹账通智能科技有限公司 | Resume screening method, device, equipment and storage medium |
| CN115293131B (en)* | 2022-09-29 | 2023-01-06 | 广州万维视景科技有限公司 | Data matching method, device, equipment and storage medium |
| CN115631446B (en)* | 2022-11-02 | 2025-07-04 | 无锡苏广汽车部件科技有限公司 | Digital management system and method for motor vehicle disassembly and assembly integration based on big data |
| CN115879901B (en)* | 2023-02-22 | 2023-07-28 | 陕西湘秦衡兴科技集团股份有限公司 | Intelligent personnel self-service platform |
| CN116562837A (en)* | 2023-07-12 | 2023-08-08 | 深圳须弥云图空间科技有限公司 | Person-post matching method, device, electronic device, and computer-readable storage medium |
| CN116843155B (en)* | 2023-07-27 | 2024-04-30 | 深圳市贝福数据服务有限公司 | SAAS-based person post bidirectional matching method and system |
| CN116680590B (en)* | 2023-07-28 | 2023-10-20 | 中国人民解放军国防科技大学 | Job portrait label extraction method and device based on work description analysis |
| CN116994270B (en)* | 2023-08-28 | 2024-06-14 | 乐麦信息技术(杭州)有限公司 | Resume analysis method, device, equipment and readable storage medium |
| CN117236647B (en)* | 2023-11-10 | 2024-02-02 | 贵州优特云科技有限公司 | Post recruitment analysis method and system based on artificial intelligence |
| CN117670273A (en)* | 2023-12-11 | 2024-03-08 | 南京道尔医药研究院有限公司 | Staff service system based on human resource intelligent terminal |
| CN117875921B (en)* | 2024-03-13 | 2024-05-24 | 北京金诚久安人力资源服务有限公司 | Human resource management method and system based on artificial intelligence |
| CN118035561A (en)* | 2024-03-29 | 2024-05-14 | 上海云生未来技术集团有限公司 | Post recommendation method and system based on big data |
| CN118333591B (en)* | 2024-05-07 | 2025-05-06 | 中国人民解放军91977部队 | Dynamic optimization-based human resource scheduling method and device |
| CN118195562B (en)* | 2024-05-16 | 2024-09-20 | 乐麦信息技术(杭州)有限公司 | Job entering willingness assessment method and system based on natural semantic analysis |
| CN120011424B (en)* | 2025-01-24 | 2025-07-11 | 北京职点迷津教育科技有限公司 | An intelligent management system for job screening based on AI big data |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108874928A (en)* | 2018-05-31 | 2018-11-23 | 平安科技(深圳)有限公司 | Resume data information analyzing and processing method, device, equipment and storage medium |
| CN109710930A (en)* | 2018-12-20 | 2019-05-03 | 重庆邮电大学 | A Chinese resume parsing method based on deep neural network |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080222133A1 (en)* | 2007-03-08 | 2008-09-11 | Anthony Au | System that automatically identifies key words & key texts from a source document, such as a job description, and apply both (key words & text) as context in the automatic matching with another document, such as a resume, to produce a numerically scored result. |
| CN107766318B (en)* | 2016-08-17 | 2021-03-16 | 北京金山安全软件有限公司 | Keyword extraction method and device and electronic equipment |
| CN110399475A (en)* | 2019-06-18 | 2019-11-01 | 平安科技(深圳)有限公司 | Resume matching process, device, equipment and storage medium based on artificial intelligence |
| CN110750993A (en)* | 2019-10-15 | 2020-02-04 | 成都数联铭品科技有限公司 | Word segmentation method, word segmentation device, named entity identification method and system |
| CN111428488B (en)* | 2020-03-06 | 2024-10-22 | 平安科技(深圳)有限公司 | Resume data information parsing and matching method, device, electronic device and medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108874928A (en)* | 2018-05-31 | 2018-11-23 | 平安科技(深圳)有限公司 | Resume data information analyzing and processing method, device, equipment and storage medium |
| CN109710930A (en)* | 2018-12-20 | 2019-05-03 | 重庆邮电大学 | A Chinese resume parsing method based on deep neural network |
| Publication number | Publication date |
|---|---|
| CN111428488A (en) | 2020-07-17 |
| WO2021174919A1 (en) | 2021-09-10 |
| Publication | Publication Date | Title |
|---|---|---|
| CN111428488B (en) | Resume data information parsing and matching method, device, electronic device and medium | |
| CN110717339B (en) | Method, device, electronic device and storage medium for processing semantic representation model | |
| US11016966B2 (en) | Semantic analysis-based query result retrieval for natural language procedural queries | |
| JP5936698B2 (en) | Word semantic relation extraction device | |
| WO2019153737A1 (en) | Comment assessing method, device, equipment and storage medium | |
| US20180181544A1 (en) | Systems for Automatically Extracting Job Skills from an Electronic Document | |
| WO2020133960A1 (en) | Text quality inspection method, electronic apparatus, computer device and storage medium | |
| CN107273861A (en) | Subjective question marking and scoring method and device and terminal equipment | |
| CN108038725A (en) | A kind of electric business Customer Satisfaction for Product analysis method based on machine learning | |
| CN112182145B (en) | Text similarity determination method, device, equipment and storage medium | |
| CN115714002B (en) | Depression risk detection model training method, depressive symptom early warning method and related equipment | |
| CN111858834B (en) | Case dispute focus determining method, device, equipment and medium based on AI | |
| CN113344125B (en) | Long text matching recognition method and device, electronic equipment and storage medium | |
| CN113065355A (en) | Professional encyclopedia named entity identification method, system and electronic equipment | |
| US12405989B2 (en) | Method and apparatus for calculating text semantic similarity, device and storage medium | |
| CN113326702A (en) | Semantic recognition method and device, electronic equipment and storage medium | |
| CN110610003A (en) | Method and system for assisting text annotation | |
| CN112364068A (en) | Course label generation method, device, equipment and medium | |
| CN116450829A (en) | Medical text classification method, device, equipment and medium | |
| CN116956896A (en) | Text analysis method, system, electronic equipment and medium based on artificial intelligence | |
| CN113704410A (en) | Emotion fluctuation detection method and device, electronic equipment and storage medium | |
| CN113254814A (en) | Network course video labeling method and device, electronic equipment and medium | |
| CN115392237B (en) | Emotion analysis model training method, device, equipment and storage medium | |
| CN119739928A (en) | Personalized recommendation method and system for educational courses based on artificial intelligence | |
| CN112559711A (en) | Synonymous text prompting method and device and electronic equipment |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| REG | Reference to a national code | Ref country code:HK Ref legal event code:DE Ref document number:40030827 Country of ref document:HK | |
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |