Disclosure of Invention
In order to solve the problems, the invention provides a data query method, a system, a terminal and a medium based on a large language model, which utilize a natural language processing technology, the large language model and a semantic layer technology to convert natural language query sentences into an elastic search query DSL, simplify the data query process, improve the data query efficiency and accuracy of users, reduce the cost and meet the query requirements.
In a first aspect, the present invention provides a data query method based on a large language model, including the following steps:
 Receiving a natural language query statement input by a user;
 analyzing the natural language query sentence by utilizing a natural language processing technology to obtain query parameters including a query entity, sentence dependency relationship and user intention;
 generating a logic SQL query statement according to the query parameters by utilizing a pre-trained large language model;
 Converting the logical SQL query statement into a physical SQL query statement by utilizing a semantic layer technology;
 converting the physical SQL query statement into a field specific language for the elastomer search query by using an SQL parser;
 and executing the elastiscearch query on the elastiscearch cluster in a domain-specific language to obtain a query result.
In an alternative embodiment, the natural language query sentence is analyzed by using a natural language processing technology to obtain a query parameter, which specifically includes:
 Text cleaning is carried out on the natural language query sentence so as to remove irrelevant characters in the query sentence;
 splitting the text-cleaned natural language query sentence into independent words or phrases by using an NLP framework;
 Distributing part of speech for each split vocabulary;
 Based on word segmentation results, identifying query entities in the query statement by using a pre-trained named entity identification model;
 Analyzing the dependency relationship of the natural language query sentence by using a dependency syntax analyzer, generating a dependency syntax tree, and extracting the dependency relationship from the dependency syntax tree;
 And carrying out intention classification on the natural language query statement by utilizing a pre-trained intention classification BERT model to obtain the query intention of the user.
In an alternative embodiment, the method further comprises the steps of:
 periodically extracting names, aliases and values from processing results of natural language processing technology;
 dictionary matching is performed by adopting an n-gram dictionary detection mechanism to construct a knowledge base.
In an alternative embodiment, the method for generating the logical SQL query statement according to the query parameters by utilizing the pre-trained large language model specifically comprises the following steps:
 inputting the query parameters into a pre-trained large language model to generate an initial logical SQL query statement;
 Analyzing the domain nouns from the initial logical SQL query statement;
 Detecting validity of the parsed domain nouns based on a knowledge base, and correcting the wrong fields through a correction error tool;
 And obtaining the corrected logical SQL query statement.
In an alternative embodiment, converting the logical SQL query statement into the physical SQL query statement using semantic layer techniques specifically comprises:
 The association relation and the operation formula between the technical terms and the business terms are managed by utilizing a semantic layer technology;
 based on the association relation and operation formula between technical terms and business terms, the technical terms in the logical SQL query statement are converted into the business terms to generate a physical SQL query statement.
In an alternative embodiment, the method for converting the physical SQL query statement into the field specific language for the elastic search query by using the SQL parser specifically comprises:
 Mapping the SELECT clause to the aggs portion of the elastic search;
 mapping the WHERE clause to the query portion of the elastomer search;
 map the GROUP BY clause to the terms part of the elastomer search;
 the ORDER BY clause is mapped to the sort portion of the elastomer search.
In a second aspect, the present invention provides a data query system based on a large language model, including:
 The query sentence receiving module is used for receiving a natural language query sentence input by a user;
 The query parameter acquisition module is used for analyzing the natural language query sentence by utilizing a natural language processing technology to acquire a query parameter, wherein the query parameter comprises a query entity, a sentence dependency relationship and a user intention;
 The logic SQL query statement generation module is used for generating a logic SQL query statement according to the query parameters by utilizing the pre-trained large language model;
 The physical SQL query statement generation module is used for converting the logical SQL query statement into a physical SQL query statement by utilizing a semantic layer technology;
 the statement conversion module is used for converting a physical SQL query statement into a field specific language for the elastic search query by utilizing an SQL parser;
 And the query execution module is used for executing the elastiscearch query on the elastiscearch cluster by using the domain-specific language to obtain a query result.
In an alternative embodiment, the system further comprises:
 And the knowledge base construction module is used for periodically extracting names, aliases and values from the processing results of the natural language processing technology, and performing dictionary matching by adopting an n-gram dictionary detection mechanism to construct a knowledge base.
In a third aspect, a technical solution of the present invention provides a terminal, including:
 a memory for storing a data query program based on a large language model;
 a processor for implementing the large language model based data query method according to any one of the above steps when executing the large language model based data query program.
In a fourth aspect, the present invention provides a computer readable storage medium, where a data query program based on a large language model is stored, where the data query program based on the large language model implements the steps of the data query method based on the large language model according to any one of the above steps when the data query program based on the large language model is executed by a processor.
Compared with the prior art, the data query method, the system, the terminal and the medium based on the large language model have the advantages that query parameters are extracted from natural language query sentences through a natural language processing technology, the large language model and a semantic layer technology are utilized to generate physical SQL query sentences according to the query parameters, then the physical SQL query sentences are converted into elastic search query DSL, and the DSL is executed to obtain query results. The invention only needs the user to input the natural language query statement, can automatically convert the natural language into DSL to realize the query by using the natural language processing technology, the large language model and the semantic layer technology, does not need to write DSL codes, greatly improves the convenience, the efficiency and the accuracy of data query, and realizes high-efficiency and accurate retrieval.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions of the present invention will be clearly and completely described below with reference to the drawings in this specific embodiment, and it is apparent that the embodiments described below are only some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, based on the embodiments in this patent, which would be within the purview of one of ordinary skill in the art without the particular effort to make the invention are intended to be within the scope of the patent protection.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The following explains key terms appearing in the present invention.
SQL Structured Query Language, structured query language.
The elastomer search, ES for short, is a distributed, high-expansion, high-real-time search and data analysis engine.
DSL Domain-Specific Language, domain specific language.
NER NAMED ENTITY Reconnaist, named entity Recognition.
Fig. 1 is a schematic flow chart of a data query method based on a large language model according to an embodiment of the present invention. The execution subject of fig. 1 may be a data query system based on a large language model. The data query method based on the large language model provided by the embodiment of the invention is executed by the computer equipment, and correspondingly, the data query system based on the large language model is operated in the computer equipment. The order of the steps in the flow chart may be changed and some may be omitted according to different needs.
As shown in fig. 1, the method includes the following steps.
S1, receiving a natural language query sentence input by a user.
In the step, a user inputs a natural language query sentence at a front end query interface of a platform, a background receives the natural language query sentence input by the user and executes subsequent processing, and finally, a query result is fed back to the front end for the user to check.
S2, analyzing the natural language query sentence by utilizing a natural language processing technology to obtain query parameters including a query entity, sentence dependency relationship and user intention.
S2.1, performing text cleaning on the natural language query sentence to remove irrelevant characters in the query sentence.
S2.2, splitting the natural language query sentence after text cleaning into independent words or phrases by using the NLP framework.
S2.3, distributing part of speech for each word obtained by splitting.
S2.4, based on the word segmentation result, identifying the query entity in the query sentence by utilizing a pre-trained named entity identification model.
S2.5, analyzing the dependency relationship of the natural language query sentence by using a dependency syntax analyzer, generating a dependency syntax tree, and extracting the dependency relationship from the dependency syntax tree.
S2.6, performing intention classification on the natural language query sentence by using a pre-trained intention classification BERT model to obtain the query intention of the user.
In this embodiment, a natural language processing technology is used to analyze a natural language query sentence input by a user, and extract a key element, and if the user inputs "query error log in past week", the core information such as "past week", "error log" can be identified after this step processing.
S3, generating a logic SQL query statement according to the query parameters by utilizing the pre-trained large language model.
S3.1, inputting the query parameters into the pre-trained large language model to generate an initial logical SQL query statement.
S3.2, analyzing the domain nouns from the initial logical SQL query statement.
And S3.3, detecting validity of the parsed domain nouns based on a knowledge base, and correcting the wrong fields through a correction error tool.
S3.4, obtaining the corrected logical SQL query statement.
The embodiment utilizes the large language model to convert the natural query language into the logical SQL query statement, trains the large language model in advance, such as the GPT-4 model, and utilizes the powerful language understanding and generating capability of the large language model to convert the query parameters of the natural query language into the query statement conforming to the SQL grammar.
Illustratively, for "find error log over the past week", the generated logical SQL query statement is:
SELECT *
FROM logs
WHERE log_level = 'error' AND timestamp>= NOW() - INTERVAL 7 DAY。
 In order to improve query accuracy, after the initial logical SQL query statement is generated, the method performs validity detection and correction on domain nouns in the initial logical SQL query statement, and can perform validity and correction based on a knowledge base. And analyzing nouns such as tables, fields, values and the like from the generated SQL, and checking the legality of the nouns one by one. For illegal nouns, query the internal knowledge base (knowledgebase) in a manner similar to SCHEMA MAPPING, try to find the correct match, and rewrite SQL. For example, a large language model may map values to incorrect fields, attempt to find the correct field mapping by corrector module, and rewrite SQL to ensure query accuracy.
In an alternative embodiment, names, aliases and values are periodically extracted from the processing results of the natural language processing technology, and an n-gram dictionary detection mechanism is used for dictionary matching to construct a knowledgebase (knowledgebase).
S4, converting the logical SQL query statement into a physical SQL query statement by utilizing a semantic layer technology.
According to the embodiment, the logical SQL query statement is converted into the physical SQL query statement through the semantic layer technology, firstly, the management of the query caliber is carried out, the association relation between the technical terms and the business terms and the operation formula are included, the consistency and the accuracy of data are ensured, the checking and comparison are convenient, and the caliber confusion is eliminated. And then translating the technical names (such as table names and field names) into service terms (such as dimensions, indexes and labels) according to the association relation and the operation formula to obtain the physical SQL query statement.
For example, the conversion of the logical SQL query statement of "find error log in past week" described above into a physical SQL query statement is:
SELECT *
FROM actual_logs_table
WHERE actual_log_level_field = 'error' AND actual_timestamp_field>= CURRENT_TIMESTAMP - INTERVAL '7 days'
 s5, converting the physical SQL query statement into a field specific language for the elastic search query by using an SQL parser.
The present embodiment uses an SQL parser to convert the generated SQL statement into an elastic search query DSL. The SQL parser uses APACHE CALCITE or JSqlParser to parse SQL, and generates the corresponding elastic search query DSL by parsing the structure and content of SQL statements.
The SQL-DSL generated DSL includes query, aggs, terms and a sort section of the elastomer search, and the specific parsing process includes mapping the SELECT clause to aggs of the elastomer search, the WHERE clause to the query section of the elastomer search, the GROUP BY clause to the terms section of the elastomer search, and the ORDER BY clause to the sort section of the elastomer search.
Wherein the SELECT clause map maps an aggregation operation to scripted _metric aggregation of the elastic search to generate DSL.
Illustratively, the above-described physical SQL query statement "find error log over the past week" is converted into:
{
"query": {
"bool": {
"must": [
{
"match": {
"actual_log_level_field": "error"
}
},
{
"range": {
"actual_timestamp_field": {
"gte": "now-7d/d",
"lt": "now/d"
}
}
}
]
}
}
}
 S6, executing the elastomer search query on the elastomer search cluster by using a domain specific language to obtain a query result.
In this embodiment, the generated elastsearch query DSL is executed on the elastsearch cluster, and the system may obtain the document meeting the query condition by calling the API of the elastsearch. The method and the device realize the operations of connection with an elastic search cluster, query execution, result acquisition and the like.
The embodiment of the data query method based on the large language model is described in detail above, and the data query system based on the large language model corresponding to the method is also provided.
Fig. 2 is a schematic block diagram of a data query system based on a large language model according to an embodiment of the present invention, where in this embodiment, the data query system 200 based on the large language model may be divided into a plurality of functional modules according to functions performed by the data query system, as shown in fig. 2. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory.
The query term receiving module 210 is configured to receive a natural language query term input by a user.
The query parameter obtaining module 220 is configured to analyze a natural language query sentence by using a natural language processing technology to obtain a query parameter, including a query entity, a sentence dependency relationship, and a user intention.
The logical SQL query statement generation module 230 is configured to generate a logical SQL query statement according to the query parameters using the pre-trained large language model.
The physical SQL query statement generation module 240 is configured to convert a logical SQL query statement into a physical SQL query statement using semantic layer technology.
The statement conversion module 250 is configured to convert the physical SQL query statement into a domain-specific language for the elastic search query by using an SQL parser.
The query execution module 260 is configured to execute the elastic search query on the elastic search cluster in a domain-specific language, so as to obtain a query result.
In an alternative embodiment, the system 200 further includes a knowledge base construction module 270 for periodically extracting names, aliases, and values from the processing results of the natural language processing technique, and performing dictionary matching using an n-gram dictionary detection mechanism to construct a knowledge base.
The data query system based on the large language model of the present embodiment is used for implementing the foregoing data query method based on the large language model, so that the specific implementation of the system can be seen from the foregoing example part of the data query method based on the large language model, and therefore, the specific implementation of the system can be referred to the description of the corresponding examples of the various parts, and will not be further described herein.
In addition, since the data query system based on the large language model of the present embodiment is used to implement the foregoing data query method based on the large language model, the functions thereof correspond to those of the foregoing method, and will not be described herein.
Fig. 3 is a schematic structural diagram of a terminal 300 according to an embodiment of the present invention, which includes a processor 310, a memory 320, and a communication unit 330. The processor 310 is configured to implement the following steps when implementing the large language model based data query program stored in the memory 320:
 Receiving a natural language query statement input by a user;
 analyzing the natural language query sentence by utilizing a natural language processing technology to obtain query parameters including a query entity, sentence dependency relationship and user intention;
 generating a logic SQL query statement according to the query parameters by utilizing a pre-trained large language model;
 Converting the logical SQL query statement into a physical SQL query statement by utilizing a semantic layer technology;
 converting the physical SQL query statement into a field specific language for the elastomer search query by using an SQL parser;
 and executing the elastiscearch query on the elastiscearch cluster in a domain-specific language to obtain a query result.
The terminal 300 includes a processor 310, a memory 320, and a communication unit 330. The components may communicate via one or more buses, and it will be appreciated by those skilled in the art that the configuration of the server as shown in the drawings is not limiting of the invention, as it may be a bus-like structure, a star-like structure, or include more or fewer components than shown, or may be a combination of certain components or a different arrangement of components.
The memory 320 may be used to store instructions for execution by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile memory terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. The execution of the instructions in memory 320, when executed by processor 310, enables terminal 300 to perform some or all of the steps in the method embodiments described below.
The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by running or executing software programs and/or modules stored in the memory 320, and invoking data stored in the memory. The processor may be comprised of an integrated circuit (INTEGRATED CIRCUIT, simply referred to as an IC), for example, a single packaged IC, or may be comprised of multiple packaged ICs connected to one another for the same function or for different functions. For example, the processor 310 may include only a central processing unit (Central Processing Unit, CPU for short). In the embodiment of the invention, the CPU can be a single operation core or can comprise multiple operation cores.
And a communication unit 330 for establishing a communication channel so that the storage terminal can communicate with other terminals. Receiving user data sent by other terminals or sending the user data to other terminals.
The invention also provides a computer storage medium, which can be a magnetic disk, an optical disk, a read-only memory (ROM) or a random access memory (random access memory, RAM) and the like.
The computer storage medium stores a data query program based on a large language model, which when executed by a processor, implements the steps of:
 Receiving a natural language query statement input by a user;
 analyzing the natural language query sentence by utilizing a natural language processing technology to obtain query parameters including a query entity, sentence dependency relationship and user intention;
 generating a logic SQL query statement according to the query parameters by utilizing a pre-trained large language model;
 Converting the logical SQL query statement into a physical SQL query statement by utilizing a semantic layer technology;
 converting the physical SQL query statement into a field specific language for the elastomer search query by using an SQL parser;
 and executing the elastiscearch query on the elastiscearch cluster in a domain-specific language to obtain a query result.
It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium such as a U-disc, a mobile hard disc, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, etc. various media capable of storing program codes, including several instructions for causing a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, etc.) to execute all or part of the steps of the method described in the embodiments of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.