This application claims priority to U.S. Provisional Application No. 63/070,074, entitled “MACHINE LEARNING MODEL SELECTION AND EXPLANATION FOR MULTI-DIMENSIONAL DATASETS AND CONVERSATIONAL SYNTAX USING CONSTRAINED NATURAL LANGUAGE PROCESSING FOR ACCESSING DATASETS,” filed Aug. 25, 2020, the contents of which are hereby incorporated by reference as if set out in their entirety herein.
TECHNICAL FIELDThis disclosure relates to computing and data analytics systems, and more specifically, systems using natural language processing.
BACKGROUNDNatural language processing generally refers to a technical field in which computing devices process user inputs provided by users via conversational interactions using human languages. For example, a device may prompt a user for various inputs, present clarifying questions, present follow-up questions, or otherwise interact with the user in a conversational manner to elicit the input. The user may likewise enter the inputs as sentences or even fragments, thereby establishing a simulated dialog with the device to specify one or more intents (which may also be referred to as “tasks”) to be performed by the device.
During this process the device may present various interfaces by which to present the conversation. An example interface may act as a so-called “chatbot,” which often is configured to attempt to mimic human qualities, including personalities, voices, preferences, humor, etc. in an effort to establish a more conversational tone, and thereby facilitate interactions with the user by which to more naturally receive the input. Examples of chatbots include “digital assistants” (which may also be referred to as “virtual assistants”), which are a subset of chatbots focused on a set of tasks dedicated to assistance (such as scheduling meetings, make hotel reservations, and schedule delivery of food).
There are a number of different natural language processing algorithms utilized to parse the inputs to identify intents, some of which depend upon machine learning. However, natural languages often do not follow precise formats, and various users may have slightly different ways of expressing inputs that result in the same general intent, resulting in so-called “edge cases” that many natural language algorithms, including those that depend upon machine learning, are not programed (or, in the context of machine language, trained) to specifically address.
SUMMARYIn general, this disclosure describes techniques for constrained natural language processing (CNLP) that expose language sub-surfaces in a constrained manner, thereby removing ambiguity and aiding discoverability. In general, a natural language surface refers to the permitted set of potential user inputs (e.g., utterances), i.e., the set of utterances that the natural language processing system is capable of correctly processing.
Various aspects of the techniques are described by which to access datasets, including multi-dimensional datasets having two or more dimensions (e.g., rows and/or columns), using CNLP. Rather than require users to understand formal (and, often, rigid) syntaxes employed by formal databases, such as sequential query language SQL, Pandas, and other database programming languages-various aspects of the techniques may enable a device to provide an interface by which less formal, more conversational queries may be received and processed to retrieve data from datasets that meet certain requirements. The device may transform the informal, more conversational queries into formal statements that adhere to the formal syntax associated with the datasets.
Facilitating such access to datasets may enable users to more efficiently operate devices used to retrieve relevant data (in terms of relevance to queries). The efficiencies may occur as a result of not having to process additional commands or operations in a trial-and-error approach while also ensuring adequate confidence in query results, as the device(s) may augment or otherwise transform query results into results that include an explanation of query results in plain language.
As fewer attempts to access datasets may occur as a result of such transformations, the devices may operate more efficiently. That is, the devices may receive fewer queries in order to successfully access databases to provide results that may potentially result in less consumption of resources, such as processor cycles, memory, memory bandwidth, etc., and thereby result in less power consumption.
Further, the devices may determine a correlation of one or more dimensions (e.g., a selected row or column) of the multi-dimensional datasets stored to the databases to query results provided in response to transformed queries output by machine learning models (MLMs). The device may invoke multiple MLMs responsive to queries that analyze query results resulting from accessing the datasets to obtain results. Based on the determined correlation, the device may select one or more of MLM to obtain result (e.g., selecting an MLM having the determined correlation above a threshold correlation as one or more sources of the result). The device may output results for each of the one or more of MLMs, which may output the result.
The devices may determine a sentence in plain language explaining why one or more of MLM were selected, utilizing the determined correlation to facilitate generation and/or determination of the sentence. The device may include the sentence explaining why one or more of MLM were selected as part of the results, thereby potentially enabling users to better trust the result. Such trust may enable users, whether an experienced data scientist or a new user, to gain confidence in the result such that the user may reduce a number of interactions with the device.
Again, as fewer attempts to access the databases may occur as a result of such explanation, the device may operate more efficiently. That is, the device(s) may receive fewer queries in order to successfully access databases to provide results that may, again, potentially result in less consumption of resources, such as processor cycles, memory, memory bandwidth, etc., and thereby result in less power consumption.
In one example, various aspects of the techniques are directed to a device configured to interpret a multi-dimensional dataset, the device comprising: a memory configured to store the multi-dimensional dataset; and one or more processors configured to: apply a plurality of machine learning models to the multi-dimensional dataset to obtain a result output by each of the plurality of machine learning models; determine a correlation of one or more dimensions of the multi-dimensional dataset to the results output by each of the plurality of machine learning models; select, based on the correlation determined between the one or more dimensions and the result output by each of the plurality of machine learning models, a subset of the plurality of machine learning models to obtain the result for each of the subset of the plurality of machine learning models; and output the result for each of the subset of the plurality of machine learning models.
In another example, various aspects of the techniques are directed to a method of interpreting a multi-dimensional dataset, the method comprising: applying a plurality of machine learning models to the multi-dimensional dataset to obtain a result output by each of the plurality of machine learning models; determining a correlation of the one or more dimensions of the multi-dimensional dataset to the results output by each of the plurality of machine learning models; selecting, based on the correlation determined between the one or more dimensions and the result output by each of the plurality of machine learning models, a subset of the plurality of machine learning models to obtain the result for each of the subset of the plurality of machine learning models; and outputting the result for each of the subset of the plurality of machine learning models.
In another example, various aspects of the techniques are directed to a non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors to: apply a plurality of machine learning models to a multi-dimensional dataset to obtain a result output by each of the plurality of machine learning models; determine a correlation of the one or more dimensions of the multi-dimensional dataset to the result output by each of the plurality of machine learning models; select, based on the correlation determined between the one or more dimensions and the result output by each of the plurality of machine learning models, a subset of the plurality of machine learning models to obtain the result for each of the subset of the plurality of machine learning models; and output the result for each of the subset of the plurality of machine learning models.
In another example, various aspects of the techniques are directed to a device configured to access a dataset, the device comprising: a memory configured to store the dataset; and one or more processors configured to: expose a language sub-surface specifying a natural language containment hierarchy defining a grammar for a natural language as a hierarchical arrangement of a plurality of language sub-surfaces; receive a query to access the dataset, the query conforming to a portion of the natural language provided by the exposed language sub-surface; transform the query into one or more statements that conform to a formal syntax associated with the dataset; access, based on the one or more statements, the dataset to obtain a query result; and output the query result.
In another example, various aspects of the techniques are directed to a method of accessing a dataset, the method comprising: exposing a language sub-surface specifying a natural language containment hierarchy defining a grammar for a natural language as a hierarchical arrangement of a plurality of language sub-surfaces; receiving a query to access the dataset, the query conforming to a portion of the language provided by the exposed language sub-surface; transforming the query into one or more statements that conform to a formal syntax associated with the dataset; accessing, based on the one or more statements, the dataset to obtain a query result; and outputting the query result.
In another example, various aspects of the techniques are directed to a non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors to: expose a language sub-surface specifying a natural language containment hierarchy defining a grammar for a natural language as a hierarchical arrangement of a plurality of language sub-surfaces; receive a query to access a dataset, the query conforming to a portion of the language provided by the exposed language sub-surface; transform the query into one or more statements that conform to a formal syntax associated with the dataset; access, based on the one or more statements, the dataset to obtain a query result; and output the query result.
The details of one or more aspects of the techniques are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of these techniques will be apparent from the description and drawings, and from the claims.
BRIEF DESCRIPTION OF DRAWINGSFIG. 1 is a block diagram illustrating a system that may perform various aspects of the techniques described in this disclosure.
FIGS. 2A-2I are diagrams illustrating an example interface presented by the interface unit of the host device shown inFIG. 1 that includes a number of different applications executed by the execution platforms of the host device.
FIGS. 3A-3G are diagrams illustrating interfaces presented by the interface unit of the host device shown inFIG. 1 that facilitate sales manager productivity analytics via the sales manager productivity application shown inFIG. 2 in accordance with various aspects of the CNLP techniques described in this disclosure.
FIG. 4 is a block diagram illustrating a data structure used to represent the language surface shown in the example ofFIG. 1 in accordance with various aspects of the techniques described in this disclosure.
FIG. 5 is a block diagram illustrating example components of the devices shown in the example ofFIG. 1.
FIG. 6 is a flowchart illustrating example operation of the host device ofFIG. 1 in performing various aspects of the techniques described in this disclosure.
FIG. 7 is another flowchart illustrating example operation of the host device ofFIG. 1 in performing additional aspects of the techniques described in this disclosure.
DETAILED DESCRIPTIONFIG. 1 is a diagram illustrating asystem10 that may perform various aspects of the techniques described in this disclosure for constrained natural language processing (CNLP). As shown in the example ofFIG. 1,system10 includes ahost device12 and aclient device14. Although shown as including two devices, i.e.,host device12 andclient device14 in the example ofFIG. 1,system10 may include a single device that incorporates the functionality described below with respect to both ofhost device12 andclient device14, ormultiple clients14 that each interface with one ormore host devices12 that share a mutual database hosted by one or more of thehost devices12.
Host device12 may represent any form of computing device capable of implementing the techniques described in this disclosure, including a handset (or cellular phone), a tablet computer, a so-called smart phone, a desktop computer, and a laptop computer to provide a few examples. Likewise,client device14 may represent any form of computing device capable of implementing the techniques described in this disclosure, including a handset (or cellular phone), a tablet computer, a so-called smart phone, a desktop computer, a laptop computer, a so-called smart speaker, so-called smart headphones, and so-called smart televisions, to provide a few examples.
As shown in the example ofFIG. 1,host device12 includes aserver28, aCNLP unit22, one ormore execution platforms24, and adatabase26.Server28 may represent a unit configured to maintain a conversational context as well as coordinate the routing of data betweenCNLP unit22 andexecution platforms24.
Server28 may include aninterface unit20, which may represent a unit by whichhost device12 may present one ormore interfaces21 toclient device14 in order to elicitdata19 indicative of an input and/orpresent results25.Data19 may be indicative of speech input, text input, image input (e.g., representative of text or capable of being reduced to text), or any other type of input capable of facilitating a dialog withhost device12.Interface unit20 may generate or otherwise outputvarious interfaces21, including graphical user interfaces (GUIs), command line interfaces (CLIs), or any other interface by which to present data or otherwise provide data to auser16.Interface unit20 may, as one example, output achat interface21 in the form of a GUI with which theuser16 may interact to inputdata19 indicative of the input (i.e., text inputs in the context of the chat server example).Server28 may output thedata19 to CNLP unit22 (or otherwise invokeCNLP unit22 and passdata19 via the invocation).
CNLP unit22 may represent a unit configured to perform various aspects of the CNLP techniques as set forth in this disclosure.CNLP unit22 may maintain a number of interconnected language sub-surfaces (shown as “SS”)18A-18G (“SS18”). Language sub-surfaces18 may collectively represent a language, while each of the language sub-surfaces18 may provide a portion (which may be different portions or overlapping portions) of the language. Each portion may specify a corresponding set of syntax rules and strings permitted for the natural language with whichuser16 may interface to enterdata19 indicative of the input.CNLP unit22 may, as described below in more detail, perform CNLP, based on the language sub-surfaces18 anddata19, to identify one ormore intents23.CNLP unit22 may output theintents23 toserver28, which may in turn invoke one ofexecution platforms24 associated with theintents23, passing theintents23 to one of theexecution platforms24 for further processing.
Execution platforms24 may represent one or more platforms configured to perform various processes associated with the identifiedintents23. The processes may each perform a different set of operations with respect to, in the example ofFIG. 1,databases26. In some examples,execution platforms24 may each include processes corresponding to different categories, such as different categories of data analysis including sales data analytics, health data analytics, or loan data analytics, different forms of machine learning, etc. In some examples,execution platforms24 may perform general data analysis that allows various different combinations of data stored todatabases26 to undergo complex processing and display via charts, graphs, etc.Execution platforms24 may process theintents23 to obtainresults25, whichexecution platforms24 may return toserver28.Interface unit20 may generate aGUI21 that presentresults25, transmitting theGUI21 toclient device14.
In this respect,execution platforms24 may generally represent different platforms that support applications to perform analysis of underlying data stored todatabases26, where the platforms may offer extensible application development to accommodate evolving collection and analysis of data or perform other tasks/intents. For example,execution platforms24 may include such platforms as Postgres (which may also be referred to as PostgreSQL, and represents an example of a relational database that performs data loading and manipulation), TensorFlow™ (which may perform machine learning in a specialized machine learning engine), and Amazon Web Services (or AWS, which performs large scale data analysis tasks that often utilize multiple machines, referred to generally as the cloud).
Theclient device14 may include a client30 (which may in the context of a chatbot interface be referred to as a “chat client30”).Client30 may represent a unit configured to presentinterface21 and allow entry ofdata19.Client30 may execute within the context of a browser, as a dedicated third-party application, as a first-party application, or as an integrated component of an operating system (not shown inFIG. 1) ofclient device14.
Returning to natural language processing,CNLP unit22 may perform a balanced form of natural language processing compared to other forms of natural language processing. Natural language processing may refer to a process by whichhost device12 attempts to processdata19 indicative of inputs (which may also be referred to as “inputs19” or, in other words, “queries19” for ease of explanation purposes) provided via a conversational interaction withclient device14.Host device12 may dynamically promptuser16 forvarious inputs19, present clarifying questions, present follow-up questions, or otherwise interact with the user in a conversational manner to elicitinput19.User16 may likewise enter theinputs19 as sentences or even fragments, thereby establishing a simulated dialog withhost device12 to identify one or more intents23 (which may also be referred to as “tasks23”).
Host device12 may presentvarious interfaces21 by which to present the conversation. An example interface may act as a so-called “chatbot,” which may attempt to mimic human qualities, including personalities, voices, preferences, humor, etc. in an effort to establish a more conversational tone, and thereby facilitate interactions with the user by which to more naturally receive the input. Examples of chatbots include “digital assistants” (which may also be referred to as “virtual assistants”), which are a subset of chatbots focused on a set of tasks dedicated to assistance (such as scheduling meetings, make hotel reservations, and schedule delivery of food).
A number of different natural language processing algorithms exist to parse theinputs19 to identifyintents23, some of which depend upon machine learning. However, natural language may not always follow a precise format, and various users may have slightly different ways of expressinginputs19 that result in the samegeneral intent23, some of which may result in so-called “edge cases” that many natural language algorithms, including those that depend upon machine learning, are not programed (or, in the context of machine learning, trained) to specifically address such edge cases. Machine learning based natural language processing may value naturalness over predictability and precision, thereby encountering edge cases more frequently when the trained naturalness of language differs from the user's perceived naturalness of language. Such edge cases can sometimes be identified by the system and reported as an inability to understand and proceed, which may frustrate the user. On the other hand, it may also be the case that the system proceeds with an imprecise understanding of the user's intent, causing actions or results that may be undesirable or misleading.
Other types of natural language processing algorithms utilized to parseinputs19 to identifyintents23 may rely on keywords. While keyword based natural language processing algorithms may be accurate and predictable, keyword based natural language processing algorithms are not precise in that keywords do not provide much if any nuance in describingdifferent intents23.
In other words, various natural language processing algorithms fall within two classes. In the first class, machine learning-based algorithms for natural language processing rely on statistical machine learning processes, such as deep neural networks and support vector machines. Both of these machine learning processes may suffer from limited ability to discern nuances in the user utterances. Furthermore, while the machine learning based algorithms allow for a wide variety of natural-sounding utterances for the same intent, such machine learning based algorithms can often be unpredictable, parsing the same utterance differently in successive versions, in ways that are hard for developers and users to understand. In the second class, simple keyword-based algorithms for natural language processing may match the user's utterance against a predefined set of keywords and retrieve the associated intent.
CNLP unit22 may parse inputs19 (which may, as one example, include natural language statements that may also be referred to as “utterances”) in a manner that balances accuracy, precision, and predictability.CNLP unit22 may achieve the balance through various design decisions when implementing the underlying language surface (which is another way of referring to the collection of sub-surfaces18, or the “language”). Language surface18 may represent a set of potential user utterances for whichserver28 is capable of parsing (or, in more anthropomorphic terms, “understanding”) the intent of theuser16.
The design decisions may negotiate a tradeoff between competing priorities, including accuracy (e.g., how frequentlyserver28 is able to correctly interpret the utterances), precision (e.g., how nuanced the utterances can be in expressing the intent of user16), and naturalness (e.g., how diverse the various phrasing of an utterance that map to the same intent ofuser16 can be). The CNLP techniques may allowCNLP unit22 to unambiguously parse inputs19 (which may also be denoted as the “utterances19”), thereby potentially ensuring predictable, accurate parsing of precise (though constrained)natural language utterances19.
In operation,CNLP unit22 may expose, to an initial user (whichuser16 may be assumed to be for purposes of illustration) a select one of language sub-surfaces18 in a constrained manner, potentially only exposing the select one of the language sub-surfaces18.CNLP unit22 may receive viainterface unit20input19 that conforms with the portion of the language provided by the exposed language sub-surface, andprocess input19 to identifyintent23 ofuser16 from a plurality ofintents23 associated with the language. That is, when designingCNLP unit22 in support ofserver28, a designer may select a set ofintents23 that theserver28 supports (in terms of supporting parsing ofinput19 via CNLP unit22).
Further,CNLP unit22 may optionally increase precision with respect to each ofintents23 by supporting one or more entities. To illustrate, consider an intent of scheduling a meeting, which may have entities, such as a time and/or venue associated with the meeting scheduling intent, a frequency of repetition of each of the meeting scheduling intent (if any), and other participants (if any) to the schedule meeting intent.CNLP unit22 may perform the process of parsing to identify thatutterance19 belongs to a certain one of the set ofintents23, and thereafter to extra any entities that may have occurred in theutterance19.
CNLP unit22 may associate each ofintents23 provided by the language18 with one or more patterns. Each pattern supported byCNLP unit22 may include one or more of the following components:
- a) A non-empty set of identifiers that may be present in the utterance for it to be parsed as belonging to this intent. Each identifier may be associated with one or more synonyms whose presence is treated equivalently to the presence of the identifier;
- b) An optional set of positional entities whichCNLP unit22 may parse based on where the positional entities occur in the utterance, relative to the identifiers;
- c) An optional set of keyword entities, each associated with a keyword (and possibly synonyms thereof). These keyword entities may occur anywhere in the utterance relative to each other; instead of their position, the keyword entities are parsed based on the occurrence of the associated keyword nearby (either before or after) in the utterance;
- d) An optional set of prepositional phrase entities, each associated with one or more prepositions (which may include terms such as “for each”). These propositional phrase entities may be parsed based on the occurrence of the corresponding prepositional phrase;
- e) A set of ignored words, which may refer to words that occur commonly in natural language or otherwise carry little utility to interpreting the utterance, such as “the,” “a,” etc.;
- f) A prompt optionally associated with each entity, providing both a description of the entity, as well as a statement thatCNLP unit22 may use to queryuser16 and elicit a value for the entity when the value may not be parsed from the utterance; and
- g) A pattern statement describing a relative order in which the identifiers and entities may occur in the pattern.
As an example, consider that in order to schedule a meeting,CNLP unit22 may define a pattern as follows. The identifiers may be “schedule” and “meeting”, where the word “meeting” may have a synonym “appointment.”CNLP unit22 may capture a meeting frequency as a positional entity frominput19 occurring in the form “schedule a daily meeting” or “schedule a weekly appointment.” Such statements may instead be captured as a keyword entity (with keyword “frequency”) as in “schedule a meeting at daily frequency” or “schedule an appointment with frequency weekly.”CNLP unit22 may use a prepositional phrase to parse the timing using the preposition “at” as in “I want to schedule the meeting at 5 PM” or “schedule an appointment at noon.”
The above examples included a number of words thatCNLP unit22 may be programmed to ignore when parsing, including “a”, “an”, “the”, “I”, “want”, “to” etc. The timing entity may also include a prompt such as “At what time would you like to have the meeting?,” whereserver28 may initiate aquery asking user16 if they did not specify a timing inutterance19. The pattern statement may describe that this pattern requires the identifier “schedule” to occur before “meeting” (or its synonym “appointment”) as well as all other entities.
As such,CNLP unit22 may processinput19 to identify a pattern from a plurality of patterns associated with the language18, each of the plurality of patterns associated with a different one of the plurality ofintents23.CNLP unit22 may then identify, based on the identified pattern,intent23 ofuser16 from the plurality of intents associated with the portion of the language.
The pattern may, as noted above include an identifier. To identify the pattern,CNLP unit22 may parseinput19 to identify the identifier, and then identify, based on the identifier, the pattern. The pattern may include both the identifier and a positional entity. In these instances,CNLP unit22 may parseinput19 to identify the positional entity, and identify, based on the identifier and the positional identity, the pattern.
Additionally, the pattern may, as noted above, include a keyword.CNLP unit22 may parseinput19 to identify the keyword, and then identify, based on the keyword, the pattern in the manner illustrated in the examples below.
The pattern may, as noted above, include an entity. When the pattern includes an entity,CNLP unit22 may determine thatinput19 corresponds to the pattern but does not include the entity.CNLP unit22 may interface withinterface unit22 to output, based on the determination thatinput19 corresponds to the pattern but does not include the entity, a prompt via aninterface21 requesting data indicative of additional input specifying the entity.User16 may enterdata19 indicative of the additional input (which may be denoted for ease of expression as “additional input19”) specifying the entity.Interface unit22 may receive theadditional input19 and pass theadditional input19 toCNLP unit22, which may identify, based on theinput19 andadditional input19, the pattern.
CNLP unit22 may provide a platform by which to execute pattern parsers to identifydifferent intents23. The platform provided byCNLP unit22 may be extensible allowing for development, refinement, addition or removal of pattern parsers.CNLP unit22 may utilize entity parsers imbedded in the pattern parsers to extract various entities. When various entities are not specified ininput19,CNLP unit22 may invoke prompts, which are also embedded in the pattern parses.CNLP unit22 may receive, responsive to outputting the prompts,additional inputs19 specifying the unspecified entities, and thereby parseinput19 to identify patterns, which may be associated withintent23.
In this way,CNLP unit22 may parsevarious inputs19 to identifyintent23.CNLP unit22 may provide intent23 toserver28, which may invoke one or more ofexecution platforms26, passing the intent23 to theexecution platforms26 in the form of a pattern and associated entities, keywords, and the like. The invoked ones ofexecution platforms26 may execute a process associated withintent23 to perform an operation with respect to corresponding ones ofdatabases26 and thereby obtainresult25. The invoked ones ofexecution platforms26 may provide result25 (of performing the operation) toserver28, which may provideresult25, viainterface21, toclient device14 interfacing withhost device12 to enterinput19.
Associated with each pattern may be a function (i.e., a procedure) that can identify whether that pattern is to be exposed to the user at this point in the current user session. For instance, a “plot_bubble_chart” pattern may be associated with a procedure that determines whether there is at least one dataset previously loaded by the user (possibly using the “load_data_from_file” pattern, or a “load_data_from_database” pattern that works similarly but loads data from a database instead of a file). When such a procedure is associated with every data visualization pattern in the system (such as “plot histogram” and “plot_line_chart”, etc.), the data visualization patterns may be conceptualized as forming a language sub-surface.
Because these patterns are only exposed when using one of the data loading patterns (which form another language sub-surface), theCNLP unit22 may effectively link language sub-surfaces to each other. Because the user is only able to execute an utterance belonging to the data visualization language sub-surface after the above prerequisite has been met, the user is provided structure with regard to a so-called “thought process” in executing tasks of interest, allowing the user to (naturally) discover the capabilities of the system in a gradual manner, and reducing cognitive overhead during the discovery process.
As such,CNLP unit22 may promote better operation ofhost device12 that interfaces withuser16 according to a natural language interface, such as so-called “digital assistants” or “chatbots.” Rather than consume processing cycles attempting to process ambiguous inputs from which multiple different meanings can be parsed, and presenting follow-up questions to ascertain the precise meaning the user intended by the input,CNLP unit22 may result in more efficient processing ofinput19 by limiting the available language to one or more sub-surfaces18. The reduction in processing cycles may improve the operation ofhost device12 as less power is consumed, less state is generated resulting in reduced memory consumption and less memory bandwidth is utilized (both of which also further reduce power consumption), and more processing bandwidth is preserved for other processes.
CNLP unit22 may introduce different language sub-surfaces18 through autocomplete, prompts, questions, or dynamic suggestion mechanisms, thereby exposing the user to additional language sub-surfaces in a more natural (or, in other words, conversational) way. The natural exploration that results through linked sub-surfaces may promote user acceptability and natural learning of the language used by the CNLP techniques, which may avoid frustration due to frequent encounters with edge cases that generally appear due to user inexperience through inadequate understanding of the language by which the CNLP techniques operate. In this sense, the CNLP techniques may balance naturalness, precision and accuracy by naturally allowing a user to expose sub-surfaces utilizing a restricted unambiguous portion of the language to allow for precision and accuracy in expressing intents that avoid ambiguous edge cases.
For example, consider a chatbot designed to perform various categories of data analysis, including loading and cleaning data, slicing and dicing it to answer various business-relevant questions, visualizing data to recognize patterns, and using machine learning techniques to project trends into the future. Using the techniques described herein, the designers of such a system can specify a large language surface that allows users to express intents corresponding to these diverse tasks, while potentially constraining the utterances to only those that can be unambiguously understood by the system, thereby avoiding the edge-cases. Further, the language surface can be tailored to ensure that, using the auto-complete mechanism, even a novice user can focus on the specific task they want to perform, without being overwhelmed by all the other capabilities in the system. For instance, once the user starts to express their intent to plot a chart summarizing their data, the system can suggest the various chart formats from which the user can make their choice. Once the user selects the chart format (e.g., a line chart), the system can suggest the axes, colors and other options the user can configure.
The system designers can specify language sub-surfaces (e.g., utterances for data loading, for data visualization, and for machine learning), and the conditions under which they would be exposed to the user. For instance, the data visualization sub-surface may only be exposed once the user has loaded some data into the system, and the machine learning sub-surface may only be exposed once the user acknowledges that they are aware of the subtleties and pitfalls in building and interpreting machine learning models. That is, this process of gradually revealing details and complexity in the natural language utterances extends both across language sub-surfaces as well as within it.
Taken together, the CNLP techniques can be used to build systems with user interfaces that are easy-to-use (e.g., possibly requiring little training and limiting cognitive overhead), while potentially programmatically recognizing a large variety of intents with high precision, to support users with diverse needs and levels of sophistication. As such, these techniques may permit novel system designs achieving a balance of capability and usability that is difficult or even impossible otherwise. More information regarding CNLP techniques can be found in U.S. application Ser. No. 16/441,915, entitled “CONSTRAINED NATURAL LANGUAGE PROCESSING,” filed Jun. 14, 2019, the entire contents of which are hereby incorporated by reference as if set for in its entirety.
In the context of these CNLP techniques,various queries19 may require interfacing with one ormore databases26 that adhere to a formal syntax. For example, one or more ofdatabases26 may represent a sequential query language (SQL) database that has a formal syntax (known by the acronym SQL that was formally referred to as SEQUEL) for accessing data stored to thedatabase26. As another example, one or more ofdatabases26 may represent a so-called Pandas dataframe accessible via a formal Pandas syntax. Such formal syntaxes may limit accessibility todatabases26 whetheruser16 is a less experienced user or an experienced data scientists. Requiringuser16 to understand and correctly definequeries19 using appropriate commands in accordance with the formal syntax may contravene the accessible nature of the CNLP techniques discussed above.
In addition,server28 mayoutput results25 obtained via application of machine learning models to multi-dimensional data stored bydatabases26. That is, one or more ofexecution platforms24 may implement a machine learning model, which are shown as machine learning model (MLM)44 in the example ofFIG. 1. In some instances,MLM44 are trained using training data to produce a trained model able to generalize properties of data based on similar patterns with the training data.Training MLM44 may involve learning model parameters by optimizing an objective function, thus optimizing a likelihood of observing the training data given the model. Given variabilities in the training data, the extent of training samples within the training data, and other limitations to training, and the complexity of modern machine learning models, it is often difficult to explainresults25 that appear erratic or fail to meet expectations particularly using plain language that less experienced users (represented byuser16 in some instances) can understand.
In accordance with various aspects of the techniques described in this disclosure,server28 may receive aquery19 via CNL sub-surface18 exposed byCNLP unit22 viainterface21 that includes a plain language request for data stored todatabases26.Such queries19 may, in other words, request access todatabases26 so as to retrieve data stored to thedatabases26 as a dataset. As noted above,such queries26 may conform to a plain conversational language having various inputs that are translated, byCNLP unit22, intointents23.Server28 may redirectintents23 toexecution platforms24 that apply transformations to theintents23 that transform intents23 (representative of queries19) into one ormore statements27 that conform to a formal syntax associated with the dataset stored todatabases26.Execution platforms24 may access, based onstatements27, the dataset stored todatabases27 to obtain aquery result29 providing portions of the dataset relevant toinitial queries19.Execution platforms24 may obtainquery result29 thatexecution platforms24 may use when forming results25.
As such,host device12 may maintain the accessibility of the foregoing CNLP techniques in terms of allowinguser16 to definequeries19 in plain conversational language and thereby potentially avoiduser16 from having to have a broad understanding of the formal syntax of SQL, Pandas, or other formal database syntax. In this manner, both experienced data scientists and new users with little data science experience (or training) may access complicated datasets having formal (or, in other words) rigid syntax using plain language. Facilitating such access to datasets may enablesuser16 to more efficiently operateclient device14 andhost device12 to retrieve relevant data (in terms of relevance to queries19). The efficiencies may occur as a result of not having to process additional commands or operations in a trail-and-error approach while also ensuring adequate confidence in query results29, asexecution platforms24 may augment or otherwise transformquery results29 intoresults25 that include an explanation of query results29 in plain language.
As fewer attempts to accessdatabases26 may occur as a result of such transformations, bothclient device14 andhost device12 may operate more efficiently. That is,client device14 andhost device12 may receivefewer queries19 in order to successfully accessdatabases26 to provideresults25 that may potentially result in less consumption of resources, such as processor cycles, memory, memory bandwidth, etc., and thereby result in less power consumption.
Further,execution platforms24 may determine a correlation of one or more dimensions (e.g., a selected row or column) of the multi-dimensional datasets stored todatabases26 to queryresults29—provided in response to transformed intents23 (which are represented by statements27)—output byMLM44.Execution platforms24 may invokemultiple MLM44 responsive to intents23 (or transformedintents23 represented by statements27) that analyzequery results29 resulting from accessing, based onstatements27, to obtainresults25. Based on the determined correlation,execution platforms24 may select one or more ofMLM44 to obtain result25 (e.g., selectingMLM44 having the determined correlation above a threshold correlation as one or more sources of result25).Execution platforms24 may output result25 for each of the one or more ofMLM44 toserver28, which may provideoutput result25 viainterface21.
Execution platforms24 may determine a sentence in plain language explaining why one or more ofMLM44 were selected, utilizing the determined correlation to facilitate generation and/or determination of the sentence.Execution platforms24 may include the sentence explaining why one or more ofMLM44 were selected as part ofresults25 provided by way ofinterface21 toclient device14, thereby potentially enablinguser16 to trustresult25. Such trust may enableuser16, whether an experienced data scientist or a new user, to gain confidence inresult25 such thatuser16 may reduce a number of interactions with client device15 to receiveresult25.
Again, as fewer attempts to accessdatabases26 may occur as a result of such explanation, bothclient device14 andhost device12 may operate more efficiently. That is,client device14 andhost device12 may receivefewer queries19 in order to successfully accessdatabases26 to provideresults25 that may, again, potentially result in less consumption of resources, such as processor cycles, memory, memory bandwidth, etc., and thereby result in less power consumption.
In operation,server28 may expose a language sub-surface18 viainterface21 by which to receive aquery19 conforming to the portion of the language provided by exposed language sub-surface18. Server18 may invokeCNLP unit22 to reducequery19 tointents23 as described above, wheresuch intents23 are representative ofquery19.Server28 may obtainintents23 and invokeexecution platforms24 to processintents23, passingintents23 toexecution platforms24.
Execution platforms24 may, responsive to receivingintents23, invoke one or more oftransform units34 that apply one or more transforms tointents23 that convertintents23 into one ormore statements27 that conform to the formal syntax associated with the dataset stored todatabases26. In some examples,execution platform24 may categorize or, in other words, classifyintents23 to identify which oftransform units34 to invoke. To illustrate, one or more ofintents23 may indicate one or more rows along with an operation, such as rows that contain a particular value in an identified column (e.g., by name or variable) of the dataset are to be “kept” (e.g., having a particular value in the identified column) while the remaining rows are to be removed from the working dataset, as will be explained in more detail below.Execution platform24 may categorize theseintents23 as a database query that requests only a subset of the rows of the dataset that meet the condition (e.g., having the identified value), thereby invoking certain ones oftransform units34. The invoked ones oftransform units34 may transform the one or more ofintents23 intostatements27 that conform to the formal SQL syntax.
In this respect, the foregoing example enableshost device12 to receive aquery19 that identifies one or more dimensions of the dataset to “keep” in the working dataset.Execution platforms24 may invoke transformunits34 to apply transforms that convert intents23 (representative of query19) intostatements27 that conform to the formal SQL syntax associated with the underlying dataset stored todatabases26.Execution platforms24 may then access, based onstatements27, the dataset stored todatabases26 to obtain a query result29 (that in this example includes the one or more dimensions of the dataset identified by query19), and output the query results29 as part ofresult25.
Further, as noted above,execution platforms24 may apply a number ofdifferent MLM44 to the multi-dimensional dataset stored todatabases26 to obtain a result output by each ofdifferent MLM44. Examples ofMLM44 include a neural network, a support vector machine, a naïve Bayes model, a linear regression model, a linear discriminant analyses model, a light gradient boosted machine (lightGBM) model, a decision tree, etc.
Execution platform24 may determine a correlation between each dimension of the multi-dimensional dataset to the result output by each ofMLM44. Correlation may refer to a statistical association that represents a degree to which a pair of variables are linearly related. As such,execution platform24 determines a correlation coefficient for each dimension (e.g., column or row) of the multi-dimensional dataset to the result output by each ofMLM44.Execution platform24 may determine this correlation to evaluate which dimension most accurately forms the result output be each ofMLM44, thereby enabling selection of a subset (meaning, less than all but not none) ofMLM44 having an associated correlation with a meaningful dimension (as measured in terms of randomness, uniqueness, entropy—as understood in the context of information theory, etc.) of the multi-dimensional dataset that exceed some threshold correlation.
In this way,execution platform24 may select, based on the correlation, a subset ofMLM44 to obtain aresult25 for each of the subset ofMLM44.Execution platform24 may output result25 for each of the subset ofMLM44, wheresuch result25 may include a sentence explaining the result using plain language.Execution platform24 may also include, inresult25, a graph identifying a relevance of each of the one or more dimensions of the multi-dimensional dataset to the result for each of the subset ofMLM44. More information concerning the foregoing aspects of the techniques are provided below with respect toFIGS. 2A-3G.
FIGS. 2A-2I arediagrams illustrating interface21 presented byhost device12 for selecting machine learning models selection and evaluation in accordance with various aspects of the techniques described in this disclosure. In the example ofFIG. 2A, ascreenshot200 of interface21 (which may be denoted as “interface21A”) is shown in whichhost device12 present aninitial prompt201 by which to engageuser16.User16 then enters acommand202 in conversational language to “[l]oad data from the file telcoCustomerChurn.csv” that results inresult25A being displayed that lists the data from the telcoCustomerChurn.csv dataset.Host device12 may include inresult25A,explanation203, in plain language, that explains the dataset along with suggestions for suggestedinputs209, including a suggestedinput209A to analyze churn for customers of the telecommunications service.
In the example ofFIG. 2B,user16 has selected suggestedinput209A, which results inhost device12 providing interface21B that is illustrated byscreenshot210 shown in the example ofFIG. 2B. That is,user16 has selected suggestedinput209A that results ininput211 to “Analyze Churn,” whereuponhost device12 has processedinput211 to generatechart212 representative, in part, ofresult25.Host device12 provides chart212 (which is another way to refer to a graph, hence chart212 may be referred to as “graph212”) along withexplanation213 asresult25 in this example, whereexplanation213 describeschart25.Explanation213 states that “[d]etected that 1 categorical column(s) have unique values that are nearly equal to the size of the dataset.”Explanation213 continues to note that “[p]lease consider dropping these columns from the analysis to get a more meaning model,” where “[y]ou can do this by asking:” with a link to a suggested input to “Analyze Churn excluding customerID.”
In this example,host device12 has invokedexecution platforms24 to analyze the telcoCustomerChurn.csv dataset to determine a correlation of the columns (or other dimensions) relative to results provided by each ofMLM44.Execution platforms24 may execute a dimension reduction algorithm that detects unique and/or random numbers for certain dimensions and thereby identified that the customerID column appears to have little relevance (due to the random and/or unique nature of the underlying customerID data) on any analysis resulting from applications of one or more ofMLM44.Host device12 may then provideexplanation213 with a suggested input (noted above) automatically explaining that a better result may be achieved using the suggested input, all of which occurs via plain language sentences.
Explanation213 continues, noting thathost device12 has “trained a Lightgbm classifier and saved it as BestFit1,” which “achieved a validation Accuracy of 80.0%” providing a further explanation that “[t]he model predicted correctly 80% of the time.” In this respect,host device12 has further explainedresult25 using plain language that allowsuser16 to trustresult25. Moreover,explanation213 notes that the “values for Churn are most impacted by: Contract, tenure, TotalCharges, MonthlyCharges,” which are labels (or, in other words, names or variable names) for dimensions (i.e., columns in this example) of the telcoCustomerChurn.csv dataset.Explanation213 also notes that the “bar chart [chart212] shows the full impact scores; detailed scores are in the dataset ImpactList1,” where ImpactList1 is a suggested link for viewing the impact of each dimension on the result to the lightGBM one ofMLM44. In other words, chart212 indicates an impact (or, in other words, correlation) of each dimension (which in this instance refers to columns of the dataset) on the result output by the lightGBM model ofMLM44.
In other words,execution platforms24 may invoke each ofMLM44,training MLM44 for the underlying dataset, and then apply each ofMLM44 to determine a respective result.Execution platforms24 may next determine a correlation between each dimension of the dataset to the result output by each ofMLM44, selecting the result of eachMLM44 having a corresponding correlation that exceeds a high correlation threshold (e.g., which may be 60-70%).Execution platform24 may provideexplanation213 to explainchart212 in plain language to facilitate easy understanding ofchart212 while also providing links to allowuser16 to further explore and/or understand the creation ofchart212.
FIG. 2C is a diagram illustrating another example of interface21 (which may be denoted as interface21C), wherescreenshot220 of interface21C presents afurther explanation221 indicating, in plain language, that the “bar chart [chart212] shows the full impact scores,” and noting that the “detailed scores are in the dataset ImpactList1” with a hyperlink to facilitate access to the dataset ImpactList1. In this respect,explanation221 explains the that chart212 represent an impact graph (or, in other words, an impact chart), and further explains, as noted below and in plain language, the formulation of chart212 (which may also be referred to as a visual chart).
Explanation221 further indicates that three additional models “called SimpleFit1A (Accuracy: 75.0), SimpleFit1B (Accuracy: 75.0), SimpleFit1C (Accuracy: 77.0) with increasing levels of detail.” In the example ofFIG. 2C, each of the model names, SimpleFit1A, SimpleFit1B, and SimpleFit1C, are referenced via hyperlinks to enableuser16 to quickly view the results of the three additional models, each of which may represent another iteration ofMLM44. In this respect,execution platforms24 have generated and trained a complicated model referred to as Lightgbm and additional models that are simple fit models (being less complicated, or simpler, than the Lightgbm model).
In addition,explanation221 also states that there are 3 additional “charts to visualize the data,” referencing each of Chart1A, Chart1B and Chart1C as hyperlinks to again facilitate access byuser16 to the additional charts.Explanation221 also describes each of Chart1A-Chart1C, noting that Chart1A is a “bubble chart with x-axis Contract y-axis tenureInt20 bubble color Churn bubble size NumRecords,” Chart1B is a “scatter chart with x-axis Contract y-axis NumRecords for each Chum,” and Chart1C is a “stacked bar chart with x-axis Contract y-axis NumRecords for each tenureInt20.”
In this respect,execution platforms24 may determine, based on the results for each ofMLM44, one or more charts to explain the corresponding result output by each ofMLM44, such aschart212 and Charts1A-Chart1C.Execution platforms24 may rank the charts to identify the highest ranked chart (which in the example ofFIG. 2C is chart212), selecting the highest ranked chart for output viainterface21.Execution platforms24 may rank the charts based on model accuracy as discussed in explanation221 (where the accuracy is provided next to the lightGBM model and each model SimpleFit1A-SimpleFit1C.
Although not shown in the example ofFIG. 2C,execution platforms24 may also identify dimensions of the dataset that have low correlation.Execution platform24 may identify the dimensions that have low correlation by comparing the correlation for each dimension to a low correlation threshold (which may also be referred to as a relevance threshold). That is,execution platform24 may determine, based on a comparison of the correlation determined between the one or more dimensions and the result output by each ofMLM44 to the relevance threshold, one or more low relevance dimensions of the multi-dimensional dataset that have low relevance to the result output by each ofMLM44.
Execution platform24 may provide an explanation, for example, that indicates that various dimensions, such as the dimension denoted by name “gender,” does not relate to the churn analysis performed by the lightGBM model ofMLM44. Such explanation may be different than denoting that customerID does not appear to have much relevance to the result produced by the lightGBM model, asexecution platform24 may perform a different analysis on customerID to determine that customerID appears to be a random, unique number assigned to each row of the dataset.
In the example ofFIG. 2D,user16 has enteredinput226 indicating thathost device12 should “Analyze Churn using Contract, gender, tenure,” where “Contract,” “gender,” and “tenure” refer to labels assigned to dimensions (again, columns in this example) of the telcoCustomerChurn.csv dataset.Host device12 may processinput226 in the manner described above in whichexecution platforms24 may processintents23 representative ofinput226 to obtainchart227 andexplanation229, providingchart227 andexplanation229 asresult25 viainterface21 toclient device14.
Chart227 represents another impact graph that is focused on the three identified dimensions ininput226, indicating that the “Contract” dimension has the most impact (of the three identified dimensions) followed by the “tenure” dimension, and then the “gender” dimension.Execution platforms24 may build another lightGBM model that assesses the three identified dimensions to determine whether such dimensions impact customer churn (for the telecommunication contract).
Explanation229 explainschart227, stating that “I've trained a Lightgbm classifier and saved it as BestFit2,” which “achieved a validation accuracy of 76%.”Explanation229 further notes that the “model predicted correctly 76% of the time,” before continuing to note that the “values for Churn are most impacted by: Contract, tenure, gender” naming the dimensions in order of impact (or, in other words, correlation to) on Churn.Explanation229 concludes by stating that the “bar chart shows the full impact scores,” indicating that “detailed scores are in the dataset ImpactList2.”Explanation229 provides the ImpactList2 as a hyperlink thatuser16 may quickly access to more fully explore the detailed impact scores.
Referring next to the example ofFIG. 2E,screenshot230 represents another example ofinterface21 in whichuser16 has enteredinput231 to “Visualize the model SimpleFit1A.” Responsive to input231,host device12 may invokeexecution platforms24 to processintents23 parsed frominput231 to generate aresult25 that includes diagram232 andexplanation234.
Execution platforms24 may build and train a decision tree that can be visualized as diagram232 in which the dark circles represent “No” customer churn, while the light circles indicate “Yes,” or in other words indicative of an impact, in terms of customer churn. Starting from the initial dataset, diagram232 indicates that there are two initial branches related to whether the contract is or is not a month-to-month contract. In the “Contract is not Month-to-month” branch, diagram232 includes two sub-branches indicating whether or not the contract year or the contract is one year dimensions factor in to customer churn with both being unrelated (“No”) to customer churn. In the “Contract is Month-to-month” branch, diagram232 includes two sub-branches indicating whether or not the monthly charge being less than or greater than $68.6 to customer churn, with less than $68.6 being unrelated to customer churn and churn occurring (“Yes”) when the monthly cost is equal to or greater to $68.6.
Execution platform24 may also translate diagram232 intoexplanation234, which provides the following “key insights:”
- When Contract is Month-to-month, MonthlyCharges is greater than $69, Churn is Yes (32% of total samples).
- When Contract is Month-to-month, MonthlyCharges is less than or equal to $69, Churn is no (23% of total samples).
- When Contract is not Month-to-month, Churn is No (45% of total samples).
Using these key insight provided byexplanation234 and reviewing diagram232 may enableuser16 to better understand diagram232 in terms of analyzing customer churn.
In the example ofFIG. 2F,screenshot235 provides another example ofinterface21 in whichuser16 has enteredinput236 to “Visualize the model SimpleFit1B.” Responsive to input236,host device12 may invokeexecution platforms24 to processintents23 parsed frominput236 to generate aresult25 that includes diagram237 andexplanation239.
Execution platforms24 may build and train decision trees having various degrees of granularity where the levels of granularity may be controller byuser16 setting a skill level to a value between 1 and 3 (1 being a novice, and 3 being an expert where higher levels of granularity are provided as you move up the skill level). In the example ofFIG. 2F, the decision tree has four levels, starting with “Dataset” and moving down three additional node levels, providing an additional level of granularity compared to the decision tree shown in diagram232. The decision tree visualized in diagram237 begins with two branches from “Dataset” that are the same as those presented in diagram232. However, in the “Contract is not Month-to-month” branch, diagram237 provides two sub-branches regarding tenure being either 70 or 71. In the “tenure 70” branch, diagram237 provides two addition sub-branches directed to whether tenure is less than 32 or between 33 and 70. In the “tenure 71” branch, diagram237 provides two sub-branches directed to whether the contract is less than a year or contract is one year. In each instance, of the contract is not month-to-month branch, the dark circles (or, in other words, nodes) represent no contract churn.
In the “Contract is Month-to-month” branch, diagram237 provides two sub-branches directed to whether the “tenure is less than or equal to 5” or “tenure 6.” The light node representative of “tenure is less than or equal to 5” in diagram237 represents relatively higher correlation to the customer churn analysis, thereby indicating that tenure is less than or equal to 5 may result in customer churn. The dark node representative of “tenure 6” (or greater) may indicate a relatively low correlation to customer churn.
Under the “tenure is less than or equal to 5” branch, diagram237 provides two sub-branches of “tenure is less than or equal to 1” and “tenure is between 2 and 5” with both having relatively higher correlation to customer churn as indicated by the light nodes. Under the “tenure 6” branch, diagram237 provides for two sub-branches that indicate “gender is not Male” and “gender is Male,” but neither of these nodes have a relatively high correlation to customer churn as indicated by the dark nodes.
Again,execution platform24 may also translate diagram237 intoexplanation239, which provides the following “key insights:”
- When Contract is Month-to-month, tenure is greater than 6, Churn is No (36% of total samples).
- When Contract is Month-to-month, tenure is less than or equal to 5, Churn is Yes (19% of total samples).
- When Contract is not Month-to-month, Churn is No (45% of total samples).
Using these key insight provided byexplanation239 and reviewing diagram237 may enableuser16 to better understand diagram237 in terms of analyzing customer churn.
Moreover, as can be seen throughout the examples ofFIG. 2A-2F,user16 may enter a simple query19 (such asinput236 shown inFIG. 2F) that results inhost device12 automatically, without any additional input fromuser16, creating charts, diagrams, and other results in a visual manner to assistuser16 in interpretingresults25. Moreover,host device12 automatically, without any additional input fromuser16, may provide the explanation in plain language that allowsuser16 to gain confidence withresults25, as well as link through to additional datasets, charts, models, etc.
In the example ofFIG. 2G,screenshot250 provides another example ofinterface21 in whichuser16 has enteredinput251 to “Plot Chart Chart1A,” referring to the Chart1A discussed above with respect toexplanation221 shown in the example ofFIG. 2C. Responsive to input251,host device12 may invokeexecution platforms24 to processintents23 parsed frominput251 to generate aresult25 that includes diagram252 andexplanation254.
Execution platforms24 may build and train simpler models (in terms of complexity than the lightGBM model) that result in different charts, such aschart252 that presents bubbles indicative of “Yes” or “No” churn similar to the visualization of the decision trees. The size of the bubbles indicate the relative number of “Yes” or “No” customer churn.Execution platform24 also providesexplanation254 that explains the formulation ofchart252 as follows.
- First, I distributed tenure into several buckes, each withsize 20, and named the new column tenureint20.
- Then, I computed the count of records for each Churn, Contract and tenureint20 calling the output columns NumRecords.
- Finally, I plotted a bubble chart with Contract as the x-axis, tenureint20 as the y-axis, the bubble color was set using Churn, NumRecords was used to set the size of the bubble.
As noted above,user16 may set different levels of skill (from 1 to 3).Screenshot250 showsinterface21 when configured to present information at a skill level of 1. In the example ofFIG. 2H,screenshot260 showsinterface21 when configured to present information at a skill level of 2 or 3.Screenshot260 includes thesame chart252, but provides a much morecomprehensive explanation254.
Explanation264 states the following:
<not shown inFIG. 2H> Detected that 1 categorical column(s) have unique values that are nearly equal to the size of the dataset. Please consider dropping these columns from the analysis to get a more meaningful model. You can do this by asking:
Analyze Churn excluding customerID [which is a hyperlink to allowuser16 to quickly enter this as an input/query]
I've trained a Lightgbm Classifier and saved it as BestFit1. The model's validation scores are:
- Accuracy: 80.0%
- AUC: 0.85 The values for Churn are most impacted by: Contract, tenure, TotalCharges,
MonthlyCharges. </not shown inFIG. 2H>
The bar chart shows the full impact scores; detailed scores are in the dataset ImpactList1.
I've also bit 3 model(s) called SimpleFit1A (AUC: 0.71, Accuracy: 75.0), SimpleFit1B (AUC: 0.71, Accuracy: 75.0), SimpleFit1C (AUC: 0.66, Accuracy: 77.0) with increasing levels of detail.
Here are 3 charts to visualize the data: - Chart1A (bubble chart with x-axis Contract y-axis tenureInt20 bubble color Churn bubble size NumRecords)
- Chart1B (scatter chart with x-axis Contract y-axis NumRecords for each Churn)
- Chart1C (stacked bar chart with x-axis Contract y-axis NumRecords for each tenureInt20)
The parameters used to model are saved in the dataset named PipelineReport [which is provided as a hyperlink to quickly allowuser16 to enter a query/input for the PipelineReport].
In this example,execution platforms24 have provided more insight into how the models were constructed and also alloweduser16 to retrieve the PipelineReport. In the example ofFIG. 2I, ascreenshot270 showsinterface21 after requesting the PipelineReport viainput271.Host device12 may invokeexecution platforms24 to interface withdatabases26 to retrievepipeline report272, returningpipeline report272 as a result toclient device14.Pipeline report272 may enable data scientists to better understand howMLM44 are created, trained, and employed in order to retrieve the various results. In this manner,host device12 may generate apipeline report272 explaining, in more technical detail and not necessarily in plain language, howhost device12 producedMLM44, andoutput pipeline report272 for review by user16 (e.g., when the skill level is set to a level of 2 or 3). Such pipeline reports272 may facilitate audits and other internal reviews.
FIGS. 3A-3G are additionaldiagrams illustrating interface21 presented byhost device12 for accessing datasets in accordance with various aspects of the constrained natural language processing techniques. Referring first to the example ofFIG. 3A, ascreenshot300 may represent another example ofinterface21 in whichuser16 has enteredquery301 to “Load data from the file sba.csv,” whichhost device12 processes usingCNLP unit22 to determineintents23 thatexecution platforms24 may transform (through invocation of transform units34) intoformal statements27 that conform to the formal syntax associated with the datasets (loaded from file sba.csv).Execution platforms24 may accessdatabases26 usingstatements27 to obtainquery results29 from which preview302 is formed.Execution platforms24return preview302 asresult25 toclient device14.
In addition,user16 may interface withclient device14 to enterquery303 that indicates in plain language to “[k]eep the rows where BorrState contains WI.” Again,host device12 may invokeCNLP unit22 to processquery303 to parse one ormore intents23 fromquery303, providing theintents23 toexecution platforms24.Execution platforms24 may invoke transformunit34 to transform intents23 (representative of query303) into one or more select statements27 (such as SQL select statements) that conform to the formal syntax associated with the dataset.
Execution platforms24 may access, based onselect statements27, the dataset to obtainquery result29 that includes the one or more dimensions of the dataset identified byquery303. That is,query303 indicates to keep the rows where BorrState (which is a an example label identifying a dimension) contains a value “WI.” As such,execution platforms24 obtain any rows having a value of “WI” for the column BorrState, returning query results29 from which preview304 is generated.
In the example ofFIG. 3B, ascreenshot310 illustratesinterface21 in whichuser16 has entered aquery311 requesting to “[k]eep the columns BorrCity, BorrState, BorrStreet.” Again,host device12 invokes, responsive to query311,CNLP unit22 to parseintents23 fromquery311, providingintents23 toexecution platform24.Execution platform24 may invoke transformunit34 to transformintents23 intoselect statements27 that select the columns identified by query311 (i.e., BorrCity, BorrState, BorrStreet in this example) using the formal syntax associated with the dataset.
Execution platform24 may accessdatabases26 usingselect statements27 to obtainquery results29 from which apreview312 is formed.Host device12 may then providepreview312 viainterface21 toclient device14 as part ofresult25, which presentspreview312.
In the example ofFIG. 3C, ascreenshot315 ofinterface21 shows thatuser16 has entered aquery316 to “[k]eep the rows where ApprovalFiscalYear is greater than 2009.” Again,host device12 invokes, responsive to query316,CNLP unit22 to parseintents23 fromquery316, providingintents23 toexecution platform24.Execution platform24 may invoke transformunit34 to transformintents23 intoselect statements27 that select the rows identified by query311 (i.e., rows having a value in the column labeled ApprovalFiscalYear with a value greater than 2009) and that conform to the SQL syntax.
Execution platform24 may accessdatabases26 usingselect statements27 to obtainquery results29 from which apreview312 is formed.Host device12 may then providepreview312 viainterface21 toclient device14 as part ofresult25, which presentspreview312.
In the example ofFIG. 3D, ascreenshot320 ofinterface21 shows thatuser16 has entered aquery321 to “[c]ompute count of records, tototal GrossApproval, total GrossChargeOffAmount, maximum GrossApproval for each ApprovalFiscalYear, BorrState calling the output columns NumberOfLoansMade, TotalApproved, TotalLost, MaximumLoan.” Again,host device12 invokes, responsive to query321,CNLP unit22 to parseintents23 fromquery321, providingintents23 toexecution platform24.Execution platform24 may invoke transformunit34 to transformintents23 intostatements27 that compute a number of loans made, a total amount approved, a total amount lost, and a maximum loan value and that conform to the SQL syntax.
In computing a number of loans made, a total amount approved, a total amount lost, and a maximum loan value,execution platform24 may determine that another dataset can be used to satisfy query321 (i.e., the sba_sample dataset in the example ofFIG. 3D).Execution platform24 may composefeedback323 indicating that the sba_sample dataset may be used to answerquery321, providingfeedback323 as part ofresult25.Execution platform24 may then return toprocessing query321 using the most recently loaded dataset (from the file sba.csv).
Execution platform24 may access databases26 (storing the most recently loaded dataset) usingstatements27 to obtainquery results29 from which apreview322 is formed.Host device12 may then providepreview322 viainterface21 toclient device14 as part ofresult25, which presentspreview322.
In addition,execution platform24 may maintain a working dataset formed from query results29, referencing the working dataset in response to subsequentadditional queries19. The working dataset may represent an example of the most recently loaded dataset.Execution platform24 may determine, however, that a dimension (e.g., a row and/or column) of the working dataset is not present but is referenced in additionalsubsequent queries19.Execution platform24 may then invoketransform unit34 to transform this additional query into one or moreadditional statements27, and then automatically (without requiring any additional input from user16) access, based on the one or moreadditional statements27 and responsive to determining that the identified dimension is not present inprevious query result29, the underlying dataset to obtain anadditional query result29.Execution platform24 may provide these additional query results29 as part ofresult25.
In the foregoing example,query321 is relatively complex in terms of computing a number of different values across a number of different dimensions of the underlying dataset. Using a formal syntax, such as SQL,execution platforms24 may return different results depending on the ordering of the different computes withinquery321, which may reduce a confidence byuser16 inresults25. However,execution platform24 may return the same results regardless of the order in which the various operations are to be performed.
In other words, query321 may represent a multi-part query having multiple query statements (e.g., compute the number of loans made, compute a total amount approved, compute the total amount lost, compute the maximum loan value, etc.).CNLP unit22 may however processquery321 by exposing language sub-surfaces18 in a manner that removes ambiguity in definingquery321 such that multiple query statements forming the multi-part query are definable in any order, but result in thesame intents23. As such,execution platform24 may transformmulti-part query321 into the same one ormore statements27 regardless of the order in which the multiple query statements are defined to formmulti-part query321.
In the example ofFIG. 3E, ascreenshot330 ofinterface21 shows thatuser16 has entered aquery331 to “[k]eep the rows where GrossApproval is less than the aggregate value median GrossApproval.” Again,host device12 invokes, responsive to query331,CNLP unit22 to parseintents23 fromquery331, providingintents23 toexecution platform24.Execution platform24 may invoke transformunit34 to transformintents23 intostatements27 that compute the median GrossApproval and select rows with a GrossApproval is less than the aggregate value median GrossApprova and that conform to the SQL syntax.
Execution platform24 may access databases26 (storing the most recently loaded dataset) usingstatements27 to obtainquery results29 from which apreview332 is formed.Host device12 may then providepreview332 viainterface21 toclient device14 as part ofresult25, which presentspreview332.
In the foregoing example,query331 is relatively complex in that various computations that are mentioned last inquery331 are required to be performed before selecting the rows. Using a formal syntax, such as SQL,execution platforms24 may return different results depending on the ordering of the different query statements withinquery331, which may reduce a confidence byuser16 inresults25. However,execution platform24 may return the same results regardless of the order in which the various operations are to be performed.
In other words, query331 may represent another example of a multi-part query having multiple query statements.CNLP unit22 may however processquery331 by exposing language sub-surfaces18 in a manner that removes ambiguity in definingquery331 such that multiple query statements forming the multi-part query are definable in any order, but result in thesame intents23. As such,execution platform24 may transformmulti-part query331 into the same one ormore statements27 regardless of the order in which the multiple query statements are defined to formmulti-part query331.
In the example ofFIG. 3F, ascreenshot340 ofinterface21 shows thatuser16 has entered aquery341 to “[c]reate a new window column TestWindow as average Fare computed overrows10 before 3 after for each Parch sorted by Age.” Again,host device12 invokes, responsive to query341,CNLP unit22 to parseintents23 fromquery331, providingintents23 toexecution platform24.Execution platform24 may invoke transformunit34 to transformintents23 intostatements27 that, using the formal SQL syntax, compute the a window that computes anAverage Fair 10 rows before and 3 rows after the current row when the rows are sorted by the value for the Age column.
Execution platform24 may access databases26 (storing the most recently loaded dataset) usingstatements27 to obtainquery results29 from which apreview332 is formed.Host device12 may then providepreview342 viainterface21 toclient device14 as part ofresult25, which presentspreview342.
Referring next to the example ofFIG. 3G,execution platforms24 may receiveintents23 that require accessing multiple datasets.Execution platforms24 may determine thatintents23 determine whetherquery19 includes query statements that identify dimensions of datasets other than the current working dataset. When query statements identify dimensions of datasets other than the current working dataset,execution platforms24 may automatically (without requiring any further input from user16) join the current working dataset and the other datasets to obtain a joined dataset.
Ascreenshot350 shown in the example ofFIG. 3G illustrates how multiple datasets may be joined (as shown via arrows) to obtain the joined datasets (which becomes the working dataset).Execution platforms24 may automatically createjoin statements27, which conform to the formal SQL syntax, to join multiple datasets and thereby obtain the joined datasets.
In some examples,execution platforms24 may formulate a graph data structure based on the relationships (again, similar to what is shown in example screenshot350), where the graph data structure has nodes representative of each of the multiple datasets and edges representative of the relationship between the dimensions of the multiple datasets.Execution platforms24 may traverse, based onintents23, the graph data structure to identify a shortest path through the graph data structure by which to satisfyunderlying query19, automatically joining the datasets along the shortest path to obtain the joined dataset.
Moreover,execution platforms24 may identify, when traversing the graph data structure, additional paths through the graph data structure that would satisfyquery19. Responsive to identifying additional paths,execution platform24 may formulate an indication identifying the additional path through the graph data structure, providing the indication as part ofresults25 that are presented touser16 viaclient device14. The indication may include a link for a revised query that would result in traversing the additional path through the graph data structure. In other examples, the indication may indicate that there is ambiguity thatuser16 needs to resolve before completingquery19.
Execution platform24 may output, responsive to query19, a diagram similar to that shown inscreenshot350 that identifies the relationships between the one or more dimensions of the datasets (which represents a visualization of the graph data structure in the example ofFIG. 3G). That is,execution platform24 may identify relationships between one or more dimensions of the multiple datasets, and generate the diagram illustrating the relationship between the one or more dimensions of the datasets.Execution platform24 may output the diagram as shown in the example ofscreenshot350.
In any event,execution platforms24 may then access, usingvarious statements27, the joined dataset (assuming automatic joins occur via the shortest path through the graph data structure) to obtain query results29.Execution platforms24 may updateresult25 to include the query results29, wherehost device12 providesresult25 viainterface21 toclient device14.
FIG. 4 is a block diagram illustrating adata structure700 used to represent the language surface18 shown in the example ofFIG. 1 in accordance with various aspects of the techniques described in this disclosure. As shown in the example ofFIG. 4, thedata structure700 may include a sub-surface root node702 (“SS-RN702”), a number of hierarchically arrangedsub-surface child nodes704A-704N (“SS-CNs704”),706A-706N (“SS-CNs706”),708A-708N (“SS-CNs708”), andsub-surface leaf nodes710A-710N (“SS-LNs710”).
Sub-surface root node702 may represent an initial starting node that exposes a basic sub-surface, thereby constraining exposure to the sub-surfaces dependent therefrom, such as SS-RNs704. Initially,CNLP unit22, for anew user16, may only expose a limited set of patterns, each of which, as noted above, include identifiers, positional and keyword entities and ignored words.CNLP unit22 may traverse from SS-RN702 to one of SS-CNs407 based on a context (which may refer to one of or a combination of a history of the current session, identified user capabilities, user preferences, etc.). As such,CNLP unit22 may traverse hierarchically arranged nodes702-710 (e.g., from SS-RN702 to one of SS-CNs704 to one of SS-CNs706/708 to one of SS-LFs710) in order to balance discoverability with cognitive overhead.
As described above, all of the patterns in the language surface may begin with an identifier, and the these identifiers are reused across patterns to group them into language sub-surfaces702-710. For example, all the data visualization intents begin with “Plot.” When beginning to enter an utterance in the text box,user16 may view an auto-complete suggestions list containing one the first identifiers (like “Plot”, “Load” etc.). Onceuser16 completes the first identifier,CNLP unit22 may only expose other patterns belonging to that language sub-surface as further completions. In the above example, only whenuser16 specifies “Plot” as the first word, doesCNLP unit16 invoke the auto-complete mechanism to propose various chart formats (such as line chart, bubble chart, histogram, etc.). Responsive touser16 specifying one of the autocomplete suggestion (e.g., line chart),CNLP unit16 may expose the entities thatuser16 would need to specify to configure the chart (like the columns on the axes, colors, sliders, etc.).
Conceptually, the set of all utterances (the language surface) may be considered as being decomposed into subsets (sub-surfaces) which are arranged hierarchically (based on the identifiers and entities in the utterances), where each level of the hierarchy contains all the utterances/patterns that form the underlying subsets. Using the auto-complete mechanism,user16 navigates this hierarchy top-to-bottom, one step at a time. At each step,user16 may only be shown a small set of next-steps as suggestions. This allowsCNLP unit22 to balance discoverability with cognitive overhead. In other words, this aspect of the techniques may be about how to structure the patterns using the pattern specification language: the design choices here (like “patterns begin with identifiers”) are not imposed by the pattern specification language itself.
Additionally, certain language sub-surfaces are exposed only when corresponding conditions are met. For example,CNLP unit22 may only expose the data visualization sub-surface when there is at least one dataset already loaded. This is achieved by associating each pattern with a function/procedure that looks at the current context (the history of this session, the capabilities and preferences of the user, etc.) to decide whether that pattern is exposed at the current time.
FIG. 5 is a block diagram illustrating example components of thehost device12 and/or theclient device14 shown in the example ofFIG. 1. In the example ofFIG. 4, thedevice12/14 includes aprocessor412, a graphics processing unit (GPU)414,system memory416, adisplay processor418, one or moreintegrated speakers105, adisplay103, a user interface420, and atransceiver module422. In examples where thesource device12 is a mobile device, thedisplay processor418 is a mobile display processor (MDP). In some examples, such as examples where thesource device12 is a mobile device, theprocessor412, theGPU414, and thedisplay processor418 may be formed as an integrated circuit (IC).
For example, the IC may be considered as a processing chip within a chip package and may be a system-on-chip (SoC). In some examples, two of theprocessors412, theGPU414, and thedisplay processor418 may be housed together in the same IC and the other in a different integrated circuit (i.e., different chip packages) or all three may be housed in different ICs or on the same IC. However, it may be possible that theprocessor412, theGPU414, and thedisplay processor418 are all housed in different integrated circuits in examples where thesource device12 is a mobile device.
Examples of theprocessor412, theGPU414, and thedisplay processor418 include, but are not limited to, one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Theprocessor412 may be the central processing unit (CPU) of thesource device12. In some examples, theGPU414 may be specialized hardware that includes integrated and/or discrete logic circuitry that provides theGPU414 with massive parallel processing capabilities suitable for graphics processing. In some instances,GPU414 may also include general purpose processing capabilities, and may be referred to as a general-purpose GPU (GPGPU) when implementing general purpose processing tasks (i.e., non-graphics related tasks). Thedisplay processor418 may also be specialized integrated circuit hardware that is designed to retrieve image content from thesystem memory416, compose the image content into an image frame, and output the image frame to thedisplay103.
Theprocessor412 may execute various types of theapplications20. Examples of theapplications20 include web browsers, e-mail applications, spreadsheets, video games, other applications that generate viewable objects for display, or any of the application types listed in more detail above. Thesystem memory416 may store instructions for execution of theapplications20. The execution of one of theapplications20 on theprocessor412 causes theprocessor412 to produce graphics data for image content that is to be displayed and theaudio data21 that is to be played. Theprocessor412 may transmit graphics data of the image content to theGPU414 for further processing based on and instructions or commands that theprocessor412 transmits to theGPU414.
Theprocessor412 may communicate with theGPU414 in accordance with a particular application processing interface (API). Examples of such APIs include the DirectX API by Microsoft®, the OpenGL® or OpenGL ES® by the Khronos group, and the OpenCL™; however, aspects of this disclosure are not limited to the DirectX, the OpenGL, or the OpenCL APIs, and may be extended to other types of APIs. Moreover, the techniques described in this disclosure are not required to function in accordance with an API, and theprocessor412 and theGPU414 may utilize any technique for communication.
Thesystem memory416 may be the memory for thesource device12. Thesystem memory416 may comprise one or more computer-readable storage media. Examples of thesystem memory416 include, but are not limited to, a random-access memory (RAM), an electrically erasable programmable read-only memory (EEPROM), flash memory, or other medium that can be used to carry or store desired program code in the form of instructions and/or data structures and that can be accessed by a computer or a processor.
In some examples, thesystem memory416 may include instructions that cause theprocessor412, theGPU414, and/or thedisplay processor418 to perform the functions ascribed in this disclosure to theprocessor412, theGPU414, and/or thedisplay processor418. Accordingly, thesystem memory416 may be a computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors (e.g., theprocessor412, theGPU414, and/or the display processor418) to perform various functions.
Thesystem memory416 may include a non-transitory storage medium. The term “non-transitory” indicates that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term “non-transitory” should not be interpreted to mean that thesystem memory416 is non-movable or that its contents are static. As one example, thesystem memory416 may be removed from thesource device12 and moved to another device. As another example, memory, substantially similar to thesystem memory416, may be inserted into thedevices12/14. In certain examples, a non-transitory storage medium may store data that can, over time, change (e.g., in RAM).
The user interface420 may represent one or more hardware or virtual (meaning a combination of hardware and software) user interfaces by which a user may interface with thesource device12. The user interface420 may include physical buttons, switches, toggles, lights or virtual versions thereof. The user interface420 may also include physical or virtual keyboards, touch interfaces—such as a touchscreen, haptic feedback, and the like.
Theprocessor412 may include one or more hardware units (including so-called “processing cores”) configured to perform all or some portion of the operations discussed above with respect to one or more of the various units/modules/etc. Thetransceiver module422 may represent a unit configured to establish and maintain the wireless connection between thedevices12/14. Thetransceiver module422 may represent one or more receivers and one or more transmitters capable of wireless communication in accordance with one or more wireless communication protocols.
In each of the various instances described above, it should be understood that thedevices12/14 may perform a method or otherwise comprise means to perform each step of the method for which thedevices12/14 is described above as performing. In some instances, the means may comprise one or more processors. In some instances, the one or more processors may represent a special purpose processor configured by way of instructions stored to a non-transitory computer-readable storage medium. In other words, various aspects of the techniques in each of the sets of encoding examples may provide for a non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause the one or more processors to perform the method for which thedevices12/14 has been configured to perform.
FIG. 6 is a flowchart illustrating example operation of the host device ofFIG. 1 in performing various aspects of the techniques described in this disclosure. Initially,CNLP unit22 ofhost device12 may expose a language sub-surface, e.g.,language sub-surface18A, specifying a natural language containment hierarchy defining a grammar for a natural language as a hierarchical arrangement of language sub-surfaces18 (800).Server28 ofhost device12 may receive aquery19 via CNL sub-surface18 exposed byCNLP unit22 viainterface21 that includes a plain language request for data stored todatabases26.Such queries19 may, in other words, request access todatabases26 so as to retrieve data stored to thedatabases26 as a dataset. In this way,server12 may receive aquery19 to access the dataset (802).
As noted above,such queries26 may conform to a plain conversational language having various inputs that are translated, byCNLP unit22, intointents23.Server28 may redirectintents23 toexecution platforms24 that apply transformations to theintents23 that transform intents23 (representative of queries19) into one ormore statements27 that conform to a formal syntax associated with the dataset stored to databases26 (804).Execution platforms24 may access, based onstatements27, the dataset stored todatabases27 to obtain aquery result29 providing portions of the dataset relevant to initial queries19 (806).Execution platforms24 may obtainquery result29 thatexecution platforms24 may use when forming results25.Execution platforms24 may output query results25 (808).
FIG. 7 is another flowchart illustrating example operation of the host device ofFIG. 1 in performing additional aspects of the techniques described in this disclosure. As described above,execution platforms24 ofhost device12 apply a plurality ofMLM44 to a multi-dimensional dataset stored todatabases26 to obtainquery results29 output by each of the plurality of MLM44 (900).Execution platforms24 may next determine a correlation of one or more dimensions (e.g., a selected row or column) of the multi-dimensional datasets stored todatabases26 to queryresults29—provided in response to transformed intents23 (which are represented by statements27)—output by MLM44 (902).
Execution platforms24 may invokemultiple MLM44 responsive to intents23 (or transformedintents23 represented by statements27) that analyzequery results29 resulting from accessing the datasets, based onstatements27, to obtainresults25.Execution platforms24 may select, based on the correlation determine between the one or more dimensions and query result29 output by each of the plurality ofMLM44, a subset ofMLM44 to obtainresult25 for each of the subset of the plurality of MLM44 (904).Execution platforms24 may output result25 for each of the one or more ofMLM44 toserver28, which may provideoutput result25 via interface21 (906).
In this way, various aspects of the techniques may enable the following examples.
Example 1A. A device configured to interpret a multi-dimensional dataset, the device comprising: a memory configured to store the multi-dimensional dataset; and one or more processors configured to: apply a plurality of machine learning models to the multi-dimensional dataset to obtain a result output by each of the plurality of machine learning models; determine a correlation of one or more dimensions of the multi-dimensional dataset to the results output by each of the plurality of machine learning models; select, based on the correlation determined between the one or more dimensions and the result output by each of the plurality of machine learning models, a subset of the plurality of machine learning models to obtain the result for each of the subset of the plurality of machine learning models; and output the result for each of the subset of the plurality of machine learning models.
Example 2A. The device of example 1A, wherein the one or more processors are configured to output the result as a sentence using plain language.
Example 3A. The device of any combination of examples 1A and 2A, wherein the one or more processors are configured to output the result for at least one of the subset of the plurality of machine learning models as a graph identifying a relevance of each of the one or more dimensions to the result for each of the subset of the plurality of machine learning models.
Example 4A. The device of example 3A, wherein the graph comprises an impact graph.
Example 5A. The device of any combination of examples 1A-4A, wherein the one or more processors are configured to output the result for each of the subset of the plurality of machine learning models as a graphical representation of a decision tree.
Example 6A. The device of any combination of examples 1A-5A, wherein the one or more processors are further configured to: determine, based on a comparison of the correlation determined between the one or more dimensions and the result output by each of the plurality of machine learning models to a relevance threshold, one or more low relevance dimensions of the multi-dimensional dataset that have low relevance to the result output by each of the plurality of machine learning models; and output an indication explaining that the one or more low relevance dimensions have low relevance to the result.
Example 7A. The device of example 6A, wherein the one or more processors are configured to output a sentence in plain language that explain the one or more low relevance dimensions having low relevance to the result.
Example 8A. The device of any combination of examples 1A-7A, wherein the one or more processors are further configured to refrain from transforming the one or more dimensions of the multi-dimensional dataset prior to application of the plurality of machine learning models.
Example 9A. The device of any combination of examples 1A-8A, wherein the one or more processors are further configured to: determine, based on the results for each of the one or more of the plurality of machine learning models, one or more of a plurality of charts to explain the corresponding result; rank the one or more of the plurality of charts to identify a highest ranked chart; select the highest ranked chart; and output the highest ranked chart as a visual chart.
Example 10A. The device of example 9A, wherein the one or more processors are further configured to: generate an explanation in plain language explaining a formulation of the visual chart; and output the explanation.
Example 11A. The device of any combination of examples 1A-10A, wherein the one or more processors are further configured to: generate a pipeline report explaining how the device produced the plurality of the machine learning models; and output the pipeline report.
Example 12A. A method of interpreting a multi-dimensional dataset, the method comprising: applying a plurality of machine learning models to the multi-dimensional dataset to obtain a result output by each of the plurality of machine learning models; determining a correlation of the one or more dimensions of the multi-dimensional dataset to the results output by each of the plurality of machine learning models; selecting, based on the correlation determined between the one or more dimensions and the result output by each of the plurality of machine learning models, a subset of the plurality of machine learning models to obtain the result for each of the subset of the plurality of machine learning models; and outputting the result for each of the subset of the plurality of machine learning models.
Example 13A. The method of example 12A, wherein outputting the result comprises outputting the result as a sentence using plain language.
Example 14A. The method of any combination of examples 12A and 13A, wherein outputting the result comprises outputting the result for at least one of the subset of the plurality of machine learning models as a graph identifying a relevance of each of the one or more dimensions to the result for each of the subset of the plurality of machine learning models.
Example 15A. The method of example 14A, wherein the graph comprises an impact graph.
Example 16A. The method of any combination of examples 12A-15A, wherein outputting the result comprises outputting the result for each of the subset of the plurality of machine learning models as a graphical representation of a decision tree.
Example 17A. The method of any combination of examples 12A-16A, further comprising: determining, based on a comparison of the correlation determined between the one or more dimensions and the result output by each of the plurality of machine learning models to a relevance threshold, one or more low relevance dimensions of the multi-dimensional dataset that have low relevance to the result output by each of the plurality of machine learning models; and outputting an indication explaining that the one or more low relevance dimensions have low relevance to the result.
Example 18A. The method of example 17A, wherein outputting the indication comprises outputting a sentence in plain language that explain the one or more low relevance dimensions having low relevance to the result.
Example 19A. The method of any combination of examples 12A-18A, further comprising refraining from transforming the one or more dimensions of the multi-dimensional dataset prior to application of the plurality of machine learning models.
Example 20A. The method of any combination of examples 12A-19A, further comprising: determining, based on the results for each of the one or more of the plurality of machine learning models, one or more of a plurality of charts to explain the corresponding result; ranking the one or more of the plurality of charts to identify a highest ranked chart; selecting the highest ranked chart; and outputting the highest ranked chart as a visual chart.
Example 21A. The method of example 20A, further comprising: generating an explanation in plain language explaining a formulation of the visual chart; and outputting the explanation.
Example 22A. The method of any combination of examples 12A-21A, further comprising: generating a pipeline report explaining how the device produced the plurality of the machine learning models; and outputting the pipeline report.
Example 23A. A non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors to: apply a plurality of machine learning models to a multi-dimensional dataset to obtain a result output by each of the plurality of machine learning models; determine a correlation of the one or more dimensions of the multi-dimensional dataset to the result output by each of the plurality of machine learning models; select, based on the correlation determined between the one or more dimensions and the result output by each of the plurality of machine learning models, a subset of the plurality of machine learning models to obtain the result for each of the subset of the plurality of machine learning models; and output the result for each of the subset of the plurality of machine learning models.
Example 1B. A device configured to access a dataset, the device comprising: a memory configured to store the dataset; and one or more processors configured to: expose a language sub-surface specifying a natural language containment hierarchy defining a grammar for a natural language as a hierarchical arrangement of a plurality of language sub-surfaces; receive a query to access the dataset, the query conforming to a portion of the natural language provided by the exposed language sub-surface; transform the query into one or more statements that conform to a formal syntax associated with the dataset; access, based on the one or more statements, the dataset to obtain a query result; and output the query result.
Example 2B. The device of example 1B, wherein the one or more processors are configured to: receive the query that identifies one or more dimensions of the dataset to keep; transform the query into one or more select statements that conform to the formal syntax associated with the dataset; and access, based on the one or more select statements, the dataset to obtain the query result that includes the one or more dimensions of the dataset identified by the query.
Example 3B. The device of any combination of examples 1B and 2B, wherein the one or more processors are further configured to: receive an additional query to access the dataset, the additional query conforming to the portion of the language provided by the exposed language sub-surface and identifying a dimension in the dataset that is not present in the query result; determine that the identified dimension is not present in the query result; transform, the additional query into one or more additional statements that conform to the formal syntax; access, based on the one or more additional statements and responsive to determining that the identified dimension is not present in the query result, the dataset to obtain an additional query result; and output the additional query result along with an indication that the additional query result was obtained from the dataset rather than the query result.
Example 4B. The device of any combination of examples 1B-3B, wherein the dataset is a dataset of a plurality of datasets, wherein the one or more processors are further configured to: determine whether the query applies to multiple datasets of the plurality of datasets; and output, responsive to determining that the query applies to the multiple datasets of the plurality of datasets, an indication that the query applies to the multiple datasets.
Example 5B. The device of any combination of examples 1B-4B, wherein the query includes a multi-part query having multiple query statements, wherein the exposed language sub-surface removes ambiguity in defining the query such that the multiple query statements forming the multi-part query are definable in any order, and wherein the one or more processors are configured to transform the multi-part query into the same one or more statements regardless of the order in which the multiple query statements are defined to form the multi-part query.
Example 6B. The device of any combination of examples 1B-5B, wherein the dataset is a first dataset of a plurality of datasets, and wherein the one or more processors are further configured to: determine whether the query includes query statements that identify dimensions of a second dataset of the plurality of datasets; automatically join, responsive to determining that the query includes query statements that identify dimensions of the second dataset, the first dataset and the second dataset to obtain a joined dataset; and access, based on the one or more statements, the joined dataset to obtain the query result.
Example 7B. The device of any combination of examples 1B-6B, wherein the dataset is a dataset of a plurality of datasets, and wherein the one or more processors are further configured to: identify relationships between one or more dimensions of the plurality of datasets; generate a diagram illustrating the relationships between the one or more dimensions of the plurality of datasets; and output the diagram.
Example 8B. The device of any combination of examples 1B-7B, wherein the dataset is a dataset of a plurality of datasets, and wherein the one or more processors are further configured to: identify relationships between one or more dimensions of the plurality of datasets; generate, based on the identified relationships, a graph data structure having nodes representative of each of the plurality of datasets and edges representative of the relationships between the one or more dimensions of the plurality of datasets; traverse, based on the query, the graph data structure to identify a shortest path through the graph data structure by which to satisfy the query; automatically join the dataset and one or more additional datasets of the plurality of datasets identified along the shortest path to obtain a joined dataset; and access, based on the one or more statements, the joined dataset to obtain the query result.
Example 9B. The device of example 8B, wherein the one or more processors are further configured to: traverse, based on the query, the graph data structure to identify an additional path through the graph data structure that would satisfy the query; and output an indication identifying the additional path through the graph data structure.
Example 10B. The device of example 9B, wherein the indication is a link for a revised query that would result in traversing the additional path through the graph data structure.
Example 11B. The device of any combination of examples 1B-10B, wherein the formal syntax includes a structure query language syntax or a Pandas dataframe syntax.
Example 12B. A method of accessing a dataset, the method comprising: exposing a language sub-surface specifying a natural language containment hierarchy defining a grammar for a natural language as a hierarchical arrangement of a plurality of language sub-surfaces; receiving a query to access the dataset, the query conforming to a portion of the language provided by the exposed language sub-surface; transforming the query into one or more statements that conform to a formal syntax associated with the dataset; accessing, based on the one or more statements, the dataset to obtain a query result; and outputting the query result.
Example 13B. The method of example 12B, wherein receiving the query comprises receiving the query that identifies one or more dimensions of the dataset to keep, wherein transforming the query comprises transforming the query into one or more select statements that conform to the formal syntax associated with the dataset, and wherein accessing the dataset comprises accessing, based on the one or more select statements, the dataset to obtain the query result that includes the one or more dimensions of the dataset identified by the query.
Example 14B. The method of any combination of examples 12B and 13B, further comprising: receiving an additional query to access the dataset, the additional query conforming to the portion of the language provided by the exposed language sub-surface and identifying a dimension in the dataset that is not present in the query result; determining that the identified dimension is not present in the query result; transforming, the additional query into one or more additional statements that conform to the formal syntax; accessing, based on the one or more additional statements and responsive to determining that the identified dimension is not present in the query result, the dataset to obtain an additional query result; and outputting the additional query result along with an indication that the additional query result was obtained from the dataset rather than the query result.
Example 15B. The method of any combination of examples 12B-14B, wherein the dataset is a dataset of a plurality of datasets, and wherein the method further comprises: determining whether the query applies to multiple datasets of the plurality of datasets; and outputting, responsive to determining that the query applies to the multiple datasets of the plurality of datasets, an indication that the query applies to the multiple datasets.
Example 16B. The method of any combination of examples 12B-15B, wherein the query includes a multi-part query having multiple query statements, wherein the exposed language sub-surface removes ambiguity in defining the query such that the multiple query statements forming the multi-part query are definable in any order, and wherein transforming the query comprises transforming the multi-part query into the same one or more statements regardless of the order in which the multiple query statements are defined to form the multi-part query.
Example 17B. The method of any combination of examples 12B-16B, wherein the dataset is a first dataset of a plurality of datasets, and wherein the method further comprises: determining whether the query includes query statements that identify dimensions of a second dataset of the plurality of datasets; automatically joining, responsive to determining that the query includes query statements that identify dimensions of the second dataset, the first dataset and the second dataset to obtain a joined dataset; and accessing, based on the one or more statements, the joined dataset to obtain the query result.
Example 18B. The method of any combination of examples 12B-17B, wherein the dataset is a dataset of a plurality of datasets, and wherein the method further comprises: identifying relationships between one or more dimensions of the plurality of datasets; generating a diagram illustrating the relationships between the one or more dimensions of the plurality of datasets; and outputting the diagram.
Example 19B. The method of any combination of examples 12B-18B, wherein the dataset is a dataset of a plurality of datasets, and wherein the method further comprises: identifying relationships between one or more dimensions of the plurality of datasets; generating, based on the identified relationships, a graph data structure having nodes representative of each of the plurality of datasets and edges representative of the relationships between the one or more dimensions of the plurality of datasets; traversing, based on the query, the graph data structure to identify a shortest path through the graph data structure by which to satisfy the query; automatically joining the dataset and one or more additional datasets of the plurality of datasets identified along the shortest path to obtain a joined dataset; and accessing, based on the one or more statements, the joined dataset to obtain the query result.
Example 20B. The method of example 19B, further comprising: traversing, based on the query, the graph data structure to identify an additional path through the graph data structure that would satisfy the query; and outputting an indication identifying the additional path through the graph data structure.
Example 21B. The method of example 20B, wherein the indication is a link for a revised query that would result in traversing the additional path through the graph data structure.
Example 22B. The method of any combination of examples 12B-21B, wherein the formal syntax includes a structure query language syntax or a Pandas dataframe syntax.
Example 23B. A non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors to: expose a language sub-surface specifying a natural language containment hierarchy defining a grammar for a natural language as a hierarchical arrangement of a plurality of language sub-surfaces; receive a query to access a dataset, the query conforming to a portion of the language provided by the exposed language sub-surface; transform the query into one or more statements that conform to a formal syntax associated with the dataset; access, based on the one or more statements, the dataset to obtain a query result; and output the query result.
In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure. A computer program product may include a computer-readable medium.
Likewise, in each of the various instances described above, it should be understood that thehost device12 may perform a method or otherwise comprise means to perform each step of the method for which thehost device12 is configured to perform. In some instances, the means may comprise one or more processors. In some instances, the one or more processors may represent a special purpose processor configured by way of instructions stored to a non-transitory computer-readable storage medium. In other words, various aspects of the techniques in each of the sets of encoding examples may provide for a non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause the one or more processors to perform the method for which thehost device12 has been configured to perform.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some examples, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.
Various aspects of the techniques have been described. These and other aspects of the techniques are within the scope of the following claims.