meets e where P refers to inference performance. The key challenge in the optimization problem of equation (1) is assessing the constraint -/z(m(q)) meets e. Computing /z in the online setting is generally infeasible, since doing so can require access to a ground-truth label or an expensive verification process (e.g., for code generation by an LLM). Further, training some verifier/scorer T as a proxy for /z, then to transform the optimization problem as m_opt (c) = arg max P(m, q) (2) m s.t. T(m(q))>e is a non-trivial task. The approach of the relative degradation application 116 is instead to substitute the constraint to no longer rely on the relationship T(m(q)) « /z(m(q)) or even a predictor for /z. Doing so avoids computing m(q) and sidesteps challenges faced in applications with more complex and nuanced /z metrics (e.g., code generation by an LLM) where the metric is difficult to predict directly. More specifically, the relative degradation application 116 exploits and re-frames the semantics of e in the objective function. Suppose a candidate set of machine learning models (e.g., machine learning models 150) exists. If a baseline machine learning model that is most computationally expensive is selected from the candidate set m', then the constraint can be defined as follows:

where e' is a relative degradation measure between the machine learning models in the set of machine learning models rather than an absolute guality threshold. While /z might be defined in a way that makes /z difficult to approximate, such as a human- (or LLM-) evaluated guality score, a set of unit tests for code generation — measuring relative guality degradation only reguires predicting the relationship between two machine learning models on the given task. In other words - while /z(m(q)) and /z(m'(q)) are hard to predict,

can be informed by the relationship between m

and m'.

[0057] The relative degradation application 116 uses an approximation function 0 Deriving a theoretical solution js non-trivial. Instead, the relative

degradation application 116 empirically evaluates candidate metrics c such that c(q) ~ These metrics, such as the input length, number of nouns, reading

time, and symbol count, described above, can be straightforward yet still serve as useful approximations. Functions c such that c(q) ~

provide a basis for the l™ (<?)) approximation function 0. More specifically, each metric c(q) can be used to create the approximation function 0(m, q) as follow. Given a q, the metric c(q) can be computed and mapped into a guantile on a curve correlating c(q) with 1 - Then, the value associated with the upper-bound of the quantile can be taken to provide an upper-bound estimate of the relative output quality degradation.

[0058] Figure 5A illustrates relative output quality degradations of exemplar machine learning models for a query length metric, according to various embodiments. As shown, a graph 500 of query length (x-axis) versus percentage average quality degradation (y-axis) indicates the percentage average quality degradation of the outputs of three different large language models (LLMs) for queries of different lengths that are input into the three LLMs. The three LLMs include different numbers of parameters, with a most computationally expensive LLM that includes a largest number of parameters being used as the baseline for comparison. In the examples of Figures 5A-5C, the baseline LLM that is most computationally expensive includes 70 billion parameters, a second most computationally expensive LLM includes 13 billion parameters, and a least computationally expensive LLM includes 7 billion parameters. The queries are assigned to quantile buckets based on the length of the queries. In addition, a Pearson correlation coefficient has been computed between the query length metric and the quality degradation to show the correlation between the metric and the degradation, and the y-axes are inverted to make the curves easier to understand visually. Experience has shown that the metrics of input length, reading time, and number of nouns are strongly correlated with output quality degradation relative to a baseline machine learning model.

[0059] Figure 5B illustrates relative output quality degradations of exemplar machine learning models for a reading time metric, according to various embodiments. As shown, a graph 510 of reading time duration in minutes (x-axis) versus percentage average quality degradation (y-axis) indicates the percentage average quality degradation of the outputs of three different large language models (LLMs) for queries associated with different reading time durations that are input into the three LLMs. The graph 510 is similar to the graph 500, described above in conjunction with Figure 5A, and shows the relative output quality degradation for the same three LLMs, except the metric used is reading time duration rather than query length.

[0060] Figure 5C illustrates relative output quality degradations of exemplar machine learning models for a number of named entities metric, according to various embodiments. As shown, a graph 520 of number of named entities (x-axis) versus percentage average quality degradation (y-axis) indicates the percentage average quality degradation of the outputs of three different large language models (LLMs) for queries having different numbers of named entities that are input into the three LLMs. In some embodiments, the number of named entities can be obtained by counting the number of nouns in a query. The graph 520 is similar to the graph 500, described above in conjunction with Figure 5A, and shows the relative output quality degradation for the same three LLMs, except the metric used is the number of named entities rather than query length.

[0061] Figure 6 is a more detailed illustration of the routing application 146 of Figure 1 , according to various embodiments. As shown, the routing application 146 includes a metric computation module 604, a relative degradation module 608, and a routing module 612. In operation, the routing application 146 (1 ) receives an input 602, (2) selects one of the machine learning models 150 that is least computationally expensive and associated with a relative output quality degradation with respect to a most computationally expensive machine learning model 150 that satisfies a constraint on the relative output quality degradation, and (3) processes the input 602 using the selected machine learning model 150. Any suitable input 602, such as user input entered via a user interface (U I), can be received in some embodiments.

[0062] The metric computation module 604 computes values 6O61 to 606N of one or more metrics (referred to herein collectively as metric values 606 and individually as a metric value 606) based on the input 602. Any technically feasible metrics can be used in some embodiments. In some embodiments, the metric computation module 604 can use the same metrics that are computed by the metric computation module 404 of the relative degradation application 116, described above in conjunction with Figure 4. In some embodiments, the metrics can include domain-specific complexity measures. In some embodiments, the metric(s) can include lexical analysis techniques. For example, in some embodiments, the metrics can include a length of the input 602, a number of nouns in the input 602, a reading time duration associated with the input 602, a symbol count in the input 602, and/or the like.

[0063] The relative degradation module 608 determines a predicted relative quality degradation of the output generated by each of the machine learning models 150 for the input 602, shown as relative degradations 610i to 61 ON (referred to herein collectively as relative degradations 610 and individually as a relative degradation 610). In some embodiments, the relative degradation module 608 first determines a bucket, such as a quantile bucket, that the computed metric belongs to. Then, the relative degradation module 608 determines a relative output quality degradation associated with the bucket for each of the machine learning models 150. The relative output quality degradation associated with the bucket for each machine learning model 150 can be pre-computed by the relative degradation application 116, as described above in conjunction with Figure 4.

[0064] The routing module 612 routes the input 602 for processing by a least computationally expensive machine learning model that satisfies a constraint on the relative output quality degradation. Such processing generates a machine learning model output 614. In some embodiments, the routing module 612 selects, for each of the machine learning models 150, a largest relative output quality degradation from the relative output quality degradations determined by the relative degradation module 608. Then, the routing module 612 determines a least computationally expensive machine learning model 150 whose largest relative output quality degradation satisfies a constraint on the relative output quality degradation. How computationally expensive each machine learning model 150 is to execute can be known. For example, in some embodiments, machine learning models 150 that include more parameters can be considered to be more computationally expensive to execute, and vice versa.

[0065] In some embodiments, the constraint on the relative output quality degradation can be a largest acceptable relative output quality degradation, such as a largest percentage relative quality degradation, that the relative output quality degradations 610 for a machine learning model 150 need to be below in order to select that machine learning model 150 for use. In some embodiments, the constraint on the relative output quality degradation can be a quality constraints required to satisfy a service-level objective (SLO). In some other embodiments, rather than requiring that the relative output quality degradation predicted from each of the metric values satisfy a constraint for a machine learning model 150, the routing application 146 could employ a more permissive rule, such as requiring the relative output quality degradation predicted from any of the metric values to satisfy a constraint for a machine learning model 150, requiring an average (e.q., a weighted average) of the relative output quality degradation predicted from the metric values to satisfy a constraint for a machine learning model 150, or the like. It should be noted that the more restrictive rule of requiring the relative output quality degradation predicted from each of the metric values to satisfy a constraint errs on the side of generating higher quality outputs, whereas more permissive rules would allow lower quality outputs in order to improve performance/reduce computational expense.

[0066] More formally, the derivation of 0, described above in conjunction with Figure 4, allows the objective function of routing to be defined as follows. m_opt (e') = arg max P(m, q). (4) m s.t. Q{rn,q)>e'

In the worst case, equation (4) invokes m'(q) (the biggest/slowest model). As described above in conjunction with Figure 4, candidate metrics c can be used to compute corresponding 0 estimators. Using an ensemble of 0 estimators, as shown in equation (5) can mitigate the risk of an inaccurate routing decision. m_opt (e') = arg max P(m, q). (5) m s.t. 0₁(m,<7)>e^,V0_lE{0_o,0₁... }

When each 0-function is computationally light, computing multiple 0-functions together should still not induce significant overheads. To solve equation (5), the arg max function needs to be computed, which requires iterating over the model list M. In some embodiments, the models can be sorted in order of size (which serves as a proxy for P(m, q)). For each model, 0(m, q) is computed. If the constraints are satisfied, m(q) is computed and the output returned. The most computationally expensive model (e.q., a largest model) will yield 100% quality for every 0 function, since the degradation estimate is computed in relation to such a model. The user- defined e' threshold is constrained so that the threshold cannot exceed 100%; thus, some viable candidate model is guaranteed to be found to serve an input.

[0067] Algorithm 1 presents the routing technique used by the routing application 146 in pseudocode. In effect, Algorithm 1 iteratively computes and moves along the 0-approximated Pareto-frontier between performance and quality for a given input and metric. Algorithm 1 then returns the point along the frontier that satisfies the quality requirement across all metrics. Algorithm 1 0 - Proxy Routing Procedure for m G M do for 0 e 0₀, 0₁₍ ... do if Q(m, q) < e' then skip to next m end if end for return m, m q~) end for

[0068] Figure 7 illustrates a flow diagram of method steps for determining the relative output quality degradations of the different machine learning models included in a set of machine learning models, according to various embodiments. Although the method steps are described in conjunction with the systems of Figures 1-4 and 6, persons of ordinary skill in the art will understand that any system configured to perform the method steps, in any order, is within the scope of the inventions.

[0069] As shown, a method 700 begins at step 702, where the relative degradation application 116 receives example inputs. At step 704, the relative degradation application 116 selects one of the example inputs for processing.

[0070] At step 706, the relative degradation application 116 computes one or more metric values based on the selected example input. Any technically feasible metrics can be used in some embodiments. For example, in some embodiments, the metrics can include a length of the selected example input, a number of nouns in the selected example input, and/or a reading time associated with the selected example input.

[0071] At step 708, the relative degradation application 116 processes the selected example input using a set of machine learning models to generate outputs. The set of machine learning models can include some machine learning models that are more computationally expensive and/or consume more energy to execute than some other machine learning models.

[0072] At step 710, the relative degradation application 116 computes a quality of each of the outputs generated by the machine learning models. Any technically feasible quality metric can be used in some embodiments. For example, in some embodiments, the quality metric can be a score generated for each output by a chatbot that is asked to assess the quality of the output. As another example, in some embodiments, the quality metric can be a measure of accuracy of the output.

[0073] At step 712, the relative degradation application 116 computes a relative quality degradation between the output of a most computationally expensive machine learning model and the output of each other machine learning models in the set of machine learning models. The relative output quality degradation is a difference in quality of the output from a machine learning model in the set of machine learning models compared to an output from the most computationally expensive machine learning model in the set of machine learning models. Returning to the example of scores generated by a chatbot, the relative output quality degradation can be a chatbot-assessed loss that is the difference between the score for an output of the most computationally expensive machine learning model and the score for another machine learning model in the set of machine learning models. The output quality degradation can be represented in any technically feasible manner, such as using a value between 0% and 100% to indicate the percent degradation.

[0074] At step 714, if there are more example inputs, then the method 700 returns to step 704, where the relative degradation application 116 selects another one of the example inputs for processing.

[0075] On the other hand, if there are no more example inputs, then the method 700 continues to step 716, where the relative degradation application 116 stores an association between buckets of the computed metric values and the computed relative output quality degradations. The computed metric values can be grouped into any suitable buckets, such as quantile buckets, in some embodiments. In some other embodiments, metric values can be associated with relative output quality degradations in any technically feasible manner, such as by fitting a function to a curve formed by the metric values on one axis and the relative output quality degradations on another axis.

[0076] Figure 8 illustrates a flow diagram of method steps for routing an input for execution by a given machine learning model included in a set of machine learning models based on relative output quality degradation, according to various embodiments. Although the method steps are described in conjunction with the systems of Figures 1-4 and 6, persons of ordinary skill in the art will understand that any system configured to perform the method steps, in any order, is within the scope of the inventions.

[0077] As shown, a method 800 begins at step 802, where the routing application 146 receives an input. Any suitable input, such as user input entered via a III, can be received in some embodiments.

[0078] At step 804, the routing application 146 computes a metric value based on the input. The value of any technically feasible metric can be computed in some embodiments. In some embodiments, the same metric(s) can be used that were used in the method 700, described above in conjunction with Figure 7, to associate metric values with relative output quality degradations of a set of machine learning models. In some embodiments, the metric(s) can include domain-specific complexity measures. In some embodiments, the metric(s) can include lexical analysis techniques. For example, in some embodiments, the metric(s) can include a length of the selected example input, a number of nouns in the selected example input, and/or a reading time associated with the selected example input.

[0079] At step 806, the routing application 146 determines a bucket to which the computed metric value computed at step 804 belongs. The computed metric value can be assigned to any suitable buckets, such as quantile buckets, in some embodiments. In some embodiments, the same buckets can be used that were used in the method 700, described above in conjunction with Figure 7, to associate metric values with relative output quality degradations of a set of machine learning models.

[0080] At step 808, the routing application 146 determines a relative output quality degradation associated with the bucket for each machine learning model in a set of machine learning models. In some embodiments, the routing application 146 can look up the relative output quality degradation associated with the bucket for each machine learning model, as determined and stored according to the method 700, described above in conjunction with Figure 7.

[0081] At step 810, if there are more metrics, then the method 800 returns to step 804, where the routing application 146 computes the value of another metric based on the input. Although described with respect to steps 804, 806, and 808 being performed serially for different metrics, in some embodiments, steps 804, 806, and 808 can be performed in parallel for multiple metrics.

[0082] On the other hand, if there are no more metrics, then the method 800 continues to step 812, where the routing application 146 selects, for each of the machine learning models, a largest relative output quality degradation from the relative output quality degradations determined for different metrics at step 808.

[0083] At step 814, the routing application 146 determines a least computationally expensive machine learning model in the set of machine learning models that satisfies a constraint on the relative output quality degradation for each of the metrics that a value was computed at step 804. Any suitable constraint can be used in some embodiments, such as a threshold that the relative output quality degradation cannot drop below, a constraint required to satisfy a SLO, and/or the like. In some other embodiments, rather than requiring that the relative output quality degradation predicted from each of the metric values to satisfy a constraint, the routing application 146 could employ a more permissive rule, such as requiring the relative output quality degradation predicted from any of the metric values to satisfy a constraint, requiring an average (e.q., a weighted average) of the relative output quality degradation predicted from the metric values to satisfy a constraint, or the like.

[0084] At step 816, the routing application 146 routes the input for processing by the least computationally expensive machine learning model. The least computationally expensive machine learning model then generates an output that can be returned to a user, such as by displaying the output via a display device, or otherwise used by the routing application 146 or another application.

[0085] In sum, techniques are disclosed for routing inputs over machine learning models for processing. In some embodiments, a relative degradation application receives example inputs and computes one or more metric values for the example inputs. The relative degradation application determines, for each of the example inputs, a relative degradation between the quality of the output of each machine learning model in a set of machine learning models and the output of a most computationally expensive machine learning model in the set of machine learning models. Associations between the metric value(s) and the relative output quality degradations of the machine learning models are stored. Given an input, a routing application computes the same metric value(s) for the input and selects, based on the associations between metric value(s) and relative output quality degradations of the machine learning models, a least computationally expensive machine learning model from the set of machine learning models that is associated with a relative output quality degradation that satisfies a predefined constraint. Then, the application processes the input using the selected machine learning model to generate an output.

[0086] At least one technical advantage of the disclosed techniques relative to the prior art is that the disclosed techniques robustly and reliably balance the tradeoff between the size of a machine learning model and the accuracy of a machine learning model in a way that can be implemented for many, if not all, machine learning model applications. In this regard, the disclosed techniques route inputs to a first machine learning model within a set of machine learning models for execution, where the first machine learning model is the least computationally expensive machine learning model within the set of machine learning models having a predicted output quality degradation relative to the output quality of the most computationally expensive machine learning within the set of machine learning models that satisfies a desired constraint. By using the least computationally expensive machine learning model, while maintaining a desired level of output quality, the disclosed techniques can save computational resources, computational time, and energy. In addition, the predicted relative output quality degradation between the least expensive machine learning model and the most computationally expensive machine learning model within the set of machine learning models can be computed in an efficient manner using relatively simple metrics, including metrics specific to particular domains, while still being able to predict the relative output accuracy for different types of inputs that can be processed using the set of machine learning models. These technical advantages represent one or more technological improvements over prior art approaches.

[0087] 1 . In some embodiments, a computer-implemented method for routing inputs to machine learning models for execution comprises computing one or more metric values based on an input, determining, for at least one trained machine learning model included in a plurality of trained machine learning models, a corresponding output quality degradation based on the one or more metric values, wherein the corresponding output quality degradation is relative to a most computationally expensive trained machine learning model included in the plurality of trained machine learning models, selecting a first trained machine learning model included in the plurality of trained machine learning models based on the corresponding output quality degradations, and transmitting the input to the first trained machine learning model for execution.

[0088] 2. The computer-implemented method of clause 1 , wherein determining, for the at least one trained machine learning model included in the plurality of machine learning models, a corresponding output quality degradation comprises, for each trained machine learning model included in the at least one trained machine learning model for each metric value included in the one or more metric values, determining a corresponding intermediate output quality degradation, and selecting a largest output quality degradation from the corresponding intermediate output quality degradations as the corresponding output quality degradation.

[0089] 3. The computer-implemented method of clauses 1 or 2, wherein for each metric value included in the one or more metric values, determining a corresponding intermediate output quality degradation comprises determining a bucket included in a plurality of buckets to which the metric value belongs, and determining the corresponding intermediate output quality degradation based on the bucket.

[0090] 4. The computer-implemented method of any of clauses 1-3, wherein the first trained machine learning model comprises a least computationally expensive trained machine learning model included in the plurality of machine learning models having a corresponding output quality degradation that satisfies a predefined condition.

[0091] 5. The computer-implemented method of any of clauses 1-4, wherein the one or more metric values include a count of a number of words included in the input.

[0092] 6. The computer-implemented method of any of clauses 1-5, wherein the one or more metric values include a count of a number of nouns included in the input.

[0093] 7. The computer-implemented method of any of clauses 1-6, wherein the one or more metric values include a reading time duration associated with the input. [0094] 8. The computer-implemented method of any of clauses 1-7, wherein each trained machine learning model included in the plurality of trained machine learning models comprises a language model.

[0095] 9. The computer-implemented method of any of clauses 1-8, further comprising computing a plurality of additional metric values based on a plurality of example inputs, determining, for each trained machine learning model included in the one or more trained machine learning models, an additional corresponding output quality degradation for each example input included in the plurality of example inputs, and storing one or more associations between the plurality of additional metric values and the additional corresponding output quality degradations.

[0096] 10. The computer-implemented method of any of clauses 1 -9, wherein determining, for each trained machine learning model included in the one or more trained machine learning models, an additional corresponding output quality degradation for each example input comprises computing a difference between an accuracy of the trained machine learning model for the example input and an accuracy of the most computationally expensive trained machine learning model included in the plurality of trained machine learning models for the example input.

[0097] 11 . In some embodiments, one or more non-transitory computer-readable media store instructions that, when executed by at least one processor, cause the at least one processor to perform steps comprising computing one or more metric values based on an input, determining, for at least one trained machine learning model included in a plurality of trained machine learning models, a corresponding output quality degradation based on the one or more metric values, wherein the corresponding output quality degradation is relative to a most computationally expensive trained machine learning model included in the plurality of trained machine learning models, selecting a first trained machine learning model included in the plurality of trained machine learning models based on the corresponding output quality degradations, and transmitting the input to the first trained machine learning model for execution.

[0098] 12. The one or more non-transitory computer-readable media of clause 11 , wherein determining, for one or more trained machine learning models included in the plurality of machine learning models, a corresponding output quality degradation comprises, for each trained machine learning model included in the one or more trained machine learning models for each metric value included in the one or more metric values, determining a corresponding intermediate output quality degradation, and selecting a largest output quality degradation from the corresponding intermediate output quality degradations as the corresponding output quality degradation.

[0099] 13. The one or more non-transitory computer-readable media of clauses 11 or 12, wherein for each metric value included in the one or more metric values, determining a corresponding intermediate degradation comprises determining a bucket included in a plurality of buckets to which the metric value belongs, and determining the corresponding intermediate output quality degradation based on the bucket.

[0100] 14. The one or more non-transitory computer-readable media of any of clauses 11-13, wherein the plurality of buckets include a plurality of quantile buckets.

[0101] 15. The one or more non-transitory computer-readable media of any of clauses 11-14 wherein the first trained machine learning model comprises a least computationally expensive trained machine learning model included in the plurality of machine learning models having a corresponding output quality degradation that satisfies a predefined condition.

[0102] 16. The one or more non-transitory computer-readable media of any of clauses 11-15, wherein the one or more metric values include at least one of a count of a number of words included in the input, a count of a number of nouns included in the input, or a reading time duration associated with the input.

[0103] 17. The one or more non-transitory computer-readable media of any of clauses 11-16, wherein each trained machine learning model included in the plurality of trained machine learning models comprises a language model.

[0104] 18. The one or more non-transitory computer-readable media of any of clauses 11-17, wherein the instructions, when executed by the at least one processor, further cause the at least one processor to perform the steps of computing a plurality of additional metric values based on a plurality of example inputs, determining, for each trained machine learning model included in the one or more trained machine learning models, an additional corresponding output quality degradation for each example input included in the plurality of example inputs, and storing one or more associations between the plurality of additional metric values and the additional corresponding output quality degradations.

[0105] 19. The one or more non-transitory computer-readable media of any of clauses 11-18, wherein determining the one or more associations comprises determining, for each additional metric value included in the plurality of additional metric values, a bucket included in a plurality of buckets that the additional metric value belongs to, and storing an association between each bucket included in the plurality of buckets and an average of the additional corresponding output quality degradations that are determined for example inputs whose additional metric values belong to the bucket.

[0106] 20. In some embodiments, a system comprises one or more memories storing instructions, and one or more processors coupled to the one or more memories that, when executing the instructions, perform the steps of computing one or more metric values based on an input, determining, for at least one trained machine learning models included in a plurality of trained machine learning models, a corresponding output quality degradation based on the one or more metric values, wherein the corresponding output quality degradation is relative to a most computationally expensive trained machine learning model included in the plurality of trained machine learning models, selecting a first trained machine learning model included in the plurality of trained machine learning models based on the corresponding output quality degradations, and transmitting the input to the first trained machine learning model for execution.

[0107] Any and all combinations of any of the claim elements recited in any of the claims and/or any elements described in this application, in any fashion, fall within the contemplated scope of the present disclosure and protection.

[0108] The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. [0109] Aspects of the present embodiments may be embodied as a system, method or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “module” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.

[0110] Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc readonly memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

[0111] Aspects of the present disclosure are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine. The instructions, when executed via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such processors may be, without limitation, general-purpose processors, special-purpose processors, application-specific processors, or field-programmable gate arrays.

[0112] The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

[0113] While the preceding is directed to embodiments of the present disclosure, other and further embodiments of the disclosure may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

WHAT IS CLAIMED IS:

1 . A computer-implemented method for routing inputs to machine learning models for execution, the method comprising: computing one or more metric values based on an input; determining, for at least one trained machine learning model included in a plurality of trained machine learning models, a corresponding output quality degradation based on the one or more metric values, wherein the corresponding output quality degradation is relative to a most computationally expensive trained machine learning model included in the plurality of trained machine learning models; selecting a first trained machine learning model included in the plurality of trained machine learning models based on the corresponding output quality degradations; and transmitting the input to the first trained machine learning model for execution.

2. The computer-implemented method of claim 1 , wherein determining, for the at least one trained machine learning model included in the plurality of machine learning models, a corresponding output quality degradation comprises, for each trained machine learning model included in the at least one trained machine learning model: for each metric value included in the one or more metric values, determining a corresponding intermediate output quality degradation; and selecting a largest output quality degradation from the corresponding intermediate output quality degradations as the corresponding output quality degradation.

3. The computer-implemented method of claim 2, wherein for each metric value included in the one or more metric values, determining a corresponding intermediate output quality degradation comprises: determining a bucket included in a plurality of buckets to which the metric value belongs; and determining the corresponding intermediate output quality degradation based on the bucket.

4. The computer-implemented method of claim 1 , wherein the first trained machine learning model comprises a least computationally expensive trained machine learning model included in the plurality of machine learning models having a corresponding output quality degradation that satisfies a predefined condition.

5. The computer-implemented method of claim 1 , wherein the one or more metric values include a count of a number of words included in the input.

6. The computer-implemented method of claim 1 , wherein the one or more metric values include a count of a number of nouns included in the input.

7. The computer-implemented method of claim 1 , wherein the one or more metric values include a reading time duration associated with the input.

8. The computer-implemented method of claim 1 , wherein each trained machine learning model included in the plurality of trained machine learning models comprises a language model.

9. The computer-implemented method of claim 1 , further comprising: computing a plurality of additional metric values based on a plurality of example inputs; determining, for each trained machine learning model included in the one or more trained machine learning models, an additional corresponding output quality degradation for each example input included in the plurality of example inputs; and storing one or more associations between the plurality of additional metric values and the additional corresponding output quality degradations.

10. The computer-implemented method of claim 9, wherein determining, for each trained machine learning model included in the one or more trained machine learning models, an additional corresponding output quality degradation for each example input comprises computing a difference between an accuracy of the trained machine learning model for the example input and an accuracy of the most computationally expensive trained machine learning model included in the plurality of trained machine learning models for the example input.

11 . One or more non-transitory computer-readable media storing instructions that, when executed by at least one processor, cause the at least one processor to perform steps comprising: computing one or more metric values based on an input; determining, for at least one trained machine learning model included in a plurality of trained machine learning models, a corresponding output quality degradation based on the one or more metric values, wherein the corresponding output quality degradation is relative to a most computationally expensive trained machine learning model included in the plurality of trained machine learning models; selecting a first trained machine learning model included in the plurality of trained machine learning models based on the corresponding output quality degradations; and transmitting the input to the first trained machine learning model for execution.

12. The one or more non-transitory computer-readable media of claim 11 , wherein determining, for one or more trained machine learning models included in the plurality of machine learning models, a corresponding output quality degradation comprises, for each trained machine learning model included in the one or more trained machine learning models: for each metric value included in the one or more metric values, determining a corresponding intermediate output quality degradation; and selecting a largest output quality degradation from the corresponding intermediate output quality degradations as the corresponding output quality degradation.

13. The one or more non-transitory computer-readable media of claim 12, wherein for each metric value included in the one or more metric values, determining a corresponding intermediate degradation comprises: determining a bucket included in a plurality of buckets to which the metric value belongs; and determining the corresponding intermediate output quality degradation based on the bucket.

14. The one or more non-transitory computer-readable media of claim 13, wherein the plurality of buckets include a plurality of quantile buckets.

15. The one or more non-transitory computer-readable media of claim 11 wherein the first trained machine learning model comprises a least computationally expensive trained machine learning model included in the plurality of machine learning models having a corresponding output quality degradation that satisfies a predefined condition.

16. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more metric values include at least one of a count of a number of words included in the input, a count of a number of nouns included in the input, or a reading time duration associated with the input.

17. The one or more non-transitory computer-readable media of claim 11 , wherein each trained machine learning model included in the plurality of trained machine learning models comprises a language model.

18. The one or more non-transitory computer-readable media of claim 11 , wherein the instructions, when executed by the at least one processor, further cause the at least one processor to perform the steps of: computing a plurality of additional metric values based on a plurality of example inputs; determining, for each trained machine learning model included in the one or more trained machine learning models, an additional corresponding output quality degradation for each example input included in the plurality of example inputs; and storing one or more associations between the plurality of additional metric values and the additional corresponding output quality degradations.

19. The one or more non-transitory computer-readable media of claim 18, wherein determining the one or more associations comprises: determining, for each additional metric value included in the plurality of additional metric values, a bucket included in a plurality of buckets that the additional metric value belongs to; and storing an association between each bucket included in the plurality of buckets and an average of the additional corresponding output quality degradations that are determined for example inputs whose additional metric values belong to the bucket.

20. A system, comprising: one or more memories storing instructions; and one or more processors coupled to the one or more memories that, when executing the instructions, perform the steps of: computing one or more metric values based on an input, determining, for at least one trained machine learning models included in a plurality of trained machine learning models, a corresponding output quality degradation based on the one or more metric values, wherein the corresponding output quality degradation is relative to a most computationally expensive trained machine learning model included in the plurality of trained machine learning models, selecting a first trained machine learning model included in the plurality of trained machine learning models based on the corresponding output quality degradations, and transmitting the input to the first trained machine learning model for execution.