<storage system Info>
- System Hardware :<hardware>
- System Platform: <platform>
- Drive Information:<Drive>
- average_physical_space_usage=40(percentage)
<runtime performance Info>
- average_total_IOPS=60(K)
- average_throughput=250(MB/S)
- average_CPU_Util=70 (percentage)
- average_Latency=2(ms)
<IO pattern combination Info>
- average_IO_size =8(KB)
- Read/write ratio=95(Read percentage)
- Random/Sequential ratio=75(Random percentage)

It is to be appreciated, however, that different configurations of S_t and additional or alternative components can be used in other embodiments.

The action space will now be described. Thereinforcement learning agent401, as noted above, observes the current state S_t at each time step t and takes an action A_t. In some embodiments, the action A_t involves modifying a single throttling value (e.g., increasing or decreasing IOPS, or increasing or decreasing throughput) based at least in part on an IO throttling policy.

FIG.5 shows an example IO throttling policy in an illustrative embodiment. In this example, the IO throttling policy shown in table500 includes two IO parameters - IOPS and throughput, although additional or alternative parameters can be used in other embodiments. The table500 includes, for each such IO parameter, an associated state space, applicable increase/decrease values, and corresponding actions for that IO parameter.

The IO throttling policy illustrated inFIG.5 has a total of five possible actions, which are shown in table600 ofFIG.6. Additional or alternative actions and associated IO throttling policies may be used in other embodiments.

The reward space will now be described. A reward function R is defined to guide thereinforcement learning agent401 towards good solutions for a given objective. For example, the objective for the agent in some embodiments is to achieve the best possible storage system performance (e.g., minimizing latency and maximizing throughput) with throttling of minimal IO loads. The reward R_t+1 may thus be defined as:

\begin{array}{l} R_{t + 1} = W_{1} * (\frac{- L a t e n c y_{a v e r a g e} + L a t e n c y_{i n i t i a l}}{L a t e n c y_{i n i t i a l}}) + \\ W_{2} * (\frac{T h r o u g h p u t_{a v e r a g e} - T h r o u g h p u t_{i n i t i a l}}{T h r o u g h p u t_{i n i t i a l}}) \end{array}

where an initial performance of the storage system has latency given by Latency_initial and throughput given by Throughput_initial, and W₁ and W₂ denote weights applied to the respective latency and throughput parameters. Such weights can be adjusted depending upon the relative importance of latency and throughput within a given storage system implementation, and are illustratively set to 0.5 and 0.5 to represent an equal importance of these two example parameters. Also, additional or alternative key performance indicators (KPIs) or other parameters can be used to define the reward function in other embodiments.

As one possible example of a reward function that utilizes additional KPIs other than latency and throughput, the following reward function utilizes a combination of latency, throughput, CPU utilization and memory utilization, weighted by respective weights W₁, W₂, W₃ and W₄:

\begin{array}{l} R_{t + 1} = W_{1} * (\frac{- L a t e n c y_{a v e r a g e} + L a t e n c y_{i n i t i a l}}{L a t e n c y_{i n i t i a l}}) + \\ W_{2} * (\frac{T h r o u g h p u t_{a v e r a g e} - T h r o u g h p u t_{i n i t i a l}}{T h r o u g h p u t_{i n i t i a l}}) + \\ W_{3} * (\frac{- C P U_{a v e r a g e} + C P U_{i n i t i a l}}{C P U_{i n i t i a l}}) + \\ W_{4} * (\frac{- M e m o r y_{a v e r a g e} + M e m o r y_{i n i t i a l}}{M e m o r y_{i n i t i a l}}) \end{array}

Again, these are only example reward functions, and other types and configurations of reward functions can be used in other embodiments.

Thereinforcement learning agent401 tunes the IO throttling setting of the storage system utilizing the IO throttling policy and associated actions set forth inFIGS.5 and6. At time step t, Latency_average is the average latency of the storage system and Throughput_average is the average throughput of the storage system. In the example reward function, the lower the latency and the higher the throughput observed, compared with the initial system performance, the greater the reward that will be generated at time step t.

FIG.7 shows aninformation processing system700 in which astorage system702 interacts with a storage systemIO throttling agent704 that is implemented externally from thestorage system702. For example, the storage systemIO throttling agent704 may be implemented at least in part on one or more external servers of thesystem700 and/or on one or more host devices of thesystem700. In other embodiments, the storage systemIO throttling agent704 can be implemented internally to thestorage system702.

Thestorage system702 in this embodiment issues IO throttling requests to the storage systemIO throttling agent704, which utilizes a reinforcement learning framework of the type previously described to generate recommended IO throttling actions which are returned to thestorage system702 for execution. The storage systemIO throttling agent704 is illustratively implemented as an autonomous agent that automates storage system monitoring, learning and decision making in relation to IO throttling in order to achieve the best storage system performance. It may be deployed as a service accessible to thestorage system702, for example, via one or more external servers as noted above.

The storage systemIO throttling agent704 implements a number of functional modules which are utilized in implementing a reinforcement learning framework that generates the recommended IO throttling actions which are provided back to the requestingstorage system702. Such functional modules includestate collection module706,action selection module708,reward computation module710,experience module712,initial training module714 and IO throttlingaction recommendation module716.

Thestate collection module706 obtains a current state of thestorage system702 in conjunction with receipt of an IO throttling request. The state illustratively includes static and runtime information such as storage system information, runtime performance information and IO pattern combinations, as previously described.

Theaction selection module708 observes the current state (e.g., S_t) and provides a recommended IO throttling action A_t to thestorage system702.

Thereward computation module710 calculates the reward R_t+1 for performing action A_t selected for state S_t based on the specified storage system performance goal, which is illustratively achieving the best storage system performance (e.g., providing minimal IO latency and maximal IO throughput) while throttling minimal IO loads.

Theexperience module712 uses a reinforcement learning algorithm to update the experience according to the current state, action, reward and next state. The experience Q(S_i, A_i) is a mapping between the storage system environment states and actions that maximize a long-term reward. Such experience in some embodiments is also referred to herein as an “experience network.”

Theinitial training module714 gathers some initial IO throttling experience to build an initial experience model which can be leveraged directly for upcoming new IO throttling tasks. With theinitial training module714, the storage systemIO throttling agent704 can find the “good” IO patterns and combinations of IO patterns with fewer trials, since upcoming tasks can leverage existing learned experience. It should be noted that use of theinitial training module714 is optional, and may be deployed as an advanced service in some embodiments. Such an “optional” designation should not be viewed as an indication that other components in this and other embodiments are required.

The IO throttlingaction recommendation module716 illustratively sends a given recommended IO throttling action to thestorage system702 in response to a given IO throttling request received therefrom.

FIG.8 shows aprocess flow800 for the storage systemIO throttling agent704 to generate IO throttling action recommendations. The process flow800 starts as indicated atstep801, and includessteps803 through815 before ending atstep817.

Instep803, an IO throttling policy is customized for the particular storage system implementation. An example IO throttling policy was previously described in conjunction withFIG.5, but additional or alternative policies can be used, and can be individually customized for particular storage systems that are subjected to autonomous IO throttling using a reinforcement learning framework as disclosed herein.

Instep805, a determination is made as to whether or not an offline training service is enabled (e.g., whether the functionality of theinitial training module714 is enabled). If the offline training service is enabled, the process moves to step807, and otherwise moves to step811 as indicated.

Instep807, theinitial training module714 initiates performance of offline training.

Instep809, the offline training initiated instep807 is utilized to obtain some initial IO throttling experience, which is then used to guide online training to hit the system performance goals quicker (e.g., with fewer iterations). The offline training illustratively includes the following training steps:

T1. Thestate collection module706 monitors the storage system state, and once it detects a significant change in IOPS and throughput, it obtains an initial state S_t and the new state S_t+1 as previously described.

T2. Theaction selection module708 determines an action A_t based on the IO throttling policy and its associated set of available actions, as previously described in conjunction with the examples ofFIGS.5 and6.

T3. Thereward computation module710 calculates the reward R_t+1 in the manner previously described.

T4. Theexperience module712 utilizes a reinforcement learning algorithm and records of (S_t, A_t, R_t+1, S_t+1) to update IO throttling experience Q(S_i, A_i) in order to approximate an optimal IO throttling policy. Examples of reinforcement algorithms that can be used include but are not limited to Q-learning algorithms, DQN algorithms, DDQN algorithms, etc.

The records of (S_t, A_t, R_t+1, S_t+1) are examples of what are more generally referred to herein as “state-action records.” Other types and configurations of state-action records can be used in other embodiments. For example, in some embodiments, such records can include a reward R_t in place of or in addition to a reward R_t+1.

The experience Q(S_i, A_i) is an example action-value mapping which illustratively represents the long-term value of action A_i at any state S_i. The long-term value refers to the possibility of hitting the desired storage system performance goal in the future after taking action A_i, even if the goal is not achieved immediately after taking this action.

Referring now toFIG.9, an example action-value mapping for long-term values of actions is shown. This action-value mapping shows various actions that may be taken from astate S₁901. Atstate S₁901, after taking a first action A₁ astate S₂902 is reached. Fromstate S₂902, there is no possibility of hitting the performance goal (from the experience learned thus far). Thus, Q(S₁,A₁) = 0, which means the first action A₁ does not have long-term value. Atstate S₁901, after taking a second action A₂ astate S₃903 is reached. In state S₃903 the performance goal is not achieved, but upcoming actions starting from the state S₃903 do eventually lead to achieving the performance goal. Thus, the second action A₂ has value for the long term instead of the short term, and Q(S₁, A₂) = 2. Atstate S₁901, after taking a third action A₃ thestate S₄904 is reached where the performance goal is achieved immediately, and thus Q(S_1,A₃) = 10. The experience Q(S_i, A_i) will get more and more accurate with every training iteration. If enough training is performed, it will converge and represent a true Q-value.

Returning toFIG.8, instep811, the storage systemIO throttling agent704 receives an IO throttling request from thestorage system702. Such a request is also referred to herein as an “online” request, as it may be received from thestorage system702 while the system is experiencing conditions that appear to require IO throttling. For example, if thestorage system702 experiences at least a specified threshold amount of performance degradation, the online request can be triggered automatically.

Step811 may be performed following

steps

807 and809, or followingstep803 if the result of thestep805 determination is negative. The IO throttling request received from thestorage system702 illustratively includes information characterizing the current state S_t of thestorage system702, such as the above-described state information:

\{s t o r a g e_s y s t e m_i n f o_{t}, r u n t i m e_p e r f o r m a n c e_i n f o_{t}, I O_p a t t e r n_c o m b i n a t i o n_i n f o_{t}\},

although additional or alternative types of state information can be used in other embodiments. Such information can illustratively be extracted from the online request by thestate collection module706.

Instep813, the storage systemIO throttling agent704 adaptively reuses learned knowledge or experience to tune IO throttling to achieve the system performance goal.

In some embodiments, there are multiple distinct modes for adaptively reusing the experience. The modes include an exploitation mode, an exploration mode, and a mode that utilizes a combination of exploitation and exploration. Selection between the modes is illustratively controlled by an exploitation and exploration tradeoff parameter ε(t), which can take on values from 0 to 1, with a value of 0 indicating the exploitation mode, a value of 1 indicating the exploration mode, and values between 0 and 1 indicating different combinations of exploration and exploitation.

The value of the exploitation and exploration tradeoff parameter ε(t) is illustratively set at a given time step t, and varies over time. For example, it may decrease over time as more experience is obtained. At time step t, the storage systemIO throttling agent704 will with probability ε(t) select a random action from the action space, and otherwise selects the best action (e.g., with the highest Q(S_i, A_i) value) from the action space. Accordingly, after gaining enough experience, the storage systemIO throttling agent704 tends to leverage the learned experience via exploitation, while before having enough experience, the storage systemIO throttling agent704 tends to select random actions via exploration, where the value of ε(t) at time step t denotes the probability of selecting a random action for that time step.

The selected IO throttling action A_t for state S_t is provided to thestorage system702 as an IO throttling action recommendation for execution, and a corresponding record of (S_t, A_t, R_t+1 S_t+1) is determined for the iteration.

Instep815, theexperience module712 keeps using the reinforcement learning algorithm to record additional (S_t, A_t, R_t+1, S_t+1) records and to update Q(S_i, A_i). In this way, the learned experience keeps updating over time. Thus, over time better recommendations for IO throttling actions are provided which improve storage system performance. The process flow800 then ends instep817. For example, it can terminate responsive to the storage system obtaining an acceptable performance level relative to its performance goal, or upon reaching a specified maximum number of tuning attempts (e.g., three attempts). Such tuning attempts are considered examples of what are more generally referred to herein as “iterations” and different types and arrangements of iterations can be used in other embodiments.

Regardless of whether or not the performance goal is achieved in a given iteration, the additional experience obtained with each iteration will enhance the future decision-making ability of the storage systemIO throttling agent704.

FIG.10 shows a more detailed view of an example implementation of a portion of theFIG.8 process, the portion including atleast steps811 to815 of theFIG.8 process. TheFIG.10 flow diagram illustrates aprocess flow1000 that starts as indicated atstep1001, and includessteps1002 through1013 before ending atstep1014.

Instep1002, training is initialized, including initializing experience Q(S_i, A_i) and a maximum number of tuning attempts.

Instep1003, an online IO throttling request is received from thestorage system702.

Instep1004, the current state S_t of thestorage system702 is obtained.

Instep1005, a determination is made as to whether or not the state S_t exists in the experience Q(S_i,A_i). If the determination is affirmative, the process moves to step1006, and otherwise moves to step1007 as indicated.

Instep1006, which is reached if the state S_t exists in the experience Q(S_i,A_i), the exploitation and exploration tradeoff parameter ε(t) is set to a value between 0 and 1 that illustratively decreases over multiple throttling attempts. The process then moves to step1008 as indicated.

Instep1007, which is reached if the state S_t does not exist in the experience Q(S_i,A_i), the exploitation and exploration tradeoff parameter ε(t) is set to a value of 1, meaning that exploration will be performed by randomly selecting an action to take for the state S_t.

Instep1008, based on the state S_t, with probability ε(t), a random action is selected from the action space, and otherwise the best action, having the highest Q(S_i,A_i) observed thus far, is selected.

Instep1009, the selected IO throttling action A_t is provided to thestorage system702 for execution, and reward R_t+1 and next state S_t+1 are determined.

Instep1010, the reinforcement learning algorithm and records of (S_t, A_t, R_t+1, S_t+1) are used to update Q(S_i,A_i) in order to approximate the optimal IO throttling policy.

Instep1011, a determination is made as to whether or not an acceptable system performance in terms of a Quality of Service (QoS) level is obtained. If the determination is affirmative, the process ends atstep1014, and otherwise moves to step1012 as indicated.

Instep1012, a determination is made as to whether or not the specified maximum number of tuning attempts has been reached. If the determination is affirmative, the process ends atstep1014, and otherwise moves to step1013 as indicated.

Instep1013, the next state S_t+1 is set as the new current state S_t, and the process returns to step1005 for another tuning attempt. The process then proceeds throughsteps1005 through1011 or1012 as previously described.

As in other flow diagrams provided herein, the particular steps of the flow diagrams ofFIGS.8 and10 are presented in sequential order for clarity and simplicity of illustration only, and certain steps can at least partially overlap with other steps. Additional or alternative steps can be used in other embodiments.

It is also to be appreciated that the particular functionality, features and advantages described above and elsewhere herein are associated with particular illustrative embodiments and need not be present in other embodiments.

Illustrative embodiments of processing platforms utilized to implement functionality for storage system IO throttling utilizing a reinforcement learning framework will now be described in greater detail with reference toFIGS.11 and12. Although described in the context ofsystem100, these platforms may also be used to implement at least portions of other information processing systems in other embodiments.

FIG.11 shows an example processing platform comprisingcloud infrastructure1100. Thecloud infrastructure1100 comprises a combination of physical and virtual processing resources that may be utilized to implement at least a portion of theinformation processing system100 inFIG.1. Thecloud infrastructure1100 comprises multiple virtual machines (VMs) and/or container sets1102-1,1102-2, . . .1102-L implemented usingvirtualization infrastructure1104. Thevirtualization infrastructure1104 runs onphysical infrastructure1105, and illustratively comprises one or more hypervisors and/or operating system level virtualization infrastructure. The operating system level virtualization infrastructure illustratively comprises kernel control groups of a Linux operating system or other type of operating system.

Thecloud infrastructure1100 further comprises sets of applications1110-1,1110-2, . . .1110-L running on respective ones of the VMs/container sets1102-1,1102-2, . . .1102-L under the control of thevirtualization infrastructure1104. The VMs/container sets1102 may comprise respective VMs, respective sets of one or more containers, or respective sets of one or more containers running in VMs.

In some implementations of theFIG.11 embodiment, the VMs/container sets1102 comprise respective VMs implemented usingvirtualization infrastructure1104 that comprises at least one hypervisor. A hypervisor platform may be used to implement a hypervisor within thevirtualization infrastructure1104, where the hypervisor platform has an associated virtual infrastructure management system. The underlying physical machines may comprise one or more distributed processing platforms that include one or more storage systems.

In other implementations of theFIG.11 embodiment, the VMs/container sets1102 comprise respective containers implemented usingvirtualization infrastructure1104 that provides operating system level virtualization functionality, such as support for Docker containers running on bare metal hosts, or Docker containers running on VMs. The containers are illustratively implemented using respective kernel control groups of the operating system.

As is apparent from the above, one or more of the processing modules or other components ofsystem100 may each run on a computer, server, storage device or other processing platform element. A given such element may be viewed as an example of what is more generally referred to herein as a “processing device.” Thecloud infrastructure1100 shown inFIG.11 may represent at least a portion of one processing platform. Another example of such a processing platform is processingplatform1200 shown inFIG.12.

Theprocessing platform1200 in this embodiment comprises a portion ofsystem100 and includes a plurality of processing devices, denoted1202-1,1202-2,1202-3, . . .1202-K, which communicate with one another over anetwork1204.

Thenetwork1204 may comprise any type of network, including by way of example a global computer network such as the Internet, a WAN, a LAN, a satellite network, a telephone or cable network, a cellular network, a wireless network such as a WiFi or WiMAX network, or various portions or combinations of these and other types of networks.

The processing device1202-1 in theprocessing platform1200 comprises aprocessor1210 coupled to amemory1212.

Theprocessor1210 may comprise a microprocessor, a microcontroller, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a central processing unit (CPU), a graphical processing unit (GPU), a tensor processing unit (TPU), a video processing unit (VPU) or other type of processing circuitry, as well as portions or combinations of such circuitry elements.

Thememory1212 may comprise random access memory (RAM), read-only memory (ROM), flash memory or other types of memory, in any combination. Thememory1212 and other memories disclosed herein should be viewed as illustrative examples of what are more generally referred to as “processor-readable storage media” storing executable program code of one or more software programs.

Articles of manufacture comprising such processor-readable storage media are considered illustrative embodiments. A given such article of manufacture may comprise, for example, a storage array, a storage disk or an integrated circuit containing RAM, ROM, flash memory or other electronic memory, or any of a wide variety of other types of computer program products. The term “article of manufacture” as used herein should be understood to exclude transitory, propagating signals. Numerous other types of computer program products comprising processor-readable storage media can be used.

Also included in the processing device1202-1 isnetwork interface circuitry1214, which is used to interface the processing device with thenetwork1204 and other system components, and may comprise conventional transceivers.

Theother processing devices1202 of theprocessing platform1200 are assumed to be configured in a manner similar to that shown for processing device1202-1 in the figure.

Again, theparticular processing platform1200 shown in the figure is presented by way of example only, andsystem100 may include additional or alternative processing platforms, as well as numerous distinct processing platforms in any combination, with each such platform comprising one or more computers, servers, storage devices or other processing devices.

For example, other processing platforms used to implement illustrative embodiments can comprise converged infrastructure.

It should therefore be understood that in other embodiments different arrangements of additional or alternative elements may be used. At least a subset of these elements may be collectively implemented on a common processing platform, or each such element may be implemented on a separate processing platform.

As indicated previously, components of an information processing system as disclosed herein can be implemented at least in part in the form of one or more software programs stored in memory and executed by a processor of a processing device. For example, at least portions of the functionality for storage system IO throttling utilizing a reinforcement learning framework as disclosed herein are illustratively implemented in the form of software running on one or more processing devices.

It should again be emphasized that the above-described embodiments are presented for purposes of illustration only. Many variations and other alternative embodiments may be used. For example, the disclosed techniques are applicable to a wide variety of other types of information processing systems, host devices, storage systems, IO throttling actions, IO throttling policies, reinforcement learning frameworks, and additional or alternative components. Also, the particular configurations of system and device elements and associated processing operations illustratively shown in the drawings can be varied in other embodiments. Moreover, the various assumptions made above in the course of describing the illustrative embodiments should also be viewed as exemplary rather than as requirements or limitations of the disclosure. Numerous other alternative embodiments within the scope of the appended claims will be readily apparent to those skilled in the art.