Detailed Description
The present invention is specifically described below with reference to the accompanying drawings, and fig. 1 is a schematic flow chart of a cloud computing-based data full flash memory optimization method according to an embodiment of the present invention, and the cloud computing-based data full flash memory optimization method is described in detail below.
Step S110, collecting multidimensional performance data sets of all nodes in the full flash storage cluster in real time, wherein the multidimensional performance data sets comprise node storage space fragmentation rate, input and output request queue depth and network link bandwidth utilization rate.
In detail, for a data center of a certain enterprise, a full flash storage cluster may be deployed to store massive business data, such as financial data, customer information, sales records, and the like of the enterprise. The full flash cluster contains a plurality of nodes, each of which is responsible for storing and processing a portion of data.
In this scenario, the collection process of the node storage space fragmentation rate is as follows, as the enterprise continuously performs data storage, deletion and modification operations, the storage space in the node will gradually become fragmented. For example, when the financial department frequently updates financial statement data, many small blocks of free space may be formed in the storage space instead of a continuous large block of free space, and the node storage space fragmentation rate is obtained by calculating the proportion of these fragmented spaces.
For the depth of the input/output request queue, it is assumed that sales departments of the enterprise perform statistics and reporting of sales data in a month end set of each month, during which a large number of read/write requests are sent to the full flash storage cluster. Each node receives these requests and places the requests in the incoming and outgoing request queues in order of arrival. Thus, the number of requests in the queue, i.e., the input-output request queue depth, is monitored in real time. For example, if during peak sales data processing, the input-output request queue depth of a node may reach hundreds or even thousands of requests.
The collection of network link bandwidth utilization is related to the transmission of data between nodes. In detail, data may be shared between different departments of an enterprise, and when a customer service department obtains customer information from a storage cluster to provide services to customers, the data needs to be transmitted between nodes through network links. In this process, the ratio of the actual transmission data amount of the network link to the total bandwidth of the link in unit time can be monitored, which is the bandwidth utilization of the network link. For example, if the total bandwidth of the network link is 1000Mbps and the actual transmission data amount is 500Mbps at a certain time, the bandwidth utilization of the network link at this time is 50%.
Step S120, a storage operation request stream sent by a user terminal is received, and a to-be-stored data block identification set and a corresponding operation mode label in the storage operation request stream are analyzed.
Continuing with the above-described enterprise data center as an example, the user terminal may be a computer, a server, or other devices used by employees of various departments within the enterprise. Assume that the market segment of the enterprise wants to store a new market research report into the full flash storage cluster. Market department personnel send a stream of storage operation requests to the storage cluster through specific software (user terminals).
The storage operation request stream contains a set of identifiers of data blocks to be stored, for example, a market research report may be stored in a plurality of data blocks, each data block having a unique identifier. At the same time, the request stream also carries a corresponding operation mode tag, which may be "write", indicating that this is a store (write) operation. Therefore, after the storage operation request stream is received, the storage operation request stream is analyzed, and the operation mode label of the data block identification set to be stored and the operation mode label of writing in is accurately extracted so as to be processed according to the information.
Step S130, based on the data block identification set traversal history access record library to be stored, extracting the history access time sequence characteristics of the data block to be stored and the physical storage position adjacency of the associated data block.
Still based on the enterprise data center scenario, a store operation for market research reports. For example, the matching may be performed in a history access record base according to each data block identification in the set of data block identifications to be stored. It is assumed that the history access record library records access to all data in the past of the enterprise.
For the extraction of the historical access timing characteristics, taking one of the data blocks as an example, if the data block is displayed in the past access record, it is frequently accessed at 9 to 10 am every workday, which constitutes an access time stamp sequence. Based on the sequence of access time stamps, a periodic access interval distribution parameter can be calculated, for example, an access peak is found on average every 7 days, which is the periodic access interval distribution parameter. Meanwhile, if there is a sudden large number of accesses during a particular event (such as a new product release period of a company), the set of time stamps of the sudden access event may be recorded.
For the physical storage location proximity of the associated data block, a set of associated data block identifications that have been accessed together with the data block to be stored within a preset time window (e.g., the past month) more than a synergy threshold (say 10 times) is selected from the subset of associated data block identifications. For example, some market analysis data blocks related to market research reports, which have been commonly accessed 15 times in the past month. These associated data block identifications are then acquired to identify a set of physical storage coordinates that are grouped in a full flash storage cluster. Suppose that a data block of a market research report is stored at a location with a logical unit address of 100 for node A, and an associated market analysis data block is stored at a location with a logical unit address of 105 for node A and a logical unit address of 200 for node B. Based on the topological connection relation between the storage node identifications (the node A and the node B are connected through a high-speed network) and the distance difference value between the logic unit addresses (the logic unit addresses 100 and 105 are relatively close to each other and relatively far from 200), a physical storage position proximity matrix of the associated data block is generated.
And step 140, constructing a dynamic access heat prediction model of the data block to be stored in a preset time period according to the multi-dimensional performance data set and the historical access time sequence characteristics.
Still taking the above scenario as an example, a periodic access peak interval (e.g., 9 to 10 am every workday), an emergency access event timestamp (e.g., access peak during new product release), and an access interval distribution statistic (e.g., an average of one access peak every 7 days) are extracted from the historical access timing characteristics.
Assuming that the operating mode label corresponding to the market research report is "written," some of the data contained therein may be read frequently later for adjustment of the market policy. And identifying the read-write operation proportion corresponding to the operation mode label, and activating the hot spot data prediction mark when the read operation proportion exceeds a preset threshold (60% is assumed). For example, after analysis, it is found that 80% of the market share analysis portion in the market research report is likely to be read later, exceeding a preset threshold of 60%, so that the hot spot data predictive marker is activated.
And generating a time dimension access probability density function by combining the periodic access peak interval and the sudden access event time stamp. For example, if the access probability at 9 to 10 am is 0.3, the access probability corresponding to the sudden access event during the release of the new product is 0.2, and the time-dimension access probability density function is constructed based on these data.
And correcting the weight parameter of the time dimension access probability density function based on the access interval distribution statistic value and the hot spot data predictive marker. The access interval distribution statistics are assumed to indicate that the access probability gradually decreases between two access peaks, and the weight parameters of the time-dimension access probability density function are adjusted according to the situation.
And carrying out coupling calculation on the corrected time dimension access probability density function and the network link bandwidth utilization rate in the multidimensional performance data set. For example, a bandwidth fluctuation dataset of network link bandwidth utilization over a plurality of historical time windows is extracted from a multi-dimensional performance dataset, assuming a peak bandwidth utilization interval of 80% from 9 am to 10 am on weekdays, and an average transmission rate of 500Mbps. And carrying out alignment mapping on the bandwidth utilization peak interval and a time axis of the time dimension access probability density function to generate a time-synchronous bandwidth utilization distribution sequence. Based on the average transmission rate, a bandwidth weight factor of the time-dimension access probability density function at each time unit is calculated. And carrying out weighted correction on the access probability value of the time dimension access probability density function in the corresponding time unit according to the bandwidth weight factor to generate a bandwidth-aware access probability density distribution function. And carrying out time dimension normalization processing on the bandwidth-aware access probability density distribution function, and generating a heat level parameter of the dynamic access heat prediction model based on the maximum probability value after normalization and the slope change of the probability distribution curve, wherein the heat level parameter can be used for representing the access heat condition of the market research report data block in a future preset time period.
Step S150, generating a cross-node slicing storage topological graph and a cache preloading strategy matrix based on the dynamic access heat prediction model and the physical storage position adjacency.
In detail, for the storage of the market research report data block, determining a fragmentation redundancy threshold and the minimum copy number of the data block to be stored according to the heat level parameter output by the dynamic access heat prediction model. Assuming a higher level of heat, a slice redundancy threshold of 3 is determined (indicating that a data block can be stored in up to 3 slices), and the minimum number of copies is 2 (at least 2 copies per slice).
And traversing the storage space fragmentation rate and the input/output request queue depth of all nodes in the full-flash storage cluster, and screening candidate node subsets meeting the fragmentation capacity constraint. For example, node a may be selected as a subset of candidate nodes if its storage space fragmentation rate is low, its input-output request queue depth is within an acceptable range, and there is sufficient space to store fragments of market research report data blocks.
Based on the physical storage location proximity, a storage location association score is calculated for each node in the subset of candidate nodes. Assume that node a is closer to the associated data block storage location, and the storage location association score is higher. And constructing a storage path weight table among the nodes according to the storage position association scores and the network link bandwidth utilization rate in the multidimensional performance data set. For example, if node a to node B network link bandwidth utilization is high and the storage location association score is high, then the storage path weight value between them is high.
Based on the stored path weight table and the minimum number of copies, a sliced stored topology graph is generated that includes redundant path cross-connects. For example, according to the inter-node stored path weight values in the stored path weight table, a candidate stored path set with a weight value exceeding a preset weight threshold (assumed to be 0.6) is screened out. And determining that the number of the sliced copies of the market research report data block is 2 based on the minimum number of the copies, and distributing an initial storage path for each sliced copy. The storage path from the node A to the node B is firstly allocated to the first sliced copy, the storage path weight value in the candidate storage path set is traversed, and the highest path weight value from the node A to the node B is found, so that the highest path weight value is used as the default storage path of the sliced copy. And detecting the inter-node connection state of the default storage path, and if only one connection path from the node A to the node B is found to have single-point fault risk, selecting a standby storage path with a next highest weight value (assuming that the path weight value from the node A to the node C is next highest) from the candidate storage path set as a cross redundant path. And performing bidirectional connection on the cross redundant paths and the default storage paths to generate a segmented storage topological graph containing the cross connection of the redundant paths. And verifying whether the path redundancy of each partitioned copy in the partitioned storage topological graph meets the redundancy constraint condition corresponding to the minimum copy number, and if not, reselecting the standby storage path and updating the cross connection relation of the partitioned storage topological graph.
And generating a cache preloading strategy matrix based on the dynamic access heat prediction model and the physical memory location proximity. A sequence of target data chunks spatially associated with the market research report data chunk is identified based on the physical storage location proximity, as was previously mentioned for the market analysis data chunk. Based on the dynamic access heat prediction model, predicting concurrent access probability distribution of the target data block sequence in a preset time period. It is assumed that 60% of the market analysis data blocks are accessed simultaneously with the market research report data blocks within the next week. And calculating the preloading priority coefficient of each target data block according to the concurrent access probability distribution and the node storage space fragmentation rate in the multidimensional performance data set. The preload priority coefficient may be higher if the node memory space fragmentation rate is lower. And based on the preloading priority coefficient, differential buffer reservation period and compression grade parameters are allocated for the target data block sequence. For example, for market analysis data blocks with high preloading priority coefficients, the current buffer space occupancy rate and the historical buffer replacement frequency of the edge nodes in the full-flash storage cluster are obtained. When the concurrent access probability distribution is higher than a first preset threshold (assuming 50%), a fixed reservation period (e.g., 3 days) is allocated for the corresponding data block and the buffer space is locked. And when the concurrent access probability distribution is lower than a second preset threshold (30% assumed), dynamically adjusting the retention period decay rate based on the historical cache replacement frequency. And performing variable compression rate processing on the low-priority data blocks according to the compression grade parameters, and recording metadata verification information of the compressed data blocks.
Based on the steps, after the multidimensional performance data sets of all nodes in the full-flash storage cluster are collected in real time, a storage operation request stream of a user terminal is received, a to-be-stored data block identification set and an operation mode label are analyzed, a history access record library is traversed based on the to-be-stored data block identification set, history access time sequence characteristics and the physical storage position adjacency of related data blocks are extracted, the history access characteristics of the data blocks and the spatial relation between the data blocks and the peripheral data blocks are deeply mined, and furthermore, a dynamic access heat prediction model of the to-be-stored data blocks in a preset time period is constructed according to the multidimensional performance data set and the history access time sequence characteristics, the dynamic access heat prediction model fuses real-time performance data and a history access rule, the access heat of the data blocks in a future period can be predicted dynamically and accurately, the variability and the complexity of data access in a cloud computing environment can be better adapted, and the pre-judging capability of the data access trend is effectively improved. And finally, generating a cross-node partitioned storage topological graph and a cache preloading strategy matrix based on the dynamic access heat prediction model and the physical storage position adjacency, so that the optimal allocation and the efficient utilization of storage resources are realized. The cross-node fragmented storage topological graph considers the access heat and the physical storage position relation of the data blocks, reasonably plans the storage distribution of the data among different nodes, effectively balances the load pressure of each node, reduces the fragmentation degree of the storage space and improves the read-write performance of the whole storage cluster. Meanwhile, the cache preloading strategy matrix loads the data blocks which are likely to be frequently accessed into the cache in advance according to the predicted access heat, so that the waiting time of data access is greatly reduced, and the response speed and the user experience of the system are remarkably improved.
For example, in one possible implementation, step S130 includes:
Step S131, according to each data block identifier in the set of data block identifiers to be stored, matching a corresponding history access record in the history access record library, where the history access record includes an access timestamp set and an associated data block identifier subset.
In the enterprise data center scene, in the process of storing the data blocks of the market research report, firstly, according to each data block identifier of the market research report, matching the corresponding historical access record in the historical access record library. Since the history access record library records the past access condition of all data of the enterprise in detail, records corresponding to the market research report data blocks can be accurately found. These historical access records contain a set of access time stamps and an associated subset of data block identifications, for example, for a particular data block in a market research report, the set of access time stamps in the historical access records showing the specific time each time it was accessed in the past may include a series of time points from the enterprise after a large marketing campaign was developed, such as 2023, 3, 1, 10, 15, 2023, 3, 5, 14, 30, etc. The subset of associated data block identifications then contains the other data block identifications that may be involved in accessing the data block.
Step S132, extracting an access time stamp sequence in the history access record, and calculating a periodic access interval distribution parameter and a time stamp set of the sudden access event based on the access time stamp sequence.
For example, the access time stamp sequence of the specific data block of the market research report is 2023, 3, 1, 10, 15, 2023, 3, 5, 14, 30, or the like. A periodic access interval distribution parameter is calculated based on the access timestamp sequence. The calculation process is that the time interval of adjacent access time stamps is counted, for example, the time interval from 15 minutes at 10 days of 2023 month 3 day 1 to 30 minutes at 14 days of 2023 month 3 day 5 is 4 days, 4 hours and 15 minutes, and all the adjacent access time stamps are calculated. These time intervals are then analyzed to count the frequency of occurrence of each time interval, which is then considered as an important periodic access interval reference if the frequency of occurrence is highest for a 4 day 4 hour 15 minute time interval of 3 occurrences. By counting and analyzing all time intervals in this way, a periodic access interval distribution parameter is obtained, and the periodic access interval distribution parameter can reflect the periodicity rule of the data block accessed in different time periods. In this process, it is also possible to determine a set of timestamps of the sudden access event, for example, when the enterprise performs annual financial examination, a large amount of data in the market research report needs to be referenced, so that there is a large amount of centralized access during the period from 1 day at 4 in 2023 to 5 days at 4 in 2023, and the timestamps of the period form the set of timestamps of the sudden access event.
And step S133, screening out the associated data block identification set which is accessed together with the data block to be stored in a preset time window for more than a cooperative threshold value from the associated data block identification subset.
Assuming that the preset time window is the past 6 months, the synergy threshold is set to 15 times. And checking the common access times corresponding to each identifier in the associated data block identifier subset, for example, a market analysis data block, wherein the common access times with the market research report data block are 20 times in the past 6 months, and the common access times exceed the 15-time collaboration threshold, so that the common access times are screened, and the screened data block identifiers form an associated data block identifier set.
Step S134, obtaining a physical storage coordinate set of the associated data block identifier set in the full flash storage cluster, where the physical storage coordinate set includes a storage node identifier and a logical unit address.
For example, for the filtered market analysis data block, its storage node stored in the full flash storage cluster is identified as node a, and the logical unit address is 105. There may be other associated data blocks stored at the node B, logical unit address 200, etc., which storage node identities and logical unit addresses constitute a set of physical storage coordinates.
Step S135, generating a physical storage position proximity matrix of the associated data block based on the topological connection relation between storage node identifiers in the physical storage coordinate set and the distance difference value between the logic unit addresses.
For example, for a topological connection between storage node identities, if node a and node B are connected by a high-speed network and the network delay is low, then the topological connection between them is tighter. For the distance difference between logical unit addresses, e.g. the distance difference between logical unit address 105 of node a and another logical unit address 110 of node a is relatively small, and the distance difference between logical unit address 200 of node B is relatively large. And carrying out quantitative evaluation on the relation between each pair of associated data blocks by comprehensively considering the factors, thereby constructing a physical storage position proximity matrix of the associated data blocks. The physical storage position proximity matrix can accurately reflect the proximity relation of each associated data block on the physical storage position, and provides important basis for subsequent data storage and management strategies.
In one possible implementation, step S140 includes:
step S141, extracting periodic access peak intervals, sudden access event time stamps and access interval distribution statistic values from the historical access time sequence characteristics.
In detail, in the aspect of periodically accessing the peak interval, the historical access record is reviewed, and the data block is found to have larger access quantity between 9 am and 10 am of each working day, and the time period is the periodically accessing peak interval. In terms of sudden access event time stamps, for example, during the adjustment of the quarter sales strategy by the enterprise, the market research report data block has a large number of sudden accesses from the 15 th year of 2023 to the 20 th year of 2023, and these time points are the sudden access event time stamps. For the access interval distribution statistics, the time interval between each access is counted, for example, the first access is 3 days apart from the second access, the second access is 5 days apart from the third access, etc., and the access interval distribution statistics are obtained by performing statistical analysis on a large amount of interval data.
Step S142, the read-write operation proportion corresponding to the operation mode label is identified, and when the read operation proportion in the read-write operation proportion exceeds a preset threshold value, the hot spot data prediction mark is activated.
In detail, the market research report data block is stored with an operation mode tag, assuming that 70% of operations are read operations through analysis. The preset threshold is set to 60%, and since the read operation is 70% over the preset threshold of 60%, the hot spot data predictive flag is activated.
Step S143, generating a time dimension access probability density function in combination with the periodic access peak interval and the burst access event time stamp.
To illustrate with a simple calculation procedure, it is assumed that the access probability in the periodic access peak section (9 to 10 am on the working day) is initially set to 0.3, and the access probability in the 2023, 6, 15 to 2023, 6, 20 days corresponding to the emergency access event time stamp is set to 0.2. And setting corresponding probability values according to factors such as historical access frequency and the like for other time periods, so as to construct a preliminary time dimension access probability density function.
And step S144, correcting the weight parameter of the time dimension access probability density function based on the access interval distribution statistical value and the hot spot data prediction mark.
The access interval distribution statistics are assumed to indicate a gradual decrease in access probability between access peaks, e.g. a decrease of 0.05 per day in a period of time after a periodic access peak interval. Because the hotspot data prediction flag is activated, which indicates that the data block may be hotspot data, for this case, the weight parameter needs to be adjusted, so that the decay speed of the access probability after the peak is reduced, for example, the decay rate is originally adjusted to be 0.03 per day, so as to correct the weight parameter of the time dimension access probability density function.
And step S145, performing coupling calculation on the corrected time dimension access probability density function and the network link bandwidth utilization rate in the multi-dimensional performance data set to generate the dynamic access heat prediction model, and outputting the heat level parameter of the data block to be stored by the dynamic access heat prediction model.
For example, step S145 includes:
step S1451, extracting a bandwidth fluctuation data set of the network link bandwidth utilization within a plurality of historical time windows from the multi-dimensional performance data set, wherein the bandwidth fluctuation data set comprises a bandwidth utilization peak interval and an average transmission rate.
Step S1452, performing alignment mapping on the bandwidth utilization peak interval in the bandwidth fluctuation dataset and the time axis of the time dimension access probability density function, and generating a time-synchronized bandwidth utilization distribution sequence.
Step S1453, calculating a bandwidth weight factor of the time dimension access probability density function at each time unit based on the average transmission rate in the bandwidth utilization distribution sequence.
Step S1454, performing weighted correction on the access probability value of the time dimension access probability density function in the corresponding time unit according to the bandwidth weight factor, so as to generate a bandwidth-aware access probability density distribution function.
Step S1455, performing a time dimension normalization process on the bandwidth-aware access probability density distribution function, and generating a heat level parameter of the dynamic access heat prediction model based on the maximum probability value of the normalized bandwidth-aware access probability density distribution function and the slope change of the probability distribution curve.
For example, the past 10 workdays are selected as the historical time window, and the network link bandwidth utilization of each workday is counted. Wherein the bandwidth utilization peak interval may occur from 10 to 11 am, with a peak of 80% and an average transmission rate of 500Mbps. And carrying out alignment mapping on the bandwidth utilization peak interval in the bandwidth fluctuation data set and a time axis of the time dimension access probability density function to generate a time-synchronous bandwidth utilization distribution sequence. Assuming that the time unit of the time dimension access probability density function is hour, the access probability from 9 am to 10 am is 0.3, and the bandwidth utilization peak from 10 am to 11 am is 80%. The process of calculating the bandwidth weight factor of the time dimension access probability density function at each time unit is as follows, firstly determining the influence degree of the relation between the average transmission rate and the bandwidth utilization peak value on the access probability. If the higher the average transmission rate and the higher the bandwidth utilization peak, it is indicated that the data transmission demand is large and the network resources are fully utilized during this period, the bandwidth weight factor of this time unit should be larger. For example, in a simple calculation, if the average transmission rate is 500Mbps, the bandwidth utilization peak is 80%, and a base weight is set to 1, then the bandwidth weight factor calculated from these two values may be 1.2 (here, just one example calculation, the actual calculation would be according to more complex logic). And carrying out weighted correction on the access probability value of the time dimension access probability density function in the corresponding time unit according to the bandwidth weight factor, wherein the original access probability is 0.3 from 9 am to 10 am, and the access probability density function is changed into 0.3 multiplied by 1.2=0.36 after the weighted correction, so that the bandwidth-aware access probability density distribution function is generated.
Further, the normalization process adjusts the access probability values of all time units to a specific range, for example, between 0 and 1. The calculation process is that the maximum probability value in the access probability density distribution function of bandwidth perception is found first, and is assumed to be 0.4. Then dividing the access probability value of each time unit by the maximum probability value, for example, if the original access probability of a certain time unit is 0.2, the normalized access probability value becomes 0.2/0.4=0.5. And generating a heat level parameter of the dynamic access heat prediction model based on the maximum probability value of the normalized bandwidth-aware access probability density distribution function and the slope change of the probability distribution curve. If the maximum probability value is close to 1 and the slope of the probability distribution curve is large, indicating that the data block has a high access heat and a significant heat change in certain time periods, the heat level parameter may be set to a higher level, such as level 3 (here, it is assumed that the heat level is classified as level 1 to level 5), and if the maximum probability value is small and the slope change is gentle, the heat level parameter may be set to a lower level, such as level 1. The heat level parameter can accurately reflect the access heat condition of the market research report data block in a preset time period, and provides an important basis for a subsequent data storage management strategy.
In a possible embodiment, after step S143, the method further includes:
Step S210, dividing a plurality of data access areas according to the node topology structure of the full flash storage cluster, and configuring an independent access frequency monitor for each area.
The node topology of a full flash storage cluster may be a complex network structure consisting of multiple storage nodes. For example, nodes storing enterprise financial data are divided into one area, nodes storing sales data are divided into another area, nodes storing relevant data blocks for market research reports are divided into specific areas, etc. Each such area is provided with an independent access frequency monitor which can accurately count the access of each area.
And step S220, collecting actual access times distribution data of each area in a historical time window through the access frequency monitor.
Taking the area where the market research report data block is located as an example, a historical time window is set as the past month, and during the month, the access frequency monitor records the actual access times of the area in detail in each day or each specific time period. For example, in the first few days of the month, as each department of the enterprise just makes a working plan, the number of accesses to the market research report data block is small, and may be only 10 accesses per day, the number of accesses is increased to 30 times per day by the adjustment period of the marketing plan in the month, and when the month is summarized near the end of the month, the number of accesses is changed, and may be 20 times per day, so that the actual access number distribution data of the area in the historical time window is obtained.
And step S230, carrying out fitting verification on the actual access times distribution data and the time dimension access probability density function, and adjusting the curve form of the probability density function.
Explained with a simple calculation procedure, it is assumed that the access probability set by the time-dimension access probability density function at the beginning of the month is 0.1, and the actual access number distribution data shows that the actual access number at the beginning of the month is small. The difference between the actual number of visits and the number of visits predicted from the probability density function is calculated, e.g. 15 visits should be made per day at the beginning of the month, whereas only 10 visits are actually made, the difference being 5. Such a difference calculation is performed for each time period within the entire historical time window. If a large difference is found for a certain time period, this indicates that the probability density function is not predicted accurately for that time period. For example, in the marketing-plan adjustment period in the month, the number of visits corresponding to the visit probability predicted from the probability density function is 25, and there are actually 30 visits, and the difference is 5. The curve morphology of the probability density function is adjusted based on these differences. If the actual access times of a certain time period are more than the predicted access times, the access probability of the time period is properly increased, so that the curve is adjusted upwards in the time period, and otherwise, the curve is adjusted downwards.
Step S240, updating the heat level parameter calculation rule of the dynamic access heat prediction model based on the adjusted probability density function.
The heat level parameter calculation rule may be based on some characteristic value of the original probability density function, such as the maximum probability value, the slope of the probability distribution curve, etc., before adjusting the probability density function. For example, the original rule is that the heat level parameter is level 3 when the maximum probability value is greater than 0.3 and the slope is greater than a certain value. These eigenvalues change after adjusting the probability density function. And determining a new characteristic value according to the adjusted probability density function again, wherein the slope is changed when the maximum probability value is changed to 0.35. Updating the heat level parameter calculation rule based on these new feature values, for example, the new rule may become that the heat level parameter is level 3 when the maximum probability value is greater than 0.35 and the slope satisfies the new condition. Therefore, the dynamic access heat prediction model can more accurately reflect the access heat of the market research report data block under different conditions, and more accurate basis is provided for operations such as data storage, management, prefetching and the like.
In one possible implementation, step S150 includes:
And step S151, determining a fragmentation redundancy threshold and the minimum copy number of the data block to be stored according to the heat level parameter output by the dynamic access heat prediction model.
It is assumed that the heat level parameter of the market research report data block is higher, indicating that it may have a higher access frequency in the future. Based on this, the slice redundancy threshold is determined to be 3, which means that the data block can be divided into at most 3 slices for storage, so as to improve the availability and reliability of the data. At the same time, the minimum number of copies is determined to be 2, i.e. there are at least 2 copies per slice. The purpose of this is to ensure data accessibility in the face of node failure, data corruption, etc.
Step S152, traversing the storage space fragmentation rates and the input/output request queue depths of all nodes in the full flash storage cluster, and screening candidate node subsets meeting the fragmentation capacity constraint.
For example, there are multiple nodes in the full flash storage cluster, such as node a, node B, node C, etc. For node a, its storage space fragmentation rate is 20% and the input-output request queue depth is 50 requests at the current time. The memory fragmentation rate of the node B is 30% and the input-output request queue depth is 80 requests. Assuming that the shard capacity constraint is that the storage space fragmentation rate is no more than 30% and the input-output request queue depth is no more than 100 requests, then both node a and node B satisfy the shard capacity constraint and are thus selected to be a subset of candidate nodes. If the storage space fragmentation rate of the node C is 40% or the input-output request queue depth is 150 requests, the fragmentation capacity constraint is not satisfied, and the candidate node subset is not selected.
Step S153, calculating storage location association scores of all nodes in the candidate node subset based on the physical storage location proximity.
Taking the previously mentioned association of the market analysis data block with the market research report data block as an example, assuming that the market analysis data block is stored in the node a and a certain associated data block of the market research report data block is stored in a position adjacent to the logical unit address of the node a, the association score of the node a and the storage position of the market research report data block is higher. If another node B is farther from the storage location of the associated data block of the market research report data block, its storage location association score is relatively low. By taking these factors into account in combination, an accurate storage location association score is calculated for each node in the subset of candidate nodes.
And step S154, constructing a storage path weight table among nodes according to the storage position association scores and the network link bandwidth utilization rate in the multidimensional performance data set.
For example, the storage location association between node a and node B is scored higher while the network link bandwidth utilization between them is 70%, indicating that there is a higher efficiency in transferring data between the two nodes. Based on these factors, a higher weight value is given to the storage path between node a and node B. If the storage location association score between node a and node C is low, the network link bandwidth utilization is 50%, and the storage path weight value between them is relatively low. By such evaluation of the relationships between all nodes in the candidate node subset, a complete stored path weight table between nodes is constructed.
Step S155, based on the stored path weight table and the minimum copy number, generating a sliced stored topology graph containing redundant path cross-connections.
For example, step S155 includes:
step S1551, screening candidate storage path sets with the storage path weight value exceeding a preset weight threshold according to the inter-node storage path weight values in the storage path weight table.
And step S1552, determining the number of the partitioned copies of the data block to be stored based on the minimum number of copies, and distributing an initial storage path for each partitioned copy.
And step S1553, traversing the storage path weight values in the candidate storage path set, and selecting a main storage path with the highest weight value as a default storage path of the fragmented copy.
Step S1554, detecting the inter-node connection state of the default storage path, and if there is a single node connection path, selecting a standby storage path with a next highest weight value from the candidate storage path set as a cross redundant path.
And step S1555, performing bidirectional connection on the cross redundant paths and the default storage paths to generate a segmented storage topological graph containing the cross connection of the redundant paths.
Assuming that the preset weight threshold is 0.6, the stored path weight value from node a to node B is 0.7, and the stored path weight value from node B to node C is 0.8, then the stored paths from node a to node B and node B to node C are selected into the candidate stored path set. And determining that the number of the partitioned copies of the data block to be stored is 2 based on the minimum number of the copies, and distributing an initial storage path for each partitioned copy. For example, a first sharded copy is allocated a storage path from node A to node B. Then traversing the storage path weight values in the candidate storage path set, finding that the path weight value from the node A to the node B is highest, so that the path weight value is used as the default storage path of the sliced copy. And detecting the inter-node connection state of the default storage path, and if only one connection path from the node A to the node B is found, carrying out single-point fault risk. At this time, the backup storage path of the next highest weight value (assuming that the path weight values of the nodes a to C are next highest) is selected from the candidate storage path set as the cross redundant path. And performing bidirectional connection on the cross redundant paths and the default storage paths to generate a segmented storage topological graph containing the cross connection of the redundant paths. For example, in the sliced storage topology, the first sliced copy is the primary storage path from node a to node B and the cross redundant path from node a to node C, both connected in both directions, to ensure that data is still accessible through node C in the event of a failure of node a or node B.
And step S1556, verifying whether the path redundancy of each partitioned copy in the partitioned storage topological graph meets the redundancy constraint condition corresponding to the minimum copy number, and if not, reselecting a standby storage path and updating the cross connection relation of the partitioned storage topological graph.
In detail, since the minimum number of copies is 2, for each sliced copy, its path redundancy should be at least 1, i.e., there is at least one spare storage path in addition to the primary storage path. Checking the path condition of each sliced copy in the sliced storage topological graph, and if the path redundancy of a certain sliced copy is found to be unsatisfied, for example, only a main storage path is found and no standby storage path exists, then the standby storage path is reselected and the cross connection relation of the sliced storage topological graph is updated. Assuming that a certain sliced copy originally only has a main storage path from a node A to a node B, and does not meet the redundancy constraint condition, selecting a path from the node A to the node C from the candidate storage path set again as a standby storage path, and updating the sliced storage topological graph to enable the nodes A to the node B and the nodes A to the node C to establish a correct cross connection relation, thereby ensuring the reliability and the availability of the whole sliced storage topological graph, meeting the requirement of an enterprise data center on the storage of market research report data blocks, and improving the storage safety and the access efficiency of data.
In one possible embodiment, the method further comprises:
step S310, detecting whether a single point failure risk path exists in the partitioned storage topology graph.
For example, for a fragmented copy of a market research report data block, its primary storage path is from node a to node B, and if this is the only storage path, there are no other backup paths connected to it, then this is a single point-of-failure risk path. Because should node a or node B fail, such as if node a's power supply fails or node B's storage medium is damaged, the fragmented copy will not be accessible, thereby affecting the availability of the entire data block.
Step S320, if there is a single point failure risk path, dynamically inserting a backup storage node to form a ring redundant path based on the real-time performance data of the candidate node subset.
The candidate node subset is assumed to contain node a, node B, node C, etc. And the real-time performance data of the nodes, such as the low real-time storage space fragmentation rate of the node C, are checked, the depth of the input/output request queue is also in a reasonable range, the network link bandwidth utilization rate is high, and the network link has a good performance state. At this time, node C is dynamically inserted into a path that has a single point of failure risk. For the previous path from node a to node B, it is extended to a ring redundant path from node a to node C to node B. Thus, even if the node A or the node B fails, the data can still be transmitted through the node C, and the reliability of the data is improved. In this process, the connection relationship and the data flow direction between the nodes need to be considered in detail, so as to ensure the rationality and the effectiveness of the annular redundant path.
And step S330, optimizing the copy synchronization priority in the partitioned storage topological graph according to the inter-node delay data of the annular redundant path.
For example, by measuring delay data between each node in the ring redundant path, e.g., delay from node a to node C is5 milliseconds, delay from node C to node B is 3 milliseconds, and delay from node B to node a is 4 milliseconds. The priority of replica synchronization is determined from these delay data. If a slice copy has an update on node a, because the delay from node C to node B is relatively small, the update can be preferentially synchronized to node C, then from node C to node B, and finally from node B back to node a, so as to ensure the consistency and timeliness of the data. The process needs to accurately calculate and balance the influence of delays of different paths on data synchronization, so that the data reliability is improved, and meanwhile, the high-efficiency synchronization of the data can be ensured.
And step S340, performing association mapping on the optimized fragment storage topological graph and the cache preloading strategy matrix to generate a joint storage strategy configuration file.
The cache preloading policy matrix contains the cache policy information of the data blocks related to the market research report data blocks, such as preloading priority coefficient, cache retention period, compression level parameter and the like. And carrying out association mapping on the optimized fragment storage topological graph and the cache policy information. For example, for a sharded copy stored on node a in the sharded storage topology graph, if the corresponding associated data block has a higher preloading priority coefficient in the cache preloading policy matrix, then the node a where the sharded copy is located is marked in the joint storage policy configuration file to need to perform the cache preloading operation more preferentially. Through the association mapping, a comprehensive joint storage strategy configuration file is generated, and the joint storage strategy configuration file guides the whole data storage system how to store, cache, synchronize data and the like on the market research report data blocks and related data blocks thereof.
In one possible embodiment, the method further comprises:
step S410, a storage life cycle tracking log is created for the data block to be stored, so as to record the creation time stamp, migration event and cache state change history of the sharded copy through the storage life cycle tracking log.
In detail, when a sliced copy of the market research report data block is created at a certain time, the stored lifecycle tracking log records the creation timestamp, e.g., 2023, 7, 1, 10, 15 minutes. During the storage process of the data block, if a migration event occurs to the sliced copy due to the performance problem of a certain node or the requirement of load balancing, such as migration from the node a to the node C, the migration event is recorded in detail in the log, including information of a starting node, a target node, and migration time. At the same time, the change history of the cache state is recorded, and if the cache of a certain piece of copy is changed from the initial uncached state to the cached state or the retention period of the cache is changed, the information is recorded.
Step S420, analyzing the abnormal event sequence in the storage life cycle tracking log, and identifying the potential cause of the storage performance bottleneck.
For example, in a storage lifecycle tracking log, it is found that, during a certain period of time, the fragmented copies of the market research report data blocks frequently experience migration events, while the cache state is also unstable. By analyzing the sequence of abnormal events, it may be found that the storage space fragmentation rate of a certain node (such as node a) is too high, resulting in reduced storage and access efficiency of the data block, thereby causing frequent migration events and unstable cache states. Or found to be because the network link bandwidth utilization suddenly drops at some point, affecting the transfer and buffering of data, which is a potential cause of storage performance bottlenecks.
Step S430, adjusting the parameter updating frequency of the dynamic access heat prediction model and the calculation logic of the fragment redundancy threshold according to the potential reasons.
If a problem is found that is caused by too high a storage space fragmentation rate of the node, the parameter update frequency of the dynamic access heat prediction model can be adjusted. For example, the original parameter update frequency is once a day, and the change of the fragmentation rate of the storage space can have a large influence on the data access heat, so that the parameter update frequency is increased to be once an hour, so as to reflect the actual access condition of the data more timely. For the calculation logic of the segment redundancy threshold, if the instability of the network link bandwidth utilization is found to have a larger influence on the data availability, the consideration weight on the network link bandwidth utilization is increased when the segment redundancy threshold is calculated. For example, the slice redundancy threshold is determined to be 3 only according to the heat level parameter, and the slice redundancy threshold may be adjusted to be 4 according to the fluctuation condition of the network link bandwidth utilization rate due to the influence of network factors, so as to improve the reliability of data.
Step S440, the adjusted parameter updating frequency and the calculation logic are synchronized to all the management nodes of the all-flash storage cluster in real time.
Each management node in the full flash storage cluster needs to obtain the adjusted information to ensure that the whole storage system performs data storage and management according to a new strategy. For example, the management node a, the management node B and the like need to receive and update the parameters, so that when the market research report data block and other data blocks are subjected to storage operation, decision can be made according to the new dynamic access heat prediction model parameter updating frequency and the calculation logic of the fragmentation redundancy threshold value, and the high efficiency, the reliability and the availability of the data of the whole storage system are ensured.
In one possible implementation, step S150 further includes:
Step S156, identifying a target data block sequence spatially associated with the data block to be stored according to the physical storage location proximity.
In detail, in a full flash storage cluster, market research report data blocks are stored at specific node and logical unit addresses. Other data blocks stored adjacent to the market research report data block are determined by analyzing the physical storage location proximity, such as looking at the relationship of storage node identification and logical unit address. It is assumed that the market research report data block is stored at the logical unit address 100 of the node a, and may be auxiliary data blocks related to the market research report, such as raw data collection records, preliminary analysis results and the like, which are in the same node and have similar logical unit addresses (e.g., 101-105), and these data blocks form a target data block sequence spatially related to the market research report data block.
Step S157, predicting a concurrent access probability distribution of the target data block sequence in the preset time period based on the dynamic access popularity prediction model.
The dynamic access heat prediction model outputs information such as heat level parameters and the like for the market research report data block, and predicts concurrent access probability distribution of the target data block sequence based on the information. For example, according to the high access heat of the market research report data block from 9 to 10 am on the working day, it is presumed that the original data acquisition record data block spatially associated with the market research report data block has a high concurrent access probability in the time period, and the probability may reach 0.4. And for the primary analysis result data block, the concurrent access probability may be slightly lower, which is 0.3. And for other time periods, the concurrent access probability of each time period is calculated according to the access heat trend of the market research report data block and the association tightness degree of each target data block and the market research report data block, so that the concurrent access probability distribution of the target data block sequence in the preset time period is obtained.
Step S158, calculating a preloading priority coefficient of each target data block according to the concurrent access probability distribution and the node storage space fragmentation rate in the multi-dimensional performance data set.
Taking the original data acquisition record data block as an example, the concurrent access probability is 0.4, and the storage space fragmentation rate of the node A is assumed to be 20%. The calculation process is such that if the storage space fragmentation rate is low, indicating that the node has more space for caching, there is a positive impact on the preload priority. A base score, such as a concurrent access probability of 0.4 for 40 points, may be set and then adjusted based on the storage space fragmentation rate. Since the storage space fragmentation rate is 20%, at a lower level, given a certain score, assuming a score of 10, the pre-load priority coefficient of the original data acquisition record data block is 50. For the data block of the primary analysis result, the concurrent access probability is 0.3, the fragmentation rate of the storage space of the node where the data block is located is assumed to be 30%, the fragmentation rate of 30% is relatively high, the pre-loading priority is negatively affected to a certain extent, and according to the same calculation mode, the basic score is 30 minutes, and the pre-loading priority coefficient is 25 minutes because the fragmentation rate is high and possibly less than 5 minutes. By such calculation, a preload priority coefficient is calculated for each target data block.
Step S159, based on the preloading priority coefficient, allocating differentiated buffer reservation period and compression level parameters for the target data block sequence, integrating the buffer reservation period and compression level parameters according to node dimensions, and generating a multi-layer structure of the buffer preloading policy matrix.
In a full flash memory cluster, there are a plurality of nodes, and for each node, the cache retention period and compression level parameters of a target data block stored on the node are integrated. For example, at the node A, the buffer memory retention period of the data block of the original data acquisition record is 3 hours, the compression level is a low compression level, and the buffer memory retention period of the data block of the primary analysis result is a high compression level according to the dynamic adjustment condition (assuming that the current time is 1 hour). And sorting the information according to the node dimension to form a cache preloading strategy matrix with a multi-layer structure. The cache preloading strategy matrix can clearly guide the cache preloading operation of the full-flash storage cluster on different nodes on the target data block sequence, including when to cache, how long to cache, what compression rate is adopted, and the like, so that the data access efficiency and the storage resource utilization rate are improved.
Wherein, step S159 includes:
Step S1591, obtaining the current buffer space occupancy rate and the history buffer replacement frequency of the edge node in the full flash storage cluster.
And step S1592, when the concurrent access probability distribution is higher than a first preset threshold, allocating a fixed reservation period for the corresponding data block and locking a cache space.
And step S1593, dynamically adjusting the retention period attenuation rate based on the history buffer replacement frequency when the concurrent access probability distribution is lower than a second preset threshold.
And step S1594, performing variable compression rate processing on the low-priority data block according to the compression grade parameter, and recording metadata verification information of the compressed data block.
Assuming that the current buffer space occupancy of the edge node is 60%, the historical buffer replacement frequency is one replacement every 2 hours. For the original data acquisition record data block with higher preloading priority coefficient, when the concurrent access probability distribution is higher than a first preset threshold value (assumed to be 0.35), a fixed reservation period is allocated for the corresponding data block and the buffer space is locked. For example, a fixed buffer retention period of 3 hours is allocated, and during this 3 hours, the data block is not replaced even if the buffer space is tight. For the block of preliminary analysis results, its concurrent access probability distribution is below a second preset threshold (say 0.25), and the retention period decay rate is dynamically adjusted based on the historical cache replacement frequency. Since the historical cache replacement frequency is replaced every 2 hours, when the cache space is tight, its cache retention period is decayed at a faster rate, for example, 10 minutes per 30 minutes, to free up the cache space as soon as possible to the higher priority data blocks that are more needed to be cached. The variable compression rate processing is performed on the low-priority data blocks according to the compression level parameters, and for such low-priority data blocks as a result of the preliminary analysis, compression is performed with a higher compression rate according to their compression level parameters (assuming a higher compression level). During the compression process, metadata verification information, such as a checksum, of the compressed data block is recorded, so as to perform verification of data integrity when the data block is accessed later.
In one possible embodiment, the method further comprises:
and step S510, deploying a dynamic load balancing controller in the full-flash memory cluster, and collecting the performance data fluctuation trend of each node in real time.
For example, for each node storing market research report data blocks and their associated data blocks, dynamic load balancing controllers may continue to focus on their performance status. It records the change of each item of performance data of nodes such as node A, node B and the like along with time. Taking node A as an example, the method monitors the change of the storage space fragmentation rate of the node A in different time periods, such as 15% of the morning, 18% of the noon and 20% of the afternoon, monitors the fluctuation of the input/output request queue depth, such as 30 requests in the morning when the business is just developed, increases to 80 requests in the peak business in the morning along with the frequent access of market departments to the market research report data block and the related data block, and then gradually drops back to 50 requests in the afternoon. By recording the data at different time points, the fluctuation trend of the performance data of each node is analyzed.
And step S520, identifying potential overload nodes according to the performance data fluctuation trend, and triggering a data block copy migration early warning mechanism.
For example, it is observed that node a's input-output request queue depth increases at a faster rate, from 30 requests to 80 requests in the past hour, and its rate of increase (number of requests increased divided by time interval) is calculated assuming that the rate of increase exceeds a pre-set queue depth warning line (e.g., 50 requests per hour). Meanwhile, the gradient of the change of the fragmentation rate of the storage space of the node A is larger, and the fragmentation threshold value (assumed to be a 15% -20% change range) is reached from 15% to 20%. At this point, node a is identified as a potentially overloaded node, triggering a data block copy migration early warning mechanism.
Step S530, selecting a migration target node and calculating an optimal delay parameter of the migration path based on the redundant path cross connection relationship in the partitioned storage topology.
In the partitioned storage topology, node a has a redundant path cross-connect relationship with other nodes (e.g., node B, node C, etc.). The real-time performance data of the nodes is checked, including storage space fragmentation rate, input-output request queue depth, network link bandwidth utilization, and the like. Assuming that the storage space fragmentation rate of the node B is low, the depth of the input/output request queue is small, and the bandwidth utilization rate of the network link is high, the node B is selected as the migration target node. An optimal delay parameter of the migration path from node a to node B is calculated. Consider various link links between node a and node B, such as passing intermediate nodes, network devices, etc. Delay data for each link is measured, e.g., 3 ms delay from node a to the intermediate node, 2 ms delay from the intermediate node to node B, and 5 ms total delay. And meanwhile, the influence of factors such as network congestion, data transmission rate and the like possibly encountered in the migration process on delay is considered, and the optimal delay parameter of the migration path from the node A to the node B is determined by comprehensively analyzing the data.
Step S540, during the migration operation, maintaining the access availability of the original data block and synchronously updating the federated storage policy configuration file.
When starting to migrate a fragmented copy of the market research report data block on node a to node B, it is ensured that during the migration process, the market department or other department requiring access to the data block will still be able to access the data normally. This may require some technical means, such as temporary copies of data, multi-path access of data, etc. Meanwhile, in the migration process, the joint storage strategy configuration file is synchronously updated. The joint storage policy configuration file contains information about the partitioned storage topology map, the cache preloading policy matrix, etc. of the market research report data block. Because of the change in storage locations of the data blocks, relevant information needs to be updated in the federated storage policy configuration file. For example, the original storage information about the data block on the node a is modified to the storage information on the node B, including updating the relevant information such as the storage path of the fragmented copy on the node B, the cache policy, etc., so as to ensure that the policy of the whole storage system matches with the actual storage condition of the data.
Wherein, step S520 includes:
Step S521 monitors a gradient of a change in the depth of the input/output request queue of the potentially overloaded node.
And step S522, when the growth rate exceeds a queue depth early warning line and the change gradient reaches a fragmentation threshold, generating a migration task queue.
Step S523, sorting the migration execution sequence and distributing the migration bandwidth resources according to the data block priority labels in the migration task queues.
For example, as described above, for node a, the significant increase in the depth of its input/output request queue and the significant increase in the fragmentation rate of the storage space are recorded in detail. And when the growth rate exceeds the queue depth early warning line and the change gradient reaches the fragmentation threshold, generating a migration task queue. Assume that a market research report data block has multiple sliced copies on node a, which are labeled with different data block priority labels according to factors such as their importance. When the migration task queue is generated, it is ordered according to these priority labels. For example, for a sliced copy associated with core data of a market research report, a high priority is marked, for auxiliary data, a medium priority is marked, and other copies of data blocks with lower association are marked as low priorities. According to the priority order, the high-priority data block copies are arranged in front of the migration task queue, the medium-priority data block copies are arranged in the middle, and the low-priority data block copies are arranged in the last. And meanwhile, distributing migration bandwidth resources for the data block copies in the migration task queue according to the current network bandwidth resource condition of the full flash memory cluster. If the total bandwidth is 1000Mbps, it is possible to allocate 500Mbps of bandwidth to the copy of the data block with high priority, 300Mbps for medium priority, and 200Mbps for low priority, depending on the priority and size of the data block.
Step S524, after the migration is completed, verifying the data consistency of the target node and updating the node state identification of the fragment storage topological graph.
After the copy of the data block has been completely migrated from node A to node B, the data on node B is authenticated for consistency. This may include comparing the checksums of the data blocks to see if the size, content, etc. of the data blocks are consistent with when on node a. If the data is consistent, the migration is successful. And updating the node state identification of the sliced storage topological graph, modifying the state identification of the data block copy on the node A to be migrated, and modifying the state identification on the node B to be received and stored. Through the operation, the partitioned storage topological graph can accurately reflect the storage position and the state of the data block copy, and an accurate basis is provided for subsequent storage management operation.
In one possible embodiment, the method further comprises:
Step S610, implementing a data integrity verification cycle in the full flash storage cluster, periodically scanning checksum information of the partitioned copies, and initiating a copy repair request based on a redundant path cross connection relationship in the partitioned storage topological graph when detecting that the checksum is abnormal.
In this embodiment, taking the market research report data block as an example, the full flash storage cluster may scan the partitioned copies of the market research report data block stored on each node for checksum information at predetermined time intervals, for example, every hour or every day. The checksum is a value representing the characteristics of the content of the data block calculated by a specific algorithm. For each sliced copy of the market research report data block, its checksum may be recalculated and then compared to previously stored checksum information.
Assuming that in one scan, the checksum of a sliced copy of the market research report data block stored on node a is found to not match the previously stored value, this indicates that the data of that sliced copy may be corrupted or erroneous. At this time, the redundant path cross-connect relationship in the sharded storage topology may be referred to. For example, the sharded storage topology shows that the sharded copy has a primary storage path on node A while being connected to node B by a redundant path and that node B has a redundant copy of the sharded copy stored thereon. Based on this, a copy repair request may be initiated to node B, requiring that node B provide the correct fragmented copy data to repair the corrupted copy on node a.
And step S620, according to the compression grade parameter in the cache preloading strategy matrix, performing recompression and cache state refreshing operation on the repaired data block, updating the metadata mapping relation in the joint storage strategy configuration file and feeding back the repair result to the user terminal.
In the cache pre-load policy matrix, each data block has a corresponding compression level parameter. Assume that the compression level parameter of the sliced copy of the block of market research report data is a medium compression level. After repair is completed, the repaired data block may be recompressed at the intermediate compression level. For the compression process, repeated data in the data block can be processed according to a compression algorithm so as to reduce the memory space occupation of the data block. Meanwhile, since the content of the data block changes (after repair and recompression), a refresh operation is required for the cache state. If the data block is in a cached state in the cache before and there is a certain cache retention period and a cache policy, then the states related to the caches all need to be updated according to new situations. For example, it may be necessary to recalculate the cache retention period, or adjust the priority of the cache, etc.
The joint storage policy configuration file contains various information about the storage of the market research report data blocks, wherein the metadata mapping relation records the association between the data blocks and various information such as storage positions, cache policies, compression levels and the like. These related information changes as the data block undergoes repair, recompression and cache state refreshing. For example, the location of a data block on a storage node may change due to a repair operation, or a change in compression level may result in a change in its relationship in the cache preload policy. Therefore, the metadata mapping in the federated storage policy configuration file needs to be updated to ensure that the information in the file matches the storage and management conditions of the actual data blocks. Finally, the repair result can be fed back to the user terminal. For users of the market research report data block, such as market department staff of an enterprise, notification of the repair result can be received through the user terminal. If the repair is successful, the notification may show that the fragmented copy of the market research report data block on the node A has been successfully repaired, the data integrity is recovered and can be used normally, and if the repair is failed, the notification may show that the fragmented copy of the market research report data block on the node A is failed to repair, and the manager is contacted for further inspection. In this way, the status of the data block can be known in time to relevant personnel of the enterprise to take further action when required.
FIG. 2 illustrates a schematic diagram of exemplary hardware and software components of a cloud computing based data all-flash memory optimization system 100 that may implement the concepts of the present application, provided by some embodiments of the present application. For example, the processor 120 may be used on the cloud computing based data all-flash memory optimization system 100 and to perform the functions of the present application.
The cloud computing-based data full-flash memory optimization system 100 may be a general-purpose server or a special-purpose server, both of which may be used to implement the cloud computing-based data full-flash memory optimization method of the present application. Although only one server is shown, the functionality described herein may be implemented in a distributed fashion across multiple similar platforms for convenience to balance processing loads.
For example, the cloud computing-based data all-flash memory optimization system 100 can include a network port 110 connected to a network, one or more processors 120 for executing program instructions, a communication bus 130, and various forms of storage media 140, such as magnetic disk, ROM, or RAM, or any combination thereof. By way of example, the cloud computing-based data all-flash memory optimization system 100 can also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The method of the present application may be implemented in accordance with these program instructions. The cloud computing based data full flash memory optimization system 100 also includes an Input/Output (I/O) interface 150 between the computer and other Input/Output devices.
For ease of illustration, only one processor is depicted in the cloud computing based data all-flash memory optimization system 100. It should be noted, however, that the cloud computing-based data all-flash memory optimization system 100 of the present application may also include multiple processors, and thus the steps performed by one processor described in the present application may also be performed jointly or separately by multiple processors. For example, if the processors of the cloud computing-based data full-flash memory optimization system 100 perform steps a and B, it should be understood that steps a and B may be performed by two different processors together or performed separately in one processor. For example, the first processor performs step a, the second processor performs step B, or the first processor and the second processor together perform steps a and B.
In addition, the embodiment of the invention also provides a readable storage medium, wherein computer executable instructions are preset in the readable storage medium, and when a processor executes the computer executable instructions, the data full-flash memory optimization method based on cloud computing is realized.
It should be noted that in order to simplify the presentation of the disclosure and thereby aid in understanding one or more embodiments of the invention, various features are sometimes grouped together in a single embodiment, figure, or description thereof.