Disclosure of Invention
Therefore, the invention aims to provide a system and a method for cleaning data lake files on time based on calculation separation, which are used for solving the problems that the current file cleaning cannot quickly respond and the files needing to be cleaned are processed.
In order to achieve the above object, an aspect of the present invention provides a system for cleaning data lake files on time based on calculation separation, including:
The identification module is used for scanning all files in the data lake in real time, sequentially identifying target files according to a scanning result, judging the cleaning type of the target files, recording the actual number of the target files, and determining an actual cleaning mode according to the cleaning type and the actual number;
the preprocessing module is used for generating a cleaning set for each target file obtained by scanning according to the actual cleaning mode, sequentially evaluating the cleaning urgency of each file to be cleaned in the cleaning set, and sequencing each file to be cleaned according to an evaluation result so as to obtain an actual cleaning sequence;
The cleaning module is used for cleaning according to the actual cleaning sequence, acquiring actual cleaning parameters in the cleaning process, analyzing the actual cleaning parameters to judge the consistency of the cleaning process, and determining a corresponding process control mode according to a judging result;
and the feedback correction module is used for periodically acquiring a cleaning result and carrying out feedback adjustment on the cleaning condition according to the cleaning result.
Further, the identification module comprises a scanning unit and a judging unit;
The scanning unit is used for scanning all files in the data lake in real time according to a scanning standard so as to obtain an actual scanning result;
the judging unit is used for analyzing each file according to the actual scanning result and combining with a preset target cleaning strategy so as to identify and obtain a plurality of target files;
The scanning standard comprises a time standard, a frequency standard, a capacity standard, a format standard and a reference standard, and the target cleaning strategy comprises a standard deadline strategy, a standard use strategy, a standard capacity strategy and a standard format strategy.
Further, the identification module further comprises a marking unit and a dividing unit;
The marking unit is used for marking each target file according to the judging result so as to obtain isolated files with different cleaning types;
the dividing unit is used for dividing each isolated file according to the cleaning type so as to obtain a plurality of subsets to be cleaned.
Further, the identification module further comprises a counting unit and a first determining unit;
the counting unit is used for accumulating the actual number of marks of each subset to be cleaned;
The first determining unit is configured to determine, according to the actual number of marks of each subset to be cleaned and the cleaning type, that the actual cleaning mode is a threshold cleaning mode or a periodic cleaning mode.
Further, the preprocessing module comprises a generating unit, a calculating unit and a comparing unit;
The generating unit is used for determining the number of single cleaning according to the actual cleaning mode and generating the cleaning set based on the number of single cleaning;
The calculating unit is used for calculating the cleaning emergency degree of each isolated file according to the marking result and the marking actual quantity;
the comparison unit is used for sorting according to the numerical value of the cleaning urgency of each isolated file so as to obtain the actual cleaning sequence.
Further, the cleaning module comprises a monitoring unit and a first judging unit;
The monitoring unit is used for monitoring the cleaning process in real time so as to obtain the actual cleaning parameters;
the first judging unit is used for judging whether the actual cleaning process data and the metadata are synchronous or not and sending a first alarm signal based on the fact that the actual cleaning process data and the metadata are not synchronous;
The actual cleaning parameters comprise actual cleaning target data, actual cleaning operation results and the actual cleaning process data.
Further, the cleaning module further comprises a second judging unit and a third judging unit;
The second judging unit is used for determining a single cleaning success rate according to the actual cleaning operation result and the target cleaning strategy, and sending a second alarm signal when the single cleaning success rate is not equal to 100%;
The third judging unit is used for judging whether conflict data exist according to the actual cleaning target data and the current execution task data, and sending a third alarm signal when the conflict data exist.
Further, the cleaning module further includes a second determining unit;
the second determining unit is capable of determining that the process control mode is data re-reading according to the first alarm signal;
the second determining unit is capable of determining that the process control mode is data rollback recovery according to the second alarm signal;
The second determining unit is capable of determining the process control mode as a data locking mechanism based on the third alarm signal.
Further, the feedback correction module comprises an acquisition unit, an evaluation unit and an adjustment unit;
The acquisition unit is used for periodically acquiring the cleaning result;
the evaluation unit is used for determining a cleaning grade according to the absolute value of the first difference value and a preset first evaluation value;
The adjusting unit is used for determining that the feedback adjusting mode is single feedback adjustment or integral feedback adjustment according to the cleaning grade;
The first absolute value of the difference is the absolute value of the difference between the actual storage release amount and the preset standard storage release amount.
The invention also provides a method for cleaning the data lake files on time based on calculation separation, which comprises the following steps:
step S1, scanning all files in the data lake, determining files to be cleaned according to a scanning result, and marking to obtain a plurality of isolated files;
Step S2, accumulating the isolated files to obtain the actual marking number, and cleaning the data lakes in batches according to the actual marking number or a preset initial cleaning period;
Step S3, for any cleaning batch, generating a cleaning set by cleaning the batch and the corresponding isolated files, and arranging the cleaning sequence of the files to be cleaned in the cleaning set;
Step S4, when cleaning is performed based on the cleaning sequence arrangement result, monitoring a cleaning process in real time, sequentially judging the consistency of cleaning target data, cleaning process data and cleaning operation, and determining a corresponding process control mode according to the judging result;
Step S5, periodically acquiring a cleaning result, and carrying out feedback adjustment on cleaning conditions according to the cleaning result;
the cleaning conditions comprise the initial cleaning period, a total marking threshold value and a sub marking threshold value.
Compared with the prior art, the method has the advantages that the method ensures that target files are found in time by scanning files in the data lake in real time, is beneficial to quickly responding and processing files to be cleaned, reduces data redundancy and storage space waste, determines an actual cleaning mode according to scanning results and cleaning types by combining actual numbers, enables a cleaning strategy to be more flexible and targeted and adapt to different cleaning requirements and situations, sorts the files to be cleaned by evaluating the cleaning urgency degree to generate an actual cleaning sequence, ensures that important or urgent files are cleaned preferentially, optimizes resource use and processing efficiency, acquires and analyzes actual cleaning parameters in the cleaning process to judge consistency of the cleaning process, determines corresponding process control modes according to judging results, is beneficial to guaranteeing stability and reliability of the cleaning process, reduces risks of errors and data loss, and enables the system to continuously optimize the cleaning strategy according to the actual cleaning effect by periodically acquiring the cleaning results, thereby reducing requirements of manual intervention, improving management efficiency and management efficiency, improving the accuracy of the lake data, prolonging the efficiency of the lake data and prolonging the service life of the lake, and effectively storing and storing the data.
In particular, the files in the data lake are scanned in real time by time, frequency, capacity, format and reference to various standards, so that full coverage and accurate identification are ensured, and various file types possibly needing cleaning can be captured by multi-dimensional scanning; the method comprises the steps of analyzing a scanning result in combination with a preset target cleaning strategy (such as standard deadline, use, capacity and format strategies), accurately identifying target files to be cleaned, flexibly adjusting the cleaning strategy according to different management requirements, improving judgment accuracy, marking the target files according to judgment results, clearly distinguishing files of different cleaning types (such as expired files, unusual files, large-capacity files and files which are not in specification), classifying the marks to facilitate orderly proceeding of subsequent cleaning work, dividing marked isolated files according to the cleaning types to form a plurality of subsets to be cleaned, finely dividing the cleaning work to enable the cleaning work to be more orderly, facilitating adoption of different cleaning strategies for different types of files, accumulating statistics on the actual number of marks of each subset to be cleaned, providing data support, facilitating quantification of the workload of cleaning tasks, facilitating resource allocation and task scheduling, determining an actual cleaning mode (threshold cleaning mode or periodic cleaning mode) according to the actual number of the subsets to be cleaned, flexibly adjusting the cleaning strategy according to actual conditions, improving cleaning efficiency and effect, comprehensively considering time, frequency, and frequency, comprehensively analyzing the files of different types, comprehensively identifying the complete and comprehensively analyzing the files according to the requirements, the system automatically performs operations such as scanning, judging, marking, dividing and counting, reduces manual intervention, improves the automation level and the processing efficiency of cleaning work, combines different cleaning types and actual quantity to determine a cleaning mode, enables the system to flexibly cope with different cleaning demands, has strong adaptability, can effectively reduce data redundancy, optimize the storage structure and the management efficiency of a data lake, prolongs the service life of the data lake, improves the overall data management level, and is beneficial to improving the accuracy, the efficiency and the flexibility of cleaning the data lake files by accurately identifying and classifying the marks.
In particular, by using time, frequency, capacity, format and reference multi-dimensional scanning standards, different types of target files can be accurately identified by combining specific target cleaning strategies (such as deadline, frequency of use, capacity limitation and format requirement), the accuracy and pertinence of cleaning operation are ensured by fine condition setting, whether files accord with the cleaning standards is judged by combining different characteristics of files such as creation time, frequency of use, capacity and format and the like according to the respective standard strategies, the multi-dimensional target identification strategy effectively covers various file types possibly needing cleaning in a data lake, cleaning efficiency is improved, different types of marks (such as outdated, unusual, cold door, capacity abnormality and format disagreement) can be carried out on the target files according to the analysis result of a judging unit, the subsequent cleaning operation can be orderly carried out, flexible classification marks are beneficial to optimizing cleaning flow and resource allocation, files after the marks are divided into different subsets to be cleaned according to the cleaning types, each subset has specific targets and strategies, the accurate division mode enables the cleaning operation to be more efficient, the number of the files can be properly matched with the actual cleaning modes, the actual cleaning modes can be adjusted according to the preset threshold value, the actual cleaning cycle can be reduced, the actual cleaning cycle can be adjusted according to the threshold value of the preset cleaning strategies, and the actual cleaning cycle can be automatically adjusted according to the threshold value, and the actual cleaning cycle can be adjusted by the threshold value is ensured, and the threshold value is automatically adjusted when the cleaning threshold is met, and the cleaning threshold is met by setting is automatically or can be adjusted according to the threshold, the system can effectively manage files in the data lake, optimize the storage structure and management efficiency, prolong the service life of the data lake, thereby improving the overall data management level and the operation cost efficiency, realizing the high efficiency and the intellectualization of the cleaning operation, and providing powerful support for the long-term operation and management of the data lake.
The method comprises the steps of automatically determining the number of single cleaning and generating a cleaning set through an actual cleaning mode, reducing manual intervention, improving the automation degree and efficiency of cleaning tasks, calculating the cleaning urgency of each isolated file through combining a marking result and the actual number of marks, ensuring that the files with high urgency are cleaned preferentially, optimizing resource use, ensuring the effectiveness and timeliness of cleaning work, sorting the cleaning urgency of each isolated file, generating an actual cleaning sequence, ensuring that the files which are most required to be cleaned can be processed preferentially based on the cleaning sequence with priority, improving the overall efficiency and effect of cleaning work, automatically generating the cleaning sequence according to the cleaning urgency, providing clear guidance for subsequent cleaning operations, reducing the complexity and uncertainty in a decision process, better utilizing system resources through calculating and sorting the cleaning urgency, avoiding occupation of excessive resources by the files with low priority, improving the overall performance of the system, dynamically adjusting the number of single cleaning according to the actual cleaning mode, ensuring that the system can flexibly cope with different cleaning requirements, having strong adaptability, enabling the whole preprocessing module to carry out decision-making on the basis of data, the actual cleaning result, the reliability and the cleaning work has high-driving accuracy and the accuracy, the accuracy and the accuracy of the cleaning work is improved, the accuracy and the accuracy is improved, the accuracy and the importance of the cleaning work is improved, and the accuracy is improved.
The method comprises the steps of obtaining actual cleaning parameters by monitoring a cleaning process in real time, wherein the actual cleaning parameters comprise cleaning target data, a cleaning operation result and cleaning process data, ensuring transparency and traceability of the cleaning process, finding and processing problems in time, improving reliability of the cleaning operation, sending out a first alarm signal when the actual cleaning process data and the metadata are asynchronous by judging whether the actual cleaning process data and the metadata are synchronous, guaranteeing consistency and integrity of the data, avoiding errors and confusion caused by data asynchronous, determining a single cleaning success rate according to an actual cleaning operation result and a target cleaning strategy, and sending out a second alarm signal when the success rate is not equal to 100%. The function ensures the effectiveness and accuracy of the cleaning operation, timely identifies and processes unsuccessful cleaning operation, and improves the overall cleaning effect; according to the method, whether conflict data exist or not is judged by combining actual cleaning target data and current execution task data, if so, a third alarm signal is sent, the problem of data conflict can be effectively prevented and solved, smooth data cleaning is guaranteed, corresponding process control modes are determined according to different alarm signals, the first alarm signal triggers data to be read again, accuracy and consistency of the data are guaranteed, the second alarm signal triggers data rollback recovery, data loss caused by cleaning failure is avoided, the third alarm signal triggers a data locking mechanism, the conflict data are prevented from affecting subsequent operation, flexible control modes improve the strain capacity and operation stability of the system, the system can respond quickly and process various abnormal conditions by sending alarm signals to different types of errors and adopting corresponding control measures, the fault time in the cleaning process is shortened, the stability and reliability of cleaning operation are improved, intelligent decision support is achieved through data driving by the aid of the corresponding control modes, manual intervention is reduced, the operation efficiency and accuracy are improved, the automation, reliability and intelligent level of cleaning operation are remarkably improved, and the high efficiency and stability of data management are guaranteed.
The method comprises the steps of acquiring a cleaning result periodically, ensuring that the actual effect and the data release amount of the cleaning operation are acquired timely, enabling a feedback correction module to evaluate and adjust based on latest data, improving timeliness and accuracy of feedback, determining a cleaning grade according to the absolute value (first difference absolute value) of the difference between the actual storage release amount and a preset standard storage release amount and combining the preset first evaluation value, enabling an intelligent evaluation mode to objectively judge the quality of the cleaning effect, providing scientific basis for subsequent adjustment, determining a feedback adjustment mode according to the cleaning grade obtained through evaluation, and selecting single feedback adjustment or integral feedback adjustment. The flexible feedback adjustment mode can conduct targeted adjustment according to different cleaning conditions and requirements, optimize the cleaning effect, periodically acquire the cleaning result and conduct intelligent judgment according to the evaluation unit, the adjustment unit can effectively optimize the utilization of storage resources, achieve the maximum storage release effect on the premise of minimizing influence on normal operation of the data lake, scientifically and effectively guarantee the scientificity and the effectiveness of the cleaning strategy through intelligent evaluation and adjustment, provide powerful support for long-term management of the data lake, optimize and continuously improve the cleaning operation, and bring remarkable advantages for management and operation of the data lake.
The method comprises the steps of scanning, marking, carrying out batch cleaning according to the number of actual marks or an initial cleaning period, providing a flexible cleaning strategy for an administrator, adjusting the cleaning frequency and scale according to the actual conditions of files in a data lake, generating a cleaning set and sequentially arranging the files, ensuring the orderly cleaning process, avoiding confusion and conflict when the files are cleaned, ensuring the accuracy of the cleaning process and the safety of data through real-time monitoring and consistency judgment, ensuring that the cleaning operation is carried out according to a set rule, avoiding deleting or deleting important data by mistake through real-time monitoring and adjustment, feeding back the cleaning condition according to the cleaning result, enabling the cleaning strategy to carry out self-adaptive adjustment according to the actual use conditions of the data lake, ensuring the effectiveness of the cleaning operation and the optimization of the performance of the data lake, releasing the unnecessary space in the data lake through regular and according to the required cleaning, improving the utilization rate of the storage space of the data lake, and helping to reduce the performance of the isolated data due to the fact that the data is invalid, and the performance of the data is prevented from being accumulated.
Detailed Description
The invention will be further described with reference to examples for the purpose of making the objects and advantages of the invention more apparent, it being understood that the specific examples described herein are given by way of illustration only and are not intended to be limiting.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.
It should be noted that, in the description of the present invention, terms such as "upper," "lower," "left," "right," "inner," "outer," and the like indicate directions or positional relationships based on the directions or positional relationships shown in the drawings, which are merely for convenience of description, and do not indicate or imply that the apparatus or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Referring to fig. 1-4, fig. 1 is a schematic structural diagram of a system for on-time cleaning of data lake files based on calculation separation according to an embodiment of the present invention, fig. 2 is a schematic structural diagram of a preprocessing module in a system for on-time cleaning of data lake files based on calculation separation according to an embodiment of the present invention, fig. 3 is a schematic structural diagram of a feedback correction module in a system for on-time cleaning of data lake files based on calculation separation according to an embodiment of the present invention, and fig. 4 is a flowchart of a method for on-time cleaning of data lake files based on calculation separation according to an embodiment of the present invention.
The invention provides a data lake file on-time cleaning system based on deposit calculation separation, which comprises the following steps:
the identification module 1 is used for scanning all files in the data lake in real time, sequentially identifying target files according to scanning results, judging cleaning types of the target files, recording the actual number of the target files, and determining an actual cleaning mode according to the cleaning types and the actual number;
the preprocessing module 2 is used for generating a cleaning set for each target file obtained by scanning according to the actual cleaning mode, sequentially evaluating the cleaning urgency of each file to be cleaned in the cleaning set, and sequencing each file to be cleaned according to the evaluation result so as to obtain an actual cleaning sequence;
the cleaning module 3 is used for cleaning according to the actual cleaning sequence, acquiring actual cleaning parameters in the cleaning process, analyzing the actual cleaning parameters to judge the consistency of the cleaning process, and determining a corresponding process control mode according to a judging result;
and the feedback correction module 4 is used for periodically acquiring a cleaning result and carrying out feedback adjustment on the cleaning condition according to the cleaning result.
The method and the system for cleaning the lake according to the data comprise the steps of carrying out real-time scanning on files in the data lake, guaranteeing timely finding out target files, facilitating quick response and processing of files to be cleaned, reducing data redundancy and storage space waste, determining actual cleaning modes according to scanning results and cleaning types in combination with actual numbers, enabling a cleaning strategy to be more flexible and targeted, adapting to different cleaning requirements and situations, sorting the files to be cleaned through evaluation of cleaning urgency, generating an actual cleaning sequence, guaranteeing that important or urgent files are cleaned preferentially, optimizing resource use and processing efficiency, acquiring and analyzing actual cleaning parameters in the cleaning process to judge consistency of the cleaning process, determining corresponding process control modes according to judging results, guaranteeing stability and reliability of the cleaning process, reducing risks of errors and data loss, carrying out feedback adjustment on cleaning conditions through periodically acquiring the cleaning results, enabling a closed loop feedback mechanism to enable the system to continuously optimize the cleaning strategy according to the actual cleaning effect, improving cleaning efficiency and effect, reducing manual intervention requirements, improving management efficiency and accuracy, effectively managing data in the cleaning process, improving efficiency and prolonging service life of the data in the lake, and effectively managing the data and effectively prolonging service life of the data and retrieving the data.
Specifically, the identification module 1 in this embodiment includes a scanning unit, a judging unit, a marking unit, a dividing unit, a counting unit, and a first determining unit;
The scanning unit is used for scanning all files in the data lake in real time according to a scanning standard so as to obtain an actual scanning result;
the judging unit is used for analyzing each file according to the actual scanning result and combining with a preset target cleaning strategy so as to identify and obtain a plurality of target files;
The target cleaning strategy comprises a standard deadline strategy, a standard use strategy, a standard capacity strategy and a standard format strategy;
The marking unit is used for marking each target file according to the judging result so as to obtain isolated files with different cleaning types;
The dividing unit is used for dividing each isolated file according to the cleaning type to obtain a plurality of subsets to be cleaned;
the counting unit is used for accumulating the actual number of marks of each subset to be cleaned;
The first determining unit is configured to determine, according to the actual number, the actual number of marks of each subset to be cleaned, the cleaning type, in combination with a preset initial cleaning period, that the actual cleaning mode is a threshold cleaning mode, or a periodic cleaning mode.
Specifically, the embodiment of the invention scans the files in the data lake in real time through time, frequency, capacity, format and reference to various standards, ensures full coverage and accurate identification, and can capture various file types possibly needing cleaning through multi-dimensional scanning; the method comprises the steps of analyzing a scanning result in combination with a preset target cleaning strategy (such as standard deadline, use, capacity and format strategies), accurately identifying target files to be cleaned, flexibly adjusting the cleaning strategy according to different management requirements, improving judgment accuracy, marking the target files according to judgment results, clearly distinguishing files of different cleaning types (such as expired files, unusual files, large-capacity files and files which are not in specification), classifying the marks to facilitate orderly proceeding of subsequent cleaning work, dividing marked isolated files according to the cleaning types to form a plurality of subsets to be cleaned, finely dividing the cleaning work to be more orderly, facilitating adoption of different cleaning strategies for different types of files, accumulating statistics on the actual number of marks of each subset to be cleaned, providing data support, facilitating quantification of the workload of cleaning tasks, facilitating resource allocation and task scheduling, determining an actual cleaning mode (threshold cleaning mode or period cleaning mode) according to the actual number of the subsets to be cleaned, flexibly adjusting the cleaning strategy according to actual conditions, improving cleaning efficiency and effect, comprehensively considering time, frequency, and various reference formats, analyzing the files in a multi-reference system, the system has the advantages of improving the comprehensiveness and accuracy of identification, enabling each unit to work cooperatively, automatically executing operations such as scanning, judging, marking, dividing and counting, reducing manual intervention, improving the automation level and processing efficiency of cleaning work, determining the cleaning mode by combining different cleaning types and actual quantity, enabling the system to flexibly cope with different cleaning demands, being high in adaptability, being capable of effectively reducing data redundancy, optimizing the storage structure and management efficiency of a data lake, prolonging the service life of the data lake, improving the overall data management level and being beneficial to improving the accuracy, efficiency and flexibility of cleaning data lake files through accurate identification and classification marking.
Specifically, in this embodiment, for any file in the data lake, based on the target cleaning policy being the standard deadline policy when the scanning standard is the time standard, the scanning unit obtains the creation time of the file, and calculates the actual creation time according to the creation time and the current time;
the judging unit judges whether the file is the target file according to the actual creation time length and the standard deadline strategy;
if the actual creation time length exceeds the standard deadline strategy, the judging unit judges that the file is the target file;
the marking unit marks the target file as an expired isolated file;
If the actual creation time does not exceed the standard deadline strategy, the judging unit judges that the file is not the target file;
in this embodiment, the standard deadline policy is that the creation duration is three months;
Based on the standard usage policy of the target cleaning policy when the scanning standard is the frequency standard, the scanning unit obtains the actual usage frequency of the file;
the judging unit judges whether the file is the target file according to the actual use frequency and the standard use strategy;
If the actual use frequency is smaller than or equal to the standard use policy, the judging unit judges that the file is the target file;
the marking unit marks the target file as an unusual isolated file;
If the actual use frequency is greater than the standard use strategy, the judging unit judges that the file is not the target file;
Based on the target cleaning strategy being the standard use strategy when the scanning standard is the reference standard, the scanning unit obtains the actual reference times of the file;
the judging unit judges whether the file is the target file according to the actual reference times and the standard use strategy;
If the actual reference number is smaller than or equal to the standard use policy, the judging unit judges that the file is the target file;
the marking unit marks the target file as a cold isolated file;
If the actual reference number is greater than the standard use policy, the judging unit judges that the file is not the target file;
in the embodiment, the standard use strategy is that the use frequency is 10 and the reference frequency is 8 in three months;
based on the standard capacity strategy, the target cleaning strategy is the standard capacity strategy when the scanning standard is the capacity standard, and the scanning unit acquires the actual capacity of the file;
the judging unit judges whether the file is the target file according to the actual capacity and the standard capacity strategy;
If the actual capacity is smaller than or equal to the minimum capacity value in the standard capacity strategy, the judging unit judges that the file is the target file;
if the actual capacity is greater than or equal to the maximum capacity value in the standard capacity strategy, the judging unit judges that the file is the target file;
the marking unit marks the target file as a capacity abnormality isolated file;
If the actual capacity is larger than the minimum capacity value and smaller than the maximum capacity value, the judging unit judges that the file is not the target file;
in this embodiment, the standard capacity policy is [ minimum capacity value=3kb, maximum capacity value=3gb ];
based on the target cleaning policy being the standard format policy when the scanning standard is the format standard, the scanning unit obtains the actual format of the file;
the judging unit judges whether the file is the target file according to the actual format and the standard format strategy;
if the actual format is the standard format policy, the judging unit judges that the file is the target file;
The marking unit marks the target file as a failure isolated file;
if the actual format is not the standard format policy, the judging unit judges that the file is not the target file;
in this embodiment, the standard format policy is an operation failure format;
the dividing unit divides each object file according to the cleaning type,
Dividing the expired isolated file to form a first subset to be cleaned based on the cleaning type, and accumulating the first mark actual number of the first subset to be cleaned by the counting unit;
Dividing and forming a second subset to be cleaned based on the cleaning type as the unusual isolated file, and accumulating the actual number of second marks of the second subset to be cleaned by the counting unit;
Dividing and forming a third subset to be cleaned based on the cleaning type as the cold door isolated file, and accumulating the third mark actual number of the third subset to be cleaned by the counting unit;
dividing and forming a fourth subset to be cleaned based on the cleaning type as the capacity abnormality isolated file, and accumulating the fourth mark actual number of the fourth subset to be cleaned by the counting unit;
Dividing and forming a fifth subset to be cleaned based on the cleaning type as the failure isolated file, and accumulating the actual number of fifth marks of the fifth subset to be cleaned by the counting unit;
Based on the fact that the actual number exceeds a preset total mark threshold and the initial cleaning period is not reached, the first determining unit determines that the actual cleaning mode is a first threshold cleaning mode;
Based on the fact that the actual number does not exceed a preset total mark threshold value and the initial cleaning period is not reached, the first determining unit determines that cleaning is not started at the moment;
based on the fact that the actual number does not exceed a preset total mark threshold and the initial cleaning period is reached, the first determining unit determines that the actual cleaning mode is the period cleaning mode;
the first threshold cleaning mode is that cleaning is started when the number of marked target files reaches the total marking threshold;
the periodic cleaning mode is that each time the data lake runs one initial cleaning period, cleaning is started once;
For any subset to be cleaned, if the actual number of marks corresponding to the subset to be cleaned exceeds a preset sub-mark threshold value, judging that the subset to be cleaned accords with a single judgment condition;
If the actual number of marks corresponding to the subset to be cleaned does not exceed the sub-mark threshold, judging that the subset to be cleaned does not meet the single judgment condition;
based on the actual number not exceeding the total marking threshold and not reaching the initial cleaning period,
If the number of the items meeting the single judgment condition in the first subset to be cleaned, the second subset to be cleaned, the third subset to be cleaned, the fourth subset to be cleaned and the fifth subset to be cleaned is greater than or equal to 2, the first determining unit judges that the actual cleaning mode is a second threshold cleaning mode;
If the number of items, which meet the single judgment condition, in the first subset to be cleaned, the second subset to be cleaned, the third subset to be cleaned, the fourth subset to be cleaned and the fifth subset to be cleaned is smaller than 2, the first determining unit judges that cleaning is started at the moment;
the second threshold cleaning mode is that cleaning is started when the number of marks in any two subsets to be cleaned exceeds the sub-mark threshold;
In this embodiment, the initial cleaning period is set to 2h, the total marking threshold is set to 30, and the sub-marking threshold is set to 10.
Specifically, the embodiment of the invention can accurately identify different types of target files by combining the scanning standards of time, frequency, capacity, format and reference multiple dimensions with specific target cleaning strategies (such as deadline, frequency of use, capacity limitation and format requirement), fine condition setting ensures the accuracy and pertinence of cleaning operation, judges whether the files meet the cleaning standards according to the different characteristics of the files such as creation time, frequency of use, capacity and format, and the like, the multiple dimension target identification strategy effectively covers various file types possibly needing cleaning in a data lake according to the respective standard strategies, improves the cleaning efficiency, carries out different types of marks (such as expired, unusual, capacity abnormal and format abnormal) on the target files according to the analysis result of a judging unit, enables the subsequent cleaning work to be orderly carried out, is beneficial to optimizing the cleaning flow and resource allocation, divides the marked files into different to-be-cleaned subsets according to the cleaning types, each subset has specific cleaning targets and strategies, enables the cleaning work to meet the cleaning standards according to the accurate division, effectively covers various types of files possibly needing cleaning according to the cleaning standard, can meet the cleaning requirements, and the number of the cleaning modes can be properly determined by the preset threshold, and the threshold can meet the threshold value according to the threshold, and the threshold can be automatically adjusted according to the actual threshold, and the threshold can be met by determining the threshold value when the threshold is met (such as the threshold is automatically determined by the threshold is met or the optimal conditions and can be adjusted according to the actual threshold), the system can effectively manage files in the data lake, optimize the storage structure and management efficiency, prolong the service life of the data lake, thereby improving the overall data management level and the operation cost efficiency, realizing the high efficiency and the intellectualization of the cleaning operation, and providing powerful support for the long-term operation and management of the data lake.
Specifically, the preprocessing module 2 in the present embodiment includes a generating unit, a calculating unit, and a comparing unit;
The generating unit is used for determining the number of single cleaning according to the actual cleaning mode and generating the cleaning set based on the number of single cleaning;
The calculating unit is used for calculating the cleaning emergency degree of each isolated file according to the marking result and the marking actual quantity;
the comparison unit is used for sorting according to the numerical value of the cleaning urgency of each isolated file so as to obtain the actual cleaning sequence.
Specifically, the cleaning urgency=a cleaning type corresponding feature value×the actual number of marks of the cleaning type;
in this embodiment, the characteristic value corresponding to the expired isolated file is set to 1.5, the characteristic value corresponding to the unusual isolated file is set to 1.3, the characteristic value corresponding to the cold door isolated file is set to 1.2, the characteristic value corresponding to the capacity abnormal isolated file is set to 1.2, and the characteristic value corresponding to the failed isolated file is set to 1.3.
The method and the system for automatically determining the cleaning emergency degree of the isolated files according to the actual cleaning mode automatically determine the single cleaning number and generate the cleaning set, reduce manual intervention, improve the automation degree and efficiency of cleaning tasks, calculate the cleaning emergency degree of the isolated files according to the marking result and the marking actual number, ensure that the files with high emergency degree are cleaned preferentially, optimize resource use, ensure the effectiveness and timeliness of cleaning work, order the cleaning emergency degree of the isolated files, generate an actual cleaning sequence, ensure that the files which are most required to be cleaned can be processed preferentially according to the cleaning sequence based on the priority order, improve the overall efficiency and effect of cleaning work, automatically generate the cleaning sequence according to the cleaning emergency degree, provide clear guidance for subsequent cleaning operations, reduce the complexity and uncertainty in the decision process, and utilize system resources better through calculating and ordering the cleaning emergency degree, and avoid the excessive resources occupied by the files with low priority, further improve the overall performance of the system, dynamically adjust the single cleaning number according to the actual cleaning mode, enable the system to flexibly cope with different cleaning requirements, realize strong adaptability, realize the complete preprocessing module, drive the accuracy and the reliability of cleaning task, improve the accuracy and the accuracy of cleaning task, and the accuracy of the cleaning task is improved, the accuracy and the accuracy of the cleaning task is improved.
Specifically, the cleaning module 3 in this embodiment includes a monitoring unit, a first determining unit, a second determining unit, a third determining unit, and a second determining unit;
The monitoring unit is used for monitoring the cleaning process in real time so as to obtain the actual cleaning parameters;
the first judging unit is used for judging whether the actual cleaning process data and the metadata are synchronous or not and sending a first alarm signal based on the fact that the actual cleaning process data and the metadata are not synchronous;
the actual cleaning parameters comprise actual cleaning target data, actual cleaning operation results and actual cleaning process data;
The second judging unit is used for determining a single cleaning success rate according to the actual cleaning operation result and the target cleaning strategy, and sending a second alarm signal when the single cleaning success rate is not equal to 100%;
The third judging unit is used for judging whether conflict data exist according to the actual cleaning target data and the current execution task data, and sending a third alarm signal when the conflict data exist;
the second determining unit is capable of determining that the process control mode is data re-reading according to the first alarm signal;
the second determining unit is capable of determining that the process control mode is data rollback recovery according to the second alarm signal;
The second determining unit is capable of determining the process control mode as a data locking mechanism based on the third alarm signal.
The method and the device for cleaning the object of the invention have the advantages of ensuring transparency and traceability of the cleaning process, finding and processing problems in time, improving the reliability of the cleaning operation by monitoring the cleaning process in real time to obtain actual cleaning parameters including cleaning target data, cleaning operation results and cleaning process data, sending out a first alarm signal when the actual cleaning process data and metadata are asynchronous by judging whether the actual cleaning process data and metadata are synchronous, ensuring the consistency and the integrity of the data, avoiding errors and confusion caused by data asynchronous, determining a single cleaning success rate according to the actual cleaning operation results and the target cleaning strategy, and sending out a second alarm signal when the success rate is not equal to 100%. The function ensures the effectiveness and accuracy of the cleaning operation, timely identifies and processes unsuccessful cleaning operation, and improves the overall cleaning effect; according to the method, whether conflict data exist or not is judged by combining actual cleaning target data and current execution task data, if so, a third alarm signal is sent, the problem of data conflict can be effectively prevented and solved, smooth data cleaning is guaranteed, corresponding process control modes are determined according to different alarm signals, the first alarm signal triggers data to be read again, accuracy and consistency of the data are guaranteed, the second alarm signal triggers data rollback recovery, data loss caused by cleaning failure is avoided, the third alarm signal triggers a data locking mechanism, the conflict data are prevented from affecting subsequent operation, flexible control modes improve the strain capacity and operation stability of the system, the system can respond quickly and process various abnormal conditions by sending alarm signals to different types of errors and adopting corresponding control measures, the fault time in the cleaning process is shortened, the stability and reliability of cleaning operation are improved, intelligent decision support is achieved through data driving by the aid of the corresponding control modes, manual intervention is reduced, the operation efficiency and accuracy are improved, the automation, reliability and intelligent level of cleaning operation are remarkably improved, and the high efficiency and stability of data management are guaranteed.
Specifically, the feedback correction module 4 in this embodiment includes an acquisition unit, an evaluation unit, and an adjustment unit;
The acquisition unit is used for periodically acquiring the cleaning result;
the evaluation unit is used for determining a cleaning grade according to the absolute value of the first difference value and a preset first evaluation value;
The adjusting unit is used for determining that the feedback adjusting mode is single feedback adjustment or integral feedback adjustment according to the cleaning grade;
The first absolute value of the difference is the absolute value of the difference between the actual storage release amount and the preset standard storage release amount.
Specifically, the first difference absolute value= |actual storage release amount-standard storage release amount|;
If the absolute value of the first difference value is smaller than or equal to the first evaluation value, judging the cleaning grade as a second grade;
If the first difference absolute value is larger than the first evaluation value and the actual storage release amount is larger than the standard storage release amount, judging that the cleaning grade is one grade;
If the first difference absolute value is larger than the first evaluation value and the actual storage release amount is smaller than the standard storage release amount, judging that the cleaning grade is three-grade, and determining that the feedback adjustment mode is the integral feedback adjustment;
In this example, the standard memory release amount was set to 40%, and the first evaluation value was set to 8%.
The method and the device for the cleaning operation of the vehicle body control system comprise the steps of acquiring a cleaning result periodically, acquiring the actual effect and the data release amount of the cleaning operation in time, evaluating and adjusting the feedback correction module based on the latest data by the periodic acquisition function, improving timeliness and accuracy of feedback, determining the cleaning grade according to the absolute value of the difference value (the absolute value of the first difference value) of the actual storage release amount and the preset standard storage release amount and combining the preset first evaluation value, wherein an intelligent evaluation mode can objectively judge the quality of the cleaning effect, provide scientific basis for subsequent adjustment, and determine a feedback adjustment mode according to the cleaning grade obtained by evaluation, so that single feedback adjustment or integral feedback adjustment can be selected. The flexible feedback adjustment mode can conduct targeted adjustment according to different cleaning conditions and requirements, optimize the cleaning effect, periodically acquire the cleaning result and conduct intelligent judgment according to the evaluation unit, the adjustment unit can effectively optimize the utilization of storage resources, achieve the maximum storage release effect on the premise of minimizing influence on normal operation of the data lake, scientifically and effectively guarantee the scientificity and the effectiveness of the cleaning strategy through intelligent evaluation and adjustment, provide powerful support for long-term management of the data lake, optimize and continuously improve the cleaning operation, and bring remarkable advantages for management and operation of the data lake.
On the other hand, the invention also provides a method for cleaning the data lake files on time based on calculation separation, which comprises the following steps:
step S1, scanning all files in the data lake, determining files to be cleaned according to a scanning result, and marking to obtain a plurality of isolated files;
S2, accumulating the isolated files to obtain the actual marking number, and cleaning the data lakes in batches according to the actual marking number or the initial cleaning period;
Step S3, for any cleaning batch, generating a cleaning set by cleaning the batch and the corresponding isolated files, and arranging the cleaning sequence of the files to be cleaned in the cleaning set;
Step S4, when cleaning is performed based on the cleaning sequence arrangement result, monitoring a cleaning process in real time, sequentially judging the consistency of cleaning target data, cleaning process data and cleaning operation, and determining a corresponding process control mode according to the judging result;
Step S5, periodically acquiring a cleaning result, and carrying out feedback adjustment on cleaning conditions according to the cleaning result;
the cleaning conditions comprise the initial cleaning period, a total marking threshold value and a sub marking threshold value.
The method and the device can rapidly locate the files to be cleaned through scanning and mark, improve the efficiency of cleaning work, avoid time waste of checking the files one by one, perform batch cleaning according to the number of actual marks or an initial cleaning period, provide a flexible cleaning strategy for an administrator, adjust the frequency and scale of cleaning according to the actual conditions of the files in the data lake, ensure orderly cleaning process through generating a cleaning set and sequentially arranging the files, avoid confusion and conflict during cleaning the files, ensure the accuracy of the cleaning process and the safety of the data through real-time monitoring and consistency judgment, ensure that cleaning operation is performed according to established rules through real-time monitoring and adjustment, avoid deleting or deleting important data by mistake, adjust cleaning conditions according to cleaning results feedback, enable the cleaning strategy to perform self-adaptive adjustment according to the actual use conditions of the data lake, ensure the effectiveness of cleaning work and the optimization of the data lake performance, release the unnecessary space in the data lake according to the requirements through regular and the lake cleaning, improve the storage space utilization rate of the data lake, help to reduce the data storage rate and prevent the data from being invalid due to the regular degradation of the data.
The two functions of calculating the compensation parameter and calculating the adjustment parameter in the invention are that the first function is to balance the left and right dimension of the formula, the second function is to adjust the numerical value result, no specific assignment is carried out in the embodiment, the calculation formulas in the embodiment are used for intuitively reflecting the adjustment relation among the numerical values, such as positive correlation and negative correlation, and the numerical values of the parameters without specific limitation are positive on the premise of no special description.
Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.