Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the automatic time series regression method according to the embodiment of the present invention includes the following steps:
s1, acquiring a time sequence data set, and preprocessing the time sequence data set.
The time-series data is a data sequence recorded in time series with the same unified index. In one embodiment of the invention, the abnormal points in the time-ordered data set can be smoothed, so that the influence of the data set on the model precision is reduced. In time series data tasks, outlier points often occur, and the processing of outliers also compares the impact, which presents a greater challenge for processing outliers of time series data than non-time series data. Time series data is generally and strongly related to time, and a target value also can drift to another value range along with the time, so that if the global mean standard deviation is directly adopted for processing, some non-outliers can be processed. Considering the existence of the problem, the embodiment of the invention adopts a global and local outlier smoothing processing mode.
This approach considers the global mean standard deviation as well as the mean standard deviation of the current point near time window and the values of neighboring points, where the fold setting for deviations from the global standard deviation is larger in order to avoid that normal values are processed. A particular concern here is also that the processing of the training set and the test set is somewhat different, since the data of the test set is obtained slowly in time steps, and then later time steps of the current point in time are not visible, and therefore the processing of the test data set is based on the adjacent time window data before the current point. After the abnormal point is detected in a global and local mode, a value in a relatively normal range can be calculated according to the local mean standard deviation and the left and right adjacent values of the current point and is reassigned as a new value of the current point.
S2, carrying out automatic time sequence characteristic engineering processing and data sampling on the preprocessed time sequence data set.
In time series related tasks, what happens within a certain time window in the past has a great influence on future result predictions. The time window that is typically acted upon is somewhat different for different levels of time granularity data. Therefore, the embodiment of the invention mainly takes the characteristics of a time sliding window based on the time sequence. The features obtained by the automatic time sequence feature engineering processing comprise a target feature based on a time sliding window, a target statistical feature based on the time sliding window, a target trend feature based on the time sliding window, an important original feature based on the time sliding window, a statistical feature based on the time sliding window and other features.
For the time sliding window based target feature, the target is generally not far away from the adjacent time step value in the time series data, and they have strong correlation, so that the adjacent target in the past can be first used as the feature. In addition, the time step interval of the data set is also identified, the data set can be judged to be in terms of hours, minutes, days, months or weeks as the time step, and the size of the characteristic window is determined by verifying the search according to the model according to the time interval.
For the statistics of targets based on time sliding window, after targets are completed with sliding window, statistics are further performed on targets. There are two statistical methods, the first is to count the last N days, and some differences according to the time step difference, and the statistics are generally performed by taking the last 2, 3, 5 and 7 days of the time interval of days, and the memory limitation is also considered here. The second is to divide a large time window into N segments, and count each segment separately. The statistical calculation has maximum value, minimum value, mean value, standard deviation and the like.
For the trend feature based on the time sliding window target, the change rate of the target is calculated, and the change trend can be reflected.
Where ri represents the rate of change of the current time of the target, ti-1 represents the target of the last time node, and ti-2 represents the target of the last time node.
For important raw features based on a time sliding window, the model may be trained first using the raw features and the importance of the features obtained, and then ranked according to feature importance. Other raw features are of less importance than the historical target, so a smaller time window than target can be selected, and then the number of features used is determined based on the time window and the system-limited resources.
For statistical features based on time sliding windows, statistical feature calculations are performed here for the classification features and the numerical features, respectively. For classification features, the frequency and ratio of occurrence of statistical feature values within a time window. For numerical features, the calculation mode is the same as the statistics based on the target, and the maximum value, the minimum value, the mean value and the standard deviation are counted, but the time window is controlled to be smaller.
For other features, in addition to the above features, the feature counted by the training set is also tried to be directly used as the feature of the whole data. For example, the frequency and ratio of statistics of classification features in the training set are global, the frequency and ratio of statistics of two classification features with high feature importance are combined, and the frequency and ratio of one classification feature with high feature importance and one numerical feature with high feature importance are combined, and statistics are performed on the numerical features based on the classification features. Cross combinations of historical target and other features are also contemplated, such as cross modes of multiplying or dividing target by other more important numerical features.
The automatic feature engineering and automatic feature selection phases are typically time consuming and memory consuming, and data may be sampled in order to speed up the process. The sampling of the time sequence needs to pay attention to the sampling mode, if the data is directly and randomly sampled, the data with different time stamps of the same ID is lost, the data is not complete enough, and the effect is poor and the effect of the whole data is not comparable. Considering this problem, the embodiment of the invention randomly samples the IDs in the time-ordered data set, and uses different sampling ratios for data amounts of different sizes, the larger the data is, the smaller the sampling ratio is, when the data amount is larger, the data is truncated according to time steps, the later time step data is reserved, the sampling mode is consistent with the basic effect of using the full amount of sampling, and the final feature selection effect is relatively stable.
S3, building machine learning models of different types.
In an embodiment of the invention, two linear models and a tree model with large differences can be established. In particular, linear regression and LightGBM models may be established.
S4, calculating dynamic weights based on the time sliding window according to the time sequence data sets after preprocessing, automatic time sequence feature engineering processing and data sampling so as to fuse different types of machine learning models.
The above linear regression and LightGBM models have relatively large differences in the effects of the two models over time series data sets, some data sets have close effects, some linear regression effects are better, and some LightGBM models have better effects. Analysis has found that these datasets vary significantly over time, with the targets of some tasks increasing over time, such data often not fitting into tree models. In addition, the effect performance difference of different models in different time periods can also be greatly changed for the same data set.
The time series data has a larger time relation, so in order to reduce the influence of time factors on the model, the model can be fused by calculating the dynamic weight based on the time sliding window.
Specifically, an initial fusion weight w0 may be first determined by the validation set, then the time window of the test set is set, and the test is performed at the initial fusion weight w0 in the first time window. After each time window is finished, obtaining a corresponding optimal fusion weight according to a test result of the time window, updating the optimal fusion weight of the time window according to a set rule, and testing with the updated fusion weight in the next time window. That is, the test is performed with the initial fusion weight w0 in the first time window, and when the first time window is finished, the optimal fusion weight w1 of the window can be obtained according to the test result of the first time window, and then the following formula is used to update w1:
w′1=r×w0+(1-r)×w1
wherein r is a memory factor, namely the proportion of the previous time window weight to the current time window weight updating process.
The test results for the second window are iteratively updated using w'1 as the fusion weight, and so on. Thus, over time, the effect of longer elapsed time results on fusion may become less.
According to the automatic time sequence regression method provided by the embodiment of the invention, the time sequence data set is preprocessed, the automatic time sequence characteristic engineering processing and the data sampling are carried out, and different types of machine learning models are fused by calculating the dynamic weight based on the time sliding window, so that in the machine learning application related to the time sequence data, the application model can be conveniently obtained without relying on experience and knowledge accumulation of a data scientist, and a more accurate output result can be obtained by using the model.
Corresponding to the automatic time series regression method of the embodiment, the invention also provides an automatic time series regression device.
As shown in fig. 2, the automatic time series regression apparatus according to the embodiment of the present invention includes a preprocessing module 10, a feature engineering and sampling module 20, a model building module 30, and a fusion module 40. The preprocessing module 10 is used for acquiring a time sequence data set and preprocessing the time sequence data set; the feature engineering and sampling module 20 is used for performing automatic time sequence feature engineering processing and data sampling on the preprocessed time sequence data set; model building module 30 is used to build different types of machine learning models; the fusion module 40 is configured to calculate dynamic weights based on the time sliding window according to the pre-processed time sequence data set after the automatic time sequence feature engineering process and the data sampling, so as to fuse different types of machine learning models.
The time-series data is a data sequence recorded in time series with the same unified index. In one embodiment of the present invention, the preprocessing module 10 may perform smoothing on the outliers in the time-ordered data set, so as to reduce the influence of the data set on the model accuracy. In time series data tasks, outlier points often occur, and the processing of outliers also compares the impact, which presents a greater challenge for processing outliers of time series data than non-time series data. Time series data is generally and strongly related to time, and a target value also can drift to another value range along with the time, so that if the global mean standard deviation is directly adopted for processing, some non-outliers can be processed. Considering the existence of the problem, the embodiment of the invention adopts a global and local outlier smoothing processing mode.
This approach considers the global mean standard deviation as well as the mean standard deviation of the current point near time window and the values of neighboring points, where the fold setting for deviations from the global standard deviation is larger in order to avoid that normal values are processed. A particular concern here is also that the processing of the training set and the test set is somewhat different, since the data of the test set is obtained slowly in time steps, and then later time steps of the current point in time are not visible, and therefore the processing of the test data set is based on the adjacent time window data before the current point. After the abnormal point is detected in a global and local mode, a value in a relatively normal range can be calculated according to the local mean standard deviation and the left and right adjacent values of the current point and is reassigned as a new value of the current point.
In time series related tasks, what happens within a certain time window in the past has a great influence on future result predictions. The time window that is typically acted upon is somewhat different for different levels of time granularity data. Thus, the feature engineering and sampling module 20 of embodiments of the present invention primarily characterizes the time sliding window based on the time series itself. The features obtained by the automatic time sequence feature engineering processing comprise a target feature based on a time sliding window, a target statistical feature based on the time sliding window, a target trend feature based on the time sliding window, an important original feature based on the time sliding window, a statistical feature based on the time sliding window and other features.
For the time sliding window based target feature, the target is generally not far away from the adjacent time step value in the time series data, and they have strong correlation, so that the adjacent target in the past can be first used as the feature. In addition, the time step interval of the data set is also identified, the data set can be judged to be in terms of hours, minutes, days, months or weeks as the time step, and the size of the characteristic window is determined by verifying the search according to the model according to the time interval.
For the statistics of targets based on time sliding window, after targets are completed with sliding window, statistics are further performed on targets. There are two statistical methods, the first is to count the last N days, and some differences according to the time step difference, and the statistics are generally performed by taking the last 2, 3, 5 and 7 days of the time interval of days, and the memory limitation is also considered here. The second is to divide a large time window into N segments, and count each segment separately. The statistical calculation has maximum value, minimum value, mean value, standard deviation and the like.
For the trend feature based on the time sliding window target, the change rate of the target is calculated, and the change trend can be reflected.
Where ri represents the rate of change of the current time of the target, ti-1 represents the target of the last time node, and ti-2 represents the target of the last time node.
For important raw features based on a time sliding window, the model may be trained first using the raw features and the importance of the features obtained, and then ranked according to feature importance. Other raw features are of less importance than the historical target, so a smaller time window than target can be selected, and then the number of features used is determined based on the time window and the system-limited resources.
For statistical features based on time sliding windows, statistical feature calculations are performed here for the classification features and the numerical features, respectively. For classification features, the frequency and ratio of occurrence of statistical feature values within a time window. For numerical features, the calculation mode is the same as the statistics based on the target, and the maximum value, the minimum value, the mean value and the standard deviation are counted, but the time window is controlled to be smaller.
For other features, in addition to the above features, the feature counted by the training set is also tried to be directly used as the feature of the whole data. For example, the frequency and ratio of statistics of classification features in the training set are global, the frequency and ratio of statistics of two classification features with high feature importance are combined, and the frequency and ratio of one classification feature with high feature importance and one numerical feature with high feature importance are combined, and statistics are performed on the numerical features based on the classification features. Cross combinations of historical target and other features are also contemplated, such as cross modes of multiplying or dividing target by other more important numerical features.
The automatic feature engineering and automatic feature selection phase is typically time consuming and memory consuming, and data may be sampled by the feature engineering and sampling module 20 in order to expedite this process. The sampling of the time sequence needs to pay attention to the sampling mode, if the data is directly and randomly sampled, the data with different time stamps of the same ID is lost, the data is not complete enough, and the effect is poor and the effect of the whole data is not comparable. In view of this problem, the feature engineering and sampling module 20 of the embodiment of the present invention performs random sampling on IDs in the time-ordered data set, and uses different sampling ratios for different amounts of data, where the larger the data, the smaller the sampling ratio, and when the amount of data is larger, the data is truncated according to time steps, and the later time step data is retained, so that the sampling manner is consistent with the basic effect of using the full amount of sampling, and the final feature selection effect is also relatively stable.
In an embodiment of the present invention, model building module 30 may build two more diverse linear models and a tree model. In particular, linear regression and LightGBM models may be established.
The above linear regression and LightGBM models have relatively large differences in the effects of the two models over time series data sets, some data sets have close effects, some linear regression effects are better, and some LightGBM models have better effects. Analysis has found that these datasets vary significantly over time, with the targets of some tasks increasing over time, such data often not fitting into tree models. In addition, the effect performance difference of different models in different time periods can also be greatly changed for the same data set.
The time series data has a larger time relation, so in order to reduce the influence of the time factors on the model, the fusion module 40 can fuse the model by calculating the dynamic weight based on the time sliding window.
Specifically, the fusion module 40 may first determine an initial fusion weight w0 from the validation set, then set a time window for the test set, and test at the initial fusion weight w0 in the first time window. After each time window is finished, the fusion module 40 obtains a corresponding optimal fusion weight according to the test result of the time window, updates the optimal fusion weight of the time window according to a set rule, and tests the next time window with the updated fusion weight. That is, the fusion module 40 tests with the initial fusion weight w0 in the first time window, and after the first time window is finished, the optimal fusion weight w1 of the window can be obtained according to the test result of the first time window, and then the following formula is used to update w1:
w′1=r×w0+(1-r)×w1
wherein r is a memory factor, namely the proportion of the previous time window weight to the current time window weight updating process.
The test results for the second window are iteratively updated using w'1 as the fusion weight, and so on. Thus, over time, the effect of longer elapsed time results on fusion may become less.
According to the automatic time sequence regression device provided by the embodiment of the invention, the time sequence data set is preprocessed, the automatic time sequence characteristic engineering processing and the data sampling are carried out, and different types of machine learning models are fused by calculating the dynamic weight based on the time sliding window, so that in the machine learning application related to the time sequence data, the application model can be conveniently obtained without relying on experience and knowledge accumulation of a data scientist, and a more accurate output result can be obtained by using the model.
Corresponding to the embodiment, the invention also provides a computer device.
The computer device according to the embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the automatic time series regression method according to the above embodiment of the present invention can be implemented.
According to the computer device of the embodiment of the invention, when the processor executes the computer program stored on the memory, the time sequence data set is preprocessed, the automatic time sequence characteristic engineering processing and the data sampling are carried out, and different types of machine learning models are fused by calculating the dynamic weight based on the time sliding window, so that in the machine learning application related to the time sequence data, the application model can be conveniently obtained without depending on experience and knowledge accumulation of a data scientist, and a more accurate output result can be obtained by using the model.
The present invention also proposes a non-transitory computer-readable storage medium corresponding to the above-described embodiments.
The non-transitory computer-readable storage medium of the embodiment of the present invention has stored thereon a computer program which, when executed by a processor, can implement the automatic time series regression method according to the above-described embodiment of the present invention.
According to the non-transitory computer readable storage medium of the embodiment of the invention, when a processor executes a computer program stored thereon, firstly, preprocessing, automatic time sequence feature engineering processing and data sampling are performed on a time sequence data set, and different types of machine learning models are fused by calculating dynamic weights based on a time sliding window, so that in machine learning application involving time sequence data, an application model can be conveniently obtained without relying on experience and knowledge accumulation of a data scientist, and more accurate output results can be obtained by using the model.
The invention also provides a computer program product corresponding to the above embodiment.
The automatic time series regression method according to the above-described embodiments of the present invention may be performed when instructions in the computer program product of the embodiments of the present invention are executed by a processor.
According to the computer program product of the embodiment of the invention, when the processor executes instructions in the computer program product, the time sequence data set is firstly preprocessed, automatically processed in time sequence characteristic engineering and sampled, and different types of machine learning models are fused by calculating dynamic weights based on time sliding windows, so that in machine learning application related to time sequence data, an application model can be conveniently obtained without relying on experience and knowledge accumulation of a data scientist, and a more accurate output result can be obtained by using the model.
In the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The meaning of "a plurality of" is two or more, unless specifically defined otherwise.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
In the present invention, unless expressly stated or limited otherwise, a first feature "up" or "down" a second feature may be the first and second features in direct contact, or the first and second features in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily for the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.