Disclosure of Invention
The invention aims to provide a medicine cloud platform big data abnormity online early warning method based on a statistic generation model aiming at the defects of the prior art, and the method adopts a feature point filtering method with smooth direction, so that a large amount of mild space-time feature data can be removed, and a small amount of feature points are reserved; for searching abnormal early warning samples, the method provides an online Gaussian mixture statistics generation model which fits the probability distribution characteristics of the full life cycle of medical data, can calculate the occurrence probability of real-time sequence samples, and selects low-probability sequences as early warning samples.
The purpose of the invention is realized by the following technical scheme: a medicine cloud platform big data abnormity online early warning method based on a statistic generation model comprises the following steps:
(1) feature filtering, including affine transformation and direction smoothing filtering, as follows:
(1.1) the medicine cloud space-time data consists of a fixed-length feature vector time sequence, and the feature vector at the time t is set as Dt=<dt1,dt2,...,dtpLong, then D ═<D1,D2,...DT>Forming a sequence segment, and T is the maximum value of the sequence segment.
(1.2) performing affine transformation on each feature vector to map the feature vector to a p-dimensional finite space, and recording the feature vector at the time t after the affine transformation as D't。
(1.3) performing feature filtering in the mapped pixel space, wherein the specific process is as follows:
(1.3.1) input: time sequence segment D ═<D1,D2,...DT>(ii) a Affine-transformed time-series segment D '< D'1,D‘2,...,D‘T>;
And (3) outputting: filtered time-series fragment DA ═ DAr1,Dar2,...,Dark>, where r1,r 2.. rk ∈ {1, 2.,. T }, and k ≦ T;
(1.3.2) sequentially traversing each component D 'in D'i(i=1,2,...,T);
(1.3.2.1) if i ═ 1 or i ═ T, then D will be addediAdding into DA;
(1.3.2.2) calculate vector D'i-1And D'iA Euclidean distance between them, if the Euclidean distance is greater than a distance threshold minDis, D is determinediAdding into DA.
(1.4) directional smoothing filtration: firstly, searching a weighted main direction of a time sequence segment, and then filtering according to the weighted main direction, wherein the specific process comprises the following steps:
(1.4.1) input: the time sequence fragment DA after the last step of filtering; and (3) outputting: the direction is smoothed to obtain a filtered time sequence segment DA';
(1.4.2) mixing Dar1Adding into DA';
(1.4.3) defining the value of variable index as r1 and the value of lastAngle as-1;
(1.4.4) sequentially traversing each component Da in the DAri(i=2,...,k-1);
(1.4.4.1) calculation from DaindexTo DariIs marked as DISri;
(1.4.4.2) calculation from DaindexTo DariWeighted Angle of (1), denoted as Angleri;
(1.4.4.3) if lastAngle has a value not equal to-1, and lastAngle and Angle
riThe absolute value of the difference between is greater than
Then Da will be
riAdding the sample into DA', and making index value be ri, otherwise, filtering the point;
(1.4.4.4) let lastAngle be equalri;
(1.4.5) finally, the DarkAdded to DA'.
(2) And (3) calculating a statistical generation model: generating a probability distribution model of the time sequence segment based on historical data, wherein the probability distribution of the time sequence segment is assumed to be a Gaussian mixture function in a priori mode, and the probability distribution model is defined as follows:
where M is the number of Gaussian components in the Gaussian mixture function, k
iIs the weight of the ith Gaussian component and satisfies
N(D|u
i,Σ
i) Is the ith Gaussian function, u
iIs the mean of the ith Gaussian component, sigma
iA covariance matrix of the ith Gaussian component; a real-time online learning method is adopted, and a Gaussian mixture model is dynamically corrected along with the increase of data, and the specific process is as follows:
(2.1) initial M is in [1,5 ]]Taking values, and selecting N time sequence segments D from historical data(1),D(2),...D(N)An initial mixture gaussian model is generated using standard EM algorithms.
(2.2) continuously updating the initial Gaussian mixture model along with the arrival of new time sequence fragment data, wherein the updating process is as follows:
(2.2.1) wait for the new time series fragment data to reach R, and mark as ND(1),ND(2),...ND(R);
(2.2.2) let j be 1, L { }, and let H be the current mixed gaussian model;
(2.2.3)E(j)={E1,E2.,..,EM}={N(ND(j)|ui,Σi) I | (1, 2., M } }, i.e., ND for each newly arrived fragment data ND(j)Calculating the value of each Gaussian component;
(2.2.4) pairs of E(j)Carrying out normalization processing;
(2.2.5)I=argmax(E(j)),V=max(E(j));
(2.2.6) if V>0.5, then L ═ U { ND-(j)Else, executing step (2.2.8);
(2.2.7) if | L | > is equal to N, performing mixed gaussian clustering on all data in L by adopting an EM algorithm to obtain a new model HL, making H equal to H ═ HL, and making L equal to { };
(2.2.8) mixing ND(j)Classifying the I-th Gaussian component in the H, and recalculating the average value of the I-th Gaussian component;
(2.2.9) j equals j +1, if j > R, the algorithm ends, otherwise go back to step (2.2.3).
(3) And (5) early warning and judgment. And if the length of the set L is always smaller than N after the T batches of new data arrive, starting early warning judgment and early warning the small-probability time sequence segments.
Further, in the step (1.2), affine transformation is performed on each feature vector to map the feature vector to a p-dimensional finite space, and the maximum length of each dimension is set as LiI belongs to {1,2,. eta., p }, and the value range of each dimension is [0, L ]i](ii) a Feature vector of affine transformation at time t is recorded as D'tThen the affine transformation is defined by the following formula:
wherein d'ti(i ═ 1, 2.. multidot.p) is D'tThe ith dimension component of (1).
Further, in the step (1.4.4.2), Angle is weightedriThe calculation formula of (2) is as follows:
in the above formula, x represents a dot product operation of vectors, and d represents an euclidean distance between two vectors.
Further, in the step (2.2.8), the average value of the I-th component is recalculated according to the following formula:
further, in the step (3), the early warning determination method includes substituting each new time sequence fragment data into the gaussian mixture model, and if the calculated value is less than 0.1, indicating that a small probability time sequence fragment occurs, early warning the time sequence fragment.
The invention has the beneficial effects that:
1. the method realizes the filtering of the sequence fragment data through a two-step filtering method comprising affine transformation and direction smoothing filtering, thereby removing similar points in the sequence fragment data, reserving a small number of characteristic points, reducing the analysis data volume and simultaneously providing a data basis for a statistic generation model.
2. And an online Gaussian mixture statistic generation model is further adopted, and the model fits the probability distribution of the time sequence fragment data, so that the capacity of estimating the occurrence probability of the time sequence fragment and early warning is realized.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
The invention provides a statistical generation model-based online early warning method for big data abnormity of a medicine cloud platform, which comprises the following steps:
(1) feature filtering method
(1.1) the medicine cloud space-time data consists of a fixed-length feature vector time sequence, and the feature vector at the time t is set as Dt=<dt1,dt2,...,dtpLong, then D ═<D1,D2,...DT>Forming a sequence segment, and T is the maximum value of the sequence segment.
(1.2) affine transforming each feature vector to map it to a p-dimensional finite space, the maximum length of each dimension being LiI belongs to {1,2,. eta., p }, and the value range of each dimension is [0, L ]i](ii) a Feature vector of affine transformation at time t is recorded as D'tThen the affine transformation is defined by the following formula:
wherein d'ti(i ═ 1, 2.. multidot.p) is D'tThe ith dimension component of (1).
(1.3) affine transformation converts the feature vector into an artifact pixel space, points which are too close to each other in the space have strong similarity, and only one of the points is reserved, so that the purpose of feature filtering is achieved; the specific process is as follows:
(1.3.1) input: time sequence segment D ═<D1,D2,...DT>(ii) a Affine-transformed time-series segment D '< D'1,D‘2,...,D‘T>;
And (3) outputting: filtered time-series fragment DA ═ DAr1,Dar2,...,Dark>, where r1,r 2.. rk ∈ {1, 2.,. T }, and k ≦ T;
(1.3.2) sequentially traversing each component D 'in D'i(i=1,2,...,T);
(1.3.2.1) if i ═ 1 or i ═ T, then D will be addediAdding into DA;
(1.3.2.2) calculate vector D'i-1And D'iThe Euclidean distance therebetween, if EuclideanIf the distance is greater than the distance threshold minDis, D is setiAdded to DA, minDis is usually in [5,25 ]]Taking values in between.
(1.4) direction smoothing filtering, wherein the filtering method considers the included angle of the front and rear eigenvectors, and is different from other smoothing methods in that the direction smoothing firstly searches the weighted main direction of a time sequence segment and carries out filtering according to the weighted main direction; the method comprises the following steps:
(1.4.1) input: the time sequence fragment DA after the last step of filtering; and (3) outputting: the direction is smoothed to obtain a filtered time sequence segment DA';
(1.4.2) mixing Dar1Adding into DA';
(1.4.3) defining the value of variable index as r1 and the value of lastAngle as-1;
(1.4.4) sequentially traversing each component Da in the DAri(i=2,...,k-1);
(1.4.4.1) calculation from DaindexTo DariIs marked as DISri;
(1.4.4.2) calculation from DaindexTo DariWeighted Angle of (1), denoted as AngleriThe calculation formula is as follows:
in the formula, x represents the dot product operation of the vectors, and d represents the Euclidean distance between the two vectors;
(1.4.4.3) if lastAngle has a value not equal to-1, and lastAngle and Angle
riThe absolute value of the difference between is greater than
Then Da will be
riAdding the sample into DA', and making index value be ri, otherwise, filtering the point;
(1.4.4.4) let lastAngle be equalri;
(1.4.5) finally, the DarkAdded to DA'.
(2) The statistical generation model calculation method generates a probability distribution model of a time sequence segment based on historical data, wherein the probability distribution of the time sequence segment is assumed to be a Gaussian mixture function in a priori mode and is defined as follows:
where M is the number of Gaussian components in the Gaussian mixture function, k
iIs the weight of the ith Gaussian component and satisfies
N(D|u
i,Σ
i) Is the ith Gaussian function, u
iIs the mean of the ith Gaussian component, sigma
iIs the covariance matrix of the ith gaussian component. Where M and all k
i,u
i,Σ
iAre unknown and need to be learned through historical data. In consideration of the fact that system data continuously increases and changes in practical application, a real-time online learning method is designed, a Gaussian mixture model can be dynamically corrected along with the increase of the data, and the specific process is as follows:
(2.1) initial M is in [1,5 ]]Taking values, and selecting N time sequence segments D from historical data(1),D(2),...D(N)An initial mixture gaussian model is generated using standard EM algorithms.
(2.2) continuously updating the initial Gaussian mixture model along with the arrival of new time sequence fragment data, wherein the updating process is as follows:
(2.2.1) wait for the new time series fragment data to reach R, and mark as ND(1),ND(2),...ND(R);
(2.2.2) let j be 1, L { }, and let H be the current mixed gaussian model;
(2.2.3)E(j)={E1,E2.,..,EM}={N(ND(j)|ui,Σi) I | (1, 2., M } }, i.e., ND for each newly arrived fragment data ND(j)Calculating the value of each Gaussian component;
(2.2.4) pairs of E(j)And (3) carrying out normalization treatment:
E(j)={(E1-min(E(j)))/(max(E(j))-min(E(j))),..,(EM-min(E(j)))/(max(E(j))-min(E(j)) ) }, min and max are functions for solving the minimum value and the maximum value respectively;
(2.2.5)I=argmax(E(j)),V=max(E(j));
(2.2.6) if V>0.5, then L ═ U { ND-(j)Else, executing step (2.2.8);
(2.2.7) if | L | > is equal to N, performing mixed gaussian clustering on all data in L by adopting an EM algorithm to obtain a new model HL, making H equal to H ═ HL, and making L equal to { };
(2.2.8) mixing ND(j)The I-th Gaussian component in H is included, and the mean value of the I-th component is recalculated according to the following formula:
(2.2.9) j equals j +1, if j > R, the algorithm ends, otherwise go back to step (2.2.3).
(3) And (5) early warning and judgment. If the length of the set L is always smaller than N after T batches of new data (T usually takes 2R-10R) arrive, the early warning judgment process can be started. The judgment method comprises the steps of substituting each new time sequence fragment data into a Gaussian mixture model, and if a calculated value is smaller than 0.1, indicating that a small-probability time sequence fragment appears, carrying out early warning on the time sequence fragment.
An example of a specific application of the present invention is given below. Some acute infectious diseases have the unfavorable characteristics of fast diffusion, long incubation period and easy misdiagnosis, for example, tuberculosis of the B infectious disease is spread by droplets, the incubation period is 2-3 weeks after infection, and the viral cold is easily misdiagnosed, which brings great difficulty to the prevention and treatment of the infectious diseases, and particularly, when the infectious diseases are diffused rapidly on a large scale, timely early warning is necessary.
By adopting the method, the regional dosage conditions of the anti-tuberculosis drugs and the antiviral cold drugs, such as ethambutol, quinolone, loratadine and the like, are monitored on line, a statistical generation model is established to search for the small-probability time sequence abnormal data, and the early warning capability of the spread of potential diseases can be realized. The method comprises the following steps:
1. the 7-year data of 34 anti-tubercular drugs and antiviral cold drugs in a certain area are selected, and in order to realize effective monitoring, the hourly dosage is calculated by taking the hour as a basic unit, and the 24-hour dosage is taken as a minimum time sequence segment, and the total number of the data items is 34, 7, 365 and 86870 time sequence segments, and 34, 7, 365 and 24 is 2084880.
2. Since the dosage data can be influenced by various external factors such as population, economy and the like, the data needs to be normalized to eliminate the influence of the factors. The specific method is that the mean value and the standard deviation of the whole year are calculated by taking the year as a unit, and the mean value is subtracted from each data item and then divided by the standard deviation to be taken as normalized data.
3. Time series segments (12 minimum time series segments) in units of years are subjected to feature filtering by using the feature filtering method of the present invention, and fig. 1 shows the difference before and after the filtering. The filtering method can keep the direction change characteristics of the time sequence data and delete the data items with gentle change.
4. The method of the invention is further adopted to estimate the probability distribution of the time sequence segments, and the basic unit of estimation is the minimum time sequence segment. The probability distribution data is shown in fig. 2.
All time series segments with probability density values below 0.1 were selected, two in this example, in which the dosages of quinolone in the region ofmonth 11 showed a special case of a significant increase and decrease beyond the dosage of quinolone in the past year (marked with (1) in fig. 2), the average probability density value of this time series segment was 0.061, while the dosages of cycloserine in the same month showed a tendency of a sudden increase in the past year (marked with (2) in fig. 2), and the average probability density value of this time series segment was 0.0396. The abnormity of the two medicines can be visually displayed in a visual mode according to the difference of the probability density values, and the early warning is automatically given to related industry management personnel, so that the management personnel can be helped to acquire more valuable data from a large amount of medicine information.
The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.