Disclosure of Invention
In order to overcome the above-mentioned drawbacks of the prior art, embodiments of the present invention provide a client interaction data retrieval method based on dynamic data indexing to solve the above-mentioned problems set forth in the background art.
In order to achieve the above purpose, the present invention provides the following technical solutions:
the client interaction data retrieval method based on dynamic data index comprises the following steps:
s1, performing segment aggregation analysis on historical customer interaction data in a long-term range, and extracting stable characteristic values in the long-term data as stable characteristic base lines;
s2, applying a self-adaptive multi-scale burst feature detection algorithm to the recent customer interaction data, and identifying irregular short-term burst features by adjusting a time window and a data scale;
S3, comparing the short-term burst characteristic with a stable characteristic baseline, and removing characteristic items which do not meet the requirements according to a preset quality threshold to form composite characteristic weight distribution;
s4, carrying out dynamic stability monitoring on the feature groups with low association degree in the index, evaluating potential risks of sudden behaviors caused by weak related features in subsequent data based on weak co-occurrence probability analysis between the features, and recording potential causes;
and S5, updating the dynamic data index structure in real time based on the composite feature weight distribution and the potential causes, and searching the newly input customer interaction data based on the dynamic data index structure updated in real time.
In a preferred embodiment, a segment aggregation analysis is performed on historical customer interaction data in a long term range, and stable characteristic values in the long term data are extracted as stable characteristic baselines, specifically including:
s101, acquiring historical customer interaction data in a long-term range, dividing the historical customer interaction data into a plurality of continuous time periods according to a preset time interval, and marking the start and stop time of each time period;
S102, extracting data characteristics from historical customer interaction data in each time period according to preset characteristic indexes, wherein the data characteristics comprise the times and duration of customer interaction behaviors;
s103, carrying out statistical processing on the data features in all time periods, analyzing the distribution condition of the data features in each time period, and calculating the time variation range of each data feature;
s104, screening out data features with smaller time variation range according to the time variation range result of the statistical analysis and marking the data features as stable feature candidate values;
And S105, performing cluster analysis on the stable characteristic candidate values, extracting a central value of a clustering result as a stable characteristic baseline, and recording the central value as a stable characteristic value of long-term data.
In a preferred embodiment, an adaptive multi-scale burst feature detection algorithm is applied to recent customer interaction data to identify irregular short-term burst features by adjusting time windows and data scales, specifically comprising:
s201, acquiring recent customer interaction data, and dividing the recent customer interaction data into a plurality of continuous time windows according to a preset time range;
S202, calculating a recent characteristic value according to a multi-scale characteristic extraction rule aiming at the recent client interaction data of each time window, wherein the recent characteristic value comprises interaction frequency and action persistence;
s203, dynamically adjusting the length of a time window and the data scale parameters, and respectively calculating multi-scale characteristic values under each time window to form a multi-scale characteristic matrix;
S204, analyzing the change mode of the multi-scale feature matrix, and screening and marking the feature with larger mutation amplitude as the short-term burst feature.
In a preferred embodiment, the short-term burst feature is compared with the stationary feature baseline, and feature items which do not meet the requirements are removed according to a preset quality threshold to form a composite feature weight distribution, which specifically includes:
S301, acquiring a short-term burst feature set and a stable feature baseline set, and carrying out unified formatting treatment;
S302, pairing short-term burst characteristics with stable characteristic baselines item by item according to characteristic types and time windows to form a comparable characteristic pair set;
s303, calculating the feature difference degree between the short-term burst feature and the stable feature base line for each group of feature pairs in the feature pair set;
S304, comparing the characteristic difference degree with a preset quality threshold, eliminating characteristic items with the difference degree exceeding the threshold range, and only keeping the characteristic items meeting the requirements;
And S305, calculating weights of the reserved characteristic items according to the characteristic difference degree, and generating composite characteristic weight distribution.
In a preferred embodiment, the dynamic stability of the feature group with low association degree in the index is monitored, and based on weak co-occurrence probability analysis between features, the potential risk of the weak correlation feature causing sudden behavior in subsequent data is evaluated, and the potential cause is recorded, which specifically includes:
s401, extracting feature pairs with lower weight from composite feature weight distribution to form a low-association feature group set;
S402, calculating the co-occurrence probability among the features based on the feature items in the low-association degree feature group set to obtain a co-occurrence matrix among each pair of features;
S403, analyzing potential association among low-association-degree features according to weak co-occurrence probability in the co-occurrence matrix, and identifying potential causes;
s404, dynamically monitoring potential causes in the low-association feature group, tracking the trend of weak change among features, and evaluating the potential risk of the weak correlation feature causing sudden behavior in subsequent data.
In a preferred embodiment, the co-occurrence probability between the features is calculated based on the feature items in the low-association feature group set, so as to obtain a co-occurrence matrix between each pair of features, specifically:
counting the co-occurrence times of two features in the feature pair in all historical time windows, and counting the total occurrence times of each individual feature in the feature pair;
calculating co-occurrence probability by using the co-occurrence frequency and the total occurrence number, wherein the formula is as follows: Wherein Pq,co represents the co-occurrence probability of a feature pair, N1 represents the total number of occurrences of a first feature in the feature pair, N2 represents the total number of occurrences of a second feature in the feature pair, min (N1,N2) represents a smaller value of the total number of occurrences of both features in the feature pair, and Cq represents the number of co-occurrences of both features in the feature pair in the history;
The calculation result is expressed in the form of a co-occurrence matrix, wherein M= { Pab,co |a, b epsilon G }, M represents the co-occurrence matrix of the low-association degree feature group, Pab,co represents the co-occurrence probability of the feature pair (a, b), a and b represent any two features in the low-association degree feature group, and G represents the low-association degree feature group set.
In a preferred embodiment, the potential correlation between the low-correlation features is analyzed according to the weak co-occurrence probability in the co-occurrence matrix, and potential causes are identified, specifically:
setting a co-occurrence probability threshold, extracting feature pairs with the co-occurrence probability smaller than the co-occurrence probability threshold and marking the feature pairs as low co-occurrence probability feature pairs;
for each group of low co-occurrence probability feature pairs, judging whether potential causes exist or not by combining the feature value fluctuation in the time window;
The specific potential cause identification rule is that if the co-occurrence probability of the feature pair is smaller than the co-occurrence probability threshold value and the fluctuation modes of the two features in the feature pair show related trends, the potential cause is marked.
In a preferred embodiment, the dynamic data index structure is updated in real time based on the composite feature weight distribution and the potential incentive, and the newly input customer interaction data is retrieved based on the dynamic data index structure updated in real time, which specifically comprises the following steps:
S501, extracting the weight of each feature pair and the potential risk probability of the sudden behavior from the composite feature weight distribution to generate an updated data set;
S502, adjusting the priority of feature nodes according to the weight of feature pairs in the updated data set, and redefining the association strength between the feature pairs by combining the potential risk probability of the sudden behavior so as to optimize an index structure;
S503, cleaning low-weight feature nodes in the dynamic data index structure, and sequencing high-weight feature nodes to improve the retrieval efficiency and the compactness of the structure;
and S504, searching the newly input customer interaction data by utilizing a dynamic data index structure updated in real time, matching high-weight feature pairs associated with the input data, and returning a search result.
The client interaction data retrieval method based on dynamic data index has the technical effects and advantages that:
1. The stable characteristic baseline of the long-term data is extracted through the sectional aggregation analysis, the short-term burst characteristic is accurately identified by combining the self-adaptive multi-scale burst characteristic detection algorithm, the effective separation and comprehensive processing of different time characteristics in the customer interaction data are realized, the short-term burst characteristic and the long-term stable characteristic can be respectively identified, compared and fused, the problem that the traditional dynamic index strategy cannot consider multiple characteristic modes is solved, and a high-efficiency characteristic basis is provided for the optimization of the follow-up index structure.
2. The dynamic stability monitoring is carried out on the low-association feature group, the weak co-occurrence probability analysis among the features is combined, the sudden behavior risk possibly caused by the weak related features can be estimated in advance, further, the rapid capturing and processing of sudden information and potential causes in dynamic data are ensured through the real-time updating and efficient searching of the dynamic data index structure, the index optimization strategy based on the feature weight and the risk probability remarkably improves the searching accuracy and instantaneity, the defect of the traditional method in processing complex customer interaction data is effectively overcome, and the method is suitable for scene demands with high-frequency change and complex features.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 shows a client interaction data retrieval method based on dynamic data index, which comprises the following steps:
S1, performing segment aggregation analysis on historical customer interaction data in a long-term range, and extracting stable characteristic values in the long-term data as stable characteristic base lines.
And S2, applying a self-adaptive multi-scale burst characteristic detection algorithm to the recent customer interaction data, and identifying irregular short-term burst characteristics by adjusting a time window and a data scale.
And S3, comparing the short-term burst characteristic with a stable characteristic baseline, and removing characteristic items which do not meet the requirements according to a preset quality threshold to form composite characteristic weight distribution.
And S4, carrying out dynamic stability monitoring on the feature groups with low association degree in the index, evaluating potential risks of sudden behaviors caused by the weak correlation features in subsequent data based on weak co-occurrence probability analysis between the features, and recording potential causes.
And S5, updating the dynamic data index structure in real time based on the composite feature weight distribution and the potential causes, and searching the newly input customer interaction data based on the dynamic data index structure updated in real time.
Performing a segment aggregation analysis on historical customer interaction data in a long-term range, and extracting stable characteristic values in the long-term data as stable characteristic baselines, wherein the method specifically comprises the following steps of:
s101, acquiring historical customer interaction data in a long-term range, dividing the historical customer interaction data into a plurality of continuous time periods according to a preset time interval, and marking the start and stop time of each time period:
For example, if the time interval is one day, the daily data corresponds to one time period, and if the time interval is one week, the weekly data corresponds to one time period. The time period division rules should ensure that the time periods are continuous and non-overlapping.
S102, extracting data features from historical customer interaction data in each time period according to preset feature indexes, wherein the data features comprise the times and duration of customer interaction behaviors:
And after the time period division is completed, extracting the characteristics of the historical customer interaction data in each time period. The feature extraction needs to be completed based on a preset index, and specifically comprises the following steps:
and calculating the total number of the client interaction behaviors in a time period, and taking the total number of the client interaction behaviors as a basic index for measuring the client activity.
And counting the duration time of the client interaction behaviors, namely counting the total duration time of all the client interaction behaviors in a time period, and taking the total duration time as an index for measuring the client interaction participation depth.
And gradually calculating the characteristic value of each time period by analyzing the specific content of each piece of interaction data.
S103, carrying out statistical processing on the data features in all time periods, analyzing the distribution condition of the data features in each time period, and calculating the time variation range of each data feature:
And carrying out statistical processing on the characteristic values extracted in each time period, analyzing the distribution condition of the characteristic values in all the time periods, and calculating the time variation range of each characteristic. The specific analysis comprises the steps of counting the distribution of the characteristic values in each time period, judging the concentration trend and the discrete degree of the characteristic values, calculating the fluctuation range of each characteristic value in different time periods to evaluate the stability of the characteristic in the time dimension, integrally evaluating the characteristic values in all the time periods, and recording the variation amplitude of each characteristic.
S104, screening out data features with smaller time variation range according to the time variation range result of the statistical analysis, and marking the data features as stable feature candidate values:
And screening the data characteristics according to the result of the statistical analysis. The screening rule needs to be set in combination with the change range of the feature values, for example, only the feature values with smaller time change range are screened out and marked as stable feature candidate values, and the feature values with larger time change range are excluded to ensure the stability of the candidate features.
Wherein the specific screening conditions of the time variation range should be set according to the traffic demand and the data characteristics, such as by setting a predefined fluctuation threshold.
S105, performing cluster analysis on the stable characteristic candidate values, extracting a central value of a cluster result as a stable characteristic baseline, and recording the central value as a stable characteristic value of long-term data:
the cluster analysis process comprises the following steps:
Dividing the stable characteristic candidate values into a plurality of categories, and calculating a clustering center according to the distribution of each category of characteristic values;
The clustering center is a stable characteristic baseline and is used for representing stable characteristic values of data in a long-term range.
The specific method of cluster analysis can adopt an algorithm suitable for the service requirement, such as an average cluster method based on characteristic value distribution or other cluster tools with high-efficiency computing power. Finally, the plateau characteristic baseline is recorded as the plateau characteristic value of the long-term data.
Applying a self-adaptive multi-scale burst feature detection algorithm to the recent customer interaction data, and identifying irregular short-term burst features by adjusting a time window and a data scale, wherein the method specifically comprises the following steps of:
s201, acquiring recent customer interaction data, and dividing the recent customer interaction data into a plurality of continuous time windows according to a preset time range:
wherein each time window has a definite start time and end time, and the length of the time window is determined by the actual analysis requirement.
S202, calculating recent feature values according to a multi-scale feature extraction rule aiming at the recent client interaction data of each time window, wherein the recent feature values comprise interaction frequency and action persistence:
The multi-scale feature extraction rules are as follows:
First, different time scale parameters are chosen, such as a small time period (e.g., 10 minutes) and a large time period (e.g., 1 hour), on which features are extracted.
Then, calculating average characteristic values and variation characteristic values in a time scale to describe distribution rules of customer behaviors in short-term and long-term scales.
And finally, combining the data of different scales to form a comprehensive multi-scale characteristic representation.
The expression of the interaction frequency is as follows: The expression of behavior persistence is: Wherein Fi,k represents the interaction frequency of the ith time window under the kth scale parameter, Di,k represents the action persistence of the ith time window under the kth scale parameter, ni,k represents the total number of interaction actions of the ith time window under the kth scale parameter, Sk represents the kth scale parameter, j represents the number of the jth interaction action in a certain time window, ej represents the ending time of the jth interaction action, and Sj represents the starting time of the jth interaction action.
Wherein 1 in the expression of the interaction frequency represents a count of each interaction contribution for counting the frequency of the interaction.
The integrated multi-scale feature representation includes two dimensions of interaction frequency and behavior persistence. In each time window, feature sets are generated through multi-scale feature extraction and are respectively expressed as an interaction frequency feature set and a behavior persistence feature set, wherein the interaction frequency feature set comprises interaction frequency values under different scales, the behavior persistence feature set comprises behavior duration time values under different scales, and the two sets together form comprehensive feature representation of each time window and are used for describing complete feature distribution of customer interaction behaviors under multiple scales.
S203, dynamically adjusting the length of the time window and the data scale parameters, and respectively calculating the multi-scale characteristic values under each time window to form a multi-scale characteristic matrix:
and adjusting the time window length, namely controlling the time window length by a parameter sequence, and generating data features with different time granularities by setting different time intervals.
And (3) adjusting the data scale parameters, and generating characteristic value matrix columns under different scales by changing the data scale parameters.
Calculating the characteristic value of the ith time window under the kth scale parameter, wherein the expression is Mik=f(Wi,Sk, Mik represents the characteristic value of the ith time window under the kth scale parameter, Wi represents the ith time window, and f (Wi,Sk) represents a characteristic extraction function (jointly determined by the time window and the data scale).
S204, analyzing the change mode of the multi-scale feature matrix, and screening and marking the features with larger mutation amplitude as short-term burst features:
and screening the characteristic values with the change rate exceeding the corresponding preset threshold value, and marking the characteristic values meeting the conditions as short-term burst characteristics.
The change rate corresponds to a preset threshold value, which is a fixed value set according to specific service requirements and data characteristics, and is used for judging whether the change rate of the characteristic value is abnormal, and each threshold value corresponds to a specific time window and scale parameter, and represents the minimum amplitude of the characteristic change considered to be obvious under the condition, and is used for screening the abrupt change characteristic value.
Recording the screened short-term burst characteristics, and marking the corresponding time window and related behavior data. For example, the time window number, rate of change, and associated behavioral characteristics of the burst characteristic are recorded for subsequent use.
Comparing the short-term burst characteristic with a stable characteristic baseline, and removing characteristic items which do not meet the requirements according to a preset quality threshold to form composite characteristic weight distribution, wherein the method specifically comprises the following steps of:
S301, acquiring a short-term burst feature set and a stable feature baseline set, and carrying out unified formatting treatment to ensure that feature data can be directly compared:
A short-term burst feature set and a stable feature baseline set are respectively acquired, wherein the feature set comprises different types of data features (such as interaction frequency, action persistence and the like). To ensure comparability between feature sets, the feature sets need to be formatted. The formatting process comprises the following steps:
all eigenvalues are normalized to the same numerical range, e.g. normalized to [0,1].
And classifying and sorting the feature sets according to the feature types to ensure that the features of the same type can be directly compared.
And (3) performing time alignment on the feature set according to a time window to ensure that the short-term burst feature and the stable feature base line have corresponding relations at the same time point.
S302, pairing short-term burst features with stationary feature baselines item by item according to feature types and time windows to form a comparable feature pair set:
if the short-term burst feature is in accordance with the type of the stable feature base line and the short-term burst feature and the stable feature base lines belong to the same time window, the short-term burst feature and the stable feature base lines are paired, and if the short-term burst feature and the stable feature base lines are in accordance with the condition, the base line with the closest time distance is selected for pairing.
S303, calculating the feature difference degree between the short-term burst feature and the stable feature baseline for each group of feature pairs in the feature pair set:
And calculating the difference value of the short-term burst characteristic and the stable characteristic baseline in the characteristic pair, and taking an absolute value to obtain the characteristic difference degree.
The calculation of the feature difference degree needs to be carried out on feature pairs in all feature pair sets one by one, and a complete difference degree set is generated.
S304, comparing the feature difference degree with a preset quality threshold, eliminating feature items with the difference degree exceeding the threshold range, and only retaining feature items meeting the requirements:
If the feature difference is smaller than or equal to a preset quality threshold, the feature pair is reserved, and if the feature difference is larger than the preset quality threshold, the feature pair is removed.
The preset quality threshold is a fixed value set according to specific service requirements and characteristic attributes, and is used for judging whether the difference between the short-term burst characteristic and the stable characteristic base line meets the requirements or not, each characteristic type can correspond to different quality thresholds, the acceptable range of the characteristic difference is measured, and the rationality and the accuracy of screening results are ensured.
S305, calculating weights of the reserved characteristic items according to the characteristic difference degree, and generating composite characteristic weight distribution:
the weight of a feature pair is the inverse of the feature difference corresponding to the feature pair.
And integrating the weights of all the feature pairs to form a composite feature weight distribution, wherein the composite feature weight distribution comprises the weights of all the feature pairs.
The method comprises the steps of carrying out dynamic stability monitoring on feature groups with low association degree in an index, evaluating potential risks of sudden behaviors caused by weak related features in subsequent data based on weak co-occurrence probability analysis among the features, and recording potential causes, wherein the method specifically comprises the following steps of:
S401, extracting feature pairs with lower weight from composite feature weight distribution to form a low-association feature group set:
In the composite feature weight distribution, each feature pair has a corresponding weight value. The weight value is used for reflecting the association strength between the feature pairs, and the lower weight value represents weaker association between the two features. Firstly, feature pairs with weight values lower than a preset threshold value are extracted from composite feature weight distribution to form a low-association feature group set.
The extraction rules are as follows:
and setting a weight threshold, and extracting a feature pair set meeting the condition Kq<TK.
The extraction result is represented as G= { Pq|Kq<TK }, wherein G represents a low-association degree feature group set, Pq represents a q-th feature pair, Kq represents the weight of the q-th feature pair, and TK represents a weight threshold value.
By means of the above rules, it is ensured that only pairs of features with lower weights are brought into the subsequent analysis range.
And TK is a fixed value set according to specific service scenes and feature association strength, and is used for distinguishing low-association-degree features from high-association-degree features, and when the weight of a feature pair is lower than the threshold, the association degree is considered to be lower, and the low-association-degree feature group is required to be included for further analysis and monitoring.
S402, calculating the co-occurrence probability among the features based on the feature items in the low-association degree feature group set, and obtaining a co-occurrence matrix among each pair of features:
In the low-association feature group set, statistics is carried out on the historical data of each group of feature pairs, the co-occurrence probability of the feature pairs in the historical data is calculated, and the co-occurrence probability is used for quantifying the frequency of the simultaneous occurrence of two features, and the specific process is as follows:
and (5) co-occurrence frequency statistics, namely counting the co-occurrence times of two features in the feature pair in all historical time windows.
And counting the total occurrence number, namely counting the total occurrence number of each individual feature in the feature pair.
Calculating co-occurrence probability, namely calculating the co-occurrence probability by utilizing the co-occurrence frequency and the total occurrence frequency, wherein the formula is as follows: Wherein Pq,co represents the co-occurrence probability of a feature pair for quantifying the frequency of simultaneous occurrence of two features in the feature pair, N1 represents the total number of occurrences of a first feature in the feature pair, N2 represents the total number of occurrences of a second feature in the feature pair, min (N1,N2) represents a smaller value of the total number of occurrences of two features in the feature pair for normalizing the co-occurrence probability, and Cq represents the number of co-occurrences of two features in the feature pair in the history.
The calculation result is expressed in the form of a co-occurrence matrix, wherein M= { Pab,co |a, b epsilon G }, M represents the co-occurrence matrix of the low-association degree feature group, Pab,co represents the co-occurrence probability of the feature pair (a, b) and is used for quantifying the co-occurrence frequency of the feature a and the feature b in the historical data, and a and b represent any two features in the low-association degree feature group.
S403, analyzing potential association between low-association-degree features according to weak co-occurrence probability in the co-occurrence matrix, and identifying potential causes:
Analyzing each co-occurrence probability in the co-occurrence matrix, and combining the historical distribution of the feature pairs to identify potential association between low-association features, wherein the potential association analysis comprises the following steps:
And setting a co-occurrence probability threshold, extracting feature pairs with the co-occurrence probability smaller than the co-occurrence probability threshold and marking the feature pairs as low co-occurrence probability feature pairs.
And for each group of low co-occurrence probability feature pairs, judging whether potential causes exist or not according to the feature value fluctuation in the time window.
The specific potential cause identification rule is that if the co-occurrence probability of the feature pair is smaller than the co-occurrence probability threshold value and the fluctuation modes of the two features in the feature pair show related trends, the potential cause is marked.
The co-occurrence probability threshold is an important standard for screening feature pairs, is set according to the historical co-occurrence characteristics of the feature pairs and is used for distinguishing remarkable co-occurrence and weak co-occurrence, and when the co-occurrence probability of the feature pairs is lower than the threshold, the co-occurrence association of the feature pairs is considered weak and further analysis is needed.
The relevant trend of the eigenvalues was calculated using the following formula: Wherein Dq represents the correlation trend coefficient of the q-th feature pair, which is used for quantifying the linear correlation degree between the two feature values, Xt and Yt respectively represent the feature values of the two features in the feature pair at time t; AndRespectively represent the average of the two eigenvalues.
Where the feature value is a representation value for quantifying the feature over time t, typically including interaction frequency, duration of action, etc.
Judging whether the fluctuation modes of the two features in the feature pair show related trends or not according to the related trend coefficients through the value range. When the correlation trend coefficient is greater than 0, the fluctuation of the two features is positive, namely the feature value synchronously rises or falls along with time, when the correlation trend coefficient is less than 0, the fluctuation of the two features is negative, namely one feature value rises and the other feature value falls, and when the correlation trend coefficient is equal to 0, the two features have no correlation, and fluctuation modes do not affect each other.
Positive and negative correlation thresholds are set, which are important indicators for judging the correlation trend of the feature pair. The positive correlation threshold represents the lowest coefficient value for which the positive correlation of the two features is significant, and the negative correlation threshold represents the highest coefficient value for which the negative correlation of the two features is significant. Both together are used to screen feature pairs with significant correlation trends.
And if the correlation trend coefficient is greater than or equal to the positive correlation threshold or the correlation trend coefficient is less than or equal to the negative correlation threshold, the fluctuation mode of the two features in the feature pair is considered to show a correlation trend. The determination of the relevant trends can be used to further analyze potential feature correlations or risk assessments.
S404, dynamically monitoring potential causes in the low-association feature group, tracking the trend of weak change among features, and evaluating the potential risk of the weak correlation feature causing sudden behavior in subsequent data:
For the identified potential causes, dynamic monitoring is performed, the variation trend among the features is tracked, and the potential risk probability of causing sudden behavior in the future is calculated based on the feature fluctuation in the time window.
For each set of potential incentive feature pairs, the feature value changes are continuously tracked, and the fluctuation amplitude in different time windows is recorded.
Based on the fluctuation amplitude and the time weight, calculating the potential risk probability of the sudden behavior, wherein the potential risk probability of the sudden behavior is the product of the co-occurrence probability of the feature pair and the time weight.
The time weight is a dynamic parameter set according to timeliness of a time window and is used for reflecting the influence of the latest data on the feature analysis, the weight is usually calculated through a predefined decay function, the closer to the current time, the higher the window weight is, and common methods comprise an exponential decay or linear decremental model.
The greater the potential risk probability of the sudden behavior, the greater the potential risk of the sudden behavior caused by the weak correlation features in the subsequent data, which indicates that the co-occurrence relationship and time trend among the weak correlation features are more likely to cause abnormal fluctuation, thus indicating that the probability of the sudden behavior caused by the features in the subsequent data is higher, and important monitoring and evaluation are needed.
The dynamic data index structure is updated in real time based on the composite feature weight distribution and the potential causes, and the newly input client interaction data is retrieved based on the dynamic data index structure updated in real time, which concretely comprises the following steps:
S501, extracting the weight of each feature pair and the potential risk probability of the sudden behavior from the composite feature weight distribution, and generating an updated data set:
The weight of each feature pair is extracted from the composite feature weight distribution one by one, and the potential risk probability of the sudden behavior is obtained from the potential cause analysis. The feature weight value represents the strength of association of the feature pairs, and the risk probability is used to quantify the risk that the feature pairs may trigger bursty behavior.
And combining the extracted feature weight values and risk probabilities according to feature pairs to form an update data set for adjusting the dynamic data index structure.
S502, adjusting the priority of feature nodes according to the weight of the feature pairs in the updated data set, and redefining the association strength between the feature pairs by combining the potential risk probability of the sudden behavior to optimize the index structure:
And dynamically adjusting the priority of the corresponding feature nodes in the index structure according to the weight of each feature pair, wherein the higher the weight value is, the higher the priority is in the index structure.
And (3) adjusting rules, namely increasing the priority index level of the high-weight characteristic nodes in the index structure to ensure that the high-weight characteristic nodes participate in search matching more quickly, reducing the priority of the low-weight characteristic nodes, and reducing the resource occupation of the index structure.
The strength of association between feature nodes is redefined based on the potential risk probability of bursty behavior for each feature pair. For feature pairs with higher potential risk probabilities for bursty behavior, the strength of their association in the index structure is enhanced for preferential monitoring and handling.
S503, cleaning low-weight feature nodes in the dynamic data index structure, and sequencing high-weight feature nodes to improve retrieval efficiency and compactness of the structure:
And executing cleaning operation on the feature nodes with the weight value lower than the preset threshold value in the dynamic data index structure, wherein if the weight of the feature pairs is lower than the preset cleaning threshold value, the feature nodes are removed from the index structure, and the feature nodes with the weight of the feature pairs larger than or equal to the preset cleaning threshold value are reserved, so that only important features are reserved in the index structure.
And (3) carrying out descending order arrangement on the rest characteristic nodes by using the weight of the characteristic pairs in the updated data set on the cleaned index structure, wherein the ordering rule is as follows, the characteristic nodes with high priority occupy high bits in the index structure according to the order of the weights of the characteristic pairs so as to improve the index retrieval efficiency.
After the cleaning and sorting operation, the resource positions in the index structure are redistributed, so that the high-weight characteristic nodes occupy the key positions, and meanwhile, the resource consumption of the low-priority nodes is reduced, and a more compact index structure is formed.
S504, searching the newly input customer interaction data by utilizing a dynamic data index structure updated in real time, matching high-weight feature pairs associated with the input data, and returning a search result:
The newly input customer interaction data is preprocessed, and feature values are extracted according to feature rules in the dynamic data index structure, for example, the newly input data may include user operation records (such as click frequency, access duration, etc.). By analyzing these records, characteristic values such as "the number of accesses is 5 times", "the stay time is 120 seconds" can be extracted.
And matching the extracted characteristic values with the ordered high-weight characteristic nodes in the dynamic data index structure one by one, wherein a matching rule is carried out based on the correlation between the characteristic values and the high-priority characteristics in the index.
For example, a high-weight feature node in the index structure may record features that are "more than 3 clicks" and have a dwell time of over 100 seconds, "and if the newly entered data contains feature values of" 5 clicks "and" dwell time of 120 seconds, "the matching with that node is successful.
And (3) screening out the most relevant characteristic nodes from all the successfully matched characteristic nodes according to descending order of weight values of the characteristic nodes, and returning the most relevant characteristic nodes as a part of the retrieval result.
For example, if the new input data matches two feature nodes (e.g., node A and node B) simultaneously, where node A weights 0.8 and node B weights 0.6, node A preferentially returns to the search result, and the final returned search result may include the matched feature node number and its associated context information (e.g., feature description, time range, etc.).
And providing the returned search result to a subsequent analysis module to support user behavior pattern analysis, anomaly detection or other business logic, wherein the format of the search result possibly comprises information such as matched characteristic node numbers, related descriptions, matched characteristic values and the like.
The above formulas are all formulas with dimensionality removed and numerical calculation, the formulas are formulas with the latest real situation obtained by software simulation through collecting a large amount of data, and preset parameters and threshold selection in the formulas are set by those skilled in the art according to the actual situation.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and module may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, may be located in one place, or may be distributed over multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
Finally, the foregoing description of the preferred embodiment of the invention is provided for the purpose of illustration only, and is not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.