Summary of the invention
Present invention aim to address said one or multiple defects, propose a kind of 4G based on key index association analysisInternet user complains model.
To realize the above goal of the invention, the technical solution adopted is that:
A kind of 4G Internet user based on key index association analysis complains the method for building up of model, comprising the following steps:
S1: exploring the influence factor and condition of report user, using Logic Regression Models, if finding out influences customer complaintDry factor simultaneously establishes decision-tree model;
S2: extract include several factors described in step S1 user data, including original report user's data and non-Report user's data;Whole merging treatment is carried out to data;
S3: T inspection is carried out to the step S2 original report user's data obtained and non-report user's data, comparison, which is complained, to be referred toMark and the non-difference for complaining index, tentatively find out the factor for influencing to complain;
S4: setting training dataset establishes Logic Regression Models using R language;It wherein sets and whether complains as because becomingAmount, value is set as 0 and 1, and is optimized according to the result returned after rudimentary model is established using rear stepwise logistic regression method,It obtains final result and determines final mask;
S5: Chi-square Test is carried out to model, it is ensured that each variable of model also needs to ensure entire while passing through significance testModel is significant;
S6: predicting test data set using the model, and prediction result is carried out to intersect statistics with actual result.
Further, when several factors for customer complaint being influenced described in step S1 include attach success rate, attachProlong, default bearing success rate, default bearing time delay, success rate of shaking hands for Tcp23 times, time delay of shaking hands for Tcp23 times.
Further, data are arranged described in step S2 the following steps are included:
S2.1: all data success rates all remove percentage sign, retain the number between 0~100, and decimal point retains 2;
S2.2: Rejection index missing number is greater than 5 record;
S2.3: the record for being 1~5 to missing values number fills up missing values using k-nearest neighbor;
S2.4: randomly selecting and complain 80% and non-80% complained in record in record for training pattern, remaining20% for predicting.
Final mask are as follows:
Compared with prior art, the beneficial effects of the present invention are:
The present invention provides a kind of, and the 4G Internet user based on key index association analysis complains model, to solve on 4GNetwork users complain Producing reason, and carry out preventative solution in advance to potential report user.
Embodiment 1
A kind of 4G Internet user based on key index association analysis complains the method for building up of model, referring to FIG. 1, includingFollowing steps:
S1: exploring the influence factor and condition of report user, using Logic Regression Models, if finding out influences customer complaintDry factor simultaneously establishes decision-tree model;
Explore the influence factor and condition of report user.15 indexs are done with T inspection between complaint group and non-complaint group, is sent outExisting attach success rate, attach time delay, default bearing success rate, default bearing time delay, shake hands for Tcp23 times success rate, Tcp23Secondary time delay of shaking hands is with very strong significant difference.Using Logic Regression Models, 5 factors for influencing customer complaint are had found:Attach success rate, attach time delay, default bearing time delay, success rate of shaking hands for Tcp23 times, time delay of shaking hands for Tcp23 times.According to this5 factors establish decision-tree model, obtain user and are likely to the condition complained:
1) Tcp23 shake hands success rate < 100 and shake hands for Tcp23 times time delay < 80 and time delay < 178 attach;
2) Tcp23 success rate of shaking hands<100 and time delay of shaking hands for Tcp23 times>=80;
3) Tcp23 shake hands success rate=100 and attach time delay<374 and attach time delay>=179 and default bearingTime delay >=191;
4) Tcp23 shake hands success rate=100 and time delay < 179 attach;
5) shake hands for Tcp23 times success rate=100 and attach time delay >=374.
The decision-tree model has 65.7% predictablity rate, but obtained complaint condition is less than satisfactory.FinallyThreshold value whether there is to each index single factor analysis, so that user is likely to complain outside this threshold value, final conclusion is as follows:
1) when attach success rate is less than or equal to 60%, user is likely to complain;
2) when attach time delay is optionally greater than 1500ms, user is likely to complain;
3) when default bearing success rate is less than or equal to 20%, user is likely to complain;
4) when default bearing time delay is optionally greater than 1000ms, user is likely to complain;
5) when Tcp23 success rate of shaking hands is less than or equal to 90%, user is likely to complain.
S2: extract include several factors described in step S1 user data, including original report user's data and non-Report user's data;Whole merging treatment is carried out to data;
The present embodiment extracts original 2336 parts of report user's data, and 2993 parts of non-report user, totally 5329 record.DataIt is integrated into a table, includes following index:
When attach success rate, attach time delay, default bearing success rate, default bearing time delay, DNS success rate, DNSProlong, success rate of shaking hands for Tcp12 times, time delay of shaking hands for Tcp12 times, success rate of shaking hands for Tcp23 times, time delay of shaking hands for Tcp23 times, Get are rungAnswer success rate, Get response delay, Post response success rate, Post response delay, great Bao (being greater than 500KB) downloading rate.
For modeling, following processing is done to data:
1) all success rates all remove percentage sign, retain the number between 0~100, and decimal point retains 2.For example, 99.5%It is transformed to 99.50;
2) Rejection index missing number is greater than 5 record: calculating the index missing number of every record: 15 indexs first, such asFruit has the index missing values of 5 or more (without 5), considers directly to reject, otherwise can be very unfavorable to subsequent modeling.After kicking off,Remaining 4819 parts of data (complain 1999, non-complaint is 2820);
3) record for being 1~5 to missing values number, fills up missing values using k-nearest neighbor, in this way, data to be modeled do not haveThere are missing values, facilitates modeling;
4) it randomly selects and complains 80% and non-80% complained in record in record for training pattern (total totally 3820Item record), residue 20% is for predicting.
S3: T inspection is carried out to the step S2 original report user's data obtained and non-report user's data, comparison, which is complained, to be referred toMark and the non-difference for complaining index, tentatively find out the factor for influencing to complain;
The Mathematics Application of P value is as follows: in T inspection
| P value | Probability by chance | To null hypothesis | Statistical significance |
| P>0.05 | A possibility that occurring by chance is greater than 5% | It cannot negate null hypothesis | Two groups of difference are without significant meaning |
| P<0.05 | A possibility that occurring by chance is less than 5% | It can negate null hypothesis | Two groups of difference have significant meaning |
| P<0.01 | A possibility that occurring by chance is less than 1% | It can negate null hypothesis | Difference of them has very significant meaning |
Index and the non-difference complained between index are complained using the method comparison that T is examined, influence can be tentatively found out and complainFactor.Data use 2336 parts of original report user's data, and 2993 parts of non-report user, totally 5329 records are (containing missingValue), each index independently calculates, and it encounters the record containing missing values and ignores automatically, as a result as shown in the table:
As seen from the above table, attach success rate, attach time delay, default bearing time delay, success rate of shaking hands for Tcp23 times,Time delay of shaking hands for Tcp23 times these indexs have very strong significant difference between report user and non-report user, and (99% setsLetter is horizontal).Also there were significant differences under 95% confidence level for default bearing success rate, without aobvious under 99% confidence levelWrite difference.Other indexs are not significantly different.
S4: setting training dataset establishes Logic Regression Models using R language;It wherein sets and whether complains as because becomingAmount, value is set as 0 and 1, and is optimized according to the result returned after rudimentary model is established using rear stepwise logistic regression method,It obtains final result and determines final mask;
Below according to training dataset (3820 records), Logic Regression Models are established using R language.Whether conduct is complainedDependent variable, value only have 0 and 1 (0 is non-complaint, and 1 is complaint), and 15 indexs are as independent variable.R is returned after establishing rudimentary modelResult it is as follows:
Call:
Glm (whether formula=complain~and attach success rate+attach time delay+default bearing success rate+
Default bearing time delay+DNS success rate+DNS time delay+Tcp12 times+Tcp12 time delays of shaking hands of success rate of shaking hands+Secondary time delay+Get response success rate+Get response delay+Post response the success rate of shaking hands of the Tcp23 success rate+Tcp23 that shakes hands+Post response delay+big packet is greater than 500KB. downloading rate,
Family=" binomial ", data=train.dt)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2577-0.9793-0.9161 1.3126 2.3556
Coefficients:
Big packet is greater than 500KB. downloading rate -1.491e-06 2.603e-06-0.573 0.566695
---
Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1
(Dispersion parameter for binomial family taken to be 1)
Null deviance:5182.4 on 3819 degrees of freedom Residual deviance:5006.4 on 3804 degrees of freedom AIC:5038.4
Number of Fisher Scoring iterations:5
In result above, band * * * indicates that coefficient highly significant, band * * indicate very significant, and band * indicates significant, and band indicates micro-Significantly, tape identification is not then significant.As a result have in the coefficient of multiple indexs be it is inapparent, need to be optimized model.
Using stepwise logistic regression method Optimized model backward, (every step, which rejects one, influences least apparent factor, so that mouldIt is more preferable before type ratio, the significant factor of final retention factor), the final result of successive Regression is as follows backward:
Call:
Glm (whether formula=complain~and attach success rate+attach time delay+default bearing time delay+Tcp12 times holdsHand time delay+Tcp23 times+Tcp23 time delays of shaking hands of success rate of shaking hands, family=" binomial ", data=train.dt)
Deviance Residuals:
Min 1Q Median 3Q Max -3.2489 -0.9789 -0.9198 1.3174 1.7080
Coefficients:
Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1
(Dispersion parameter for binomial family taken to be 1)
Null deviance:5182.4 on 3819 degrees of freedom Residual deviance:5012.0 on 3813 degrees of freedom AIC:5026
Number of Fisher Scoring iterations:5
The model finally remains 6 indexs, but the coefficient for time delay of shaking hands for Tcp12 times is still not significant, consideration rejectingThis index is modeled again again with 5 indexs of residue:
Call:
Glm (whether formula=complain~and attach success rate+attach time delay+default bearing time delay+Tcp23 times holds+ Tcp23 time delays of shaking hands of hand success rate, family=" binomial ", data=train.dt)
Deviance Residuals:
Min 1Q Median 3Q Max -3.2410 -0.9790 -0.9201 1.3180 1.5413
Coefficients:
Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1
(Dispersion parameter for binomial family taken to be 1)
Null deviance:5182.4 on 3819 degrees of freedom Residual deviance:5014.2 on 3814 degrees of freedom AIC:5026.2
Number of Fisher Scoring iterations:5
So far each term coefficient significantly, remains relatively important variable.Then the model established are as follows:
+ 0.0005773* default bearing time delay -0.0666100*Tcp23 times the success rate+0.0038896*Tcp23 that shakes hands is secondaryIt shakes hands time delay
As model calculated P > 0.5, it is believed that can complain, otherwise not complain.
S5: Chi-square Test is carried out to model, it is ensured that each variable of model also needs to ensure entire while passing through significance testModel is significant;
Each variable of model also needs to ensure that entire model is significantly, could only in this way to protect while passing through significance testModel of a syndrome is correct, significant.Chi-square Test is carried out to model, as a result as follows:
Analysis of Deviance Table
Model:binomial,link:logit
Response: whether complain
Terms added sequentially(first to last)
Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1
Model has passed through whole significant inspection, illustrates that the model being made of above-mentioned variable is meaningful.
S6: predicting test data set using the model, and prediction result is carried out to intersect statistics with actual result.
Below with the model, to test data set, (totally 956 records complain 396, and 560) non-complaint is predicted, in advanceSurvey whether a unknown subscriber may complain.As model calculated P > 0.5, it is believed that can complain, otherwise not complain.It will predictionAs a result it is done with actual result and intersects statistics, as shown in the table:
| Report user | Non- report user |
| Prediction is complained | 162 | 94 |
| Prediction is not complained | 234 | 466 |
Accuracy rate: 162/ (162+94) * 100%=63.8% is complained in prediction;
It then predicts not complain accuracy rate: 466/ (466+234) * 100%=66.6%;
Whole predictablity rate: (466+162)/(466+162+234+94) * 100%=65.7%;
Recall rate: 162/ (162+234)=40.9%.
Brief summary: model prediction ability is preferable.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pairThe restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above descriptionTo make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all thisMade any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of inventionProtection scope within.