Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art, a kind of network connection speed predicting method based on machine learning is provided.
Network connection speed predicting method based on machine learning may further comprise the steps:
1) utilize custom browser, the connection speed of the website that recording user was browsed is as training set and test set;
2) utilize the website connection speed that obtains, use the connection speed of each website in neural metwork training and predictive user and the training set;
3) predicated error according to neural network reduces situation, and perhaps execution in step 4), perhaps training set is divided into littler training set and each training set is returned execution in step 2);
4) use decision tree to test the estimated performance of neural network;
5) use decision tree and neural network, the connection speed of predictive user and any unknown website.
The described custom browser of utilizing, the connection speed of the website that recording user was browsed, as training set and test set step:
(a) to each website of user capture, write down each user and send request of access to the time interval that the user obtains to respond to the website, be designated as user's tie-time of website;
(b) to each website of user capture, the speed of download when writing down each user from the website data download is designated as the user bandwidth of website;
(c) if user's access websites repeatedly, then with in the nearest week or the mean value of nearest 10 times user's tie-time as user's tie-time of website, with in the nearest week or the mean value of nearest 10 times user bandwidth as the user bandwidth of website;
(d) select 10% at random as test set from user's historical data, all the other are 90% as training set.
The website connection speed that described utilization obtains, use the connection speed step of each website in neural metwork training and predictive user and the training set:
(e) set up artificial neural network, it is input as the characteristic of a website: comprise that one is expressed as the network ip address of 32 integers and the hourage that the integer of 1 value between 0~23 is used to represent the current time; It is output as 2 real numbers, represents the tie-time estimated value and the bandwidth estimation value of user and website respectively;
(f) user's tie-time that step (a)-(d) is obtained and user bandwidth historical data are as training set, and the neural network of using back-propagation algorithm training step (e) to be set up is preserved the neural network after training.
Described predicated error according to neural network reduces situation, and perhaps execution in step 4), perhaps training set is divided into littler training set and each training set is returned execution in step 2) step:
(g) the user's tie-time and the user bandwidth of neural network prediction each website in training set after the training in the use step (f) calculate the predicted value of website and the error e between the actual value:
e=t+Kb*b
Wherein t is the predicated error of user's tie-time, and unit is a millisecond; B is the predicated error of user bandwidth, and unit is a kilobits per second; Kb is that common value is 200~1000 coefficient;
(h) if step (g) is not to be performed for the first time, and predicated error summation and last predicated error summation differ and are no more than 3%, then redirect execution in step (k);
(i) website data in the training set is arranged from small to large by its predicated error in step (g), and used the contiguous clustering algorithm of k that the website is divided into the m group, m is a value between 1~5 and make the integer of consensus forecast error difference maximum between each group;
(j) to each the networking station in the step (i), with it as training set redirect execution in step (e);
The estimated performance step of described use decision tree test neural network:
(k) website in the test set that step (d) is obtained is divided into 1000 groups by its network ip address, is numbered between 1~1000; If finally used n neural network in step (e)-(j) process, record wherein each website its neural network that is used at last train after step (e)-(j) finishes is numbered, and is numbered between 1~n;
(l) set up a decision tree, it is input as value is 1~1000 network ip address group #, and being output as value is the neural network numbering of 1~n;
(m) the test set data of utilizing step (d) to be obtained use the C4.5 decision Tree algorithms to come the decision tree of being set up in the training step (l), preserve the decision tree after training.
Described use decision tree and neural network, the connection speed step of predictive user and any unknown website:
(n), be network ip address numbering between 1~1000 according to its value of the described acquisition of step (k), and the decision tree of using step (m) to obtain obtain its corresponding neural network numbering to the website of any one unknown its connection speed;
(o) use its corresponding neural network, predict the user's tie-time and the user bandwidth of this website.
The present invention has effectively utilized artificial intelligence technology, the method of using multiple machine learning is come the connection speed between predictive user and each website, promoted the precision of assessment network condition, make the use of Internet resources can use user bandwidth to a greater extent, experience for the user provides better internet.
Embodiment
Network connection speed predicting method based on machine learning may further comprise the steps:
1) utilize custom browser, the connection speed of the website that recording user was browsed is as training set and test set;
2) utilize the website connection speed that obtains, use the connection speed of each website in neural metwork training and predictive user and the training set;
3) predicated error according to neural network reduces situation, and perhaps execution in step 4), perhaps training set is divided into littler training set and each training set is returned execution in step 2);
4) use decision tree to test the estimated performance of neural network;
5) use decision tree and neural network, the connection speed of predictive user and any unknown website.
The described custom browser of utilizing, the connection speed of the website that recording user was browsed, as training set and test set step:
(a) to each website of user capture, write down each user and send request of access to the time interval that the user obtains to respond to the website, be designated as user's tie-time of website;
(b) to each website of user capture, the speed of download when writing down each user from the website data download is designated as the user bandwidth of website;
(c) if user's access websites repeatedly, then with in the nearest week or the mean value of nearest 10 times user's tie-time as user's tie-time of website, with in the nearest week or the mean value of nearest 10 times user bandwidth as the user bandwidth of website;
(d) select 10% at random as test set from user's historical data, all the other are 90% as training set.
The website connection speed that described utilization obtains, use the connection speed step of each website in neural metwork training and predictive user and the training set:
(e) set up artificial neural network, it is input as the characteristic of a website: comprise that one is expressed as the network ip address of 32 integers and the hourage that the integer of 1 value between 0~23 is used to represent the current time; It is output as 2 real numbers, represents the tie-time estimated value and the bandwidth estimation value of user and website respectively;
(f) user's tie-time that step (a)-(d) is obtained and user bandwidth historical data are as training set, and the neural network of using back-propagation algorithm training step (e) to be set up is preserved the neural network after training.
Described predicated error according to neural network reduces situation, and perhaps execution in step 4), perhaps training set is divided into littler training set and each training set is returned execution in step 2) step:
(g) the user's tie-time and the user bandwidth of neural network prediction each website in training set after the training in the use step (f) calculate the predicted value of website and the error e between the actual value:
e=t+Kb*b
Wherein t is the predicated error of user's tie-time, and unit is a millisecond; B is the predicated error of user bandwidth, and unit is a kilobits per second; Kb is that common value is 200~1000 coefficient;
(h) if step (g) is not to be performed for the first time, and predicated error summation and last predicated error summation differ and are no more than 3%, then redirect execution in step (k);
(i) website data in the training set is arranged from small to large by its predicated error in step (g), and used the contiguous clustering algorithm of k that the website is divided into the m group, m is a value between 1~5 and make the integer of consensus forecast error difference maximum between each group;
(j) to each the networking station in the step (i), with it as training set redirect execution in step (e);
The estimated performance step of described use decision tree test neural network:
(k) website in the test set that step (d) is obtained is divided into 1000 groups by its network ip address, is numbered between 1~1000; If finally used n neural network in step (e)-(j) process, record wherein each website its neural network that is used at last train after step (e)-(j) finishes is numbered, and is numbered between 1~n;
(l) set up a decision tree, it is input as value is 1~1000 network ip address group #, and being output as value is the neural network numbering of 1~n;
(m) the test set data of utilizing step (d) to be obtained use the C4.5 decision Tree algorithms to come the decision tree of being set up in the training step (l), preserve the decision tree after training.
Described use decision tree and neural network, the connection speed step of predictive user and any unknown website:
(n), be network ip address numbering between 1~1000 according to its value of the described acquisition of step (k), and the decision tree of using step (m) to obtain obtain its corresponding neural network numbering to the website of any one unknown its connection speed;
(o) use its corresponding neural network, predict the user's tie-time and the user bandwidth of this website.
The present invention has effectively utilized artificial intelligence technology, the method of using multiple machine learning is come the connection speed between predictive user and each website, promoted the precision of assessment network condition, make the use of Internet resources can use user bandwidth to a greater extent, experience for the user provides better internet.
Embodiment
As shown in Figure 1, this method comprises training stage and forecast period two parts; Training stage comprises user'shistorical data 10, training set 20, and test set 30, artificialneural network 40, error judges 50, cuts apart training set 60, artificialneural network group 70, C4.5decision tree 80; Forecast period comprisesunknown website 90, C4.5decision tree 80, artificialneural network group 70, connection speed predicted value 99.
User's historical data 10: the data of the user bandwidth of the user's tie-time during each access site during with the transmission data; Wherein the user of certain website is defined as each user the tie-time and sends request of access to this website and obtain time interval between this address response to the user, if the user repeatedly visits this website, then with in the nearest week or the mean value of nearest 10 times user's tie-time be as the criterion; Speed of download when the user bandwidth of certain website is defined as each user from this website data download, if the user repeatedly visits this website, then with in the nearest week or the mean value of nearest 10 times user bandwidth be as the criterion.
Training set 20: select 10% at random as behind the test set from user'shistorical data 10, remaining 90% part is as training set;
Test set 30: from user's historical data, select 10% at random as test set;
Artificial neural network 40: in an embodiment, we have used one 4 layers artificial neural network, and wherein input layer is the characteristic of a website: comprise that one is expressed as the network ip address of 32 integers and the hourage that the integer of 1 value between 0~23 is used to represent the current time; Its output layer is 2 real numbers, tie-time estimated value and bandwidth estimation value between expression user and this website; Each neuron in other two-layer in this neural network is a sigmod function, and per two neurons between adjacent two layers all link to each other; Utilize back-propagation algorithm (back-propagation), constantly use user'shistorical data 10 that it is trained on the backstage.
Error judges 50: use the user's tie-time and the user bandwidth of artificialneural network 40 predictions each website in training set 20 after training, calculate the predicted value of this website and the error e between the actual value:
e=t+Kb*b
Wherein t is the predicated error of user's tie-time, and unit is a millisecond; B is the predicated error of user bandwidth, and unit is a kilobits per second; Kb is that common value is 200~1000 coefficient; If this step is not to be performed for the first time and the predicated error summation of this moment differs with last predicated error summation and is no more than 3%, then finish training to neural network, preserve at this moment all neural networks, obtain artificialneural network group 70.
Cut apart training set 60: the station data in the training set 20 is arranged from small to large by its predicated error in artificialneural network 40, and using the contiguous clustering algorithm of k that website is divided into the m group, m is a value between 1~5 and make the integer of consensus forecast error difference maximum between each group; To each group in the m group, with it again as training set 20 and jump to artificialneural network 40 places and carry out.
Artificial neural network group 70: judge in 50 in error, if everyone artificial neural networks differs with the last predicated error summation of cutting apart before the training set 60 the predicated error summation of all training datas and is no more than 3%, everyone artificial neural networks of this moment is the artificial neural network group; Fig. 4 has shown the predicated error under the neural network number situation in limiting the artificial neural network group, this shows the necessity of setting up the neural network group and cutting apart training set.
C4.5 decision tree 80: all websites in the test set 30 are divided into 1000 groups by its network ip address, are numbered between 1~1000; If total n artificial neural network among the artificialneural network group 70, the neural network that each website is used to train in artificialneural network group 70 at last in the record training set 20 is numbered, and is numbered between 1~n; The value that is input as of this decision tree is 1~1000 a network ip address group #, and being output as value is the neural network numbering of 1~n; Utilize test set 30, use the C4.5 decision Tree algorithms to train this decision tree, preserve the decision tree after training;
Unknown website 90: the website of its connection speed of the unknown on the internet;
Connection speed predicted value 99: tounknown website 90, obtain its value and be the network ip address numbering between 1~1000, and use C4.5decision tree 80 to obtain its corresponding neural network numbering, use its neural network corresponding in artificialneural network group 70 then, predict the user's tie-time and the user bandwidth of this website that obtains.
An important application of the present invention: the flowage structure of the network resource recommended system of propertyization as shown in Figure 2 one by one.You and preceding you two parts before this personalized recommendation system comprises, background end comprisescustom browser 100, resource recommendation result 700; Preceding you comprises user's historical data 200, based on the networkconnection speed prediction 300 of machine learning, general search engine 400, basic search result 500, the merger of Search Results and adjustment 600.
Custom browser 100: by the form of plug-in unit, at existing Internet resources browser such as Firefox, but the module of the user bandwidth of the user's tie-time when embedding the each access site of recording user among the Internet Explorer during with the transmission data.
User's historical data 200: the data of the user bandwidth of the user's tie-time during each access site of obtaining bycustom browser 100 during with the transmission data; Wherein the user of certain website is defined as each user the tie-time and sends request of access to this website and obtain time interval between this address response to the user, if the user repeatedly visits this website, then with in the nearest week or the mean value of nearest 10 times user's tie-time be as the criterion; Speed of download when the user bandwidth of certain website is defined as each user from this website data download, if the user repeatedly visits this website, then with in the nearest week or the mean value of nearest 10 times user bandwidth be as the criterion.
Networkconnection speed prediction 300 based on machine learning: use a kind of network connection speed predicting method of the present invention, each website among the basic search result is predicted its user tie-time and user bandwidth based on machine learning.
General search engine 400 a: user interface is provided, calls the Internet resources search service; In the present embodiment, this interface is used and is realized with jsp; When the user submits a query requests to, call general network search engine (such as Google) and obtain Search Results.
Basic search result 500: after utilizing general search engine 400 to search for, preceding 100 results in its return results are resolved and obtained to its result of page searching, and, then the document is downloaded and deposited to this locality if this resource is a document.
The merger of Search Results and adjustment 600:1) merger: if the Internet resources that the user needs are text, then, use " text similarity estimation source code " (" the Code for estimating document similarity ") in the open source code package of Microsoft (Microsoft) to calculate the text similarity between them to per two in the Search Results; If its similarity is greater than 95%, then these two of marks are identical content; If the Internet resources of user's needs are other forms, then to per two in the Search Results, 10 offset location of picked at random, the data of comparison 1K byte length on each offset location in this data file of two; If this data file of two is identical in the data of all 10 positions, then they is labeled as and has identical content; The website that then all is had identical content all is integrated in the middle of the search result items that this content the most preceding occurs, is combined into a search result items; 2) adjust: if the network resource data size that the user needs is less than 100K, to each has comprised the search result items of two or more websites in the Search Results, with wherein website according to user's tie-time and bandwidth prediction 60 estimate the user resequence from small to large the tie-time; If the network resource data size that the user needs is greater than 100K, to each has comprised the search result items of two or more websites in the Search Results, with wherein website according to user's tie-time and bandwidth prediction 60 estimate user bandwidth resequence from big to small;
Resource recommendation result 700: obtain user oriented personalized resource recommendation result after the merger of process Search Results and the process of adjustment 600; This recommendation results has fully taken into account personal network's situation of user, makes the use of Internet resources can use user bandwidth to a greater extent, can experience for the user provides better internet.
The experimental result of table 1~2 demonstrates the superiority of this method clearly;
Table 1 is data of testing acquisition in the virtual network of a simulation internet situation; The network structure of this virtual network as shown in Figure 3; Each website is under the multitiered network structure that is formed by some gateway tissues; Total about 30000 computing machines in this virtual network, be distributed in three different Internet service providers (ISP) under; Gateways at different levels if more near the network root then its time delay more little and bandwidth near netting twine is big more, if more away from the network root then its time delay big more and bandwidth near netting twine is more little; Be about 1/100 of inner each gateway delay time of same ISP the time delay of the main line gateway between the different I SP, bandwidth is about 50 times; In our experiment, we have set up 500 different resource data, and each duplicates 2000 parts, are randomly dispersed in the computing machine in the virtual network; User's resource query request each time supposes that search engine can return wherein 90% website, and random alignment; The probability that the i item in the site list is returned in our suppose user clicks search is

In table 1, listed and used before and after the network resource recommended system that has the method for the invention embodiment, the user obtains the spended time altogether of its resource requirement; Each row represents that respectively user's resource requirement is of a size of the 10K byte in each experiment, obtains the required time of this resource when 1M byte and 100M byte; Each row is not when the network resource recommended system that has the embodiment of the invention is used in expression respectively, use has the network resource recommended system of the embodiment of the invention and limits and has 1 among the artificial neural network group, and 5,10,50, the experimental data during 100 neural networks; Each experimental data is the same experiment flow average data after 100 times repeatedly, and experimental data unit is millisecond (ms); In the experiment of table 1 (a)-(c), the user is set and is in than on the computing machine away from the network root, and promptly analog dialup network user situation also has 10000 in advance, and 50000, the experimental data during 100000 user's historical records; In the experiment of table 1 (d)-(f), the user is set and is in than on the computing machine near the network root, and promptly simulate the broadband network user situation and have 10000 in advance, 50000, the experimental data during 100000 user's historical records; All data of table 1 (a)-(f) all are presented among each figure of Fig. 5 (a)-(f) with the graph mode correspondence with the data mode of performance boost number percent.
Table 1
(a) analog dialup network, 10000 user's historical records
| The neural network number | The 10K byte | The 1M byte | The 100M byte |
| Not 0 (not using the present invention) | 285.3ms | 23472ms | 2290176ms |
| ?1 | 233.7ms | 17205ms | 1662668ms |
| ?5 | 117.8ms | 10445ms | 846298ms |
| ?10 | 97.0ms | 8098ms | 785530ms |
| ?50 | 87.1ms | 6957ms | 695418ms |
| ?100 | 88.3ms | 6329ms | 718152ms |
(b) analog dialup network, 50000 user's historical records
| The neural network number | The 10K byte | The 1M byte | The 100M byte |
| Not 0 (not using the present invention) | 296.6ms | 23631ms | 2394101ms |
| ?1 | 238.8ms | 18149ms | 1941615ms |
| ?5 | 118.6ms | 8791ms | 945567ms |
| ?10 | 77.1ms | 6940ms | 825965ms |
| ?50 | 76.3ms | 5334ms | 631479ms |
| ?100 | 73.3ms | 5404ms | 598086ms |
(c) analog dialup network, 100000 user's historical records
| The neural network number | The 10K byte | The 1M byte | The 100M byte |
| Not 0 (not using the present invention) | 269.8ms | 22255ms | 2250904ms |
| ?1 | 195.6ms | 17381ms | 1609396ms |
| ?5 | 80.1ms | 7077ms | 841838ms |
| ?10 | 54.8ms | 4874ms | 567228ms |
| ?50 | 41.5ms | 4015ms | 457065ms |
| ?100 | 41.5ms | 3387ms | 338572ms |
(d) simulation broadband network, 10000 user's historical records
| The neural network number | The 10K byte | The 1M byte | The 100M byte |
| Not 0 (not using the present invention) | 135.8ms | 3808ms | 486680ms |
| ?1 | 133.1ms | 4116ms | 476460ms |
| ?5 | 123.1ms | 3610ms | 435579ms |
| ?10 | 124.5ms | 3397ms | 346516ms |
| 50 | 102.5ms | 3046ms | 305148ms |
| 100 | 98.9ms | 2871ms | 279206ms |
(e) simulation broadband network, 50000 user's historical records
| The neural network number | The 10K byte | The 1M byte | The 100M byte |
| Not 0 (not using the present invention) | 140.2ms | 4597ms | 369387ms |
| ?1 | 142.4ms | 4951ms | 361999ms |
| ?5 | 125.9ms | 4275ms | 340575ms |
| ?10 | 108.7ms | 3480ms | 271869ms |
| ?50 | 106.1ms | 3273ms | 253768ms |
| ?100 | 103.5ms | 3167ms | 249336ms |
(f) simulation broadband network, 100000 user's historical records
| The neural network number | The 10K byte | The 1M byte | The 100M byte |
| Not 0 (not using the present invention) | 175.8ms | 8012ms | 494828ms |
| ?1 | 180.9ms | 7403ms | 510168ms |
| ?5 | 152.2ms | 6794ms | 422088ms |
| ?10 | 114.8ms | 3332ms | 220693ms |
| ?50 | 101.8ms | 2660ms | 196942ms |
| ?100 | 97.2ms | 2732ms | 191004ms |
Table 2 is experimental datas that the present invention compares with sudden peal of thunder software under the virtual network situation; Table 2 (a)-(c) has shown that respectively qualification user historical data is 10000,50000, and the experimental data in the time of 100000; In each table, each row respectively in the each experiment of expression user's resource requirement data size size be the 10K byte, the 1M byte obtains the required time of this resource during the 100M byte; The expression when not using embodiment of the invention system and use experimental data after the network resource recommended system that has the embodiment of the invention respectively of each row; In order better to show the special efficacy of this method, the experimental data of sudden peal of thunder software under the similarity condition (simulating its single site downloading mode) is also listed in table as a comparison; Each experimental data is the similar resource average data after 100 times repeatedly; Experimental data unit is millisecond (ms).
Table 2
(a) 10000 user's historical datas
| The 10K byte | The 1M byte | The 100M byte |
| Do not use the present invention | 285.3ms | 23472ms | 2290176ms |
| Use the present invention | 88.3ms | 6329ms | 718152ms |
| Use a sudden peal of thunder | 254.4ms | 19603ms | 1951160ms |
(b) 50000 user's historical datas
| The 10K byte | The 1M byte | The 100M byte |
| Do not use the present invention | 296.6ms | 23631ms | 2394101ms |
| Use the present invention | 73.3ms | 5404ms | 590086ms |
| Use a sudden peal of thunder | 253.6ms | 18858ms | 1855428ms |
(c) 100000 user's historical datas
| The 10K byte | The 1M byte | The 100M byte |
| Do not use the present invention | 269.8ms | 22255ms | 2250904ms |
| Use the present invention | 41.5ms | 3387ms | 338572ms |
| Use a sudden peal of thunder | 204.8ms | 15710ms | 1482256ms |
Fig. 6 is that the embodiment of the invention is used the experimental data under real China Internet; In the experiment of Fig. 6, ((user #11~#20) has used the network resource recommended system that has the embodiment of the invention for user #1~#10) and 10 broadband network users from different regions for 10 Dial-up Network users from different regions; After using fortnight, to the exemplary resource on 3 kinds of internets: text, PDF document and online game installation file conduct interviews; Fig. 6 (a)-(c) has shown the user's download text respectively, the experimental data of PDF document and online game installation file required time; In each figure, when each column figure represents not use the network resource recommended system that has the embodiment of the invention respectively and use experimental data after the network resource recommended system that has the embodiment of the invention; In order better to show the special efficacy of this method, the experimental data of sudden peal of thunder software under the similarity condition (being defined as the single site downloading mode) is also listed as a comparison with the column figure; Each user's experimental data is all listed in the drawings; Each experimental data is the similar resource average data after 100 times repeatedly; The average data size of above-mentioned three class resources is respectively the 10.6K byte, 3.49M byte, 784M byte; Experimental data unit is second (sec.).
Above-mentioned experiment shows, the present invention has effectively utilized user's web-based history Visitor Logs, the method of using artificial intelligence has been predicted the connection speed between user and each website, personal network's situation of user has been combined in the access to netwoks process, make the use of Internet resources can use user bandwidth to a greater extent, can experience for the user provides better internet.
The above only is the preferred embodiment of a kind of network connection speed predicting method based on machine learning of the present invention, is not in order to limit the scope of essence technology contents of the present invention.A kind of network connection speed predicting method of the present invention based on machine learning; its essence technology contents is to be defined in widely in claims; any technology entity or method that other people are finished; if it is identical with the definien of institute in claims; or the change of same equivalence, all will be regarded as being covered by within this scope of patent protection.