Base station co-site identification method based on big dataTechnical Field
The application relates to the technical field of base station co-site identification, in particular to a base station co-site identification method based on big data.
Background
The wireless environment measurement report MR data in the mobile communication network can accurately reflect the coverage condition of the network, and provides good tool support for operators to know the coverage of the wireless network. Good network coverage is a fundamental guarantee for operator survival. However, with further evolution of the mobile communication network, especially, the transition from 4G to 5G is gradual, the signal frequency band wavelength adopted by the wireless network is shorter and shorter, which results in a multiple increase of the construction scale. According to incomplete statistics, the size of the existing 4G sites in the whole country reaches more than 400 tens of thousands, and the size of the 5G sites is more than 3 times of that of the 4G sites, so that the total investment cost of operators is directly increased.
Co-site sharing is a good cost optimization strategy, three major operators (China Mobile, china telecom and China Unicom) have already built iron tower groups at present, the iron tower groups are used for base station construction and leased to the three major operators, and the three major operators pay according to the use condition. Due to the historical legacy problem, three operators have a large number of own sites, so that the existing sites of the operators and the shared sites cannot be well distinguished and classified, and the site cost allocation of the iron tower group is further influenced, for example, the existing sites are distinguished and classified mainly by the longitude and latitude information of the sites, but because the basic information of the sites in the existing operator resource management system is inconsistent with the information of the site sites in a large amount (mainly because the resource management system cannot be updated in time after the later-stage site migration), the existing site classification is inaccurate.
Disclosure of Invention
In order to overcome the above-mentioned drawbacks of the prior art, an embodiment of the present application provides a method for identifying co-sited sites of a base station based on big data, which cleans data measured by different network frequency band signals in a wireless environment measurement report MR, and then adopts a machine learning method to implement classification of co-sited sites, so as to solve the problems set forth in the background art.
In order to achieve the above purpose, the present application provides the following technical solutions: a base station co-site identification method based on big data comprises the following steps:
s1, data collection: the MR data and the industrial parameter data of the multi-day wireless measurement report are collected, and the main used index variables are as follows: time, base station SiteId, own cell Id, own cell TA, own cell RSRP, own cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of own cell, latitude of own cell, and sign of whether to co-station;
s2, data processing: processing the MR data and the engineering parameter data to obtain new data, selecting MR sampling points with the RSRP value of the cell within a certain range according to the new data, counting the MR sampling points of each base station according to the SiteId of the base station for the processed data, and reserving the base station MR sampling points with the number of the MR sampling points being larger than a set value;
s3, feature extraction: calculating the average value, variance and discrete coefficient of the RSRP of the own cell and the RSRP of the neighbor cell, the correlation coefficient between the RSRP of the own cell and the TA of the own cell, calculating the average value, variance and discrete coefficient of the RSRP of the own cell and the RSRP of the neighbor cell, the correlation coefficient between the RSRP of the own cell and the TA of the own cell and the like for different TAs of the own cell according to the SiteId dimension of each base station, wherein the values are the characteristic data of each base station; if the common station mark is the label data of each base station, the two data form new data;
s4, modeling by an algorithm: dividing the data after the characteristics are extracted into a training set and a testing set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT, xgboost), and verifying the testing set by using the trained model;
s5, selecting a model: training the training data by using random forest, GBDT and Xgboost algorithms respectively, obtaining an optimal model by continuously adjusting parameters by each algorithm, and verifying a test set by using the trained model;
s6, model application: and (3) after selecting a final model according to the above, saving the model, collecting MR measurement report data and work parameter data, processing the data, classifying the base stations by using the saved model, and outputting the identification results of all the base stations.
Further, the step S2 includes the following substeps:
s21, matching MR data with industrial parameter data through cell Id to obtain a base station to which each cell belongs and position coordinates (longitude and latitude) of the cell, and deleting the data with the empty position coordinates from the matched record;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell for the data processed in the step S21, deleting MR sampling points far from the base station and deleting sampling points with unmatched distance with TA, and obtaining new data;
s23, selecting MR sampling points with the RSRP value of the cell within a certain range for the data obtained in the step S22, counting the MR sampling points of each base station for the processed data according to the SiteId of the base station, and reserving the MR sampling points of the base station with the number of the MR sampling points being larger than a set value.
Further, the algorithm of the random forest in the step S4 includes the following steps:
s411, randomly extracting K new self-service sample sets from the training set in a put-back way by applying a bootstrap method, and constructing K classification trees by the self-service sample sets, wherein samples which are not extracted each time form K pieces of out-of-bag data;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables for node splitting;
s413, completely generating all decision trees without pruning;
s414, determining the category of the terminal node by the mode category corresponding to the node;
s415, classifying the new observation points by using all trees, wherein the classification is generated by a majority decision principle.
Further, the algorithm of GBDT in step S4 includes the following steps:
s421, initializing estimation values of all samples on K categories, Fk (X) is a matrix, which can be initialized to 0 or set randomly;
s422, cycling the following learning update process M times;
s423, performing Logistic transformation on the function estimated value without the sample, and converting the estimated value of the sample into the probability that the sample belongs to a certain class through the following transformation formula:
the estimated value of each category is 0 at the initial time of the sample, the probability of belonging to the category is equal, and the estimated value changes along with the continuous updating of the sample, and the probability correspondingly changes;
s424, traversing the probability of each category for all samples, noting that each category is traversed in this step, not all samples;
s425, solving probability gradients of each sample on the K-th class, wherein in the above, the probabilities that a plurality of samples belong to a certain class K and the probabilities that whether the samples really belong to the class K are solved through a regression tree algorithm, learning through a common building cost function and a derived gradient descent method, wherein the log likelihood function form of the cost function is as follows:
deriving a cost function to obtain:
s426, learning a regression tree of J leaf nodes along the gradient method,
we input all samplesAnd the residual error of probability of each sample on the K category is taken as the updating direction, the regression tree with J leaves is learned, and the basic learning process is similar to the regression tree: traversing the feature dimension of the sample, selecting a feature as a partition point, and stopping learning once J leaf nodes are learned according to the principle that the minimum mean square error is required to be met;
s427, obtaining the gain of each leaf node, wherein the gain calculation formula of each node is as follows:
s428, updating the estimated values of all samples under the K-th class, wherein the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
under the K-th type in the M-th iteration, the estimated values F of all samples can be obtained by summing the estimated values of the samples and the gain vector in the previous iteration M-1, the gain vector is obtained by multiplying the gain values of all J leaf nodes and then the gain vector with the vector 1, so that after M times of iterative learning, the final estimated matrix of all samples under all types can be obtained, and based on the estimated value matrix, multi-type classification can be realized.
Further, the algorithm of Xgboost in step S4 includes the following steps:
s431, defining complexity of the tree: splitting the tree into a structural part q and a leaf node weight part w, wherein w is a vector and represents the output value in each leaf node;
introducing regularization term Ω (f)t ) To control the complexity of the numbers, thereby achieving an efficient overfitting of the control model;
s432, boosting Tree model in XGBoost: as with the GBDT method, the lifting model of XGBoost also adopts residual errors, except that the minimum square loss is not necessarily required when the split nodes are selected, the loss function is as follows, and compared with GBDT, a regularization term is added according to the complexity of the tree model:
s433, rewriting the objective function: in XGBoost, the loss function is directly expanded into a binomial function by Taylor expansion (provided that the loss function is first-order, second-order; continuous-derivative), and the leaf node area is assumed to be:
our objective function can be converted into:
at this time we derive wj and let the derivative be 0, we can obtain:
s434, scoring function of tree structure: the Obj value above represents how much it decreases at most above the target when a tree structure is specified, we can refer to it as a structure score, which can be considered as a function that scores the tree structure more generally like the base index, we can enumerate all the possibilities for finding the tree structure with the smallest Obj score, then compare the structure scores to obtain the optimal tree structure, however this method is computationally expensive, more commonly greedy, each attempt to segment an already existing leaf node (the first leaf node is the root node), and then obtain the gain after segmentation as:
taking Gain as a condition for judging whether to divide, if Gain<0, then this leaf node does not partition, however, all partition schemes still need to be listed for each partition. In practice we will first take all samples gi According to the segmentation mode, GL and GR can be segmented only by scanning the sample once, and then segmentation is carried out according to the score of Gain.
Further, the verification in step S5 is to calculate the accuracy, recall, and F of each model1 The value is calculated as follows:
wherein TP is the number of positive classes, FP is the number of negative classes, FN is the number of positive classes;
as can be seen from the definition of recall and accuracy, to a certain extent, an increase in one accuracy will result in a decrease in the other, thus F1 The values can be compared and integrated to display the identification effect, and F is arranged on the test set according to three models1 Values, comparing their sizes, select F1 And the model with the largest value is a final model, and a classification result is output.
The application has the technical effects and advantages that:
compared with the prior art, the method and the device have the advantages that the data measured by the different-network frequency band signals in the wireless environment measurement report MR are cleaned, and then the classification of the co-sited sites is realized by adopting a machine learning method. Through verification, the method successfully overcomes the influence of inaccurate site information in the resource management system, can accurately identify whether the base station is a shared base station, provides powerful support for the landing shared by operators, and is a scientific, effective and low-cost solution.
Drawings
FIG. 1 is a flow chart of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The method for identifying the common site of the base stations based on the big data as shown in the attached figure 1 comprises the following steps:
s1, data collection: the MR data and the industrial parameter data of the multi-day wireless measurement report are collected, and the main used index variables are as follows: time, base station SiteId, own cell Id, own cell TA, own cell RSRP, own cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of own cell, latitude of own cell, and sign of whether to co-station;
s2, data processing: processing the MR data and the engineering parameter data to obtain new data, selecting MR sampling points with the RSRP value of the cell within a certain range according to the new data, counting the MR sampling points of each base station according to the SiteId of the base station for the processed data, and reserving the base station MR sampling points with the number of the MR sampling points being larger than a set value;
step S2 comprises the following sub-steps:
s21, matching MR data with industrial parameter data through cell Id to obtain a base station to which each cell belongs and position coordinates (longitude and latitude) of the cell, and deleting the data with the empty position coordinates from the matched record;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell for the data processed in the step S21, deleting MR sampling points far from the base station and deleting sampling points with unmatched distance with TA, and obtaining new data;
s23, selecting MR sampling points with the RSRP value of the cell within a certain range for the data obtained in the step S22, counting the MR sampling points of each base station for the processed data according to the SiteId of the base station, and reserving the MR sampling points of the base station with the number of the MR sampling points being larger than a set value;
s3, feature extraction: calculating the average value, variance and discrete coefficient of the RSRP of the own cell and the RSRP of the neighbor cell, the correlation coefficient between the RSRP of the own cell and the TA of the own cell, calculating the average value, variance and discrete coefficient of the RSRP of the own cell and the RSRP of the neighbor cell, the correlation coefficient between the RSRP of the own cell and the TA of the own cell and the like for different TAs of the own cell according to the SiteId dimension of each base station, wherein the values are the characteristic data of each base station; if the common station mark is the label data of each base station, the two data form new data;
s4, modeling by an algorithm: dividing the data after the characteristics are extracted into a training set and a testing set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT, xgboost), and verifying the testing set by using the trained model;
the algorithm of the random forest comprises the following steps:
s411, randomly extracting K new self-service sample sets from the training set in a put-back way by applying a bootstrap method, and constructing K classification trees by the self-service sample sets, wherein samples which are not extracted each time form K pieces of out-of-bag data;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables for node splitting;
s413, completely generating all decision trees without pruning;
s414, determining the category of the terminal node by the mode category corresponding to the node;
s415, classifying the new observation points by using all trees, wherein the classification is generated by a majority decision principle;
the algorithm of GBDT comprises the following steps:
s421, initializing estimation values of all samples on K categories, Fk (X) is a matrix, which can be initialized to 0 or set randomly;
s422, cycling the following learning update process M times;
s423, performing Logistic transformation on the function estimated value without the sample, and converting the estimated value of the sample into the probability that the sample belongs to a certain class through the following transformation formula:
the estimated value of each category is 0 at the initial time of the sample, the probability of belonging to the category is equal, and the estimated value changes along with the continuous updating of the sample, and the probability correspondingly changes;
s424, traversing the probability of each category for all samples, noting that each category is traversed in this step, not all samples;
s425, solving probability gradients of each sample on the K-th class, wherein in the above, the probabilities that a plurality of samples belong to a certain class K and the probabilities that whether the samples really belong to the class K are solved through a regression tree algorithm, learning through a common building cost function and a derived gradient descent method, wherein the log likelihood function form of the cost function is as follows:
deriving a cost function to obtain:
s426, learning a regression tree of J leaf nodes along the gradient method,
we input all samplesAnd the residual error of probability of each sample on the K category is taken as the updating direction, the regression tree with J leaves is learned, and the basic learning process is similar to the regression tree: traversing the feature dimension of the sample, selecting a feature as a partition point, and stopping learning once J leaf nodes are learned according to the principle that the minimum mean square error is required to be met;
s427, obtaining the gain of each leaf node, wherein the gain calculation formula of each node is as follows:
s428, updating the estimated values of all samples under the K-th class, wherein the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
under the K-th type in the M-th iteration, the estimated values F of all samples can be obtained by summing the estimated values of the samples and the gain vector in the previous iteration M-1, the gain vector is obtained by multiplying the gain values of all J leaf nodes and then the gain vector with the vector 1, so that after M times of iterative learning, the final estimated matrix of all samples under all types can be obtained, and based on the estimated value matrix, multi-type classification can be realized;
the algorithm of Xgboost comprises the following steps:
s431, defining complexity of the tree: splitting the tree into a structural part q and a leaf node weight part w, wherein w is a vector and represents the output value in each leaf node;
introducing regularization term Ω (f)t ) To control the complexity of the numbers, thereby achieving an efficient overfitting of the control model;
s432, boosting Tree model in XGBoost: as with the GBDT method, the lifting model of XGBoost also adopts residual errors, except that the minimum square loss is not necessarily required when the split nodes are selected, the loss function is as follows, and compared with GBDT, a regularization term is added according to the complexity of the tree model:
s433, rewriting the objective function: in XGBoost, the loss function is directly expanded into a binomial function by Taylor expansion (provided that the loss function is first-order, second-order; continuous-derivative), and the leaf node area is assumed to be:
our objective function can be converted into:
at this time we derive wj and let the derivative be 0, we can obtain:
s434, scoring function of tree structure: the Obj value above represents how much it decreases at most above the target when a tree structure is specified, we can refer to it as a structure score, which can be considered as a function that scores the tree structure more generally like the base index, we can enumerate all the possibilities for finding the tree structure with the smallest Obj score, then compare the structure scores to obtain the optimal tree structure, however this method is computationally expensive, more commonly greedy, each attempt to segment an already existing leaf node (the first leaf node is the root node), and then obtain the gain after segmentation as:
taking Gain as a condition for judging whether to divide, if Gain<0, then this leaf node does not partition, however, all partition schemes still need to be listed for each partition. In practice we will first take all samples gi According to the segmentation mode, GL and GR can be segmented only by scanning a sample once, and then segmentation is carried out according to the score of Gain;
s5, selecting a model: training the training data by using random forest, GBDT and Xgboost algorithms respectively, obtaining an optimal model by continuously adjusting parameters by each algorithm, and verifying a test set by using the trained model;
the verification in step S5 is to calculate the accuracy, recall and F1 of each model, and the calculation formula is as follows:
wherein TP is the number of positive classes, FP is the number of negative classes, FN is the number of positive classes;
the definition of the recall rate and the accuracy rate shows that the improvement of a certain accuracy rate can cause the reduction of another accuracy rate to a certain extent, so that the F1 value can be used for comprehensively displaying the identification effect, the sizes of the three models are compared according to the F1 values of the three models on the test set, the model with the largest F1 value is selected as the final model, and the classification result is output;
s6, model application: and (3) after selecting a final model according to the above, saving the model, collecting MR measurement report data and work parameter data, processing the data, classifying the base stations by using the saved model, and outputting the identification results of all the base stations.
The last points to be described are: first, in the description of the present application, it should be noted that, unless otherwise specified and defined, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be mechanical or electrical, or may be a direct connection between two elements, and "upper," "lower," "left," "right," etc. are merely used to indicate relative positional relationships, which may be changed when the absolute position of the object being described is changed;
secondly: in the drawings of the disclosed embodiments, only the structures related to the embodiments of the present disclosure are referred to, and other structures can refer to the common design, so that the same embodiment and different embodiments of the present disclosure can be combined with each other under the condition of no conflict;
finally: the foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.