Summary of the invention
Technical assignment of the present invention is to solve the deficiencies in the prior art, and a kind of method of practical, Internet-based tax data automatic capturing and intellectual analysis is provided.
Technical scheme of the present invention realizes in the following manner, the method for this kind of Internet-based tax data automatic capturing and intellectual analysis, and its concrete steps are:
One, build ecommerce tax source management cloud platform, crawler technology Network Based is set up large data search engine at this cloud platform;
Two, dispose the hadoop cluster, then carry out data acquisition;
Three, dynamically capture the concerning taxes data message on the internet business website and webpage;
Four, carry out the data acquisition extraction based on " vertical search reptile " technology, the process of realize target data from " isomery " to " isomorphism ";
Five, by acquired data storage to each node of cloud Platform Server, realize that magnanimity concerning taxes data carry out distributed storage and unified management;
Six, carry out intelligent data analysis, data are extracted, clean, change, load, extract the data that satisfy condition.
The detailed step that builds ecommerce tax source management cloud platform in described step 1 is:
The first step, based on cloud computing platform, build the electronic commerce data search engine, and disperse to dispose search component at the backbone network node, gathers domestic e-commerce website management data, sets up the enterprise operation database;
Second step, the trading rules of analytical electron business platform, by the cleaning to image data and rule conversion, be processed into unified enterprise's concerning taxes database by the management data of different-format;
The 3rd step, based on enterprise's concerning taxes database, affiliated enterprise's tax registration, tax declaration, identification, invoice data, the Data Analysis Model system of set up branch trade, dividing product, minute region ,Fen enterprise, build source of tax revenue specialized management platform.
The detailed process of described step 3 is:
The first step: determine acquisition tasks;
Second step: according to each acquisition tasks, determining can be for the target data source gathered;
The 3rd step: for different target data source, carry out different acquisition configuration, to guarantee to collect data;
The 4th step: the scheduling acquisition tasks, synchronize and upgrade with targeted sites, incremental crawler;
The 5th step: collect data result, complete the process of Heterogeneous data to isomorphism;
The 6th step: by publisher server, data are published to application platform.
The detailed process of described step 4 is: by the vertical search crawler technology, vertical search engine is taken into specific structured message data by the unstructured data of webpage, by to specific transaction platform, industry, carry out in real time addressing, collection, extraction, cleaning, excavation, processing, the process of realize target data from " isomery " to " isomorphism ", finally deposit result data in this locality, structural data forms valid data with non-structured mode and structurized mode after deep processing is processed.
The beneficial effect that the present invention compared with prior art produced is:
On-line shop's data that the method for a kind of Internet-based tax data automatic capturing of the present invention and intellectual analysis is large and concentrated mainly for data volume in network, guarantee the data high reliability, high scalability, high efficiency and high fault tolerance, realize that magnanimity concerning taxes data carry out distributed storage and unified management, improve data-switching, load, data access and data query access, the corresponding speed of the links such as multidimensional analysis and processing power, by capturing tax data, stored, and tax data is carried out to the intelligent data condition analysis according to client's requirement, meet the demand of current market for tax data, promote the continuous lifting of source of tax revenue specialized management level, practical, be easy to promote.
Embodiment
Method below in conjunction with accompanying drawing to a kind of Internet-based tax data automatic capturing of the present invention and intellectual analysis elaborates.
The gordian technique that the present invention adopts is that the hadoop technology is carried out distributed treatment, disposes the hadoop cluster, then data is stored on each node of cloud Platform Server, realizes that magnanimity concerning taxes data carry out distributed storage and unified management.Then carry out intelligent data analysis, the large data search engine of data filtering is based upon the cloud platform, by technology such as distributed file system, Distributed Storage, data-base clusters, the cleaning of image data and rule conversion, be processed into unified enterprise's concerning taxes database by the management data of different-format.As shown in Figure 1, now provide a kind of method of Internet-based tax data automatic capturing and intellectual analysis, its concrete steps are:
One, build ecommerce tax source management cloud platform, crawler technology Network Based is set up large data search engine at this cloud platform;
Two, dispose the hadoop cluster, then carry out data acquisition.
Three, dynamically capture the concerning taxes data message on the internet business website and webpage.
Four, do not follow the standard of international electronic commerce due to domestic each e-commerce website, therefore need to be for different different " reptile " standards of internet site establishment, carry out the data acquisition extraction based on " vertical search reptile " principle, the complex process of realize target data from " isomery " to " isomorphism " after standard formulation.
Five, by acquired data storage to each node of cloud Platform Server, realize that magnanimity concerning taxes data carry out distributed storage and unified management.
Six, carry out intelligent data analysis, data are extracted, clean, change, load, extract the data that satisfy condition.
Ecommerce tax source management cloud platform in above-mentioned steps one uses the technology such as cloud computing search engine, distributed file system, Distributed Storage, fully with source of tax revenue specialized management business, combine, realized unified collection, centralized stores and the venture analysis early warning of source of tax revenue data; Simultaneously, analyse in depth research e-commerce venture characteristics, grasped the trading rules of domestic existing B2B, B2C, C2C e-commerce platform, combing industry and product classification, set up the e-commerce product statistical standard.Construct the framework of ecommerce source of tax revenue specialized management cloud platform, minute three steps erect this platform, and the detailed step of step 1 is:
The first step, based on cloud computing platform, build the electronic commerce data search engine, and the backbone network node disperses to dispose search component at home, gathers domestic e-commerce website management data, sets up the enterprise operation database.
Second step, the trading rules of analysis B2B, B2C, C2C e-commerce platform, by the cleaning to image data and rule conversion, be processed into unified enterprise's concerning taxes database by the management data of different-format.
The 3rd step, based on enterprise's concerning taxes database, the data such as affiliated enterprise's tax registration, tax declaration, identification, invoice, the Data Analysis Model system of set up branch trade, dividing product, minute region ,Fen enterprise, build source of tax revenue specialized management platform.By above step, for the ecommerce tax source management provides technical support.
In described step 3, the gordian technique of data acquisition is exactly by internet " reptile " technology, the dynamic concerning taxes data message captured on the internet business website and webpage, crawler technology is captured internet concerning taxes data by the configuration data rule, and configuration data collection rule step is as follows:
The first step: determine acquisition tasks;
Second step: according to each acquisition tasks, determining can be for the target data source gathered;
The 3rd step: for different target data source, carry out different acquisition configuration, to guarantee to collect data;
The 4th step: the scheduling acquisition tasks, synchronize and upgrade with targeted sites, incremental crawler;
The 5th step: collect data result, complete the process of Heterogeneous data to isomorphism;
The 6th step: by publisher server, data are published to application platform.
The detailed process of described step 4 is: " vertical search reptile " technology is introduced in data acquisition, carry out in real time addressing, collection, extraction, cleaning, excavation, processing, the complex process of realize target data from " isomery " to " isomorphism ", result data deposits local structured database in the most at last.The large data search engine of data filtering is based upon public service cloud platform, by technology such as distributed file system, Distributed Storage, NO-SQL data-base clusters, the cleaning of image data and rule conversion, be processed into unified enterprise's concerning taxes database by the management data of different-format.
Internet of the present invention tax data automatic capturing and intelligent data analysis automatic arranging go out enterprise's concerning taxes data: the data of on the one hand the cloud platform being obtained are verified, understand on the other hand the situation such as institutional framework, operational characteristics, management style of e-commerce venture, analyze the difference of itself and physical operating mode, hold the tax risk point.
The exploration and practice that deepens continuously of e-commerce venture's tax source management, the ecommerce source of tax revenue specialized management cloud platform based on cloud computing technology will become the important tool of source of tax revenue specialized management gradually.By internet data is carried out to degree of depth excavation, analysis and utilization, will promote the continuous lifting of source of tax revenue specialized management level.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.