CN106570153A

Movatterモバイル変換

Info

Publication number: CN106570153A
Application number: CN201610970427.3A
Authority: CN
Inventors: 欧阳涛
Original assignee: Shanghai Feixun Data Communication Technology Co Ltd
Current assignee: Shanghai Feixun Data Communication Technology Co Ltd
Priority date: 2016-10-28
Filing date: 2016-10-28
Publication date: 2017-04-19

Abstract

The invention discloses a data extraction method for mass URLs. The method comprises the following steps of S10, respectively collecting each text data into a local file pool by using a distributed web sever framework; S20, uploading overall text data acquired by accumulating in the local file pool into a hadoop cloud distributed file system hdfsl; and S40, extracting keywords of the URLs in a distributed manner from the overall text data in the cloud distributed file system by using a hadoop data warehouse tool hive. According to the method provided by the invention, under a big data application scene, after each ext data is collected into the local file pool, the overall text data is uploaded into the cloud distributed file system, and the hive is used for performing distributed computation to perform distribution extraction; and the method has the advantages of high efficiency and low resource consumption.

Description

A kind of data extraction method and system of magnanimity URL

Technical field

The invention belongs to data abstraction techniques field, more particularly to the data extraction method and system of magnanimity URL.

Background technology

In today that the Internet is developed rapidly, the rule showed when using Internet resources to user, personalizationCustom be analyzed (namely user behavior analysis) after；Extract and recognize the interest of user.On the one hand, can be to userPersonalized customization and push, provide more active, intelligentized service for website caller.On the other hand, from user behaviorDifferent manifestations, find its interest and preference, membership credentials between the page can be optimized, improve web station system framework, so as to subtractLight user finds the burden of information so as to operate simpler, saves time and effort.

At present, when being analyzed to user behavior, as large-scale website typically possesses huge online user, and produceReal-time behavior and contextual information amount it is huge.Therefore, the storage capacity and calculating speed of system are higher, will could divide in timeAnalysis result feeds back to user.

In prior art, most of user behavior analysis systems are directly using local relational database technology and traditionData extraction method.However, with data magnanimity increase, data extraction method of the prior art can consume ample resources andInternal memory, and inefficiency, it is impossible to meet the efficient analysis of mass data very well.

The content of the invention

The technical scheme that the present invention is provided is as follows：

The present invention provides a kind of data extraction method of magnanimity URL, comprises the following steps：S10, using distribution Web takeEach text data is collected local file pond by business device framework respectively；S20, by the local file pond add up obtain it is totalText data is uploaded to high in the clouds distributed file system hdfs1 of hadoop；S40, using the Tool for Data Warehouse of hadoopThe keyword of hive distributed extraction URL in total text data from high in the clouds distributed file system hdfs1.

Further, it is further comprising the steps of：S30, the Tool for Data Warehouse hive send meter to the Computational frame TEZ that increases incomeCalculate request；S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text data to total text data,And be stored in the data base of high in the clouds distributed file system hdfs1.

Further, step S40 is further included：S41, using the Tool for Data Warehouse hive UDF functions fromThe keyword of compressed text extracting data URL；The type and frequency of output user accesses data.

Further, step S20 is further included：S21, the text data in the local file pond is carriedTake；S22, according to high in the clouds distributed file system hdfs1 block size, cumulative being merged into is carried out to the text dataTotal text data；S23, total text data is uploaded to into high in the clouds distribution using local distributed file system hdfs2Formula file system hdfs1.

Further, step S21 is further included：Router in S211, the filename of the extraction text dataMAC and timestamp；S212, identify whether the router mac and timestamp run into mess code；S213, when the router macWhen mess code is run into timestamp, after cleaning to the mess code, jump to step S22；Otherwise, jump directly to step S22.

Further, also included before step S10：S01, the cluster environment for building Hadoop, and configure the numberAccording to warehouse instrument hive, high in the clouds distributed file system hdfs1, local distributed file system hdfs2；S02, in the clusterWeb server distributed type assemblies are built in environment at each node, and adds load balancing；S03, realize the Tool for Data WarehouseHive, high in the clouds distributed file system hdfs1, the table of building of local distributed file system hdfs2 are associated；Reconstruct the data binsThe UDF functions of storehouse instrument hive.

Further, further include in step S01：S011, the main section that the first predetermined number is built on HadoopPoint master, the second predetermined number from node slave；It is connected with each other between each host node master, each host nodeMaster is connected from node slave with each respectively.

Further, step S01 is further comprised：Setting up on S012, each host node master has metadata to takeBusiness component metastore, relational database mysql.

The present invention also provides a kind of system of the data extraction method of magnanimity URL, including：Web server framework, utilizes and dividesEach text data is collected local file pond by cloth web server framework respectively；Local distributed file system hdfs2, willThe total text data for obtaining that adds up in the local file pond is uploaded to high in the clouds distributed file system hdfs1 of hadoop；NumberIt is according to warehouse instrument hive, total from high in the clouds distributed file system hdfs1 using the Tool for Data Warehouse hive of hadoopThe keyword of distributed extraction URL in text data.

Further, also include：Increase income Computational frame TEZ, and the Tool for Data Warehouse hive is sent out to the Computational frame TEZ that increases incomeSend computation requests；The Computational frame TEZ that increases income is compressed coded treatment into compressed text data to total text data,And be stored in the data base of high in the clouds distributed file system hdfs1.

Compared with prior art, the data extraction method and system of magnanimity URL that the present invention is provided, with following beneficial effectReally：

1) in the present invention under big data application scenarios, after each text data is converged in local file pond, will total textNotebook data is uploaded in the distributed file system of high in the clouds, recycles hive to carry out Distributed Calculation to carry out distributed extraction；ToolEffective percentage is high and consumes the low advantage of resource.

2) coded treatment is compressed to total text data in the present invention, the total text data after compression can reduce occupancySpace, efficiently solves the resource consumption problem and memory problem of local relational database；Total text data after coding canSo that ordered pair data are extracted, it is ensured that extraction smooth can run.

3) in the present invention according to high in the clouds distributed file system block size, text data is carried out it is cumulative be merged into it is totalAfter text data；Again total text data is uploaded；Total text book data can be prevented excessive, caused to distributed file systemBlocking.

4) extracting in the filename of text data in the present invention carries out router mac and timestamp, when mess code is run into,Which is cleaned；Provide safeguard smoothly to extract data.

Description of the drawings

Below by the way of clearly understandable, preferred implementation is described with reference to the drawings, a kind of data of magnanimity URL are carriedAbove-mentioned characteristic, technical characteristic, advantage and its implementation for taking method and system is further described.

Fig. 1 is a kind of schematic flow sheet of the data extraction method of magnanimity URL of the invention；

Fig. 2 is the schematic flow sheet of the data extraction method of another kind magnanimity URL of the invention；

Fig. 3 is the schematic flow sheet of step S20 in the present invention；

Fig. 4 is the part schematic flow sheet of the data extraction method of magnanimity URL in the present invention；

Fig. 5 is the part schematic flow sheet of step S01 in the present invention；

Fig. 6 is a kind of structural representation of the data extraction system of magnanimity URL of the invention；

Fig. 7 is the schematic diagram of the data extraction method of another magnanimity URL of the invention；

Fig. 8 is the composition structural representation of the data extraction method of another magnanimity URL of the invention.

Specific embodiment

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below by control description of the drawingsThe specific embodiment of the present invention.It should be evident that drawings in the following description are only some embodiments of the present invention, forFor those of ordinary skill in the art, on the premise of not paying creative work, can be obtaining other according to these accompanying drawingsAccompanying drawing, and obtain other embodiments.

To make simplified form, part related to the present invention in each figure, is only schematically show, they do not representIts practical structures as product.In addition, so that simplified form is readily appreciated, with identical structure or function in some figuresPart, only symbolically depicts one of those, or has only marked one of those.Herein, " one " is not only represented" only this ", it is also possible to represent the situation of " more than one ".

As shown in figure 1, according to one embodiment of present invention, a kind of data extraction method of magnanimity URL, including following stepSuddenly：S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted；Recycle and divideEach text data is collected local file pond by cloth web server framework respectively.

S20, the total text data obtained adding up in the local file pond are uploaded to the distributed text in high in the clouds of hadoopPart system hdfs1；Hadoop is distributed system base frame.

S40, using the Tool for Data Warehouse hive of hadoop from high in the clouds distributed file system hdfs1 total textThe keyword of distributed extraction URL in data.

Specifically, Flume is the High Availabitity that Cloudera is provided, and highly reliable, distributed massive logs are adoptedCollection, polymerization and the system transmitted, Flume support Various types of data sender is customized in log system, for collecting data；TogetherWhen, Flume is provided and data is carried out with simple process, and writes the ability of various data receivings (customizable).

Hive is a Tool for Data Warehouse based on Hadoop, structurized data file can be mapped as a numberAccording to storehouse table, and complete sql query functions are provided, sql sentences can be converted to MapReduce tasks and be run.hiveUse the HDFS (distributed file system of hadoop) of hadoop, hive using computation model be mapreduce.

As shown in Fig. 2 according to another embodiment of the invention, a kind of data extraction method of magnanimity URL, including it is followingStep：S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted；RecycleEach text data is collected local file pond by distribution Web server framework respectively.

S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income；

S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text dataAccording to, and be stored in the data base of high in the clouds distributed file system hdfs1；Wherein, boil down to ORC compressions；File is storedForm is ORC.

S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URLKey word；The type and frequency of output user accesses data.For example：8CAB8EC6CC40 186222539** { " mobile phone "：3, it is " strongHealth "：27}；8CAB8EB144A8138350092** { " automobile "：11, " picture "：127, " music "：26, " recruitment "：1, " handssMachine "：13, " health "：2907, " life "：8, " video "：4, " shopping "：7, " social activity "：84, " live "：1}；8CAB8EC00880136605272** { " music "：1, " mobile phone "：4, " health "：54, " video "：2, " shopping "：1, " social activity "：4}。

Specifically, Tez is the Computational frame of increasing income of the newest support DAG operations of Apache.Tez not region be directly facing finallyUser --- in fact it allow developer be end user build performance faster, the more preferable application program of autgmentability.Tez projectsTarget be support height customize, so it just disclosure satisfy that the needs of various use-cases, allow people need not be by others outsidePortion's mode can just complete the work of oneself, if project as Hive and Pig uses Tez rather than MapReduce is used as whichThe backbone of data processing, then their response time will be obviously improved.Tez is built on YARN, and the latter is HadoopThe new resources Governance framework for being used.

Hive file memory formats：1.textfile is default form, storage mode：Row storage, disk expense are big, dataParsing expense is big；The text files hive of compression cannot be merged and be split.2.sequencefile, binary file, with<key,value>Form sequence in file；Storage mode：Row storage, divisible, compression；Block compressions are typically chosen, it is excellentIt is compatible that gesture is the mapfile in file and Hadoop api.3.rcfile, storage mode：Data press row piecemeal, oftenBlock is according to row storage；Compression is fast, fast column access；The block that read record is related to as far as possible is minimum；The row that reading needs are only needed toRead the head definition of each row group.The operating characteristics for reading full dose data may be than sequencefile without obviousAdvantage.4.orc, storage mode：Data press row piecemeal, per block according to row storage；Compression is fast, fast column access；Efficiency ratioRcfile is high, is the modified version of rcfile.5. user-defined format.User can by realize inputformat andOutputformat carrys out self-defined input/output format.

Wherein, textfile memory spaces are consumed than larger, and the text for compressing cannot be split and merge；The effect of inquiryRate is minimum, directly can store, the speed highest of loading data.Sequencefile memory spaces consume maximum, the text of compressionPart can be split and merge, and search efficiency is high, needs by text file translations to load.Rcfile memory spaces are minimum, look intoThe efficiency highest of inquiry, needs by text file translations to load, and the speed of loading is minimum.

As shown in Figure 2 and Figure 3, according to still a further embodiment, a kind of data extraction method of magnanimity URL, includingFollowing steps：S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted；AgainUsing distribution Web server framework, each text data is collected into local file pond respectively.

S21, the text data in the local file pond is extracted；

Step S21 is further included：S211, extract router mac in the filename of the text data and whenBetween stab；

S212, identify whether the router mac and timestamp run into mess code；

S213, when the router mac and timestamp run into mess code, the mess code is cleaned using PythonAfterwards, jump to step S22；Otherwise, jump directly to step S22.Python is a kind of object-oriented, literal translation formula computer programDesign language.

S22, according to high in the clouds distributed file system hdfs1 block size, the text data is added upIt is merged into total text data；

S23, the high in the clouds distribution that total text data is uploaded to hadoop using local distributed file system hdfs2Formula file system hdfs1；Hadoop is distributed system base frame.

Specifically, hadoop and mapreduce are the foundation of hive frameworks.Hive frameworks include following component：CLI(command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore and Driver(Complier, Optimizer and Executor), these components can be divided into two big class：Service end component and client component.

Service end component：1st, Driver components：The component includes Complier, Optimizer and Executor, its workWith being that HiveQL (class SQL) sentence for writing us carries out parsing, compiling optimization, implement plan is generated, bottom is then calledMapreduce Computational frames.

2nd, Metastore components：Metadata Service component, this component store the metadata of hive, the metadata of hiveIt is stored in relational database, the relational database that hive is supported has derby, mysql.Metadata is particularly significant for hive,Therefore hive supports metastore to service independent, is installed in long-range server cluster, so as to decouple hive servicesService with metastore, it is ensured that the vigorousness of hive operations.

3rd, Thrift services：Thrift be facebook exploitation a software frame, it be used for carry out it is expansible and acrossThe exploitation of the service of language, hive are integrated with the service, and different programming languages can be allowed to call the interface of hive.

Client component：1、CLI：Command line interface, command line interface.

2nd, Thrift clients：Thrift client do not write in Organization Chart above, but many of hive frameworksClient-side interface is built upon on thrift clients, including JDBC and ODBC interfaces.

3、WEBGUI：Hive clients access the service provided by hive by way of webpage there is provided a kind of.ThisThe hwi components (hive web interface) of interface correspondence hive, will start hwi services using before.

As shown in Figure 2, Figure 3, Figure 4, according to still another embodiment of the invention, a kind of data extraction method of magnanimity URL,Comprise the following steps：S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, the distributed text in high in the cloudsPart system hdfs1, local distributed file system hdfs2；Namenode HA and ResourceManager HA are setPut.

S02, web server distributed type assemblies are built at each node in the cluster environment, and add load balancing；It is negativeBalanced foundation is carried on existing network infrastructure, it provides a kind of cheap effectively transparent method extended network equipment and serviceThe bandwidth of device, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.

S03, realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed field systemSystem hdfs2's builds table association；Reconstruct the UDF functions of the Tool for Data Warehouse hive.

S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted；AgainUsing distribution Web server framework, each text data is collected into local file pond respectively.

S21, the text data in the local file pond is extracted；

S212, identify whether the router mac and timestamp run into mess code；

Specifically, load balancing english abbreviation SLB, its main algorithm are as follows：WRR (WRR) algorithm：It is per platformOne weight of distribution, weight represent that relative to other servers itself can process the ability of connection.For n, weight represents that SLB is underBefore one server-assignment flow, to newly connect for this server-assignment n bar.

Weighting Smallest connection (WLC) algorithm：New connection can be distributed to and be flexibly connected the minimum real server of number by SLB.Be every real server distribution weight m, the ability that server process is flexibly connected equal to m divided by Servers-all weight itWith.SLB can be distributed to the real server for being flexibly connected number far fewer than its limit of power by new connection.

During using weighting Smallest connection (WLC) algorithm, SLB is controlled using a kind of mode of slow turn-on to new plus true clothesThe access of business device." slow turn-on " limits new establishment of connection frequency and allows gradually to increase, and carrys out prevention service device with thisOverload.

The configuration of Namenode HA, it is specific as follows：

1.1 hadoop-2.3.0-cdh5.0.0.tar.gz is unziped to/opt/boh/ under, and RNTO hadoop,Modification etc/hadoop/core-site.xml.

1.2 modification hdfs-site.xml.

1.3 editors/etc/hadoop/slaves；Addition hadoop3, hadoop4.

1.4 editors/etc/profile；Addition HADOOP_HOME=/opt/boh/hadoop；PATH=$ HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH；The configuration by more than copies to all nodes.

1.5 start respective services；

1.5.1 start journalnode；Sbin/hadoop- is performed on hadoop0, hadoop1, hadoop2daemon.sh start journalnode；

1.5.2 format zookeeper；Bin/hdfs zkfc-formatZK are performed on hadoop1；

1.5.3 hadoop1 nodes are formatted and are started；bin/hdfs namenode-format；sbin/hadoop-daemon.sh start namenode；

1.5.4 hadoop2 nodes are formatted and are started；bin/hdfs namenode-bootstrapStandby；sbin/hadoop-daemon.sh start namenode；

1.5.5 start zkfc services on hadoop1 and hadoop2；sbin/hadoop-daemon.sh startzkfc；Now hadoop1 and hadoop2 just have a node and are changed into active states；

1.5.6 start datanode；Order sbin/hadoop-daemons.sh start are performed on hadoop1datanode；

1.5.7 verify whether successfully；Browser is opened, hadoop1 is accessed:50070 and hadoop2:50070, twoNamenode mono- is active and another is standby.Then kill falls the namenode processes of wherein active, anotherThe naemnode of individual standby will be automatically converted to active states.

The configuration of ResourceManager HA, it is specific as follows：

2.1 modification mapred-site.xml.

2.2 modification yarn-site.xml.

Configuration file is distributed to each node by 2.3.

Yarn-site.xml on 2.4 modification hadoop2.

2.5 create directory and give authority：

2.5.1 create local directory；

2.5.2, after starting hdfs, perform following order；Create log catalogues；Under establishment hdfs /tmp；If do not create/Tmp is according to specified power, then the other assemblies of CDH will be problematic.Especially, if not creating, other processesThis catalogue can be automatically created with strict authority, thus influence whether that other programs are suitable for.hadoop fs-mkdir/tmp；hadoop fs-chmod-R777/tmp.

2.6 start yarn and jobhistory server；

2.6.1 start on hadoop1：sbin/start-yarn.sh；This script will start on hadoop1Resourcemanager and all of nodemanager.

2.6.2 start resourcemanager on hadoop2：yarn-daemon.sh startresourcemanager；

2.6.3 start jobhistory server on hadoop2；sbin/mr-jobhistory-daemon.shstart historyserver。

2.7 verify whether configuration successful.Browser is opened, hadoop1 is accessed:23188 or hadoop2:23188.

As shown in Fig. 2～Fig. 5, of the invention and another embodiment, a kind of data extraction method of magnanimity URL,Comprise the following steps：S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, the distributed text in high in the cloudsPart system hdfs1, local distributed file system hdfs2；Namenode HA and ResourceManager HA are setPut.

Step S01 is further included：S011, the first predetermined number (such as first default is built on HadoopNumber is host node master 4), the second predetermined number (such as the second predetermined number be 7) from node slave；Each main sectionIt is connected with each other between point master, each host node master is connected from node slave with each respectively.

On S012, each host node master set up have Metadata Service component metastore, relational database mysql,HiveServer2.By HiveServer2, client can be grasped to the data in Hive in the case where CLI is not startedMake, this and all allows Terminal Server Client to use various programming languages such as java, python etc. to fetch to hive submission requestsAs a result.HiveServer2 is all based on Thrift's, and HiveServer2 supports the concurrent and certification of multi-client, is openAPI clients such as JDBC, ODBC are provided and are preferably supported.

S21, the text data in the local file pond is extracted；

S212, identify whether the router mac and timestamp run into mess code；

Specifically, concepts of the Master/Slave equivalent to Server and agent.Master provides web interface and allows userMay operate in master the machine or be assigned on slave and run to manage job and slave, job.One master canServiced with associating multiple slave for as the different configurations of different job or identical job.

The metastore components of Hive are storing places in hive metadata sets.Metastore components include two parts：Metastore services the storage with back-end data.The medium of back-end data storage is exactly relational database, such as hive acquiescencesEmbedded disk database derby, also mysql data bases.Metastore services be built upon back-end data storage medium itOn, and the serviced component that can be interacted with hive services, under default situations, metastore services and hive services areIt is installed together, operates in the middle of same process.Metastore can also be serviced from hive services and be stripped out,Metastore is independently mounted in a cluster, hive far calls metastore services, metadata can be put this layerTo after fire wall, client accesses hive services, it is possible to be connected to metadata this layer, so as to provide preferably managementProperty and safety guarantee.Serviced using long-range metastore, metastore services and hive service operations can be allowed in differenceProcess in, so also ensure that the stability of hive, improve hive service efficiency.

As shown in fig. 6, according to one embodiment of present invention, a kind of data extraction system of magnanimity URL, including：Hadoop, builds the cluster environment of Hadoop, and configures the Tool for Data Warehouse hive, high in the clouds distributed file systemHdfs1, local distributed file system hdfs2；Namenode HA and ResourceManager HA are configured.

Preferably, build on Hadoop the first predetermined number (such as the first predetermined number is 4) host node master,Second predetermined number (such as the second predetermined number be 7) from node slave；It is connected with each other between each host node master,Each host node master is connected from node slave with each respectively.

On each host node master set up have Metadata Service component metastore, relational database mysql,HiveServer2.By HiveServer2, client can be grasped to the data in Hive in the case where CLI is not startedMake, this and all allows Terminal Server Client to use various programming languages such as java, python etc. to fetch to hive submission requestsAs a result.HiveServer2 is all based on Thrift's, and HiveServer2 supports the concurrent and certification of multi-client, is openAPI clients such as JDBC, ODBC are provided and are preferably supported.

Web server distributed type assemblies are built at each node in the cluster environment, and adds load balancing；Load is equalWeighing apparatus is set up on existing network infrastructure, and it provides a kind of cheap effectively transparent method extended network equipment and serverBandwidth, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.

Realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed file systemHdfs2's builds table association；Reconstruct the UDF functions of the Tool for Data Warehouse hive.

Distribution Web server framework, is gathered after text subdata using Flume, aggregates into text data, and to textData are transmitted；Distribution Web server framework is recycled, each text data is collected into local file pond respectively.

The distribution Web server framework, extracts to the text data in the local file pond；Extract describedRouter mac and timestamp in the filename of text data；Identify whether the router mac and timestamp run into unrestCode；When the router mac and timestamp run into mess code, the mess code is cleaned using Python.Python is onePlant object-oriented, literal translation formula computer programming language.

The distribution Web server framework, it is according to the size of the block of high in the clouds distributed file system hdfs1, rightThe text data carries out adding up and is merged into total text data.

Local distributed file system hdfs2, using local distributed file system hdfs2 by the local file pondThe cumulative total text data for obtaining is uploaded to high in the clouds distributed file system hdfs1 of hadoop；Hadoop is distributed systemBase frame.

Tool for Data Warehouse hive, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income；OpenSource Computational frame TEZ, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text dataAccording to, and be stored in the data base of high in the clouds distributed file system hdfs1；Wherein, boil down to ORC compressions；File is storedForm is ORC.

Tool for Data Warehouse hive, using the UDF functions of the Tool for Data Warehouse hive from the compressed text dataThe keyword of middle extraction URL；The type and frequency of output user accesses data.For example：8CAB8EC6CC40 186222539**{ " mobile phone "：3, " health "：27}；8CAB8EB144A8138350092** { " automobile "：11, " picture "：127, " music "：26, "Recruitment "：1, " mobile phone "：13, " health "：2907, " life "：8, " video "：4, " shopping "：7, " social activity "：84, " live "：1}；8CAB8EC00880136605272** { " music "：1, " mobile phone "：4, " health "：54, " video "：2, " shopping "：1, " social activity "：4}。

As shown in Figure 7, Figure 8, according to still a further embodiment, a kind of data extraction method of magnanimity URL, bagInclude：The cluster environment (deployment 4 master, 7 slave) of Hadoop2.7.1 is built, and has configured the environment such as HIVE, HDFSWith configuration (Hive Metastore, mysql, hiveserver2 etc. are set up on a master).And set NamenodeHA and ResourceManager HA, make distributed system meet high availability！Each node builds tomcat distributed type assemblies, addsLoading is balanced.That realizes hive, hdfs builds table association, develops the UDF functions of corresponding hive, and Test extraction functionNormally.

By distributed web server framework, text data is collected into local file pond；By in File PoolFile carries out cleaning, extracts, merges, and carries out cumulative merging according to the size of the block of HDFS；Utilization local HDFS are merged completeInto the efficient upload of data；The UDF functions getNUM of the Hive for being developed by oneself again is mainly closed come the URL in complete paired dataKey word is extracted, and completes the high efficiency extraction of URL mass datas.

Operation cleaning, merging, upload, high compression coding, the program of distributed extraction；By result output and further deeplyAnalysis；By the Distributed Calculation of the UDF functions and Hadoop clusters of hive, the extraction for completing mass data is calculated, is usedThe type of the access data at family the frequency, quickly obtain the online feature of user, the Products Show and service for user provide according toAccording to.

hive：Technology that apache increases income, data warehouse software provide to be stored in it is distributed in large data collectionInquiry and management, itself is built upon on Apache Hadoop.What Hive SQL were represented is based on traditionSql like language of the Mapreduce for core.

The present invention mainly the cleaning by the UDF self adaptations development function and Python of hive, merge, upload andThe ORC compressions of hive are combined, and define a high performance data extraction method.

It should be noted that above-described embodiment can independent assortment as needed.The above is only the preferred of the present inventionEmbodiment, it is noted that for those skilled in the art, in the premise without departing from the principle of the inventionUnder, some improvements and modifications can also be made, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. a kind of data extraction method of magnanimity URL, it is characterised in that comprise the following steps：

S10, using distribution Web server framework, each text data is collected into local file pond respectively；

S20, the total text data obtained adding up in the local file pond are uploaded to the high in the clouds distributed field system of hadoopSystem hdfs1；

S40, using the Tool for Data Warehouse hive of hadoop from high in the clouds distributed file system hdfs1 total text dataIn distributed extraction URL keyword.

2. the data extraction method of magnanimity URL as claimed in claim 1, it is characterised in that further comprising the steps of：

S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text data to total text data, andIt is stored in the data base of high in the clouds distributed file system hdfs1.

3. the data extraction method of magnanimity URL as claimed in claim 2, it is characterised in that step S40 is further wrappedInclude：

S41, using the UDF functions of the Tool for Data Warehouse hive from the keyword of compressed text extracting data URL；The type and frequency of output user accesses data.

4. the data extraction method of magnanimity URL as claimed in claim 1, it is characterised in that step S20 is further wrappedInclude：

S21, the text data in the local file pond is extracted；

S22, according to high in the clouds distributed file system hdfs1 block size, cumulative merging is carried out to the text dataInto total text data；

S23, total text data is uploaded to into the high in the clouds distributed field system using local distributed file system hdfs2System hdfs1.

5. the data extraction method of magnanimity URL as claimed in claim 4, it is characterised in that step S21 is further wrappedInclude：

Router mac and timestamp in S211, the filename of the extraction text data；

S212, identify whether the router mac and timestamp run into mess code；

S213, when the router mac and timestamp run into mess code, after cleaning to the mess code, jump to stepS22；Otherwise, jump directly to step S22.

6. the data extraction method of magnanimity URL as described in any one in Claims 1 to 5, it is characterised in that in the stepAlso include before rapid S10：

S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, high in the clouds distributed file systemHdfs1, local distributed file system hdfs2；

S02, web server distributed type assemblies are built at each node in the cluster environment, and add load balancing；

S03, realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed file systemHdfs2's builds table association；Reconstruct the UDF functions of the Tool for Data Warehouse hive.

7. the data extraction method of magnanimity URL as claimed in claim 6, it is characterised in that further wrap in step S01Include：

S011, the host node master that the first predetermined number is built on Hadoop, the second predetermined number from node slave；It is connected with each other between each host node master, each host node master is connected from node slave with each respectively.

8. the data extraction method of magnanimity URL as claimed in claim 7, it is characterised in that step S01 is further also wrappedInclude：

Setting up on S012, each host node master has Metadata Service component metastore, relational database mysql.

9. a kind of system for applying the data extraction method in magnanimity URL as described in any one in claim 1～8, itsIt is characterised by, including：

Each text data, using distribution Web server framework, is collected local file pond by web server framework respectively；

Local distributed file system hdfs2, the total text data obtained adding up in the local file pond are uploaded toHigh in the clouds distributed file system hdfs1 of hadoop；

Tool for Data Warehouse hive, using the Tool for Data Warehouse hive of hadoop from the high in the clouds distributed file systemIn hdfs1 in total text data distributed extraction URL keyword.

10. the data extraction system of magnanimity URL as claimed in claim 9, it is characterised in that also include：

Increase income Computational frame TEZ, and the Tool for Data Warehouse hive sends computation requests to the Computational frame TEZ that increases income；

The Computational frame TEZ that increases income is compressed coded treatment into compressed text data to total text data, and storesIn the data base of high in the clouds distributed file system hdfs1.