Movatterモバイル変換


[0]ホーム

URL:


CN106570153A - Data extraction method and system for mass URLs - Google Patents

Data extraction method and system for mass URLs
Download PDF

Info

Publication number
CN106570153A
CN106570153ACN201610970427.3ACN201610970427ACN106570153ACN 106570153 ACN106570153 ACN 106570153ACN 201610970427 ACN201610970427 ACN 201610970427ACN 106570153 ACN106570153 ACN 106570153A
Authority
CN
China
Prior art keywords
data
text data
file system
hive
hdfs1
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610970427.3A
Other languages
Chinese (zh)
Inventor
欧阳涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Feixun Data Communication Technology Co Ltd
Original Assignee
Shanghai Feixun Data Communication Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Feixun Data Communication Technology Co LtdfiledCriticalShanghai Feixun Data Communication Technology Co Ltd
Priority to CN201610970427.3ApriorityCriticalpatent/CN106570153A/en
Publication of CN106570153ApublicationCriticalpatent/CN106570153A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention discloses a data extraction method for mass URLs. The method comprises the following steps of S10, respectively collecting each text data into a local file pool by using a distributed web sever framework; S20, uploading overall text data acquired by accumulating in the local file pool into a hadoop cloud distributed file system hdfsl; and S40, extracting keywords of the URLs in a distributed manner from the overall text data in the cloud distributed file system by using a hadoop data warehouse tool hive. According to the method provided by the invention, under a big data application scene, after each ext data is collected into the local file pool, the overall text data is uploaded into the cloud distributed file system, and the hive is used for performing distributed computation to perform distribution extraction; and the method has the advantages of high efficiency and low resource consumption.

Description

A kind of data extraction method and system of magnanimity URL
Technical field
The invention belongs to data abstraction techniques field, more particularly to the data extraction method and system of magnanimity URL.
Background technology
In today that the Internet is developed rapidly, the rule showed when using Internet resources to user, personalizationCustom be analyzed (namely user behavior analysis) after;Extract and recognize the interest of user.On the one hand, can be to userPersonalized customization and push, provide more active, intelligentized service for website caller.On the other hand, from user behaviorDifferent manifestations, find its interest and preference, membership credentials between the page can be optimized, improve web station system framework, so as to subtractLight user finds the burden of information so as to operate simpler, saves time and effort.
At present, when being analyzed to user behavior, as large-scale website typically possesses huge online user, and produceReal-time behavior and contextual information amount it is huge.Therefore, the storage capacity and calculating speed of system are higher, will could divide in timeAnalysis result feeds back to user.
In prior art, most of user behavior analysis systems are directly using local relational database technology and traditionData extraction method.However, with data magnanimity increase, data extraction method of the prior art can consume ample resources andInternal memory, and inefficiency, it is impossible to meet the efficient analysis of mass data very well.
The content of the invention
The technical scheme that the present invention is provided is as follows:
The present invention provides a kind of data extraction method of magnanimity URL, comprises the following steps:S10, using distribution Web takeEach text data is collected local file pond by business device framework respectively;S20, by the local file pond add up obtain it is totalText data is uploaded to high in the clouds distributed file system hdfs1 of hadoop;S40, using the Tool for Data Warehouse of hadoopThe keyword of hive distributed extraction URL in total text data from high in the clouds distributed file system hdfs1.
Further, it is further comprising the steps of:S30, the Tool for Data Warehouse hive send meter to the Computational frame TEZ that increases incomeCalculate request;S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text data to total text data,And be stored in the data base of high in the clouds distributed file system hdfs1.
Further, step S40 is further included:S41, using the Tool for Data Warehouse hive UDF functions fromThe keyword of compressed text extracting data URL;The type and frequency of output user accesses data.
Further, step S20 is further included:S21, the text data in the local file pond is carriedTake;S22, according to high in the clouds distributed file system hdfs1 block size, cumulative being merged into is carried out to the text dataTotal text data;S23, total text data is uploaded to into high in the clouds distribution using local distributed file system hdfs2Formula file system hdfs1.
Further, step S21 is further included:Router in S211, the filename of the extraction text dataMAC and timestamp;S212, identify whether the router mac and timestamp run into mess code;S213, when the router macWhen mess code is run into timestamp, after cleaning to the mess code, jump to step S22;Otherwise, jump directly to step S22.
Further, also included before step S10:S01, the cluster environment for building Hadoop, and configure the numberAccording to warehouse instrument hive, high in the clouds distributed file system hdfs1, local distributed file system hdfs2;S02, in the clusterWeb server distributed type assemblies are built in environment at each node, and adds load balancing;S03, realize the Tool for Data WarehouseHive, high in the clouds distributed file system hdfs1, the table of building of local distributed file system hdfs2 are associated;Reconstruct the data binsThe UDF functions of storehouse instrument hive.
Further, further include in step S01:S011, the main section that the first predetermined number is built on HadoopPoint master, the second predetermined number from node slave;It is connected with each other between each host node master, each host nodeMaster is connected from node slave with each respectively.
Further, step S01 is further comprised:Setting up on S012, each host node master has metadata to takeBusiness component metastore, relational database mysql.
The present invention also provides a kind of system of the data extraction method of magnanimity URL, including:Web server framework, utilizes and dividesEach text data is collected local file pond by cloth web server framework respectively;Local distributed file system hdfs2, willThe total text data for obtaining that adds up in the local file pond is uploaded to high in the clouds distributed file system hdfs1 of hadoop;NumberIt is according to warehouse instrument hive, total from high in the clouds distributed file system hdfs1 using the Tool for Data Warehouse hive of hadoopThe keyword of distributed extraction URL in text data.
Further, also include:Increase income Computational frame TEZ, and the Tool for Data Warehouse hive is sent out to the Computational frame TEZ that increases incomeSend computation requests;The Computational frame TEZ that increases income is compressed coded treatment into compressed text data to total text data,And be stored in the data base of high in the clouds distributed file system hdfs1.
Compared with prior art, the data extraction method and system of magnanimity URL that the present invention is provided, with following beneficial effectReally:
1) in the present invention under big data application scenarios, after each text data is converged in local file pond, will total textNotebook data is uploaded in the distributed file system of high in the clouds, recycles hive to carry out Distributed Calculation to carry out distributed extraction;ToolEffective percentage is high and consumes the low advantage of resource.
2) coded treatment is compressed to total text data in the present invention, the total text data after compression can reduce occupancySpace, efficiently solves the resource consumption problem and memory problem of local relational database;Total text data after coding canSo that ordered pair data are extracted, it is ensured that extraction smooth can run.
3) in the present invention according to high in the clouds distributed file system block size, text data is carried out it is cumulative be merged into it is totalAfter text data;Again total text data is uploaded;Total text book data can be prevented excessive, caused to distributed file systemBlocking.
4) extracting in the filename of text data in the present invention carries out router mac and timestamp, when mess code is run into,Which is cleaned;Provide safeguard smoothly to extract data.
Description of the drawings
Below by the way of clearly understandable, preferred implementation is described with reference to the drawings, a kind of data of magnanimity URL are carriedAbove-mentioned characteristic, technical characteristic, advantage and its implementation for taking method and system is further described.
Fig. 1 is a kind of schematic flow sheet of the data extraction method of magnanimity URL of the invention;
Fig. 2 is the schematic flow sheet of the data extraction method of another kind magnanimity URL of the invention;
Fig. 3 is the schematic flow sheet of step S20 in the present invention;
Fig. 4 is the part schematic flow sheet of the data extraction method of magnanimity URL in the present invention;
Fig. 5 is the part schematic flow sheet of step S01 in the present invention;
Fig. 6 is a kind of structural representation of the data extraction system of magnanimity URL of the invention;
Fig. 7 is the schematic diagram of the data extraction method of another magnanimity URL of the invention;
Fig. 8 is the composition structural representation of the data extraction method of another magnanimity URL of the invention.
Specific embodiment
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below by control description of the drawingsThe specific embodiment of the present invention.It should be evident that drawings in the following description are only some embodiments of the present invention, forFor those of ordinary skill in the art, on the premise of not paying creative work, can be obtaining other according to these accompanying drawingsAccompanying drawing, and obtain other embodiments.
To make simplified form, part related to the present invention in each figure, is only schematically show, they do not representIts practical structures as product.In addition, so that simplified form is readily appreciated, with identical structure or function in some figuresPart, only symbolically depicts one of those, or has only marked one of those.Herein, " one " is not only represented" only this ", it is also possible to represent the situation of " more than one ".
As shown in figure 1, according to one embodiment of present invention, a kind of data extraction method of magnanimity URL, including following stepSuddenly:S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;Recycle and divideEach text data is collected local file pond by cloth web server framework respectively.
S20, the total text data obtained adding up in the local file pond are uploaded to the distributed text in high in the clouds of hadoopPart system hdfs1;Hadoop is distributed system base frame.
S40, using the Tool for Data Warehouse hive of hadoop from high in the clouds distributed file system hdfs1 total textThe keyword of distributed extraction URL in data.
Specifically, Flume is the High Availabitity that Cloudera is provided, and highly reliable, distributed massive logs are adoptedCollection, polymerization and the system transmitted, Flume support Various types of data sender is customized in log system, for collecting data;TogetherWhen, Flume is provided and data is carried out with simple process, and writes the ability of various data receivings (customizable).
Hive is a Tool for Data Warehouse based on Hadoop, structurized data file can be mapped as a numberAccording to storehouse table, and complete sql query functions are provided, sql sentences can be converted to MapReduce tasks and be run.hiveUse the HDFS (distributed file system of hadoop) of hadoop, hive using computation model be mapreduce.
As shown in Fig. 2 according to another embodiment of the invention, a kind of data extraction method of magnanimity URL, including it is followingStep:S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;RecycleEach text data is collected local file pond by distribution Web server framework respectively.
S20, the total text data obtained adding up in the local file pond are uploaded to the distributed text in high in the clouds of hadoopPart system hdfs1;Hadoop is distributed system base frame.
S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;
S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text dataAccording to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is storedForm is ORC.
S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URLKey word;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539** { " mobile phone ":3, it is " strongHealth ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, " recruitment ":1, " handssMachine ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":4}。
Specifically, Tez is the Computational frame of increasing income of the newest support DAG operations of Apache.Tez not region be directly facing finallyUser --- in fact it allow developer be end user build performance faster, the more preferable application program of autgmentability.Tez projectsTarget be support height customize, so it just disclosure satisfy that the needs of various use-cases, allow people need not be by others outsidePortion's mode can just complete the work of oneself, if project as Hive and Pig uses Tez rather than MapReduce is used as whichThe backbone of data processing, then their response time will be obviously improved.Tez is built on YARN, and the latter is HadoopThe new resources Governance framework for being used.
Hive file memory formats:1.textfile is default form, storage mode:Row storage, disk expense are big, dataParsing expense is big;The text files hive of compression cannot be merged and be split.2.sequencefile, binary file, with<key,value>Form sequence in file;Storage mode:Row storage, divisible, compression;Block compressions are typically chosen, it is excellentIt is compatible that gesture is the mapfile in file and Hadoop api.3.rcfile, storage mode:Data press row piecemeal, oftenBlock is according to row storage;Compression is fast, fast column access;The block that read record is related to as far as possible is minimum;The row that reading needs are only needed toRead the head definition of each row group.The operating characteristics for reading full dose data may be than sequencefile without obviousAdvantage.4.orc, storage mode:Data press row piecemeal, per block according to row storage;Compression is fast, fast column access;Efficiency ratioRcfile is high, is the modified version of rcfile.5. user-defined format.User can by realize inputformat andOutputformat carrys out self-defined input/output format.
Wherein, textfile memory spaces are consumed than larger, and the text for compressing cannot be split and merge;The effect of inquiryRate is minimum, directly can store, the speed highest of loading data.Sequencefile memory spaces consume maximum, the text of compressionPart can be split and merge, and search efficiency is high, needs by text file translations to load.Rcfile memory spaces are minimum, look intoThe efficiency highest of inquiry, needs by text file translations to load, and the speed of loading is minimum.
As shown in Figure 2 and Figure 3, according to still a further embodiment, a kind of data extraction method of magnanimity URL, includingFollowing steps:S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;AgainUsing distribution Web server framework, each text data is collected into local file pond respectively.
S21, the text data in the local file pond is extracted;
Step S21 is further included:S211, extract router mac in the filename of the text data and whenBetween stab;
S212, identify whether the router mac and timestamp run into mess code;
S213, when the router mac and timestamp run into mess code, the mess code is cleaned using PythonAfterwards, jump to step S22;Otherwise, jump directly to step S22.Python is a kind of object-oriented, literal translation formula computer programDesign language.
S22, according to high in the clouds distributed file system hdfs1 block size, the text data is added upIt is merged into total text data;
S23, the high in the clouds distribution that total text data is uploaded to hadoop using local distributed file system hdfs2Formula file system hdfs1;Hadoop is distributed system base frame.
S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;
S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text dataAccording to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is storedForm is ORC.
S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URLKey word;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539** { " mobile phone ":3, it is " strongHealth ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, " recruitment ":1, " handssMachine ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":4}。
Specifically, hadoop and mapreduce are the foundation of hive frameworks.Hive frameworks include following component:CLI(command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore and Driver(Complier, Optimizer and Executor), these components can be divided into two big class:Service end component and client component.
Service end component:1st, Driver components:The component includes Complier, Optimizer and Executor, its workWith being that HiveQL (class SQL) sentence for writing us carries out parsing, compiling optimization, implement plan is generated, bottom is then calledMapreduce Computational frames.
2nd, Metastore components:Metadata Service component, this component store the metadata of hive, the metadata of hiveIt is stored in relational database, the relational database that hive is supported has derby, mysql.Metadata is particularly significant for hive,Therefore hive supports metastore to service independent, is installed in long-range server cluster, so as to decouple hive servicesService with metastore, it is ensured that the vigorousness of hive operations.
3rd, Thrift services:Thrift be facebook exploitation a software frame, it be used for carry out it is expansible and acrossThe exploitation of the service of language, hive are integrated with the service, and different programming languages can be allowed to call the interface of hive.
Client component:1、CLI:Command line interface, command line interface.
2nd, Thrift clients:Thrift client do not write in Organization Chart above, but many of hive frameworksClient-side interface is built upon on thrift clients, including JDBC and ODBC interfaces.
3、WEBGUI:Hive clients access the service provided by hive by way of webpage there is provided a kind of.ThisThe hwi components (hive web interface) of interface correspondence hive, will start hwi services using before.
As shown in Figure 2, Figure 3, Figure 4, according to still another embodiment of the invention, a kind of data extraction method of magnanimity URL,Comprise the following steps:S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, the distributed text in high in the cloudsPart system hdfs1, local distributed file system hdfs2;Namenode HA and ResourceManager HA are setPut.
S02, web server distributed type assemblies are built at each node in the cluster environment, and add load balancing;It is negativeBalanced foundation is carried on existing network infrastructure, it provides a kind of cheap effectively transparent method extended network equipment and serviceThe bandwidth of device, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.
S03, realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed field systemSystem hdfs2's builds table association;Reconstruct the UDF functions of the Tool for Data Warehouse hive.
S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;AgainUsing distribution Web server framework, each text data is collected into local file pond respectively.
S21, the text data in the local file pond is extracted;
Step S21 is further included:S211, extract router mac in the filename of the text data and whenBetween stab;
S212, identify whether the router mac and timestamp run into mess code;
S213, when the router mac and timestamp run into mess code, the mess code is cleaned using PythonAfterwards, jump to step S22;Otherwise, jump directly to step S22.Python is a kind of object-oriented, literal translation formula computer programDesign language.
S22, according to high in the clouds distributed file system hdfs1 block size, the text data is added upIt is merged into total text data;
S23, the high in the clouds distribution that total text data is uploaded to hadoop using local distributed file system hdfs2Formula file system hdfs1;Hadoop is distributed system base frame.
S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;
S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text dataAccording to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is storedForm is ORC.
S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URLKey word;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539** { " mobile phone ":3, it is " strongHealth ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, " recruitment ":1, " handssMachine ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":4}。
Specifically, load balancing english abbreviation SLB, its main algorithm are as follows:WRR (WRR) algorithm:It is per platformOne weight of distribution, weight represent that relative to other servers itself can process the ability of connection.For n, weight represents that SLB is underBefore one server-assignment flow, to newly connect for this server-assignment n bar.
Weighting Smallest connection (WLC) algorithm:New connection can be distributed to and be flexibly connected the minimum real server of number by SLB.Be every real server distribution weight m, the ability that server process is flexibly connected equal to m divided by Servers-all weight itWith.SLB can be distributed to the real server for being flexibly connected number far fewer than its limit of power by new connection.
During using weighting Smallest connection (WLC) algorithm, SLB is controlled using a kind of mode of slow turn-on to new plus true clothesThe access of business device." slow turn-on " limits new establishment of connection frequency and allows gradually to increase, and carrys out prevention service device with thisOverload.
The configuration of Namenode HA, it is specific as follows:
1.1 hadoop-2.3.0-cdh5.0.0.tar.gz is unziped to/opt/boh/ under, and RNTO hadoop,Modification etc/hadoop/core-site.xml.
1.2 modification hdfs-site.xml.
1.3 editors/etc/hadoop/slaves;Addition hadoop3, hadoop4.
1.4 editors/etc/profile;Addition HADOOP_HOME=/opt/boh/hadoop;PATH=$ HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH;The configuration by more than copies to all nodes.
1.5 start respective services;
1.5.1 start journalnode;Sbin/hadoop- is performed on hadoop0, hadoop1, hadoop2daemon.sh start journalnode;
1.5.2 format zookeeper;Bin/hdfs zkfc-formatZK are performed on hadoop1;
1.5.3 hadoop1 nodes are formatted and are started;bin/hdfs namenode-format;sbin/hadoop-daemon.sh start namenode;
1.5.4 hadoop2 nodes are formatted and are started;bin/hdfs namenode-bootstrapStandby;sbin/hadoop-daemon.sh start namenode;
1.5.5 start zkfc services on hadoop1 and hadoop2;sbin/hadoop-daemon.sh startzkfc;Now hadoop1 and hadoop2 just have a node and are changed into active states;
1.5.6 start datanode;Order sbin/hadoop-daemons.sh start are performed on hadoop1datanode;
1.5.7 verify whether successfully;Browser is opened, hadoop1 is accessed:50070 and hadoop2:50070, twoNamenode mono- is active and another is standby.Then kill falls the namenode processes of wherein active, anotherThe naemnode of individual standby will be automatically converted to active states.
The configuration of ResourceManager HA, it is specific as follows:
2.1 modification mapred-site.xml.
2.2 modification yarn-site.xml.
Configuration file is distributed to each node by 2.3.
Yarn-site.xml on 2.4 modification hadoop2.
2.5 create directory and give authority:
2.5.1 create local directory;
2.5.2, after starting hdfs, perform following order;Create log catalogues;Under establishment hdfs /tmp;If do not create/Tmp is according to specified power, then the other assemblies of CDH will be problematic.Especially, if not creating, other processesThis catalogue can be automatically created with strict authority, thus influence whether that other programs are suitable for.hadoop fs-mkdir/tmp;hadoop fs-chmod-R777/tmp.
2.6 start yarn and jobhistory server;
2.6.1 start on hadoop1:sbin/start-yarn.sh;This script will start on hadoop1Resourcemanager and all of nodemanager.
2.6.2 start resourcemanager on hadoop2:yarn-daemon.sh startresourcemanager;
2.6.3 start jobhistory server on hadoop2;sbin/mr-jobhistory-daemon.shstart historyserver。
2.7 verify whether configuration successful.Browser is opened, hadoop1 is accessed:23188 or hadoop2:23188.
As shown in Fig. 2~Fig. 5, of the invention and another embodiment, a kind of data extraction method of magnanimity URL,Comprise the following steps:S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, the distributed text in high in the cloudsPart system hdfs1, local distributed file system hdfs2;Namenode HA and ResourceManager HA are setPut.
Step S01 is further included:S011, the first predetermined number (such as first default is built on HadoopNumber is host node master 4), the second predetermined number (such as the second predetermined number be 7) from node slave;Each main sectionIt is connected with each other between point master, each host node master is connected from node slave with each respectively.
On S012, each host node master set up have Metadata Service component metastore, relational database mysql,HiveServer2.By HiveServer2, client can be grasped to the data in Hive in the case where CLI is not startedMake, this and all allows Terminal Server Client to use various programming languages such as java, python etc. to fetch to hive submission requestsAs a result.HiveServer2 is all based on Thrift's, and HiveServer2 supports the concurrent and certification of multi-client, is openAPI clients such as JDBC, ODBC are provided and are preferably supported.
S02, web server distributed type assemblies are built at each node in the cluster environment, and add load balancing;It is negativeBalanced foundation is carried on existing network infrastructure, it provides a kind of cheap effectively transparent method extended network equipment and serviceThe bandwidth of device, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.
S03, realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed field systemSystem hdfs2's builds table association;Reconstruct the UDF functions of the Tool for Data Warehouse hive.
S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;AgainUsing distribution Web server framework, each text data is collected into local file pond respectively.
S21, the text data in the local file pond is extracted;
Step S21 is further included:S211, extract router mac in the filename of the text data and whenBetween stab;
S212, identify whether the router mac and timestamp run into mess code;
S213, when the router mac and timestamp run into mess code, the mess code is cleaned using PythonAfterwards, jump to step S22;Otherwise, jump directly to step S22.Python is a kind of object-oriented, literal translation formula computer programDesign language.
S22, according to high in the clouds distributed file system hdfs1 block size, the text data is added upIt is merged into total text data;
S23, the high in the clouds distribution that total text data is uploaded to hadoop using local distributed file system hdfs2Formula file system hdfs1;Hadoop is distributed system base frame.
S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;
S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text dataAccording to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is storedForm is ORC.
S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URLKey word;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539** { " mobile phone ":3, it is " strongHealth ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, " recruitment ":1, " handssMachine ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":4}。
Specifically, concepts of the Master/Slave equivalent to Server and agent.Master provides web interface and allows userMay operate in master the machine or be assigned on slave and run to manage job and slave, job.One master canServiced with associating multiple slave for as the different configurations of different job or identical job.
The metastore components of Hive are storing places in hive metadata sets.Metastore components include two parts:Metastore services the storage with back-end data.The medium of back-end data storage is exactly relational database, such as hive acquiescencesEmbedded disk database derby, also mysql data bases.Metastore services be built upon back-end data storage medium itOn, and the serviced component that can be interacted with hive services, under default situations, metastore services and hive services areIt is installed together, operates in the middle of same process.Metastore can also be serviced from hive services and be stripped out,Metastore is independently mounted in a cluster, hive far calls metastore services, metadata can be put this layerTo after fire wall, client accesses hive services, it is possible to be connected to metadata this layer, so as to provide preferably managementProperty and safety guarantee.Serviced using long-range metastore, metastore services and hive service operations can be allowed in differenceProcess in, so also ensure that the stability of hive, improve hive service efficiency.
As shown in fig. 6, according to one embodiment of present invention, a kind of data extraction system of magnanimity URL, including:Hadoop, builds the cluster environment of Hadoop, and configures the Tool for Data Warehouse hive, high in the clouds distributed file systemHdfs1, local distributed file system hdfs2;Namenode HA and ResourceManager HA are configured.
Preferably, build on Hadoop the first predetermined number (such as the first predetermined number is 4) host node master,Second predetermined number (such as the second predetermined number be 7) from node slave;It is connected with each other between each host node master,Each host node master is connected from node slave with each respectively.
On each host node master set up have Metadata Service component metastore, relational database mysql,HiveServer2.By HiveServer2, client can be grasped to the data in Hive in the case where CLI is not startedMake, this and all allows Terminal Server Client to use various programming languages such as java, python etc. to fetch to hive submission requestsAs a result.HiveServer2 is all based on Thrift's, and HiveServer2 supports the concurrent and certification of multi-client, is openAPI clients such as JDBC, ODBC are provided and are preferably supported.
Web server distributed type assemblies are built at each node in the cluster environment, and adds load balancing;Load is equalWeighing apparatus is set up on existing network infrastructure, and it provides a kind of cheap effectively transparent method extended network equipment and serverBandwidth, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.
Realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed file systemHdfs2's builds table association;Reconstruct the UDF functions of the Tool for Data Warehouse hive.
Distribution Web server framework, is gathered after text subdata using Flume, aggregates into text data, and to textData are transmitted;Distribution Web server framework is recycled, each text data is collected into local file pond respectively.
The distribution Web server framework, extracts to the text data in the local file pond;Extract describedRouter mac and timestamp in the filename of text data;Identify whether the router mac and timestamp run into unrestCode;When the router mac and timestamp run into mess code, the mess code is cleaned using Python.Python is onePlant object-oriented, literal translation formula computer programming language.
The distribution Web server framework, it is according to the size of the block of high in the clouds distributed file system hdfs1, rightThe text data carries out adding up and is merged into total text data.
Local distributed file system hdfs2, using local distributed file system hdfs2 by the local file pondThe cumulative total text data for obtaining is uploaded to high in the clouds distributed file system hdfs1 of hadoop;Hadoop is distributed systemBase frame.
Tool for Data Warehouse hive, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;OpenSource Computational frame TEZ, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text dataAccording to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is storedForm is ORC.
Tool for Data Warehouse hive, using the UDF functions of the Tool for Data Warehouse hive from the compressed text dataThe keyword of middle extraction URL;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539**{ " mobile phone ":3, " health ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, "Recruitment ":1, " mobile phone ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":4}。
As shown in Figure 7, Figure 8, according to still a further embodiment, a kind of data extraction method of magnanimity URL, bagInclude:The cluster environment (deployment 4 master, 7 slave) of Hadoop2.7.1 is built, and has configured the environment such as HIVE, HDFSWith configuration (Hive Metastore, mysql, hiveserver2 etc. are set up on a master).And set NamenodeHA and ResourceManager HA, make distributed system meet high availability!Each node builds tomcat distributed type assemblies, addsLoading is balanced.That realizes hive, hdfs builds table association, develops the UDF functions of corresponding hive, and Test extraction functionNormally.
By distributed web server framework, text data is collected into local file pond;By in File PoolFile carries out cleaning, extracts, merges, and carries out cumulative merging according to the size of the block of HDFS;Utilization local HDFS are merged completeInto the efficient upload of data;The UDF functions getNUM of the Hive for being developed by oneself again is mainly closed come the URL in complete paired dataKey word is extracted, and completes the high efficiency extraction of URL mass datas.
Operation cleaning, merging, upload, high compression coding, the program of distributed extraction;By result output and further deeplyAnalysis;By the Distributed Calculation of the UDF functions and Hadoop clusters of hive, the extraction for completing mass data is calculated, is usedThe type of the access data at family the frequency, quickly obtain the online feature of user, the Products Show and service for user provide according toAccording to.
hive:Technology that apache increases income, data warehouse software provide to be stored in it is distributed in large data collectionInquiry and management, itself is built upon on Apache Hadoop.What Hive SQL were represented is based on traditionSql like language of the Mapreduce for core.
The present invention mainly the cleaning by the UDF self adaptations development function and Python of hive, merge, upload andThe ORC compressions of hive are combined, and define a high performance data extraction method.
It should be noted that above-described embodiment can independent assortment as needed.The above is only the preferred of the present inventionEmbodiment, it is noted that for those skilled in the art, in the premise without departing from the principle of the inventionUnder, some improvements and modifications can also be made, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims (10)

CN201610970427.3A2016-10-282016-10-28Data extraction method and system for mass URLsPendingCN106570153A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201610970427.3ACN106570153A (en)2016-10-282016-10-28Data extraction method and system for mass URLs

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201610970427.3ACN106570153A (en)2016-10-282016-10-28Data extraction method and system for mass URLs

Publications (1)

Publication NumberPublication Date
CN106570153Atrue CN106570153A (en)2017-04-19

Family

ID=58541622

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201610970427.3APendingCN106570153A (en)2016-10-282016-10-28Data extraction method and system for mass URLs

Country Status (1)

CountryLink
CN (1)CN106570153A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107145542A (en)*2017-04-252017-09-08上海斐讯数据通信技术有限公司The high efficiency extraction subscription client ID method and system from URL
CN107193903A (en)*2017-05-112017-09-22上海斐讯数据通信技术有限公司The method and system of efficient process IP address zone location
CN107256206A (en)*2017-05-242017-10-17北京京东尚科信息技术有限公司The method and apparatus of character stream format conversion
CN108133050A (en)*2018-01-172018-06-08北京网信云服信息科技有限公司A kind of extracting method of data, system and device
CN111427884A (en)*2020-03-032020-07-17中国平安人寿保险股份有限公司 Form data processing method, device, electronic device and storage medium
CN111935215A (en)*2020-06-292020-11-13广东科徕尼智能科技有限公司Internet of things data management method, terminal, system and storage device
CN114138724A (en)*2021-11-262022-03-04浪潮软件科技有限公司 A file uploading method, device and computer readable medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103955502A (en)*2014-04-242014-07-30科技谷(厦门)信息技术有限公司Visualized on-line analytical processing (OLAP) application realizing method and system
CN104111996A (en)*2014-07-072014-10-22山大地纬软件股份有限公司Health insurance outpatient clinic big data extraction system and method based on hadoop platform
CN104301182A (en)*2014-10-222015-01-21赛尔网络有限公司Method and device for inquiring slow website access abnormal information
CN105512336A (en)*2015-12-292016-04-20中国建设银行股份有限公司Method and device for mass data processing based on Hadoop
CN105677842A (en)*2016-01-052016-06-15北京汇商融通信息技术有限公司Log analysis system based on Hadoop big data processing technique

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103955502A (en)*2014-04-242014-07-30科技谷(厦门)信息技术有限公司Visualized on-line analytical processing (OLAP) application realizing method and system
CN104111996A (en)*2014-07-072014-10-22山大地纬软件股份有限公司Health insurance outpatient clinic big data extraction system and method based on hadoop platform
CN104301182A (en)*2014-10-222015-01-21赛尔网络有限公司Method and device for inquiring slow website access abnormal information
CN105512336A (en)*2015-12-292016-04-20中国建设银行股份有限公司Method and device for mass data processing based on Hadoop
CN105677842A (en)*2016-01-052016-06-15北京汇商融通信息技术有限公司Log analysis system based on Hadoop big data processing technique

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN476328361: ""tez控制输出的文件是否压缩并指定文件名"", 《HTTPS://BLOG.CSDN.NET/THINKING2013/ARTICLE/DETAILS/48133137》*
XIAOJUN_0820: ""flume学习:flume将log4j日志数据写入到hdfs"", 《HTTPS://BLOG.CSDN.NET/XIAO_JUN_0820/ARTICLE/DETAILS/38110323》*

Cited By (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107145542A (en)*2017-04-252017-09-08上海斐讯数据通信技术有限公司The high efficiency extraction subscription client ID method and system from URL
CN107193903A (en)*2017-05-112017-09-22上海斐讯数据通信技术有限公司The method and system of efficient process IP address zone location
CN107256206A (en)*2017-05-242017-10-17北京京东尚科信息技术有限公司The method and apparatus of character stream format conversion
CN107256206B (en)*2017-05-242021-04-30北京京东尚科信息技术有限公司Method and device for converting character stream format
CN108133050A (en)*2018-01-172018-06-08北京网信云服信息科技有限公司A kind of extracting method of data, system and device
CN111427884A (en)*2020-03-032020-07-17中国平安人寿保险股份有限公司 Form data processing method, device, electronic device and storage medium
CN111935215A (en)*2020-06-292020-11-13广东科徕尼智能科技有限公司Internet of things data management method, terminal, system and storage device
CN114138724A (en)*2021-11-262022-03-04浪潮软件科技有限公司 A file uploading method, device and computer readable medium

Similar Documents

PublicationPublication DateTitle
US11797558B2 (en)Generating data transformation workflows
US11711420B2 (en)Automated management of resource attributes across network-based services
Zhang et al.A survey on emerging computing paradigms for big data
CN106570153A (en)Data extraction method and system for mass URLs
Das et al.Big data analytics: A framework for unstructured data analysis
US11068439B2 (en)Unsupervised method for enriching RDF data sources from denormalized data
CN104866497B (en)The metadata updates method, apparatus of distributed file system column storage, host
US9053161B2 (en)Database table format conversion based on user data access patterns in a networked computing environment
CN109964216A (en)Identify unknown data object
CN102915365A (en)Hadoop-based construction method for distributed search engine
Lai et al.Towards a framework for large-scale multimedia data storage and processing on Hadoop platform
CN107391502B (en)Time interval data query method and device and index construction method and device
US10956499B2 (en)Efficient property graph storage for streaming/multi-versioning graphs
CN110781505B (en)System construction method and device, retrieval method and device, medium and equipment
CN103853714A (en)Data processing method and device
CN106570151A (en)Data collection processing method and system for mass files
CN117592561A (en)Enterprise digital operation multidimensional data analysis method and system
CN106570152B (en) Method and system for mass extraction of mobile phone numbers
Gupta et al.Efficient query analysis and performance evaluation of the NoSQL data store for bigdata
CN109063059A (en)User behaviors log processing method, device and electronic equipment
CN114186101B (en) Data processing method, device, computer equipment and storage medium
US11809992B1 (en)Applying compression profiles across similar neural network architectures
CN113626439A (en) A data processing method, device, data processing equipment and storage medium
CN114297046A (en)Event obtaining method, device, equipment and medium based on log
CN116541482B (en)Text object indexing method, object storage system and related equipment

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20170419


[8]ページ先頭

©2009-2025 Movatter.jp