CN105139281A

Movatterモバイル変換

Info

Publication number: CN105139281A
Application number: CN201510515456.6A
Authority: CN
Inventors: 刘建; 赵加奎; 唐文升; 方学民; 王锦志; 欧阳红; 方红旺; 刘玉玺; 高士杰
Original assignee: State Grid Information and Telecommunication Group Co Ltd; Beijing China Power Information Technology Co Ltd; State Grid Corp of China SGCC
Current assignee: State Grid Information and Telecommunication Group Co Ltd; Beijing China Power Information Technology Co Ltd; State Grid Corp of China SGCC
Priority date: 2015-08-20
Filing date: 2015-08-20
Publication date: 2015-12-09

Abstract

本申请公开了一种电力营销大数据的处理方法及系统，方法包括：将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中，其中外部数据为从外部网站采集的数据；将非结构化的营销基础数据小文件合并为一个HAR大文件，并存储到分布式文件系统HDFS中；利用编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理，并对预处理后的数据进行分析及挖掘。本申请的方法，解决了海量电力营销结构化和非结构化数据的存储和海量数据的计算分析问题，同时充分利用了气象、新闻等外部数据对电力营销大数据进行分析、挖掘，满足了智能化营销管理和辅助决策的需求。

This application discloses a method and system for processing big data in electric power marketing. The method includes: importing structured business data and external data in the marketing basic data platform database into the Hive database of the Hadoop ecosystem, wherein the external data is obtained from the external The data collected by the website; the small unstructured marketing basic data files are merged into a large HAR file and stored in the distributed file system HDFS; the data in the Hive database and HDFS are preprocessed by using the programming computing framework MapReduce, and Analyze and mine the preprocessed data. The method of this application solves the storage of massive power marketing structured and unstructured data and the calculation and analysis of massive data. At the same time, it makes full use of external data such as weather and news to analyze and mine power marketing big data. demand for marketing management and decision-making assistance.

Description

Translated fromChinese

一种电力营销大数据的处理方法及系统A method and system for processing big data in electric power marketing

技术领域technical field

本申请涉及数据处理技术领域，更具体地说，涉及一种电力营销大数据的处理方法及系统。This application relates to the technical field of data processing, and more specifically, to a method and system for processing big data in electric power marketing.

背景技术Background technique

随着电网营销信息化和营销自动化建设的逐步推进，营销领域积累了海量数据，尤其是业扩、计量、电费、客服等业务相关的数据量大、总类繁多、实时性强且极具分析价值。With the gradual advancement of power grid marketing informatization and marketing automation construction, massive data has accumulated in the marketing field, especially business-related business expansion, metering, electricity bills, customer service, etc. value.

目前营销基础数据平台将从各业务系统集成的海量结构化业务数据存储到平台的Oracle数据库中，支撑基础业务数据共享和业务融合交互。由于营销基础数据平台使用关系数据库存储数据，不能存储客服录音、系统日志等非结构化数据，因此在营销基础数据平台上不能进行语音挖掘、日志挖掘等大数据分析项目。此外，平台只集成、存储了业务数据，没有气象、社会事件、经济发展等外部数据，而上述外部数据对电力行业有着巨大影响，外部数据的缺失将导致相关数据分析、挖掘的成果失去意义。此外，平台使用数据库软件和BI工具进行数据统计分析，没有考虑对海量数据进行复杂的分析挖掘，因此没有提供海量数据进行复杂计算所需的计算平台。At present, the marketing basic data platform stores massive structured business data integrated from various business systems into the Oracle database of the platform to support basic business data sharing and business integration and interaction. Since the marketing basic data platform uses a relational database to store data and cannot store unstructured data such as customer service recordings and system logs, big data analysis projects such as voice mining and log mining cannot be performed on the marketing basic data platform. In addition, the platform only integrates and stores business data, without external data such as weather, social events, and economic development. The above-mentioned external data has a huge impact on the power industry, and the lack of external data will lead to the loss of relevant data analysis and mining results. In addition, the platform uses database software and BI tools for data statistical analysis, and does not consider complex analysis and mining of massive data, so it does not provide the computing platform required for complex calculations of massive data.

发明内容Contents of the invention

有鉴于此，本申请提供了一种电力营销大数据的处理方法及系统，用于提供一种融合非结构化数据和外部数据，对电力营销大数据进行分析挖掘的方案。In view of this, the present application provides a method and system for processing big data in electric power marketing, which is used to provide a solution for analyzing and mining big data in electric power marketing by fusing unstructured data and external data.

为了实现上述目的，现提出的方案如下：In order to achieve the above purpose, the proposed scheme is as follows:

一种电力营销大数据的处理方法，包括：A method for processing big data in electric power marketing, comprising:

将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中，所述外部数据为通过网络爬虫技术从外部网站采集的包括气象、新闻的外部数据；Import the structured business data and external data in the marketing basic data platform database into the Hive database of the Hadoop ecosystem, and the external data is external data including weather and news collected from external websites through web crawler technology;

将非结构化的营销基础数据小文件合并为一个HAR大文件，并将其存储到Hadoop生态系统的分布式文件系统HDFS中；Merge small unstructured marketing basic data files into a large HAR file and store it in the distributed file system HDFS of the Hadoop ecosystem;

利用Hadoop生态系统的编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理，并对预处理后的数据进行分析及挖掘，保存分析及挖掘结果。Use MapReduce, the programming computing framework of the Hadoop ecosystem, to preprocess the data in the Hive database and HDFS, analyze and mine the preprocessed data, and save the analysis and mining results.

优选地，所述将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中，具体为：Preferably, the structured business data and external data in the marketing basic data platform database are imported into the Hive database of the Hadoop ecosystem, specifically:

利用Sqoop工具将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中。Use the Sqoop tool to import the structured business data and external data in the marketing basic data platform database into the Hive database in the Hadoop ecosystem.

优选地，所述将非结构化的营销基础数据小文件合并为一个HAR大文件，具体为：Preferably, the unstructured marketing basic data small file is merged into one HAR large file, specifically:

采用HadoopArchive文件归档技术将非结构化的营销基础数据小文件合并为一个HAR大文件。Use HadoopArchive file archiving technology to merge small unstructured marketing basic data files into a large HAR file.

优选地，所述利用Hadoop生态系统的编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理，包括：Preferably, the data in the Hive database and HDFS are preprocessed using the programming computing framework MapReduce of the Hadoop ecosystem, including:

利用Hadoop生态系统的编程计算框架MapReduce对Hive数据库和HDFS中的数据进行数据清洗、数据变换和数据归约。Use MapReduce, the programming computing framework of the Hadoop ecosystem, to perform data cleaning, data transformation and data reduction on the data in the Hive database and HDFS.

一种电力营销大数据的处理系统，包括营销基础数据平台和Hadoop生态系统；A processing system for power marketing big data, including marketing basic data platform and Hadoop ecosystem;

由Hadoop生态系统将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中，所述外部数据为通过网络爬虫技术从外部网站采集的包括气象、新闻的外部数据；The Hadoop ecosystem imports the structured business data and external data in the marketing basic data platform database into the Hive database of the Hadoop ecosystem. The external data is external data including weather and news collected from external websites through web crawler technology ;

由Hadoop生态系统将非结构化的营销基础数据小文件合并为一个HAR大文件，并将其存储到Hadoop生态系统的分布式文件系统HDFS中；The small unstructured marketing basic data files are merged into a large HAR file by the Hadoop ecosystem, and stored in the distributed file system HDFS of the Hadoop ecosystem;

由Hadoop生态系统利用编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理，并对预处理后的数据进行分析及挖掘，保存分析及挖掘结果。The Hadoop ecosystem uses the programming computing framework MapReduce to preprocess the data in the Hive database and HDFS, analyze and mine the preprocessed data, and save the analysis and mining results.

优选地，Hadoop生态系统将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中的过程，具体为：Preferably, the Hadoop ecosystem imports the structured business data and external data in the marketing basic data platform database into the Hive database of the Hadoop ecosystem, specifically:

利用Hadoop生态系统的Sqoop工具将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中。Use the Sqoop tool of the Hadoop ecosystem to import the structured business data and external data in the marketing basic data platform database into the Hive database of the Hadoop ecosystem.

优选地，Hadoop生态系统将非结构化的营销基础数据小文件合并为一个HAR大文件的过程，具体为：Preferably, the Hadoop ecosystem merges unstructured marketing basic data small files into one HAR large file, specifically:

由Hadoop生态系统采用HadoopArchive文件归档技术将非结构化的营销基础数据小文件合并为一个HAR大文件。The Hadoop Ecosystem adopts the HadoopArchive file archiving technology to merge small unstructured marketing basic data files into a large HAR file.

优选地，Hadoop生态系统利用编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理的过程，具体为：Preferably, the Hadoop ecosystem uses the programming computing framework MapReduce to preprocess the data in the Hive database and HDFS, specifically:

从上述的技术方案可以看出，本申请实施例提供的电力营销大数据的处理方法，将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中，其中所述外部数据为通过网络爬虫技术从外部网站采集的包括气象、新闻的外部数据，然后将非结构化的营销基础数据小文件合并为一个HAR大文件，并将其存储到Hadoop生态系统的分布式文件系统HDFS中，最后利用Hadoop生态系统的编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理，并对预处理后的数据进行分析及挖掘，保存分析及挖掘结果。本申请提出的融合营销基础数据平台和Hadoop生态系统的营销基础大数据的处理方法，解决了海量电力营销结构化和非结构化数据的存储和海量数据的计算分析问题，同时充分利用了气象、新闻等外部数据对电力营销大数据进行分析、挖掘，满足了智能化营销管理和辅助决策的需求。It can be seen from the above technical solutions that the method for processing big data in electric power marketing provided by the embodiment of the present application imports the structured business data and external data in the marketing basic data platform database into the Hive database of the Hadoop ecosystem, wherein the The above-mentioned external data is external data including weather and news collected from external websites through web crawler technology, and then the small unstructured marketing basic data files are merged into a large HAR file and stored in the distributed In the file system HDFS, MapReduce, the programming computing framework of the Hadoop ecosystem, is used to preprocess the data in the Hive database and HDFS, analyze and mine the preprocessed data, and save the analysis and mining results. The processing method of marketing basic big data that integrates the marketing basic data platform and the Hadoop ecosystem proposed by this application solves the storage of massive power marketing structured and unstructured data and the calculation and analysis of massive data. At the same time, it makes full use of meteorology, News and other external data are used to analyze and mine power marketing big data, which meets the needs of intelligent marketing management and auxiliary decision-making.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

图1为现有Hadoop生态系统常用组件加工图；Figure 1 is a processing diagram of commonly used components in the existing Hadoop ecosystem;

图2为本申请实施例公开的一种电力营销大数据的处理方法流程图；Fig. 2 is a flow chart of a processing method for electric power marketing big data disclosed in the embodiment of the present application;

图3为本申请实施例公开的一种外部数据采集、存储示意图；Fig. 3 is a schematic diagram of external data collection and storage disclosed in the embodiment of the present application;

图4为本申请实施例公开的一种结构化数据导入Hadoop集群过程示意图；Fig. 4 is a schematic diagram of a process of importing structured data into a Hadoop cluster disclosed in the embodiment of the present application;

图5为本申请实施例公开的一种非结构化数据导入Hadoop集群过程示意图；Fig. 5 is a schematic diagram of a process of importing unstructured data into a Hadoop cluster disclosed in the embodiment of the present application;

图6为本申请实施例公开的一种电力营销大数据的处理系统结构示意图。FIG. 6 is a schematic structural diagram of a processing system for power marketing big data disclosed in an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

随着电网营销信息化和营销自动化建设的逐步推进，营销领域积累了海量数据，尤其是业扩、计量、电费、客服等业务相关的数据量大、总类繁多、实时性强且极具分析价值。当前，营销业务应用系统和计量生产调度平台存储了约3.6亿客户的档案、业扩、计量、电费等结构化数据，年增长量约50TB；用电信息采集系统存储了约2.0亿客户的有功、无功电量和电能质量等结构化数据，年增长量约500TB。客服中心核心信息系统中，95598呼叫平台存储了约每天30万次客服通话相关结构化数据和语音数据，年增长量约10TB；95598智能互动网站存储了约180万网站客户的基础信息和网站访问情况等结构化数据，年增长量约500GB；95598业务支持系统存储了约每天25万张客服工单的基础信息和流转信息等结构化数据，年增长量约2TB；95598运营管理系统存储了以客服中心内部管理为主的结构化数据，年增长量约50GB；随着客服95598全业务集中工作的全部完成，客户服务中心核心信息系统年增长数据量还将大幅度提升。With the gradual advancement of power grid marketing informatization and marketing automation construction, massive data has accumulated in the marketing field, especially business-related business expansion, metering, electricity bills, customer service, etc. value. At present, the marketing business application system and the metering production scheduling platform store about 360 million customers' files, business expansion, metering, electricity bills and other structured data, with an annual growth of about 50TB; the electricity consumption information collection system stores about 200 million customers' active , reactive power and power quality and other structured data, with an annual increase of about 500TB. In the core information system of the customer service center, the 95598 call platform stores about 300,000 customer service call-related structured data and voice data per day, with an annual increase of about 10TB; the 95598 intelligent interactive website stores the basic information and website visits of about 1.8 million website customers Structured data such as conditions, with an annual increase of about 500GB; the 95598 business support system stores structured data such as basic information and circulation information of about 250,000 customer service work orders per day, with an annual increase of about 2TB; the 95598 operation management system stores the following The annual growth of structured data mainly for internal management in the customer service center is about 50GB; with the completion of the centralized work of customer service 95598, the annual growth data volume of the core information system of the customer service center will also increase significantly.

目前国家电网公司已研发了营销分析与辅助决策系统、总部营销业务管控平台、省(市)营销业务管理平台等营销管理和辅助决策系统并逐步推广，一定程度支撑了营销管理和辅助决策。然而各营销管理和辅助决策系统仍面临如下三大问题：At present, the State Grid Corporation of China has developed marketing analysis and auxiliary decision-making systems, headquarters marketing business management and control platform, provincial (city) marketing business management platform and other marketing management and auxiliary decision-making systems, and gradually promoted them, supporting marketing management and auxiliary decision-making to a certain extent. However, each marketing management and auxiliary decision-making system still faces the following three major problems:

(1)各营销管理和辅助决策系统使用的数据来源于不同的业务系统，各业务系统数据分散存储、对外访问接口不同，各业务系统存在数据重复存储且不一致的现象，并且部署在总部和各省(自治区、辖市)的同一业务直系统也存在数据格式不一致现象。这直接导致了各营销管理和辅助决策系统需要同时从不同的业务系统获取数据，还可能需要对获取的数据进行集成使数据格式一致。(1) The data used by each marketing management and auxiliary decision-making system comes from different business systems. The data of each business system is stored in a dispersed manner, and the external access interface is different. There is a phenomenon of repeated data storage and inconsistency in each business system, and it is deployed in the headquarters and provinces (Autonomous regions and municipalities) also have inconsistent data formats in the same business direct system. This directly leads to the need for each marketing management and auxiliary decision-making system to acquire data from different business systems at the same time, and may also need to integrate the acquired data to make the data format consistent.

(2)各营销管理和辅助决策系统主要基于数据仓库、联机分析处理、专家库、推理机等技术，数据仓库和联机分析处理技术通过对数据立方体的上钻、下钻、切片和切块进而从不同维度和粒度对数据进行统计分析，专家库和推理机技术利用专家经验进行逻辑推理。从智能决策的角度来看，两者都属于比较低级别的智能，因为没有通过对海量营销数据进行深度加工和分析而发现隐藏在其后的通过简单统计分析和专家经验无法发现的“知识”并利用“知识”来支撑智能决策，这直接导致了当前营销管理和辅助决策效率不高且智能化程度偏低。(2) Each marketing management and auxiliary decision-making system is mainly based on technologies such as data warehouse, online analytical processing, expert database, and inference engine. Statistical analysis is performed on data from different dimensions and granularities, and expert database and inference engine technology use expert experience to carry out logical reasoning. From the perspective of intelligent decision-making, both belong to relatively low-level intelligence, because there is no deep processing and analysis of massive marketing data to discover the hidden "knowledge" that cannot be found through simple statistical analysis and expert experience. And use "knowledge" to support intelligent decision-making, which directly leads to the low efficiency and low intelligence of current marketing management and auxiliary decision-making.

(3)虽然电网营销领域积累了海量结构化与非结构化数据，但是由于各营销管理和辅助决策系统受数据分析方法和技术的限制，在进行分析决策时从保存着海量数据的数据仓库选择部分数据进行统计分析或者逻辑推理。一方面，每次进行数据分析时使用的数据量远少于各系统积累的可用于进该分析的数据量；另一方面，各系统中积累的大量数据未被用来进行分析和决策，特别是各系统积累的非结构化数据，例如工单受理内容、录音数据等，没有被用于进行分析决策。这直接导致当前各系统耗费资源采集、集成、存储的海量数据的利用率偏低。(3) Although a large amount of structured and unstructured data has accumulated in the field of power grid marketing, due to the limitations of data analysis methods and technologies in various marketing management and auxiliary decision-making systems, when making analysis and decision-making, choose from data warehouses that store massive data. Part of the data is statistically analyzed or logically reasoned. On the one hand, the amount of data used for each data analysis is far less than the amount of data accumulated by each system that can be used for this analysis; on the other hand, a large amount of data accumulated in each system is not used for analysis and decision-making, especially It is the unstructured data accumulated by various systems, such as work order acceptance content, recording data, etc., which are not used for analysis and decision-making. This directly leads to the low utilization rate of the massive data collected, integrated, and stored by various systems at present.

为了解决数据分散存储、重复存储、数据不一致及存储格式不一致等问题，公司已研发了营销基础数据平台，实现了各业务系统数据的采集和集成。但是营销基础数据平台采用关系数据库进行结构化数据的存储，没有集成存储客服语音、系统运行日志等非结构化数据，也没有提供计算平台来进行海量数据的分析和挖掘。In order to solve the problems of scattered data storage, repeated storage, inconsistent data and inconsistent storage formats, the company has developed a basic marketing data platform to realize the collection and integration of data from various business systems. However, the marketing basic data platform uses a relational database to store structured data, and does not integrate and store unstructured data such as customer service voice and system operation logs, nor does it provide a computing platform to analyze and mine massive data.

因此，本申请提出了一种融合营销基础数据平台的和Hadoop生态系统的电力智能营销大数据采集、集成、存储和处理方法，在营销基础数据平台完成数据采集、集成的基础上，通过Hadoop的HDFS实现了海量业务数据分布式存储，通过Hadoop的MapReduce计算框架实现了海量数据的分布式处理、分析及挖掘，为进行海量数据分析挖掘提供了数据存储和计算环境，从而实现海量数据的有效利用，提升营销管理和辅助决策的效率和智能化程度，支撑客服、电费、经济分析等营销相关业务的提升和流程重构，提升电力营销的服务能力。Therefore, this application proposes a method for collecting, integrating, storing and processing big data of electric power intelligent marketing that integrates the marketing basic data platform and the Hadoop ecosystem. After completing data collection and integration on the marketing basic data platform, the Hadoop HDFS realizes the distributed storage of massive business data, realizes the distributed processing, analysis and mining of massive data through Hadoop's MapReduce computing framework, and provides a data storage and computing environment for massive data analysis and mining, thereby realizing the effective utilization of massive data , improve the efficiency and intelligence of marketing management and auxiliary decision-making, support the improvement and process reconstruction of marketing-related businesses such as customer service, electricity bills, and economic analysis, and improve the service capabilities of power marketing.

在介绍本申请方案之前，我们先对本申请文件中涉及的名词及概念进行解释。其中，Hadoop生态系统的架构参见图1所示。Before introducing the solution of this application, we first explain the terms and concepts involved in this application document. Among them, the architecture of the Hadoop ecosystem is shown in Figure 1.

Hadoop：ApacheHadoop是一款支持数据密集型分布式应用并以Apache2.0许可协议发布的开源软体框架。它支持在商品硬件构建的大型集群上运行的应用程序。Hadoop是根据Google公司发表的MapReduce和Google档案系统的论文自行实作而成。现在普遍认为整个ApacheHadoop“平台”包括Hadoop内核、MapReduce、Hadoop分布式文件系统(HDFS)以及一些相关项目，有ApacheHive和ApacheHBase等等。Hadoop: Apache Hadoop is an open source software framework that supports data-intensive distributed applications and is released under the Apache 2.0 license agreement. It supports applications running on large clusters built on commodity hardware. Hadoop is self-implemented based on the papers on MapReduce and Google File System published by Google. It is now generally accepted that the entire Apache Hadoop "platform" includes the Hadoop kernel, MapReduce, the Hadoop Distributed File System (HDFS), and related projects such as Apache Hive and Apache HBase.

HDFS：HDFS(HadoopDistributedFileSystem)是Hadoop项目的核心子项目，是分布式计算中数据存储管理的基础，是基于流数据模式访问和处理超大文件的需求而开发的，可以运行于廉价的商用服务器上。它所具有的高容错、高可靠性、高可扩展性、高获得性、高吞吐率等特征为海量数据提供了不怕故障的存储，为超大数据集(LargeDataSet)的应用处理带来了很多便利。HDFS: HDFS (Hadoop Distributed File System) is the core sub-project of the Hadoop project. It is the basis of data storage management in distributed computing. It is developed based on the requirements of streaming data mode access and processing very large files, and can run on cheap commercial servers. Its characteristics of high fault tolerance, high reliability, high scalability, high availability, and high throughput provide storage for massive data without fear of failure, and bring a lot of convenience to the application processing of large data sets (LargeDataSet). .

MapReduce：MapReduce是一种编程模型，用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)"和"Reduce(归约)"，和它们的主要思想，都是从函数式编程语言里借来的，还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下，将自己的程序运行在分布式系统上。当前的软件实现是指定一个Map(映射)函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduce(归约)函数，用来保证所有映射的键值对中的每一个共享相同的键组。MapReduce: MapReduce is a programming model for parallel computing of large-scale data sets (greater than 1TB). The concepts "Map" and "Reduce", and their main ideas, are borrowed from functional programming languages, with features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map (mapping) function to map a set of key-value pairs into a new set of key-value pairs, and specify a concurrent Reduce (reduction) function to ensure that all mapped key-value pairs are Each of the shares the same set of keys.

Hive：Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供简单的SQL查询功能，可以将SQL语句转换为MapReduce任务进行运行。Hive的优点是学习成本低，可以通过类SQL语句快速实现简单的MapReduce统计，不必开发专门的MapReduce应用，十分适合数据仓库的统计分析。Hive: Hive is a data warehouse tool based on Hadoop. It can map structured data files into a database table and provide simple SQL query functions. It can convert SQL statements into MapReduce tasks for execution. The advantage of Hive is that it has low learning costs, and can quickly implement simple MapReduce statistics through SQL-like statements without developing a dedicated MapReduce application. It is very suitable for statistical analysis of data warehouses.

Sqoop：Sqoop是一款开源的工具，主要用于在HADOOP与传统的数据库(MySQL、Oracle...)间进行数据的传递，可以将一个关系型数据库(例如：MySQL,Oracle等)中的数据导入到Hadoop的HDFS中，也可以将HDFS的数据导出到关系型数据库中。Sqoop: Sqoop is an open source tool, mainly used for data transfer between HADOOP and traditional databases (MySQL, Oracle...), which can transfer data in a relational database (such as MySQL, Oracle, etc.) Import into Hadoop's HDFS, and also export HDFS data to a relational database.

Hbase：HBase是一个分布式的、面向列的开源数据库，该技术来源于FayChang所撰写的Google论文“Bigtable：一个结构化数据的分布式存储系统”，HBase在Hadoop之上提供了类似于Bigtable的能力。HBase是Apache的Hadoop项目的子项目，是一个适合于非结构化数据存储的数据库。HBase的表能够作为MapReduce任务的输入和输出，可以通过JavaAPI来存取数据，也可以通过REST、Avro或者Thrift的API来访问。Hbase: HBase is a distributed, column-oriented open source database. This technology comes from the Google paper "Bigtable: A Distributed Storage System for Structured Data" written by FayChang. HBase provides a Bigtable-like database on top of Hadoop. ability. HBase is a sub-project of Apache's Hadoop project and is a database suitable for unstructured data storage. HBase tables can be used as the input and output of MapReduce tasks, and data can be accessed through Java API, or through REST, Avro, or Thrift APIs.

Spark：Spark是一个基于内存的分布式计算框架编程模型，基于DAG图来执行作业流，它的核心数据结构是一套弹性的分布式数据集(简称RDD)。在Spark当中，驱动程序被编写为一系列的RDD转换机制，并附带与之相关的操作环节。Spark的另一个优势在于允许使用者轻松将一套RDD共享给其它Spark项目。由于Spark的使用贯穿Spark软件栈中。因此可以将SparkSQL、机器学习、流计算以及图计算轻松整合成一个程序。Spark: Spark is a memory-based distributed computing framework programming model that executes job flows based on DAG graphs. Its core data structure is a set of elastic distributed data sets (RDD for short). In Spark, the driver program is written as a series of RDD transformation mechanisms with associated operations. Another advantage of Spark is that it allows users to easily share a set of RDDs with other Spark projects. Since the use of Spark runs through the Spark software stack. Therefore, SparkSQL, machine learning, stream computing, and graph computing can be easily integrated into one program.

Mahout：Mahout是ApacheSoftwareFoundation(ASF)旗下的一个开源项目，提供一些可扩展的机器学习领域经典算法的实现，旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现，包括聚类、分类、推荐过滤、频繁子项挖掘。Mahout最大的优点就是基于Hadoop实现，把很多以前运行于单机上的算法，转化为了MapReduce模式，这样大大提升了算法可处理的数据量和处理性能。Mahout: Mahout is an open source project under the Apache Software Foundation (ASF), which provides some scalable implementations of classic algorithms in the field of machine learning, aiming to help developers create intelligent applications more conveniently and quickly. Mahout contains many implementations, including clustering, classification, recommendation filtering, and frequent subitem mining. The biggest advantage of Mahout is that it is based on Hadoop implementation, and many algorithms that were previously run on a single machine are transformed into MapReduce mode, which greatly improves the amount of data that the algorithm can handle and the processing performance.

营销基础数据平台是一个营销全业务、全明细数据及汇总数据服务、分析、利用平台，包括客户基本档案数据、信息采集数据、电能计量数据、电费回收数据、95598客户服务数据、电动汽车和分布式能源等智能用电数据、能效管理数据等全业务数据。Marketing basic data platform is a marketing full-business, full-detailed data and summary data service, analysis, and utilization platform, including customer basic file data, information collection data, electric energy metering data, electricity charge recovery data, 95598 customer service data, electric vehicles and distribution Intelligent power consumption data such as conventional energy, energy efficiency management data and other full-service data.

参见图2，图2为本申请实施例公开的一种电力营销大数据的处理方法流程图。Referring to FIG. 2, FIG. 2 is a flow chart of a method for processing big data in electric power marketing disclosed in the embodiment of the present application.

如图2所示，该方法包括：As shown in Figure 2, the method includes:

步骤S200、将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中；Step S200, importing the structured business data and external data in the marketing basic data platform database into the Hive database of the Hadoop ecosystem;

其中，所述外部数据为通过网络爬虫技术从外部网站采集的包括气象、新闻的外部数据。在进行电力行业大数据分析挖掘时，气象、社会事件、经济发展等外部因素是不可忽略的因素。为此，我们可以利用网络爬虫技术从气象网站、新闻网站、统计局等相关网站定期采集外部数据，并将其导入至Hadoop生态系统的Hive数据库中。在进行结构化数据和外部数据导入的过程，可以使用Sqoop工具来完成这个工作。如图3和图4所示。Wherein, the external data is external data including weather and news collected from external websites through web crawler technology. When analyzing and mining big data in the power industry, external factors such as weather, social events, and economic development cannot be ignored. To this end, we can use web crawler technology to regularly collect external data from relevant websites such as weather websites, news websites, and statistics bureaus, and import them into the Hive database of the Hadoop ecosystem. In the process of importing structured data and external data, you can use the Sqoop tool to complete this work. As shown in Figure 3 and Figure 4.

步骤S210、将非结构化的营销基础数据小文件合并为一个HAR大文件，并将其存储到Hadoop生态系统的分布式文件系统HDFS中；Step S210, merging the small unstructured marketing basic data files into a large HAR file, and storing it in the distributed file system HDFS of the Hadoop ecosystem;

具体地，各个业务系统除了产生海量的结构化业务数据之外，还会产生海量的非结构化数据，例如95598呼叫平台每天新增客服录音文件约30万个，年增上亿客服录音文件。由于各个业务系统产生的录音文件、系统日志文件等非结构化数据通常都是比较小的文件，而Hadoop生态系统的分布式文件系统HDFS不适合对海量的小文件直接进行存储，因此可以将大量的小文件合并为一个HAR大文件后存储到HDFS中，打包后的录音文件仍可以从HAR文件中直接读取。Specifically, in addition to massive structured business data, various business systems also generate massive unstructured data. For example, the 95598 call platform adds about 300,000 customer service recording files every day, and hundreds of millions of customer service recording files every year. Since unstructured data such as recording files and system log files generated by various business systems are usually relatively small files, HDFS, the distributed file system of the Hadoop ecosystem, is not suitable for directly storing a large number of small files, so a large number of small files can be stored The small files are merged into a large HAR file and stored in HDFS, and the packaged recording files can still be read directly from the HAR file.

可选的，在进行小文件合并时，可以采用HadoopArchive文件归档技术进行小文件的合并。Optionally, when merging small files, the HadoopArchive file archiving technology can be used for merging small files.

例如，对于客服录音数据，由于录音文件的规范命名，每天的录音文件名称前缀都是这天的日期，因此我们可以将每天的录音文件的归档文件名设置为包含该日期的名称，在查询某个录音文件时根据录音文件名就可以得到该录音文件所在的HAR文件，然后从该HAR文件中将所需的录音文件读取出来。此外，在录音文件和HAR文件的名称中，还可以分别加入录音时长、录音文件数量等元数据信息。For example, for customer service recording data, due to the standard naming of recording files, the prefix of the recording file name of each day is the date of this day, so we can set the archive file name of the recording file of each day to the name containing the date, and query a When recording a recording file, the HAR file where the recording file is located can be obtained according to the recording file name, and then the required recording file is read from the HAR file. In addition, in the name of the recording file and the HAR file, metadata information such as the recording duration and the number of recording files can be added respectively.

由于Hadoop生态系统的Hive数据库只能够存储结构化数据，因此对于非结构化数据需要导入至HDFS中。具体导入过程可以参照图5所示。Since the Hive database in the Hadoop ecosystem can only store structured data, unstructured data needs to be imported into HDFS. Refer to Figure 5 for the specific import process.

步骤S220、利用Hadoop生态系统的编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理，并对预处理后的数据进行分析及挖掘，保存分析及挖掘结果。Step S220, using MapReduce, the programming computing framework of the Hadoop ecosystem, to preprocess the data in the Hive database and HDFS, analyze and mine the preprocessed data, and save the analysis and mining results.

本申请实施例提供的电力营销大数据的处理方法，将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中，其中所述外部数据为通过网络爬虫技术从外部网站采集的包括气象、新闻的外部数据，然后将非结构化的营销基础数据小文件合并为一个HAR大文件，并将其存储到Hadoop生态系统的分布式文件系统HDFS中，最后利用Hadoop生态系统的编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理，并对预处理后的数据进行分析及挖掘，保存分析及挖掘结果。本申请提出的融合营销基础数据平台和Hadoop生态系统的营销基础大数据的处理方法，解决了海量电力营销结构化和非结构化数据的存储和海量数据的计算分析问题，同时充分利用了气象、新闻等外部数据对电力营销大数据进行分析、挖掘，满足了智能化营销管理和辅助决策的需求。The method for processing big data in electric power marketing provided by the embodiment of the present application imports the structured business data and external data in the marketing basic data platform database into the Hive database of the Hadoop ecosystem, wherein the external data is obtained from The external website collects external data including weather and news, and then merges the small files of unstructured marketing basic data into a large HAR file, and stores it in the distributed file system HDFS of the Hadoop ecosystem, and finally utilizes the Hadoop ecosystem The system's programming computing framework MapReduce preprocesses the data in the Hive database and HDFS, analyzes and mines the preprocessed data, and saves the analysis and mining results. The processing method of marketing basic big data that integrates the marketing basic data platform and the Hadoop ecosystem proposed by this application solves the storage of massive power marketing structured and unstructured data and the calculation and analysis of massive data. At the same time, it makes full use of meteorology, News and other external data are used to analyze and mine power marketing big data, which meets the needs of intelligent marketing management and auxiliary decision-making.

上述步骤S220中存在对数据预处理的过程，该预处理可以包括对数据清洗、数据变换、数据归约等操作。There is a data preprocessing process in the above step S220, and the preprocessing may include operations such as data cleaning, data transformation, and data reduction.

数据清洗可以理解为对不合格的数据进行清除，例如客户档案中客户编号规定为10位数字，而导入的数据中如果存在客户编号不是10位长度的数据，则需要清除掉。Data cleaning can be understood as the removal of unqualified data. For example, the customer number in the customer file is stipulated as 10 digits, and if there is data in the imported data with a customer number that is not 10 digits long, it needs to be cleared.

数据变换可以理解为对数据格式的统一，举例如导入至Hadoop生态系统的数据中，某供电单位用电户数的单位有的是户，有的是万户。则数据变换的过程即为统一用电户数的单位。Data transformation can be understood as the unification of the data format. For example, among the data imported into the Hadoop ecosystem, the units of the number of electricity users of a power supply unit are either households or tens of thousands of households. Then the process of data conversion is the unit of the unified number of electricity users.

数据归约可以理解为删除某些可推导的参量。举例如，导入的数据包含某用电客户的电价、电量和电费。显然，我们只要知道这三个参量中的任意两个，即可通过现有知识推导出第三个参量，因此可以删除其中某一个参量。Data reduction can be understood as removing certain derivable parameters. For example, the imported data includes the electricity price, electricity quantity and electricity fee of an electricity customer. Obviously, as long as we know any two of these three parameters, the third parameter can be deduced from the existing knowledge, so one of the parameters can be deleted.

经过预处理后的数据才能够为后续的数据分析及挖掘所使用。The preprocessed data can be used for subsequent data analysis and mining.

需要说明的是，Hadoop生态系统的Mahout中可能存储有一些常用的分析算法，例如聚类、分类等。因此，如果我们对数据的处理操作在Mahout中有对应的算法，则可以直接从中调用对应的算法进行分析、挖掘。而如果在Mahout中不存在我们所需要的算法，则可以通过编写MapReduce程序实现对数据的分布式分析及挖掘。It should be noted that some commonly used analysis algorithms, such as clustering and classification, may be stored in Mahout of the Hadoop ecosystem. Therefore, if there is a corresponding algorithm in Mahout for our data processing operations, we can directly call the corresponding algorithm from it for analysis and mining. And if the algorithm we need does not exist in Mahout, we can implement distributed analysis and mining of data by writing MapReduce programs.

本申请提供的电力营销大数据的处理方法，融合了营销基础数据平台和Hadoop生态系统。国网总部和省(自治区、直辖市)各业务系统结构化业务数据和气象、社会事件、经济发展等外部数据被集成、存储到营销基础数据平台，然后使用Sqoop工具将这些数据导入到Hadoop集群的Hive数据仓库中。语音数据、系统日志等非结构化数据直接从总部和省(自治区、直辖市)各业务系统接入并保存到Hadoop集群的HDFS中。完成数据准备后，使用Hadoop的MapReduce框架实现海量数据的分布式处理和分析挖掘。数据分析挖掘结果将用来支撑营销管理和辅助决策。The method for processing big data in power marketing provided by this application integrates the marketing basic data platform and the Hadoop ecosystem. The structured business data of State Grid headquarters and business systems of provinces (autonomous regions, municipalities directly under the central government) and external data such as weather, social events, and economic development are integrated and stored on the marketing basic data platform, and then these data are imported into the Hadoop cluster using the Sqoop tool In the Hive data warehouse. Voice data, system logs and other unstructured data are directly accessed from the business systems of the headquarters and provinces (autonomous regions, municipalities) and stored in the HDFS of the Hadoop cluster. After completing the data preparation, use Hadoop's MapReduce framework to realize distributed processing and analysis and mining of massive data. The results of data analysis and mining will be used to support marketing management and assist decision-making.

下面对本申请实施例提供的电力营销大数据的处理系统进行描述，下文描述的电力营销大数据的处理系统与上文描述的电力营销大数据的处理方法可相互对应参照。The power marketing big data processing system provided by the embodiment of the present application is described below. The power marketing big data processing system described below and the power marketing big data processing method described above can be referred to in correspondence.

参见图6，图6为本申请实施例公开的一种电力营销大数据的处理系统结构示意图。Referring to FIG. 6 , FIG. 6 is a schematic structural diagram of a processing system for power marketing big data disclosed in an embodiment of the present application.

如图6所示，该系统包括：As shown in Figure 6, the system includes:

营销基础数据平台61和Hadoop生态系统62；Marketing basic data platform61 and Hadoop ecosystem62;

由Hadoop生态系统62将营销基础数据平台61数据库中的结构化业务数据和外部数据导入至Hadoop生态系统62的Hive数据库中；The structured business data and external data in the marketing basic data platform 61 database are imported into the Hive database of the Hadoop ecosystem 62 by the Hadoop ecosystem 62;

其中，所述外部数据为通过网络爬虫技术从外部网站采集的包括气象、新闻的外部数据。在进行电力行业大数据分析挖掘时，气象、社会事件、经济发展等外部因素是不可忽略的因素。为此，我们可以利用网络爬虫技术从气象网站、新闻网站、统计局等相关网站定期采集外部数据，并将其导入至Hadoop生态系统的Hive数据库中。Wherein, the external data is external data including weather and news collected from external websites through web crawler technology. When analyzing and mining big data in the power industry, external factors such as weather, social events, and economic development cannot be ignored. To this end, we can use web crawler technology to regularly collect external data from relevant websites such as weather websites, news websites, and statistics bureaus, and import them into the Hive database of the Hadoop ecosystem.

由Hadoop生态系统62将非结构化的营销基础数据小文件合并为一个HAR大文件，并将其存储到Hadoop生态系统的分布式文件系统HDFS中；The unstructured marketing basic data small file is merged into a HAR large file by the Hadoop ecosystem 62, and it is stored in the distributed file system HDFS of the Hadoop ecosystem;

由Hadoop生态系统62利用编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理，并对预处理后的数据进行分析及挖掘，保存分析及挖掘结果。The Hadoop ecosystem 62 uses the programming computing framework MapReduce to preprocess the data in the Hive database and HDFS, analyze and mine the preprocessed data, and save the analysis and mining results.

可选的，Hadoop生态系统62将营销基础数据平台61数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中的过程，具体可以为：Optionally, the Hadoop ecosystem 62 imports the structured business data and external data in the marketing basic data platform 61 database into the Hive database of the Hadoop ecosystem, specifically, it may be:

可选的，Hadoop生态系统62利用编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理的过程，具体可以为：Optionally, the Hadoop ecosystem 62 uses the programming computing framework MapReduce to preprocess the data in the Hive database and HDFS, which can be specifically:

其中，数据清洗可以理解为对不合格的数据进行清除，例如客户档案中客户编号规定为10位数字，而导入的数据中如果存在客户编号不是10位长度的数据，则需要清除掉。Among them, data cleaning can be understood as the removal of unqualified data. For example, the customer number in the customer file is stipulated as 10 digits, and if there is data in the imported data with a customer number that is not 10 digits long, it needs to be cleared.

本申请实施例提供的电力营销大数据的处理系统，将营销基础数据平台数据库中的结构化业务数据和外部数据导入至Hadoop生态系统的Hive数据库中，其中所述外部数据为通过网络爬虫技术从外部网站采集的包括气象、新闻的外部数据，然后将非结构化的营销基础数据小文件合并为一个HAR大文件，并将其存储到Hadoop生态系统的分布式文件系统HDFS中，最后利用Hadoop生态系统的编程计算框架MapReduce对Hive数据库和HDFS中的数据进行预处理，并对预处理后的数据进行分析及挖掘，保存分析及挖掘结果。本申请提出的融合营销基础数据平台和Hadoop生态系统的营销基础大数据的处理方式，解决了海量电力营销结构化和非结构化数据的存储和海量数据的计算分析问题，同时充分利用了气象、新闻等外部数据对电力营销大数据进行分析、挖掘，满足了智能化营销管理和辅助决策的需求。The power marketing big data processing system provided in the embodiment of the present application imports the structured business data and external data in the marketing basic data platform database into the Hive database of the Hadoop ecosystem, wherein the external data is obtained from The external website collects external data including weather and news, and then merges the small files of unstructured marketing basic data into a large HAR file, and stores it in the distributed file system HDFS of the Hadoop ecosystem, and finally utilizes the Hadoop ecosystem The system's programming computing framework MapReduce preprocesses the data in the Hive database and HDFS, analyzes and mines the preprocessed data, and saves the analysis and mining results. The processing method of marketing basic big data that integrates the marketing basic data platform and the Hadoop ecosystem proposed in this application solves the storage of massive power marketing structured and unstructured data and the calculation and analysis of massive data. At the same time, it makes full use of meteorology, News and other external data are used to analyze and mine power marketing big data, which meets the needs of intelligent marketing management and auxiliary decision-making.

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this text, relational terms such as first and second etc. are only used to distinguish one entity or operation from another, and do not necessarily require or imply that these entities or operations, any such actual relationship or order exists. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下，在其它实施例中实现。因此，本申请将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, the present application will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.