Movatterモバイル変換

Sadayuki FuruhashiFounder & Software ArchitectTreasure Data, inc.EmbulkAn open-source plugin-based parallel bulk data loaderthat makes painful data integration work relaxed.Sharing our knowledge on RubyGems to manage arbitrary ﬁles.

A little about me...> Sadayuki Furuhashi> github/twitter: @frsyuki> Treasure Data, Inc.> Founder & Software Architect> Open-source hacker> MessagePack - Efﬁcient object serializer> Fluentd - An uniﬁed data collection tool> Prestogres - PostgreSQL protocol gateway for Presto> Embulk - A plugin-based parallel bulk data loader> ServerEngine - A Ruby framework to build multiprocess servers> LS4 - A distributed object storage with cross-region replication> kumofs - A distributed strong-consistent key-value data store

Today’s talk> What’s Embulk?> How Embulk works?> The architecture> Writing Embulk plugins> Roadmap & Development> Q&A + Discussion

What’s Embulk?> An open-source parallel bulk data loader> using plugins> to make data integration relaxed.

What’s Embulk?> An open-source parallel bulk data loader> loads records from “A” to “B”> using plugins> for various kinds of “A” and “B”> to make data integration relaxed.> which was very painful…Storage, RDBMS,NoSQL, Cloud Service,etc.broken records, transactions (idempotency), performance, …

The pains of bulk data loadingExample: load a 10GB CSV ﬁle to PostgreSQL> 1. First attempt → fails> 2. Write a script to make the records cleaned• Convert ”20150127T190500Z” → “2015-01-27 19:05:00 UTC”• Convert “N" → “”• many cleanings…> 3. Second attempt → another error• Convert “Inf” → “Inﬁnity”> 4. Fix the script, retry, retry, retry…> 5. Oh, some data got loaded twice!?

The pains of bulk data loadingExample: load a 10GB CSV ﬁle to PostgreSQL> 6. Ok, the script worked.> 7. Register it to cron to sync data every day.> 8. One day… it fails with another error• Convert invalid UTF-8 byte sequence to U+FFFD

The pains of bulk data loadingExample: load 10GB CSV × 720 files> Most of scripts are slow.• People have little time to optimize bulk load scripts> One file takes 1 hour → 720 files takes 1 month (!?)A lot of integration efforts for each storages:> XML, JSON, Apache log format (+some custom), …> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…> MongoDB, Elasticsearch, Redshift, Salesforce, …

The problems:> Data cleaning (normalization)> How to normalize broken records?> Error handling> How to remove broken records?> Idempotent retrying> How to retry without duplicated loading?> Performance optimization> How to optimize the code or parallelize?

The problems at Treasure DataTreasure Data Service?> “Fast, powerful SQL access to big data from connectedapplications and products, with no new infrastructure orspecial skills required.”> Customers want to try Treasure Data, but> SEs write scripts to bulk load their data. Hard work :(> Customers want to migrate their big data, but> Hard work :(> Fluentd solved streaming data collection, but> bulk data loading is another problem.

A solution:> Package the efforts as a plugin.> data cleaning, error handling, retrying> Share & reuse the plugin.> don’t repeat the pains!> Keep improving the plugin code.> rather than throwing away the efforts every time> using OSS-style pull-reqs & frequent releases.

EmbulkEmbulk is an open-source, plugin-basedparallel bulk data loader that makes data integration works relaxed.

HDFSMySQLAmazon S3EmbulkCSV FilesSequenceFileSalesforce.comElasticsearchCassandraHiveRedis

HDFSMySQLAmazon S3EmbulkCSV FilesSequenceFileSalesforce.comElasticsearchCassandraHiveRedis✓ Parallel execution✓ Data validation✓ Error recovery✓ Deterministic behavior✓ Idempotet retryingbulk load

HDFSMySQLAmazon S3EmbulkCSV FilesSequenceFileSalesforce.comElasticsearchCassandraHiveRedis✓ Parallel execution✓ Data validation✓ Error recovery✓ Deterministic behavior✓ Idempotet retryingPlugins Pluginsbulk load

# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jarInstalling embulkBintray releasesEmbulk is released on Bintraywget embulk.jar

# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar # guess$ vi partial-config.yml$ ./embulk guess partial-config.yml -o config.ymlGuess format & schema in:type: filepaths: [data/examples/]out: type: examplein:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time type: timestamp format: '%Y-%m-%d %H:%M:%S'- name: account type: long- name: purchase type: timestamp format: '%Y%m%d'- name: comment type: stringout: type: exampleguessby guess plugins

# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar # guess$ vi partial-config.yml$ ./embulk guess partial-config.yml -o config.yml # preview$ ./embulk preview config.yml$ vi config.yml # if necessary+--------------------------------------+---------------+--------------------+| time:timestamp | uid:long | word:string |+--------------------------------------+---------------+--------------------+| 2015-01-27 19:23:49 UTC | 32,864 | embulk || 2015-01-27 19:01:23 UTC | 14,824 | jruby || 2015-01-28 02:20:02 UTC | 27,559 | plugin || 2015-01-29 11:54:36 UTC | 11,270 | fluentd |+--------------------------------------+---------------+--------------------+Preview & fix config

# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar # guess$ vi partial-config.yml$ ./embulk guess partial-config.yml -o config.yml # preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.ymlin:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time type: timestamp format: '%Y-%m-%d %H:%M:%S'- name: account type: long- name: purchase type: timestamp format: '%Y%m%d'- name: comment type: stringlast_paths: [data/examples/sample_001.csv.gz]out: type: exampleDeterministic run

in:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time type: timestamp format: '%Y-%m-%d %H:%M:%S'- name: account type: long- name: purchase type: timestamp format: '%Y%m%d'- name: comment type: stringlast_paths: [data/examples/sample_002.csv.gz]out: type: exampleRepeat# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar # guess$ vi partial-config.yml$ ./embulk guess partial-config.yml -o config.yml # preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.yml# repeat$ ./embulk run config.yml -o config.yml$ ./embulk run config.yml -o config.yml

InputPlugin OutputPluginEmbulkexecutor pluginread records write records

InputPlugin OutputPluginEmbulkexecutor pluginMySQL, Cassandra,HBase, Elasticsearch, Treasure Data, …recordrecord

InputPluginFileInputPluginOutputPluginFileOutputPluginEncoderPluginFormatterPluginDecoderPluginParserPluginEmbulkexecutor pluginread filesdecompressparse filesinto recordswrite filescompressformat recordsinto files

InputPluginFileInputPluginOutputPluginFileOutputPluginEncoderPluginFormatterPluginDecoderPluginParserPluginEmbulkexecutor pluginHDFS, S3, Riak CS, …gzip, bzip2, 3des, …CSV, JSON, RCFile, …bufferbufferrecordrecordbufferbuffer

InputPluginmodule Embulkclass InputExample < InputPluginPlugin.register_input('example', self)def self.transaction(config, &control)# read configtask = {'message' =>config.param('message', :string, default: nil)}threads = config.param('threads', :int, default:2)columns = [Column.new(0, 'col0', :long),Column.new(1, 'col1', :double),Column.new(2, 'col2', :string),]# BEGIN herecommit_reports = yield(task, columns, threads)# COMMIT hereputs "Example input finished"return {}enddef run(task, schema, index, page_builder)puts "Example input thread #{@index}…"10.times do |i|@page_builder.add([i, 10.0, "example"])end@page_builder.finishcommit_report = { }return commit_reportendendend

OutputPluginmodule Embulkclass OutputExample < OutputPluginPlugin.register_output('example', self)def self.transaction(config, schema,processor_count, &control)# read configtask = {'message' =>config.param('message', :string, default: "record")}puts "Example output started."commit_reports = yield(task)puts "Example output finished. Commitreports = #{commit_reports.to_json}"return {}enddef initialize(task, schema, index)puts "Example output thread #{index}..."super@message = task.prop('message', :string)@records = 0enddef add(page)page.each do |record|hash = Hash[schema.names.zip(record)]puts "#{@message}: #{hash.to_json}"@records += 1endenddef finishenddef abortenddef commitcommit_report = {"records" => @records}return commit_reportendendend

GuessPlugin# guess_gzip.rbmodule Embulkclass GzipGuess < GuessPluginPlugin.register_guess('gzip', self)GZIP_HEADER = "x1fx8b".force_encoding('ASCII-8BIT').freezedef guess(conﬁg, sample_buffer)if sample_buffer[0,2] == GZIP_HEADERreturn {"decoders" => [{"type" => "gzip"}]}endreturn {}endendend# guess_module Embulkclass GuessNewline < TextGuessPluginPlugin.register_guess('newline', self)def guess_text(conﬁg, sample_text)cr_count = sample_text.count("r")lf_count = sample_text.count("n")crlf_count = sample_text.scan(/rn/).lengthif crlf_count > cr_count / 2 && crlf_count >lf_count / 2return {"parser" => {"newline" => "CRLF"}}elsif cr_count > lf_count / 2return {"parser" => {"newline" => "CR"}}elsereturn {"parser" => {"newline" => "LF"}}endendendend

Releasing to RubyGemsExamples> embulk-plugin-postgres-json.gem> https://github.com/frsyuki/embulk-plugin-postgres-json> embulk-plugin-redis.gem> https://github.com/komamitsu/embulk-plugin-redis> embulk-plugin-input-sfdc-event-log-ﬁles.gem> https://github.com/nahi/embulk-plugin-input-sfdc-event-log-ﬁles

Roadmap> Add missing JRuby Plugin APIs> ParserPlugin, FormatterPlugin> DecoderPlugin, EncoderPlugin> Add Executor plugin SPI> Add ssh distributed executor> embulk run —command ssh %host embulk run %task> Add MapReduce executor> Add support for nested records (?)

Contributing to the Embulk project> Pull-requests & issues on Github> Posting blogs> “I tried Embulk. Here is how it worked”> “I read Embulk code. Here is how it’s written”> “Embulk is good because…but bad because…”> Talking on Twitter with a word “embulk"> Writing & releasing plugins> Windows support> Integration to other software> ETL tools, Fluentd, Hadoop, Presto, …

Q&A + Discussion?Hiroshi Nakamura@nahiMuga Nishizawa@muga_nishizawaSadayuki Furuhashi@frsyukiEmbulk committers:

https://jobs.lever.co/treasure-dataCloud service for the entire data pipeline.We’re hiring!

Movatterモバイル変換

Change Language

Embulk, an open-source plugin-based parallel bulk data loader

Embed presentation

Recommended

More Related Content

What's hot

Viewers also liked

Similar to Embulk, an open-source plugin-based parallel bulk data loader

More from Sadayuki Furuhashi

Recently uploaded

Embulk, an open-source plugin-based parallel bulk data loader