Embed presentation
Downloaded 214 times

















![# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.ymlGuess format & schema in:type: filepaths: [data/examples/]out:
type: examplein:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringout:
type: exampleguessby guess plugins](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-18-2048.jpg&f=jpg&w=240)

![# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.ymlin:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringlast_paths: [data/examples/sample_001.csv.gz]out:
type: exampleDeterministic run](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-20-2048.jpg&f=jpg&w=240)
![in:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringlast_paths: [data/examples/sample_002.csv.gz]out:
type: exampleRepeat# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.yml# repeat$ ./embulk run config.yml -o config.yml$ ./embulk run config.yml -o config.yml](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-21-2048.jpg&f=jpg&w=240)






![InputPluginmodule Embulkclass InputExample < InputPluginPlugin.register_input('example', self)def self.transaction(config, &control)# read configtask = {'message' =>config.param('message', :string, default: nil)}threads = config.param('threads', :int, default:2)columns = [Column.new(0, 'col0', :long),Column.new(1, 'col1', :double),Column.new(2, 'col2', :string),]# BEGIN herecommit_reports = yield(task, columns, threads)# COMMIT hereputs "Example input finished"return {}enddef run(task, schema, index, page_builder)puts "Example input thread #{@index}…"10.times do |i|@page_builder.add([i, 10.0, "example"])end@page_builder.finishcommit_report = { }return commit_reportendendend](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-28-2048.jpg&f=jpg&w=240)
![OutputPluginmodule Embulkclass OutputExample < OutputPluginPlugin.register_output('example', self)def self.transaction(config, schema,processor_count, &control)# read configtask = {'message' =>config.param('message', :string, default: "record")}puts "Example output started."commit_reports = yield(task)puts "Example output finished. Commitreports = #{commit_reports.to_json}"return {}enddef initialize(task, schema, index)puts "Example output thread #{index}..."super@message = task.prop('message', :string)@records = 0enddef add(page)page.each do |record|hash = Hash[schema.names.zip(record)]puts "#{@message}: #{hash.to_json}"@records += 1endenddef finishenddef abortenddef commitcommit_report = {"records" => @records}return commit_reportendendend](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-29-2048.jpg&f=jpg&w=240)
![GuessPlugin# guess_gzip.rbmodule Embulkclass GzipGuess < GuessPluginPlugin.register_guess('gzip', self)GZIP_HEADER = "x1fx8b".force_encoding('ASCII-8BIT').freezedef guess(config, sample_buffer)if sample_buffer[0,2] == GZIP_HEADERreturn {"decoders" => [{"type" => "gzip"}]}endreturn {}endendend# guess_module Embulkclass GuessNewline < TextGuessPluginPlugin.register_guess('newline', self)def guess_text(config, sample_text)cr_count = sample_text.count("r")lf_count = sample_text.count("n")crlf_count = sample_text.scan(/rn/).lengthif crlf_count > cr_count / 2 && crlf_count >lf_count / 2return {"parser" => {"newline" => "CRLF"}}elsif cr_count > lf_count / 2return {"parser" => {"newline" => "CR"}}elsereturn {"parser" => {"newline" => "LF"}}endendendend](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-30-2048.jpg&f=jpg&w=240)







The document discusses Embulk, an open-source parallel bulk data loader that uses plugins. Embulk loads records from various sources ("A") to various targets ("B") using plugins for different source and target types. This makes the painful process of data integration more relaxed. Embulk executes in parallel, validates data, handles errors, behaves deterministically, and allows for idempotent retries of bulk loads.

















![# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.ymlGuess format & schema in:type: filepaths: [data/examples/]out:
type: examplein:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringout:
type: exampleguessby guess plugins](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-18-2048.jpg&f=jpg&w=240)

![# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.ymlin:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringlast_paths: [data/examples/sample_001.csv.gz]out:
type: exampleDeterministic run](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-20-2048.jpg&f=jpg&w=240)
![in:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringlast_paths: [data/examples/sample_002.csv.gz]out:
type: exampleRepeat# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.yml# repeat$ ./embulk run config.yml -o config.yml$ ./embulk run config.yml -o config.yml](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-21-2048.jpg&f=jpg&w=240)






![InputPluginmodule Embulkclass InputExample < InputPluginPlugin.register_input('example', self)def self.transaction(config, &control)# read configtask = {'message' =>config.param('message', :string, default: nil)}threads = config.param('threads', :int, default:2)columns = [Column.new(0, 'col0', :long),Column.new(1, 'col1', :double),Column.new(2, 'col2', :string),]# BEGIN herecommit_reports = yield(task, columns, threads)# COMMIT hereputs "Example input finished"return {}enddef run(task, schema, index, page_builder)puts "Example input thread #{@index}…"10.times do |i|@page_builder.add([i, 10.0, "example"])end@page_builder.finishcommit_report = { }return commit_reportendendend](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-28-2048.jpg&f=jpg&w=240)
![OutputPluginmodule Embulkclass OutputExample < OutputPluginPlugin.register_output('example', self)def self.transaction(config, schema,processor_count, &control)# read configtask = {'message' =>config.param('message', :string, default: "record")}puts "Example output started."commit_reports = yield(task)puts "Example output finished. Commitreports = #{commit_reports.to_json}"return {}enddef initialize(task, schema, index)puts "Example output thread #{index}..."super@message = task.prop('message', :string)@records = 0enddef add(page)page.each do |record|hash = Hash[schema.names.zip(record)]puts "#{@message}: #{hash.to_json}"@records += 1endenddef finishenddef abortenddef commitcommit_report = {"records" => @records}return commit_reportendendend](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-29-2048.jpg&f=jpg&w=240)
![GuessPlugin# guess_gzip.rbmodule Embulkclass GzipGuess < GuessPluginPlugin.register_guess('gzip', self)GZIP_HEADER = "x1fx8b".force_encoding('ASCII-8BIT').freezedef guess(config, sample_buffer)if sample_buffer[0,2] == GZIP_HEADERreturn {"decoders" => [{"type" => "gzip"}]}endreturn {}endendend# guess_module Embulkclass GuessNewline < TextGuessPluginPlugin.register_guess('newline', self)def guess_text(config, sample_text)cr_count = sample_text.count("r")lf_count = sample_text.count("n")crlf_count = sample_text.scan(/rn/).lengthif crlf_count > cr_count / 2 && crlf_count >lf_count / 2return {"parser" => {"newline" => "CRLF"}}elsif cr_count > lf_count / 2return {"parser" => {"newline" => "CR"}}elsereturn {"parser" => {"newline" => "LF"}}endendendend](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fembuk-release-150127020151-conversion-gate01%2f75%2fEmbulk-an-open-source-plugin-based-parallel-bulk-data-loader-30-2048.jpg&f=jpg&w=240)





