Movatterモバイル変換


[0]ホーム

URL:


258,333 views

Embulk, an open-source plugin-based parallel bulk data loader

The document discusses Embulk, an open-source parallel bulk data loader that uses plugins. Embulk loads records from various sources ("A") to various targets ("B") using plugins for different source and target types. This makes the painful process of data integration more relaxed. Embulk executes in parallel, validates data, handles errors, behaves deterministically, and allows for idempotent retries of bulk loads.

Embed presentation

Downloaded 214 times
Sadayuki FuruhashiFounder & Software ArchitectTreasure Data, inc.EmbulkAn open-source plugin-based parallel bulk data loaderthat makes painful data integration work relaxed.Sharing our knowledge on RubyGems to manage arbitrary files.
A little about me...> Sadayuki Furuhashi> github/twitter: @frsyuki> Treasure Data, Inc.> Founder & Software Architect> Open-source hacker> MessagePack - Efficient object serializer> Fluentd - An unified data collection tool> Prestogres - PostgreSQL protocol gateway for Presto> Embulk - A plugin-based parallel bulk data loader> ServerEngine - A Ruby framework to build multiprocess servers> LS4 - A distributed object storage with cross-region replication> kumofs - A distributed strong-consistent key-value data store
Today’s talk> What’s Embulk?> How Embulk works?> The architecture> Writing Embulk plugins> Roadmap & Development> Q&A + Discussion
What’s Embulk?> An open-source parallel bulk data loader> using plugins> to make data integration relaxed.
What’s Embulk?> An open-source parallel bulk data loader> loads records from “A” to “B”> using plugins> for various kinds of “A” and “B”> to make data integration relaxed.> which was very painful…Storage, RDBMS,NoSQL, Cloud Service,etc.broken records,
transactions (idempotency),
performance, …
The pains of bulk data loadingExample: load a 10GB CSV file to PostgreSQL> 1. First attempt → fails> 2. Write a script to make the records cleaned• Convert ”20150127T190500Z” → “2015-01-27 19:05:00 UTC”• Convert “N" → “”• many cleanings…> 3. Second attempt → another error• Convert “Inf” → “Infinity”> 4. Fix the script, retry, retry, retry…> 5. Oh, some data got loaded twice!?
The pains of bulk data loadingExample: load a 10GB CSV file to PostgreSQL> 6. Ok, the script worked.> 7. Register it to cron to sync data every day.> 8. One day… it fails with another error• Convert invalid UTF-8 byte sequence to U+FFFD
The pains of bulk data loadingExample: load 10GB CSV × 720 files> Most of scripts are slow.• People have little time to optimize bulk load scripts> One file takes 1 hour → 720 files takes 1 month (!?)A lot of integration efforts for each storages:> XML, JSON, Apache log format (+some custom), …> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…> MongoDB, Elasticsearch, Redshift, Salesforce, …
The problems:> Data cleaning (normalization)> How to normalize broken records?> Error handling> How to remove broken records?> Idempotent retrying> How to retry without duplicated loading?> Performance optimization> How to optimize the code or parallelize?
The problems at Treasure DataTreasure Data Service?> “Fast, powerful SQL access to big data from connectedapplications and products, with no new infrastructure orspecial skills required.”> Customers want to try Treasure Data, but> SEs write scripts to bulk load their data. Hard work :(> Customers want to migrate their big data, but> Hard work :(> Fluentd solved streaming data collection, but> bulk data loading is another problem.
A solution:> Package the efforts as a plugin.> data cleaning, error handling, retrying> Share & reuse the plugin.> don’t repeat the pains!> Keep improving the plugin code.> rather than throwing away the efforts every time> using OSS-style pull-reqs & frequent releases.
EmbulkEmbulk is an open-source, plugin-basedparallel bulk data loader
that makes data integration works relaxed.
HDFSMySQLAmazon S3EmbulkCSV FilesSequenceFileSalesforce.comElasticsearchCassandraHiveRedis
HDFSMySQLAmazon S3EmbulkCSV FilesSequenceFileSalesforce.comElasticsearchCassandraHiveRedis✓ Parallel execution✓ Data validation✓ Error recovery✓ Deterministic behavior✓ Idempotet retryingbulk load
HDFSMySQLAmazon S3EmbulkCSV FilesSequenceFileSalesforce.comElasticsearchCassandraHiveRedis✓ Parallel execution✓ Data validation✓ Error recovery✓ Deterministic behavior✓ Idempotet retryingPlugins Pluginsbulk load
How Embulk works?
# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jarInstalling embulkBintray
releasesEmbulk is released on Bintraywget embulk.jar
# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.ymlGuess format & schema in:type: filepaths: [data/examples/]out:
type: examplein:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringout:
type: exampleguessby guess plugins
# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary+--------------------------------------+---------------+--------------------+| time:timestamp | uid:long | word:string |+--------------------------------------+---------------+--------------------+| 2015-01-27 19:23:49 UTC | 32,864 | embulk || 2015-01-27 19:01:23 UTC | 14,824 | jruby || 2015-01-28 02:20:02 UTC | 27,559 | plugin || 2015-01-29 11:54:36 UTC | 11,270 | fluentd |+--------------------------------------+---------------+--------------------+Preview & fix config
# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.ymlin:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringlast_paths: [data/examples/sample_001.csv.gz]out:
type: exampleDeterministic run
in:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringlast_paths: [data/examples/sample_002.csv.gz]out:
type: exampleRepeat# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.yml# repeat$ ./embulk run config.yml -o config.yml$ ./embulk run config.yml -o config.yml
The architecture
InputPlugin OutputPluginEmbulkexecutor pluginread records write records
InputPlugin OutputPluginEmbulkexecutor pluginMySQL, Cassandra,HBase, Elasticsearch,
Treasure Data, …recordrecord
InputPluginFileInputPluginOutputPluginFileOutputPluginEncoderPluginFormatterPluginDecoderPluginParserPluginEmbulkexecutor pluginread filesdecompressparse filesinto recordswrite filescompressformat recordsinto files
InputPluginFileInputPluginOutputPluginFileOutputPluginEncoderPluginFormatterPluginDecoderPluginParserPluginEmbulkexecutor pluginHDFS, S3,
Riak CS, …gzip, bzip2,
3des, …CSV, JSON,
RCFile, …bufferbufferrecordrecordbufferbuffer
Writing Embulk plugins
InputPluginmodule Embulkclass InputExample < InputPluginPlugin.register_input('example', self)def self.transaction(config, &control)# read configtask = {'message' =>config.param('message', :string, default: nil)}threads = config.param('threads', :int, default:2)columns = [Column.new(0, 'col0', :long),Column.new(1, 'col1', :double),Column.new(2, 'col2', :string),]# BEGIN herecommit_reports = yield(task, columns, threads)# COMMIT hereputs "Example input finished"return {}enddef run(task, schema, index, page_builder)puts "Example input thread #{@index}…"10.times do |i|@page_builder.add([i, 10.0, "example"])end@page_builder.finishcommit_report = { }return commit_reportendendend
OutputPluginmodule Embulkclass OutputExample < OutputPluginPlugin.register_output('example', self)def self.transaction(config, schema,processor_count, &control)# read configtask = {'message' =>config.param('message', :string, default: "record")}puts "Example output started."commit_reports = yield(task)puts "Example output finished. Commitreports = #{commit_reports.to_json}"return {}enddef initialize(task, schema, index)puts "Example output thread #{index}..."super@message = task.prop('message', :string)@records = 0enddef add(page)page.each do |record|hash = Hash[schema.names.zip(record)]puts "#{@message}: #{hash.to_json}"@records += 1endenddef finishenddef abortenddef commitcommit_report = {"records" => @records}return commit_reportendendend
GuessPlugin# guess_gzip.rbmodule Embulkclass GzipGuess < GuessPluginPlugin.register_guess('gzip', self)GZIP_HEADER = "x1fx8b".force_encoding('ASCII-8BIT').freezedef guess(config, sample_buffer)if sample_buffer[0,2] == GZIP_HEADERreturn {"decoders" => [{"type" => "gzip"}]}endreturn {}endendend# guess_module Embulkclass GuessNewline < TextGuessPluginPlugin.register_guess('newline', self)def guess_text(config, sample_text)cr_count = sample_text.count("r")lf_count = sample_text.count("n")crlf_count = sample_text.scan(/rn/).lengthif crlf_count > cr_count / 2 && crlf_count >lf_count / 2return {"parser" => {"newline" => "CRLF"}}elsif cr_count > lf_count / 2return {"parser" => {"newline" => "CR"}}elsereturn {"parser" => {"newline" => "LF"}}endendendend
Releasing to RubyGemsExamples> embulk-plugin-postgres-json.gem> https://github.com/frsyuki/embulk-plugin-postgres-json> embulk-plugin-redis.gem> https://github.com/komamitsu/embulk-plugin-redis> embulk-plugin-input-sfdc-event-log-files.gem> https://github.com/nahi/embulk-plugin-input-sfdc-event-log-files
Roadmap & Development
Roadmap> Add missing JRuby Plugin APIs> ParserPlugin, FormatterPlugin> DecoderPlugin, EncoderPlugin> Add Executor plugin SPI> Add ssh distributed executor> embulk run —command ssh %host embulk run %task> Add MapReduce executor> Add support for nested records (?)
Contributing to the Embulk project> Pull-requests & issues on Github> Posting blogs> “I tried Embulk. Here is how it worked”> “I read Embulk code. Here is how it’s written”> “Embulk is good because…but bad because…”> Talking on Twitter with a word “embulk"> Writing & releasing plugins> Windows support> Integration to other software> ETL tools, Fluentd, Hadoop, Presto, …
Q&A + Discussion?Hiroshi Nakamura@nahiMuga Nishizawa@muga_nishizawaSadayuki Furuhashi@frsyukiEmbulk committers:
https://jobs.lever.co/treasure-dataCloud service for the entire data pipeline.We’re hiring!

Recommended

PPTX
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
PDF
データ基盤の従来~最新の考え方とSynapse Analyticsでの実現
PPTX
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
PDF
Cassandraとh baseの比較して入門するno sql
PDF
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
PDF
SQL大量発行処理をいかにして高速化するか
PDF
データウェアハウスモデリング入門(ダイジェスト版)(事前公開版)
PDF
ビッグデータ処理データベースの全体像と使い分け
PDF
今からでも遅くないDBマイグレーション - Flyway と SchemaSpy の紹介 -
PDF
AWS で Presto を徹底的に使いこなすワザ
PPTX
ビッグデータ処理データベースの全体像と使い分け
2018年version
PDF
pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)
PPTX
pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)
PDF
データベース設計徹底指南
PPTX
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
PDF
統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)
PDF
データ分析を支える技術 データ分析基盤再入門
PDF
爆速クエリエンジン”Presto”を使いたくなる話
PDF
Vacuum徹底解説
PDF
Amazon S3を中心とするデータ分析のベストプラクティス
PDF
[236] 카카오의데이터파이프라인 윤도영
PDF
InnoDBのすゝめ(仮)
PDF
Elasticsearchを使うときの注意点 公開用スライド
PDF
C34 ニッチだけど、社会インフラを支えるデータベース、HiRDB ~HiRDBを選ぶ人、選ばない人、その選択基準とは~ by Taichi Ishikawa
PDF
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
PPTX
イベント・ソーシングを知る
PDF
分散トレーシング技術について(Open tracingやjaeger)
PPTX
Redisの特徴と活用方法について
PDF
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
PPTX
EmbulkとDigdagとデータ分析基盤と

More Related Content

PPTX
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
PDF
データ基盤の従来~最新の考え方とSynapse Analyticsでの実現
PPTX
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
PDF
Cassandraとh baseの比較して入門するno sql
PDF
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
PDF
SQL大量発行処理をいかにして高速化するか
PDF
データウェアハウスモデリング入門(ダイジェスト版)(事前公開版)
PDF
ビッグデータ処理データベースの全体像と使い分け
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
データ基盤の従来~最新の考え方とSynapse Analyticsでの実現
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
Cassandraとh baseの比較して入門するno sql
PostgreSQL10を導入!大規模データ分析事例からみるDWHとしてのPostgreSQL活用のポイント
SQL大量発行処理をいかにして高速化するか
データウェアハウスモデリング入門(ダイジェスト版)(事前公開版)
ビッグデータ処理データベースの全体像と使い分け

What's hot

PDF
今からでも遅くないDBマイグレーション - Flyway と SchemaSpy の紹介 -
PDF
AWS で Presto を徹底的に使いこなすワザ
PPTX
ビッグデータ処理データベースの全体像と使い分け
2018年version
PDF
pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)
PPTX
pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)
PDF
データベース設計徹底指南
PPTX
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
PDF
統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)
PDF
データ分析を支える技術 データ分析基盤再入門
PDF
爆速クエリエンジン”Presto”を使いたくなる話
PDF
Vacuum徹底解説
PDF
Amazon S3を中心とするデータ分析のベストプラクティス
PDF
[236] 카카오의데이터파이프라인 윤도영
PDF
InnoDBのすゝめ(仮)
PDF
Elasticsearchを使うときの注意点 公開用スライド
PDF
C34 ニッチだけど、社会インフラを支えるデータベース、HiRDB ~HiRDBを選ぶ人、選ばない人、その選択基準とは~ by Taichi Ishikawa
PDF
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
PPTX
イベント・ソーシングを知る
PDF
分散トレーシング技術について(Open tracingやjaeger)
PPTX
Redisの特徴と活用方法について
今からでも遅くないDBマイグレーション - Flyway と SchemaSpy の紹介 -
AWS で Presto を徹底的に使いこなすワザ
ビッグデータ処理データベースの全体像と使い分け
2018年version
pgvectorを使ってChatGPTとPostgreSQLを連携してみよう!(PostgreSQL Conference Japan 2023 発表資料)
pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)
データベース設計徹底指南
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)
データ分析を支える技術 データ分析基盤再入門
爆速クエリエンジン”Presto”を使いたくなる話
Vacuum徹底解説
Amazon S3を中心とするデータ分析のベストプラクティス
[236] 카카오의데이터파이프라인 윤도영
InnoDBのすゝめ(仮)
Elasticsearchを使うときの注意点 公開用スライド
C34 ニッチだけど、社会インフラを支えるデータベース、HiRDB ~HiRDBを選ぶ人、選ばない人、その選択基準とは~ by Taichi Ishikawa
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
イベント・ソーシングを知る
分散トレーシング技術について(Open tracingやjaeger)
Redisの特徴と活用方法について

Viewers also liked

PDF
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
PPTX
EmbulkとDigdagとデータ分析基盤と
PDF
ガチ(?)対決!OSSのジョブ管理ツール
PDF
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
PDF
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
PDF
NextGen Server/Client Architecture - gRPC + Unity + C#
PDF
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
PDF
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
PDF
Fluentd and Embulk Game Server 4
PDF
async/await不要論
PDF
Reactive Programming by UniRx for Asynchronous & Event Processing
PPTX
HttpClient詳解、或いは非同期の落とし穴について
PDF
How to Make Own Framework built on OWIN
PDF
UniRx - Reactive Extensions for Unity
PDF
H2O - making HTTP better
PDF
The History of Reactive Extensions
PDF
Fluentd - road to v1 -
PDF
A Framework for LightUp Applications of Grani
PPTX
RuntimeUnitTestToolkit for Unity
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
EmbulkとDigdagとデータ分析基盤と
ガチ(?)対決!OSSのジョブ管理ツール
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
NextGen Server/Client Architecture - gRPC + Unity + C#
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
Fluentd and Embulk Game Server 4
async/await不要論
Reactive Programming by UniRx for Asynchronous & Event Processing
HttpClient詳解、或いは非同期の落とし穴について
How to Make Own Framework built on OWIN
UniRx - Reactive Extensions for Unity
H2O - making HTTP better
The History of Reactive Extensions
Fluentd - road to v1 -
A Framework for LightUp Applications of Grani
RuntimeUnitTestToolkit for Unity

Similar to Embulk, an open-source plugin-based parallel bulk data loader

PDF
Embulk - 進化するバルクデータローダ
PDF
Plugin-based software design with Ruby and RubyGems
PDF
Using Embulk at Treasure Data
PPTX
Data integration with embulk
PDF
JRuby with Java Code in Data Processing World
PDF
Using Embulk at Treasure Data
PDF
Embulk at Treasure Data
PDF
Recent Updates at Embulk Meetup #3
PPTX
Bulk Export Tool for Alfresco
PDF
Top Python Development Services in Bangalore
PDF
Fighting Against Chaotically Separated Values with Embulk
PDF
Treasure Data and OSS
PDF
Scripting Embulk Plugins
PPTX
Retrofitting Continuous Delivery
PDF
Our challenge for Bulkload reliability improvement
PDF
Fluentd Unified Logging Layer At Fossasia
KEY
R Jobs on the Cloud
PDF
"R, Hadoop, and Amazon Web Services (20 December 2011)"
PDF
R, Hadoop and Amazon Web Services
PDF
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
Embulk - 進化するバルクデータローダ
Plugin-based software design with Ruby and RubyGems
Using Embulk at Treasure Data
Data integration with embulk
JRuby with Java Code in Data Processing World
Using Embulk at Treasure Data
Embulk at Treasure Data
Recent Updates at Embulk Meetup #3
Bulk Export Tool for Alfresco
Top Python Development Services in Bangalore
Fighting Against Chaotically Separated Values with Embulk
Treasure Data and OSS
Scripting Embulk Plugins
Retrofitting Continuous Delivery
Our challenge for Bulkload reliability improvement
Fluentd Unified Logging Layer At Fossasia
R Jobs on the Cloud
"R, Hadoop, and Amazon Web Services (20 December 2011)"
R, Hadoop and Amazon Web Services
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data

More from Sadayuki Furuhashi

PDF
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
PDF
Digdagによる大規模データ処理の自動化とエラー処理
PDF
Understanding Presto - Presto meetup @ Tokyo #1
PDF
Prestogres internals
PDF
Logging for Production Systems in The Container Era
PDF
Prestogres, ODBC & JDBC connectivity for Presto
PDF
Presto+MySQLで分散SQL
PDF
DigdagはなぜYAMLなのか?
PDF
Making KVS 10x Scalable
PDF
Embuk internals
PDF
Automating Workflows for Analytics Pipelines
PDF
Fluentd at Bay Area Kubernetes Meetup
PDF
Fluentd - Set Up Once, Collect More
PDF
Fluentd meetup at Slideshare
PDF
How to collect Big Data into Hadoop
PDF
Fluentd meetup
PDF
Presto - Hadoop Conference Japan 2014
PDF
How we use Fluentd in Treasure Data
PDF
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
PDF
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
Digdagによる大規模データ処理の自動化とエラー処理
Understanding Presto - Presto meetup @ Tokyo #1
Prestogres internals
Logging for Production Systems in The Container Era
Prestogres, ODBC & JDBC connectivity for Presto
Presto+MySQLで分散SQL
DigdagはなぜYAMLなのか?
Making KVS 10x Scalable
Embuk internals
Automating Workflows for Analytics Pipelines
Fluentd at Bay Area Kubernetes Meetup
Fluentd - Set Up Once, Collect More
Fluentd meetup at Slideshare
How to collect Big Data into Hadoop
Fluentd meetup
Presto - Hadoop Conference Japan 2014
How we use Fluentd in Treasure Data
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019

Recently uploaded

PDF
Constraints First - Why Our On-Prem Ticketing System Starts With Limits, Not ...
PPTX
NSF Converter Software to Convert NSF to PST, EML, MSG
PPTX
#15 All About Anypoint MQ - Calicut MuleSoft Meetup Group
PDF
Imed Eddine Bouchoucha | computer engineer | software Architect
PDF
What Is A Woman (WIAW) Token – Smart Contract Security Audit Report by EtherA...
PDF
How NetSuite Cloud ERP Helps Businesses Overcome Legacy System Downtime.
PPTX
Lecture 3 - Scheduling - Operating System
PPTX
Week 7 Introduction to HTML PowerPoint Notes.pptx
PPTX
Why Your Business Needs Snowflake Consulting_ From Data Silos to AI-Ready Cloud
 
PPTX
application security presentation 2 by harman
PPTX
SNG460-CNG489_Week6_3-7Nov25_Chp6-Part1.pptx
PDF
Manual vs Automated Accessibility Testing – What to Choose in 2025.pdf
PDF
Resource-Levelled Critical-Path Analysis Balancing Time, Cost and Constraints
PDF
Why Zoho Notebook’s AI-Fueled Upgrade Matters for Knowledge Workers in 2026
PDF
How Modern Custom Software is Revolutionizing Mortgage Lending Processes -Ma...
PDF
Design and Analysis of Algorithms(DAA): Unit-II Asymptotic Notations and Basi...
PDF
Transforming Compliance Through Policy & Procedure Management
PDF
Virtual Study Circles Innovative Ways to Collaborate Online.pdf
PDF
Intelligent CRM for Insurance Brokers: Managing Clients with Precision
PPTX
AI Clinic Management Software for Pulmonology Clinics Bringing Clarity, Contr...
Constraints First - Why Our On-Prem Ticketing System Starts With Limits, Not ...
NSF Converter Software to Convert NSF to PST, EML, MSG
#15 All About Anypoint MQ - Calicut MuleSoft Meetup Group
Imed Eddine Bouchoucha | computer engineer | software Architect
What Is A Woman (WIAW) Token – Smart Contract Security Audit Report by EtherA...
How NetSuite Cloud ERP Helps Businesses Overcome Legacy System Downtime.
Lecture 3 - Scheduling - Operating System
Week 7 Introduction to HTML PowerPoint Notes.pptx
Why Your Business Needs Snowflake Consulting_ From Data Silos to AI-Ready Cloud
 
application security presentation 2 by harman
SNG460-CNG489_Week6_3-7Nov25_Chp6-Part1.pptx
Manual vs Automated Accessibility Testing – What to Choose in 2025.pdf
Resource-Levelled Critical-Path Analysis Balancing Time, Cost and Constraints
Why Zoho Notebook’s AI-Fueled Upgrade Matters for Knowledge Workers in 2026
How Modern Custom Software is Revolutionizing Mortgage Lending Processes -Ma...
Design and Analysis of Algorithms(DAA): Unit-II Asymptotic Notations and Basi...
Transforming Compliance Through Policy & Procedure Management
Virtual Study Circles Innovative Ways to Collaborate Online.pdf
Intelligent CRM for Insurance Brokers: Managing Clients with Precision
AI Clinic Management Software for Pulmonology Clinics Bringing Clarity, Contr...

Embulk, an open-source plugin-based parallel bulk data loader

  • 1.
    Sadayuki FuruhashiFounder &Software ArchitectTreasure Data, inc.EmbulkAn open-source plugin-based parallel bulk data loaderthat makes painful data integration work relaxed.Sharing our knowledge on RubyGems to manage arbitrary files.
  • 2.
    A little aboutme...> Sadayuki Furuhashi> github/twitter: @frsyuki> Treasure Data, Inc.> Founder & Software Architect> Open-source hacker> MessagePack - Efficient object serializer> Fluentd - An unified data collection tool> Prestogres - PostgreSQL protocol gateway for Presto> Embulk - A plugin-based parallel bulk data loader> ServerEngine - A Ruby framework to build multiprocess servers> LS4 - A distributed object storage with cross-region replication> kumofs - A distributed strong-consistent key-value data store
  • 3.
    Today’s talk> What’sEmbulk?> How Embulk works?> The architecture> Writing Embulk plugins> Roadmap & Development> Q&A + Discussion
  • 4.
    What’s Embulk?> Anopen-source parallel bulk data loader> using plugins> to make data integration relaxed.
  • 5.
    What’s Embulk?> Anopen-source parallel bulk data loader> loads records from “A” to “B”> using plugins> for various kinds of “A” and “B”> to make data integration relaxed.> which was very painful…Storage, RDBMS,NoSQL, Cloud Service,etc.broken records,
transactions (idempotency),
performance, …
  • 6.
    The pains ofbulk data loadingExample: load a 10GB CSV file to PostgreSQL> 1. First attempt → fails> 2. Write a script to make the records cleaned• Convert ”20150127T190500Z” → “2015-01-27 19:05:00 UTC”• Convert “N" → “”• many cleanings…> 3. Second attempt → another error• Convert “Inf” → “Infinity”> 4. Fix the script, retry, retry, retry…> 5. Oh, some data got loaded twice!?
  • 7.
    The pains ofbulk data loadingExample: load a 10GB CSV file to PostgreSQL> 6. Ok, the script worked.> 7. Register it to cron to sync data every day.> 8. One day… it fails with another error• Convert invalid UTF-8 byte sequence to U+FFFD
  • 8.
    The pains ofbulk data loadingExample: load 10GB CSV × 720 files> Most of scripts are slow.• People have little time to optimize bulk load scripts> One file takes 1 hour → 720 files takes 1 month (!?)A lot of integration efforts for each storages:> XML, JSON, Apache log format (+some custom), …> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…> MongoDB, Elasticsearch, Redshift, Salesforce, …
  • 9.
    The problems:> Datacleaning (normalization)> How to normalize broken records?> Error handling> How to remove broken records?> Idempotent retrying> How to retry without duplicated loading?> Performance optimization> How to optimize the code or parallelize?
  • 10.
    The problems atTreasure DataTreasure Data Service?> “Fast, powerful SQL access to big data from connectedapplications and products, with no new infrastructure orspecial skills required.”> Customers want to try Treasure Data, but> SEs write scripts to bulk load their data. Hard work :(> Customers want to migrate their big data, but> Hard work :(> Fluentd solved streaming data collection, but> bulk data loading is another problem.
  • 11.
    A solution:> Packagethe efforts as a plugin.> data cleaning, error handling, retrying> Share & reuse the plugin.> don’t repeat the pains!> Keep improving the plugin code.> rather than throwing away the efforts every time> using OSS-style pull-reqs & frequent releases.
  • 12.
    EmbulkEmbulk is anopen-source, plugin-basedparallel bulk data loader
that makes data integration works relaxed.
  • 13.
  • 14.
    HDFSMySQLAmazon S3EmbulkCSV FilesSequenceFileSalesforce.comElasticsearchCassandraHiveRedis✓Parallel execution✓ Data validation✓ Error recovery✓ Deterministic behavior✓ Idempotet retryingbulk load
  • 15.
    HDFSMySQLAmazon S3EmbulkCSV FilesSequenceFileSalesforce.comElasticsearchCassandraHiveRedis✓Parallel execution✓ Data validation✓ Error recovery✓ Deterministic behavior✓ Idempotet retryingPlugins Pluginsbulk load
  • 16.
  • 17.
    # install$ wgethttps://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jarInstalling embulkBintray
releasesEmbulk is released on Bintraywget embulk.jar
  • 18.
    # install$ wgethttps://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.ymlGuess format & schema in:type: filepaths: [data/examples/]out:
type: examplein:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringout:
type: exampleguessby guess plugins
  • 19.
    # install$ wgethttps://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary+--------------------------------------+---------------+--------------------+| time:timestamp | uid:long | word:string |+--------------------------------------+---------------+--------------------+| 2015-01-27 19:23:49 UTC | 32,864 | embulk || 2015-01-27 19:01:23 UTC | 14,824 | jruby || 2015-01-28 02:20:02 UTC | 27,559 | plugin || 2015-01-29 11:54:36 UTC | 11,270 | fluentd |+--------------------------------------+---------------+--------------------+Preview & fix config
  • 20.
    # install$ wgethttps://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.ymlin:type: filepaths: [data/examples/]decoders:- {type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringlast_paths: [data/examples/sample_001.csv.gz]out:
type: exampleDeterministic run
  • 21.
    in:type: filepaths: [data/examples/]decoders:-{type: gzip}parser:charset: UTF-8newline: CRLFtype: csvdelimiter: ','quote: '"'header_line: truecolumns:- name: time
type: timestamp
format: '%Y-%m-%d %H:%M:%S'- name: account
type: long- name: purchase
type: timestamp
format: '%Y%m%d'- name: comment
type: stringlast_paths: [data/examples/sample_002.csv.gz]out:
type: exampleRepeat# install$ wget https://bintray.com/artifact/download/embulk/maven/embulk-0.2.0.jar -o embulk.jar$ chmod 755 embulk.jar
# guess$ vi partial-config.yml$ ./embulk guess partial-config.yml
-o config.yml
# preview$ ./embulk preview config.yml$ vi config.yml # if necessary# run$ ./embulk run config.yml -o config.yml# repeat$ ./embulk run config.yml -o config.yml$ ./embulk run config.yml -o config.yml
  • 22.
  • 23.
  • 24.
    InputPlugin OutputPluginEmbulkexecutor pluginMySQL,Cassandra,HBase, Elasticsearch,
Treasure Data, …recordrecord
  • 25.
  • 26.
  • 27.
  • 28.
    InputPluginmodule Embulkclass InputExample< InputPluginPlugin.register_input('example', self)def self.transaction(config, &control)# read configtask = {'message' =>config.param('message', :string, default: nil)}threads = config.param('threads', :int, default:2)columns = [Column.new(0, 'col0', :long),Column.new(1, 'col1', :double),Column.new(2, 'col2', :string),]# BEGIN herecommit_reports = yield(task, columns, threads)# COMMIT hereputs "Example input finished"return {}enddef run(task, schema, index, page_builder)puts "Example input thread #{@index}…"10.times do |i|@page_builder.add([i, 10.0, "example"])end@page_builder.finishcommit_report = { }return commit_reportendendend
  • 29.
    OutputPluginmodule Embulkclass OutputExample< OutputPluginPlugin.register_output('example', self)def self.transaction(config, schema,processor_count, &control)# read configtask = {'message' =>config.param('message', :string, default: "record")}puts "Example output started."commit_reports = yield(task)puts "Example output finished. Commitreports = #{commit_reports.to_json}"return {}enddef initialize(task, schema, index)puts "Example output thread #{index}..."super@message = task.prop('message', :string)@records = 0enddef add(page)page.each do |record|hash = Hash[schema.names.zip(record)]puts "#{@message}: #{hash.to_json}"@records += 1endenddef finishenddef abortenddef commitcommit_report = {"records" => @records}return commit_reportendendend
  • 30.
    GuessPlugin# guess_gzip.rbmodule EmbulkclassGzipGuess < GuessPluginPlugin.register_guess('gzip', self)GZIP_HEADER = "x1fx8b".force_encoding('ASCII-8BIT').freezedef guess(config, sample_buffer)if sample_buffer[0,2] == GZIP_HEADERreturn {"decoders" => [{"type" => "gzip"}]}endreturn {}endendend# guess_module Embulkclass GuessNewline < TextGuessPluginPlugin.register_guess('newline', self)def guess_text(config, sample_text)cr_count = sample_text.count("r")lf_count = sample_text.count("n")crlf_count = sample_text.scan(/rn/).lengthif crlf_count > cr_count / 2 && crlf_count >lf_count / 2return {"parser" => {"newline" => "CRLF"}}elsif cr_count > lf_count / 2return {"parser" => {"newline" => "CR"}}elsereturn {"parser" => {"newline" => "LF"}}endendendend
  • 31.
    Releasing to RubyGemsExamples>embulk-plugin-postgres-json.gem> https://github.com/frsyuki/embulk-plugin-postgres-json> embulk-plugin-redis.gem> https://github.com/komamitsu/embulk-plugin-redis> embulk-plugin-input-sfdc-event-log-files.gem> https://github.com/nahi/embulk-plugin-input-sfdc-event-log-files
  • 32.
  • 33.
    Roadmap> Add missingJRuby Plugin APIs> ParserPlugin, FormatterPlugin> DecoderPlugin, EncoderPlugin> Add Executor plugin SPI> Add ssh distributed executor> embulk run —command ssh %host embulk run %task> Add MapReduce executor> Add support for nested records (?)
  • 34.
    Contributing to theEmbulk project> Pull-requests & issues on Github> Posting blogs> “I tried Embulk. Here is how it worked”> “I read Embulk code. Here is how it’s written”> “Embulk is good because…but bad because…”> Talking on Twitter with a word “embulk"> Writing & releasing plugins> Windows support> Integration to other software> ETL tools, Fluentd, Hadoop, Presto, …
  • 35.
    Q&A + Discussion?HiroshiNakamura@nahiMuga Nishizawa@muga_nishizawaSadayuki Furuhashi@frsyukiEmbulk committers:
  • 36.
    https://jobs.lever.co/treasure-dataCloud service forthe entire data pipeline.We’re hiring!

[8]ページ先頭

©2009-2025 Movatter.jp