Movatterモバイル変換


[0]ホーム

URL:


SlideShare a Scribd company logo

Apache HBase at Airbnb

15 likes5,945 views
HBaseCon
HBaseCon

The document discusses Airbnb's data infrastructure and the use of Apache HBase for efficient data management, event logging, and real-time data processing. It highlights the architecture for batch and streaming data workflows, including the integration of various technologies like Spark, Kafka, and Druid. Key features of HBase, its case studies, and the advantages of its stateful computation and scalability within the Hadoop ecosystem are also covered.

1 of 35
1
2
3
Most read
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Apache HBase at AirbnbJINGWEI LU, LIYIN TANG, AND JASON ZHANG1
Data Infrastructure at Airbnb
EventLogsMySQLDumpsGold ClusterHDFSHiveKafkaSqoopSilver Cluster Spark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalCaravelTableauBatch InfrastructureYarn HDFSHiveYarnJingwei Lu, Liyin Tang, Jason Zhang3
Streaming at AirbnbEventLoggingMySQLBINLOGClusterHDFSHiveSpinal tapPresto ClusterYarnKafkaHBaseSpark StreamingDatadogDruidKafkaJingwei Lu, Liyin Tang, Jason Zhang4
Growing Pain
StatelessJingwei Lu, Liyin Tang, Jason ZhangComputation SinkSourceDStream DF DF
StatefulJingwei Lu, Liyin Tang, Jason ZhangComputationSourceDStream DF DFSink1Sink2Sink NState StorageRDD
Multiple StreamsJingwei Lu, Liyin Tang, Jason ZhangDataFrameSink1ProcessASink2Sink3SinkN…DataFrameSink1ProcessNSink2Sink3SinkN…SourceDStreamAlign by TimeDataFrameDataFrameStateStorageSourceDStream…
Streaming + BatchJingwei Lu, Liyin Tang, Jason ZhangDataFrameSink1ProcessASink2Sink3SinkN…DataFrameStateStorageSourceDStreamSource…Align by Time…DataFrameSink1ProcessASink2Sink3SinkN…
Simplify and Unify
AirStream ArchitectureJingwei Lu, Liyin Tang, Jason ZhangSourcesStream #1 Stream #NHive TablesHBaseTablesVirtual Table Views for ComputationSinks…Customized ComputationSpark SQLSimple ConfigHBase ServicesStreamingSourcesDruid
AirStream ArchitectureJingwei Lu, Liyin Tang, Jason ZhangSourcesStream #1 Stream #NHive TablesHBaseTablesVirtual Table Views for ComputationSinks…Customized ComputationSpark SQLHBase ServicesStreamingSourcesDruidSame Computationfor Batchprocessing
Stateful
Jingwei Lu, Liyin Tang, Jason ZhangState Store• Merge changes• Provide fast lookup• Fast persistent storage acrossstreaming and batch jobs14
Why HBaseJingwei Lu, Liyin Tang, Jason ZhangRich FunctionalitiesRich Integration with Hadoop EcoSystemEasy ManagementStrong CommunityReliable and Scalable
HBase State StoreOperators in AirstreamJingwei Lu, Liyin Tang, Jason Zhang16Full Table ScanSimple AggregationBulk UploadKey/Prefix LookupUpdate
Jingwei Lu, Liyin Tang, Jason ZhangComputation DAG17Input DataLeft Outer Join ResultKey Lookup
Jingwei Lu, Liyin Tang, Jason ZhangKey Space Design• Hash partition key spacefor load balance• Composite key for K -> V• Support full key lookup• Prefix lookup supported forall keys used in hashfunctionHash key1 key2 key3Hash based on key prefixHash key1 key2Lookup based on key prefixkey1 = ‘value1’ and key2 = ‘value2’18
• Partition based on key before write• Use bulk upload for large volume updateWrite PerformanceJingwei Lu, Liyin Tang, Jason Zhang19
Case StudyJingwei Lu, Liyin Tang, Jason ZhangExperiment realtime feedbackUpdateExperiment AssignmentEventLookupHBasewith TTLBooking EventDruidDatadog20one airstream job
Realtime Data Ingestion
Realtime Ingestion on HBaseData InfrastructureMySQLAnalytical EventsKafkaSparkStreamingHBaseHDFSPresto/Hive/SparkSourceIngestRealtimeQuerySnapshotBatchQueryJingwei Lu, Liyin Tang, Jason Zhang22
Access Data in HBaseJingwei Lu, Liyin Tang, Jason ZhangHBaseHive PrestoSparkSQLSparkStreamingBatch Jobs Interactive Query StreamingHDFSSnapshotTable Mapping/Unifed View on realtime data23
Snapshot & ReseedJingwei Lu, Liyin Tang, Jason ZhangHBase HDFSSnapshot(HFile Links)Bulk Upload24
Case Study 1: Events IngestionJingwei Lu, Liyin Tang, Jason ZhangKafkatopic…topictopicSparkExecutor1…Executor2ExecutorKHBaseDeDupHDFSRegion1…Region2Region MDailySnapshotRealtimeQueryHivePrestoEventsPartition25
Case Study 2: Streaming DB ExportKafkaRDSTable1…Spinaltap.Table1…Table2TableNSpinaltap.Table2Spinaltap.TableNSparkExecutor1…Executor2Executor KHBaseRegion1…Region2Region MHDFSRegion1…Region2Region MDaily SnapshotRealtime QueryJingwei Lu, Liyin Tang, Jason Zhang26
Case Study: Streaming DB ExportRows CF: Colums Version Value<ShardKey><DB_TABLE_#1><PK_a=A> id Fri May 19 00:33:19 2016 101<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 19 00:33:19 2016 San Francisco<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 10 00:34:19 2016 New York<ShardKey><DB_TABLE_#2><PK_a=A’> id Fri May 19 00:33:19 2016 1Jingwei Lu, Liyin Tang, Jason Zhang27
Case Study: Streaming DB ExportTXN 1Commit_TS:101…TXN 2Commit_TS:102TXN 3Commit_TS:103TXN NCommit_TS:N’Binlog OrderJingwei Lu, Liyin Tang, Jason Zhang28
Case Study: Streaming DB ExportTXN 1Commit_TS:101…TXN 2Commit_TS:103TXN 3Commit_TS:102TXN NCommit_TS:N’NTPBinlog OrderJingwei Lu, Liyin Tang, Jason Zhang29
Case Study: Streaming DB ExportTXN 1Commit_TS:101…Binlog OrderTXN 2Commit_TS:103TXN 3Commit_TS:102TXN NCommit_TS:N’Point-in-Time Restore on TS 102Jingwei Lu, Liyin Tang, Jason Zhang30
Case Study: Streaming DB ExportRows CF: Colums Version Value<ShardKey><DB_TABLE_#1><PK_a=A> id bin100 101<ShardKey><DB_TABLE_#1><PK_a=A> city bin101 San Francisco<ShardKey><DB_TABLE_#1><PK_a=A> city bin102 New York<ShardKey><DB_TABLE_#2><PK_a=A’> id bin100 1Jingwei Lu, Liyin Tang, Jason Zhang31
Case Study: Streaming DB ExportRows Version (Logical Offset) Value<ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100<ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101<ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103<ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102Jingwei Lu, Liyin Tang, Jason Zhang32
Case Study: Streaming DB ExportRows Version (Logical Offset) Value<ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100<ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101<ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103<ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102Jingwei Lu, Liyin Tang, Jason Zhang33
SummaryJingwei Lu, Liyin Tang, Jason ZhangScalable and ReliableRich Stateful ComputationRich Integration with Hadoop EcoSystemEasy Operation
35
Ad

Recommended

PPTX
Apache HBase™
Prashant Gupta
 
PPTX
Always on in SQL Server 2012
Fadi Abdulwahab
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PPTX
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon
 
PPTX
Hadoop Distributed File System
Rutvik Bapat
 
PPTX
YARN Federation
DataWorks Summit/Hadoop Summit
 
PPTX
An overview of data warehousing and OLAP technology
Nikhatfatima16
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PDF
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
PPTX
kafka
Amikam Snir
 
PDF
Snowflake SnowPro Core Cert CheatSheet.pdf
Dustin Liu
 
PDF
Facebook Messages & HBase
强 王
 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
PDF
The A-Z of Data: Introduction to MLOps
DataPhoenix
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PPTX
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PPTX
Spring Boot+Kafka: the New Enterprise Platform
VMware Tanzu
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PDF
Hadoop and Spark
Shravan (Sean) Pabba
 
PPTX
HBase Low Latency
DataWorks Summit
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PPTX
Building a modern data warehouse
James Serra
 
PDF
Airstream: Spark Streaming At Airbnb
Jen Aman
 
PDF
HBaseCon2017 Data Product at AirBnB
HBaseCon
 

More Related Content

What's hot(20)

PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PDF
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
PPTX
kafka
Amikam Snir
 
PDF
Snowflake SnowPro Core Cert CheatSheet.pdf
Dustin Liu
 
PDF
Facebook Messages & HBase
强 王
 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
PDF
The A-Z of Data: Introduction to MLOps
DataPhoenix
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PPTX
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PPTX
Spring Boot+Kafka: the New Enterprise Platform
VMware Tanzu
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PDF
Hadoop and Spark
Shravan (Sean) Pabba
 
PPTX
HBase Low Latency
DataWorks Summit
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PPTX
Building a modern data warehouse
James Serra
 
Big Data Analytics with Hadoop
Philippe Julio
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Snowflake SnowPro Core Cert CheatSheet.pdf
Dustin Liu
 
Facebook Messages & HBase
强 王
 
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
The A-Z of Data: Introduction to MLOps
DataPhoenix
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
Apache Spark Fundamentals
Zahra Eskandari
 
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Spring Boot+Kafka: the New Enterprise Platform
VMware Tanzu
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hadoop and Spark
Shravan (Sean) Pabba
 
HBase Low Latency
DataWorks Summit
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Building a modern data warehouse
James Serra
 

Similar to Apache HBase at Airbnb(20)

PDF
Airstream: Spark Streaming At Airbnb
Jen Aman
 
PDF
HBaseCon2017 Data Product at AirBnB
HBaseCon
 
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
PDF
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
PDF
Continuous SQL with Apache Streaming (FLaNK and FLiP)
Timothy Spann
 
PDF
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward
 
PDF
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
PPTX
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Ankit Singhal
 
PDF
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
PDF
big data fest building modern data streaming apps
Timothy Spann
 
PDF
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
PPTX
Confluent-Ably-AWS-ID-2023 - GSlide.pptx
Ahmed791434
 
PDF
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
HostedbyConfluent
 
PPTX
מיכאל
sqlserver.co.il
 
PDF
Mar 2012 HUG: Hive with HBase
Yahoo Developer Network
 
PDF
Meet HBase 2.0 and Phoenix 5.0
DataWorks Summit
 
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Facultad de Informática UCM
 
PPTX
Standalone metastore-dws-sjc-june-2018
alanfgates
 
PPTX
Sharing metadata across the data lake and streams
DataWorks Summit
 
Airstream: Spark Streaming At Airbnb
Jen Aman
 
HBaseCon2017 Data Product at AirBnB
HBaseCon
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Continuous SQL with Apache Streaming (FLaNK and FLiP)
Timothy Spann
 
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Ankit Singhal
 
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
big data fest building modern data streaming apps
Timothy Spann
 
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Confluent-Ably-AWS-ID-2023 - GSlide.pptx
Ahmed791434
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
HostedbyConfluent
 
מיכאל
sqlserver.co.il
 
Mar 2012 HUG: Hive with HBase
Yahoo Developer Network
 
Meet HBase 2.0 and Phoenix 5.0
DataWorks Summit
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Facultad de Informática UCM
 
Standalone metastore-dws-sjc-june-2018
alanfgates
 
Sharing metadata across the data lake and streams
DataWorks Summit
 
Ad

More from HBaseCon(20)

PDF
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon
 
PDF
hbaseconasia2017: HBase on Beam
HBaseCon
 
PDF
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
HBaseCon
 
PDF
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
 
PDF
hbaseconasia2017: Apache HBase at Netease
HBaseCon
 
PDF
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon
 
PDF
hbaseconasia2017: 基于HBase的企业级大数据平台
HBaseCon
 
PDF
hbaseconasia2017: HBase at JD.com
HBaseCon
 
PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
PDF
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
HBaseCon
 
PDF
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon
 
PDF
hbaseconasia2017: hbase-2.0.0
HBaseCon
 
PDF
HBaseCon2017 Democratizing HBase
HBaseCon
 
PDF
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
PDF
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon
 
PDF
HBaseCon2017 Transactions in HBase
HBaseCon
 
PDF
HBaseCon2017 Highly-Available HBase
HBaseCon
 
PDF
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon
 
hbaseconasia2017: HBase on Beam
HBaseCon
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
HBaseCon
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
 
hbaseconasia2017: Apache HBase at Netease
HBaseCon
 
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon
 
hbaseconasia2017: 基于HBase的企业级大数据平台
HBaseCon
 
hbaseconasia2017: HBase at JD.com
HBaseCon
 
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
HBaseCon
 
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon
 
hbaseconasia2017: hbase-2.0.0
HBaseCon
 
HBaseCon2017 Democratizing HBase
HBaseCon
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon
 
HBaseCon2017 Transactions in HBase
HBaseCon
 
HBaseCon2017 Highly-Available HBase
HBaseCon
 
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon
 
Ad

Recently uploaded(20)

PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 

Apache HBase at Airbnb

  • 1.Apache HBase at AirbnbJINGWEI LU, LIYIN TANG, AND JASON ZHANG1
  • 3.EventLogsMySQLDumpsGold ClusterHDFSHiveKafkaSqoopSilver Cluster Spark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalCaravelTableauBatch InfrastructureYarn HDFSHiveYarnJingwei Lu, Liyin Tang, Jason Zhang3
  • 4.Streaming at AirbnbEventLoggingMySQLBINLOGClusterHDFSHiveSpinal tapPresto ClusterYarnKafkaHBaseSpark StreamingDatadogDruidKafkaJingwei Lu, Liyin Tang, Jason Zhang4
  • 6.StatelessJingwei Lu, Liyin Tang, Jason ZhangComputation SinkSourceDStream DF DF
  • 7.StatefulJingwei Lu, Liyin Tang, Jason ZhangComputationSourceDStream DF DFSink1Sink2Sink NState StorageRDD
  • 8.Multiple StreamsJingwei Lu, Liyin Tang, Jason ZhangDataFrameSink1ProcessASink2Sink3SinkN…DataFrameSink1ProcessNSink2Sink3SinkN…SourceDStreamAlign by TimeDataFrameDataFrameStateStorageSourceDStream…
  • 9.Streaming + BatchJingwei Lu, Liyin Tang, Jason ZhangDataFrameSink1ProcessASink2Sink3SinkN…DataFrameStateStorageSourceDStreamSource…Align by Time…DataFrameSink1ProcessASink2Sink3SinkN…
  • 11.AirStream ArchitectureJingwei Lu, Liyin Tang, Jason ZhangSourcesStream #1 Stream #NHive TablesHBaseTablesVirtual Table Views for ComputationSinks…Customized ComputationSpark SQLSimple ConfigHBase ServicesStreamingSourcesDruid
  • 12.AirStream ArchitectureJingwei Lu, Liyin Tang, Jason ZhangSourcesStream #1 Stream #NHive TablesHBaseTablesVirtual Table Views for ComputationSinks…Customized ComputationSpark SQLHBase ServicesStreamingSourcesDruidSame Computationfor Batchprocessing
  • 14.Jingwei Lu, Liyin Tang, Jason ZhangState Store• Merge changes• Provide fast lookup• Fast persistent storage acrossstreaming and batch jobs14
  • 15.Why HBaseJingwei Lu, Liyin Tang, Jason ZhangRich FunctionalitiesRich Integration with Hadoop EcoSystemEasy ManagementStrong CommunityReliable and Scalable
  • 16.HBase State StoreOperators in AirstreamJingwei Lu, Liyin Tang, Jason Zhang16Full Table ScanSimple AggregationBulk UploadKey/Prefix LookupUpdate
  • 17.Jingwei Lu, Liyin Tang, Jason ZhangComputation DAG17Input DataLeft Outer Join ResultKey Lookup
  • 18.Jingwei Lu, Liyin Tang, Jason ZhangKey Space Design• Hash partition key spacefor load balance• Composite key for K -> V• Support full key lookup• Prefix lookup supported forall keys used in hashfunctionHash key1 key2 key3Hash based on key prefixHash key1 key2Lookup based on key prefixkey1 = ‘value1’ and key2 = ‘value2’18
  • 19.• Partition based on key before write• Use bulk upload for large volume updateWrite PerformanceJingwei Lu, Liyin Tang, Jason Zhang19
  • 20.Case StudyJingwei Lu, Liyin Tang, Jason ZhangExperiment realtime feedbackUpdateExperiment AssignmentEventLookupHBasewith TTLBooking EventDruidDatadog20one airstream job
  • 22.Realtime Ingestion on HBaseData InfrastructureMySQLAnalytical EventsKafkaSparkStreamingHBaseHDFSPresto/Hive/SparkSourceIngestRealtimeQuerySnapshotBatchQueryJingwei Lu, Liyin Tang, Jason Zhang22
  • 23.Access Data in HBaseJingwei Lu, Liyin Tang, Jason ZhangHBaseHive PrestoSparkSQLSparkStreamingBatch Jobs Interactive Query StreamingHDFSSnapshotTable Mapping/Unifed View on realtime data23
  • 24.Snapshot & ReseedJingwei Lu, Liyin Tang, Jason ZhangHBase HDFSSnapshot(HFile Links)Bulk Upload24
  • 25.Case Study 1: Events IngestionJingwei Lu, Liyin Tang, Jason ZhangKafkatopic…topictopicSparkExecutor1…Executor2ExecutorKHBaseDeDupHDFSRegion1…Region2Region MDailySnapshotRealtimeQueryHivePrestoEventsPartition25
  • 26.Case Study 2: Streaming DB ExportKafkaRDSTable1…Spinaltap.Table1…Table2TableNSpinaltap.Table2Spinaltap.TableNSparkExecutor1…Executor2Executor KHBaseRegion1…Region2Region MHDFSRegion1…Region2Region MDaily SnapshotRealtime QueryJingwei Lu, Liyin Tang, Jason Zhang26
  • 27.Case Study: Streaming DB ExportRows CF: Colums Version Value<ShardKey><DB_TABLE_#1><PK_a=A> id Fri May 19 00:33:19 2016 101<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 19 00:33:19 2016 San Francisco<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 10 00:34:19 2016 New York<ShardKey><DB_TABLE_#2><PK_a=A’> id Fri May 19 00:33:19 2016 1Jingwei Lu, Liyin Tang, Jason Zhang27
  • 28.Case Study: Streaming DB ExportTXN 1Commit_TS:101…TXN 2Commit_TS:102TXN 3Commit_TS:103TXN NCommit_TS:N’Binlog OrderJingwei Lu, Liyin Tang, Jason Zhang28
  • 29.Case Study: Streaming DB ExportTXN 1Commit_TS:101…TXN 2Commit_TS:103TXN 3Commit_TS:102TXN NCommit_TS:N’NTPBinlog OrderJingwei Lu, Liyin Tang, Jason Zhang29
  • 30.Case Study: Streaming DB ExportTXN 1Commit_TS:101…Binlog OrderTXN 2Commit_TS:103TXN 3Commit_TS:102TXN NCommit_TS:N’Point-in-Time Restore on TS 102Jingwei Lu, Liyin Tang, Jason Zhang30
  • 31.Case Study: Streaming DB ExportRows CF: Colums Version Value<ShardKey><DB_TABLE_#1><PK_a=A> id bin100 101<ShardKey><DB_TABLE_#1><PK_a=A> city bin101 San Francisco<ShardKey><DB_TABLE_#1><PK_a=A> city bin102 New York<ShardKey><DB_TABLE_#2><PK_a=A’> id bin100 1Jingwei Lu, Liyin Tang, Jason Zhang31
  • 32.Case Study: Streaming DB ExportRows Version (Logical Offset) Value<ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100<ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101<ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103<ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102Jingwei Lu, Liyin Tang, Jason Zhang32
  • 33.Case Study: Streaming DB ExportRows Version (Logical Offset) Value<ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100<ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101<ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103<ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102Jingwei Lu, Liyin Tang, Jason Zhang33
  • 34.SummaryJingwei Lu, Liyin Tang, Jason ZhangScalable and ReliableRich Stateful ComputationRich Integration with Hadoop EcoSystemEasy Operation
  • 35.35

Editor's Notes

  • #4: *Disaster recovery*High Slow SLA job isolation
  • #15: Slide why Stateful process vs stateless
  • #17: Use diagram to show operators
  • #22: Realtime ingestion provides fast feedback loop.Advanced monitoring infrastructure Tracking changes instead of full snapshot for RDS dump
  • #23: What is the goal of realtime ingestion:*fast feedback loop for experiment to reduce testing cycle*provide realtime view of production database for many offline workload(for example, machine learning)
  • #24: Table mapping provide a unified view to access realtime ingested data.
  • #25: For snapshot using scan it takes 10-30 minutes per table. This does not scale.Take 10 minutes to do the link and restore. All tables can be accessed afterward.
  • #27: Backup based db export restore takes 9 - 12 hours and it is subject to AWS network situation. Long latency and fragile. We just need to track changes and apply to snapshot.Provide near realtime snapshot of db. Unify across mysql and dynamodb

[8]ページ先頭

©2009-2025 Movatter.jp