Movatterモバイル変換

Naoki TakezoePresto Conference Tokyo 2020Nov 20, 2020Testing Distributed QueryEngine as a ServiceDeliver our service to customers as safe as possible

© 2020 Treasure DataWho am I?• Naoki Takezoe• Joined Treasure Data in 2018• Work for Presto / Apache Spark• Open Source• GitBucket• Scalatra• Apache PredictionIO• Books• Japanese translation of Scala Puzzlers• Scala 300 recipes, etcTwitter: @takezoenGitHub: https://github.com/takezoe

© 2020 Treasure DataTreasure DataLogsDeviceDataBatchDataPlazmaDBTable SchemaData Collection Cloud Storage Distributed Data ProcessingJobsJob ManagementSQL EditorSchedulerWorkflowsMachineLearningTreasure Data OSSThird Party OSSDataReady to use Cloud Data Platform

© 2020 Treasure DataPresto at Treasure Data• 2010• Presto, developed at Facebook, was open-sourced• Treasure Data was providing Impala As A Service• 2014• Launched Presto As A Service as a replacement of Impala• 2015• 20,000 queries / day• 2019• Reached 1,000,000 queries / day• Presto creators (Martin, Dain and David) left Facebook and founded anNPO Presto Software Foundation (prestosql), then joined Starburst• Hosted Presto Conference in Tokyo

© 2020 Treasure DataTest can be more important when upgrading Presto• Presto development is super active• 27 releases in 2019• 18 releases in 2020 at this point (Nov 14)• No stable version• Incompatible updates come with bug ﬁxes• Sticking to one version cannot be an option• Backport bug ﬁxes and new features from newer version also getschallenging over timeHow we can upgrade Presto safely...?

© 2020 Treasure DataIn order to minimize the riskUnit test Integration test System testRegular performance provingGradual migration for big updateInternal dogfoodingCluster status monitoringTestRelease processMonitoring

© 2020 Treasure DataWhat are missing?• Covering variety of use cases• Performance degradation in corner cases• Unknown compatibility issues• Production-scale environment• Data size and characteristics• Number of queries, cluster size, etc

© 2020 Treasure Datapresto-query-simulatorTest using production data and queries with security and safetyBase ClusterTarget ClusterQuery Log Hashed ResultsReportQuery SetReal Database Test Databaseread write• Security: We don’t see customer data and query results• Safety: We don’t cause any side-eﬀect on customer dataQuery Metrics

© 2020 Treasure DataChallenges in query-simulator• Query simulation takes very long time• Testing 1-day queries will take 1 day at least, theoretically• Not only time, but also cost of test clusters is the matter• Result veriﬁcation is not straightforward• Many false positives and duplications• Result analysis tends to depend on personal knowledge

© 2020 Treasure DataMake query simulation faster• Reduce number of queries by grouping by query signature (up to -90%)• Reduce amount of data by narrowing table scan ranges (up to -80%)• Use multiple Presto clusters• Test only long-running queries

© 2020 Treasure DataQuery signatureSELECT time, path, user_agentFROM accessWHERE TD_INTERVAL(time, '-1M')SELECT time, path, user_agentFROM access aINNER JOIN account b ON a.account_id = b.account_idS(T) access->#S(J(T,T)) access->#,account->#Simpliﬁed expression of query structureOpen-source Scala implementation is included in Airframe:https://github.com/wvlet/airframe/blob/master/airframe-sql/src/main/scala/wvlet/airframe/sql/analyzer/QuerySignature.scala

© 2020 Treasure DataNarrowing scan rangesTime distribution of recordsUse only x% of total records by adding a time range predicateSELECT time, parh, user_agentFROM accessSELECT time, path, user_agentFROM (SELECT time, path, user_agentFROM access)WHERE TD_TIME_RANGE(time, from, to)Original scan rangeUse this range only

© 2020 Treasure DataWe choose these options depending on thepurpose of query simulation• Reduce number of queries by grouping by query signature (up to -90%)• Reduce amount of data by narrowing table scan ranges (up to -80%)• Use multiple Presto clusters• Test only long-running queriesfor checking compatibility? or for checking performance diﬀerence?

© 2020 Treasure DataMake result veriﬁcation easier• Auto detect non-deterministic query results• Running query multiple times to see if results are the same• Grouping similar errors• Fuzzy comparison of error messages•• List problematic queries based on internal metrics• Performance, resource usage, scan ranges, worker distribution, etc• Finally, check problematic queries by human

© 2020 Treasure DataWe just need to check queries listed on the reportGive a possible reason ofthe inconsistent resultFailures are grouped by thesimilarity of error messagesList only queries morethan 5 min slower

© 2020 Treasure DataFuture work for further improvement• Run query simulation more frequently (hopefully regularly)• Further speed up is required• Maintain small but eﬀective query sets for quick test• Automate test environment provisioning• Improve test coverage• Overcome some system-level restriction• Test with schema and data of that time (like time travel)• Improve the resolution of query grouping• ...and more!!

© 2020 Treasure DataRelated Work• Snowtrail: Testing with Production Queries on a Cloud Database• https://resources.snowﬂake.com/report/snowtrail-testing-with-production-series-on-a-cloud-database• クエリログを使ったAurora MySQLの負荷テスト• https://techlife.cookpad.com/entry/2020/10/13/090000• Building an Automated Testing Framework Based on Chaos Mesh and Argo• https://pingcap.com/blog/building-automated-testing-framework-based-on-chaos-mesh-and-argo

Movatterモバイル変換

Change Language

Testing Distributed Query Engine as a Service

Embed presentation

Recommended

More Related Content

What's hot

Similar to Testing Distributed Query Engine as a Service

More from takezoe

Recently uploaded

Testing Distributed Query Engine as a Service