Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

An Executable Sequential Specification for Spark Aggregation

  • Conference paper
  • First Online:
Networked Systems(NETYS 2017)

Part of the book series:Lecture Notes in Computer Science ((LNCCN,volume 10299))

Included in the following conference series:

Abstract

Spark is a new promising platform for scalable data-parallel computation. It provides several high-level application programming interfaces (APIs) to perform parallel data aggregation. Since execution of parallel aggregation in Spark is inherently non-deterministic, a natural requirement for Spark programs is to give the same result for any execution on the same data set. We presentPureSpark, an executable formal Haskell specification for Spark aggregate combinators. Our specification allows us to deduce the precise condition for deterministic outcomes from Spark aggregation. We report case studies analyzing deterministic outcomes and correctness of Spark programs.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Apache Spark.https://github.com/apache/spark

  2. IBM DB2 Version 9.7. Partitioned Tables.https://ibm.biz/BdHyYR

  3. The Scalaz project.https://github.com/scalaz

  4. PureSpark.https://github.com/guluchen/purespark

  5. Bennett, J., Grout, R., Pebay, P., Roe, D., Thompson, D.: Numerically stable, single-pass, parallel statistics algorithms. In: CLUSTER, pp. 1–8 (2009)

    Google Scholar 

  6. Bird, R.S.: An introduction to the theory of lists. In: Broy, M. (eds) Logic of Programming and Calculi of Discrete Design. NATO ASI Series (Series F: Computer and Systems Sciences), vol. 36, pp. 5–42. Springer, Heidelberg (1987)

    Google Scholar 

  7. Bocchino Jr., R.L., Adve, V.S., Dig, D., Adve, S.V., Heumann, S., Komuravelli, R., Overbey, J., Simmons, P., Sung, H., Vakilian, M.: A type and effect system for deterministic parallel Java. In: OOPSLA, pp. 97–116 (2009)

    Google Scholar 

  8. Bocchino Jr., R.L., Heumann, S., Honarmand, N., Adve, S.V., Adve, V.S., Welc, A., Shpeisman, T.: Safe nondeterminism in a deterministic-by-default parallel language. SIGPLAN Not.46(1), 535–548 (2011)

    Google Scholar 

  9. Budimlic, Z., Burke, M.G., Cavé, V., Knobe, K., Lowney, G., Newton, R., Palsberg, J., Peixotto, D.M., Sarkar, V., Schlimbach, F., Tasirlar, S.: Concurrent collections. Sci. Program.18(3–4), 203–217 (2010)

    Google Scholar 

  10. Burnim, J., Sen, K.: Asserting and checking determinism for multithreaded programs. Commun. ACM53(6), 97–105 (2010)

    Article  Google Scholar 

  11. Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS 1998 (1998)

    Google Scholar 

  12. Chen, Y., Hong, C., Lengál, O., Mu, S., Sinha, N., Wang, B.: An executable sequential specification for Spark aggregationarXiv:1702.02439 [cs.DC] (2017)

  13. Chen, Y.-F., Hong, C.-D., Sinha, N., Wang, B.-Y.: Commutativity of reducers. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 131–146. Springer, Heidelberg (2015). doi:10.1007/978-3-662-46681-0_9

    Google Scholar 

  14. Chu, C., Kim, S.K., Lin, Y., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-Reduce for machine learning on multicore. In: NIPS, pp. 281–288 (2006)

    Google Scholar 

  15. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM53(1), 72–77 (2010)

    Article  Google Scholar 

  16. Dörre, J., Apel, S., Lengauer, C.: Modeling and optimizing MapReduce programs. Concurrency Comput. Pract. Experience27(7), 1734–1766 (2015)

    Article  Google Scholar 

  17. Emoto, K., Fischer, S., Hu, Z.: Generate, test, and aggregate. In: Seidl, H. (ed.) ESOP 2012. LNCS, vol. 7211, pp. 254–273. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28869-2_13

    Chapter  Google Scholar 

  18. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc. VLDB Endowment4(11), 1111–1122 (2011)

    Google Scholar 

  19. Herodotou, H., Borisov, N., Babu, S.: Query optimization techniques for partitioned tables. In: SIGMOD 2011, pp. 49–60 (2011)

    Google Scholar 

  20. Ioannidis, Y.E.: Query optimization. ACM Comput. Surv.28(1), 121–123 (1996)

    Article  Google Scholar 

  21. Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for MapReduce. In: SODA, pp. 938–948 (2010)

    Google Scholar 

  22. Leijen, D., Fähndrich, M., Burckhardt, S.: Prettier concurrency: Purely functional concurrent revisions. In: Haskell, pp. 83–94 (2011)

    Google Scholar 

  23. Liu, C., Zhang, J., Zhou, H., McDirmid, S., Guo, Z., Moscibroda, T.: Automating distributed partial aggregation. In: SoCC, pp. 1:1–1:12 (2014)

    Google Scholar 

  24. Radoi, C., Fink, S.J., Rabbah, R.M., Sridharan, M.: Translating imperative code to MapReduce. In: OOPSLA, pp. 909–927 (2014)

    Google Scholar 

  25. Sakr, S., Liu, A., Fayoumi, A.G.: The family of MapReduce and large-scale data processing systems. ACM Comput. Surv.46(1), 11:1–11:44 (2013)

    Google Scholar 

  26. Tian, Y., Tatikonda, S., Reinwald, B.: Scalable and numerically stable descriptive statistics in SystemML. In: ICDE, pp. 1351–1359 (2012)

    Google Scholar 

  27. Xiao, T., Zhang, J., Zhou, H., Guo, Z., McDirmid, S., Lin, W., Chen, W., Zhou, L.: Nondeterminism in MapReduce considered harmful? an empirical study on non-commutative aggregators in MapReduce programs. In: Companion Proceedings of ICSE, pp. 44–53 (2014)

    Google Scholar 

  28. Xu, Z., Hirzel, M., Rothermel, G.: Semantic characterization of MapReduce workloads. In: IISWC, pp. 87–97 (2013)

    Google Scholar 

  29. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)

    Google Scholar 

  30. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM59(11), 56–65 (2016)

    Article  Google Scholar 

  31. Zhang, Z., Cherkasova, L., Verma, A., Loo, B.T.: Performance modeling and optimization of deadline-driven Pig programs. ACM Trans. Auton. Adapt. Syst.8(3), 14:1–14:28 (2013)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the Czech Science Foundation (project 17-12465S), the BUT FIT project FIT-S-17-4014, the IT4IXS: IT4Innovations Excellence in Science project (LQ1602), and Ministry of Science and Technology, R.O.C. (MOST projects 103-2221-E-001-019-MY3 and 103-2221-E-001-020-MY3).

Author information

Authors and Affiliations

  1. Academia Sinica, Taipei, Taiwan

    Yu-Fang Chen, Chih-Duo Hong, Ondřej Lengál, Shin-Cheng Mu & Bow-Yaw Wang

  2. Brno University of Technology, Brno, Czech Republic

    Ondřej Lengál

  3. IBM Research, New Delhi, India

    Nishant Sinha

Authors
  1. Yu-Fang Chen

    You can also search for this author inPubMed Google Scholar

  2. Chih-Duo Hong

    You can also search for this author inPubMed Google Scholar

  3. Ondřej Lengál

    You can also search for this author inPubMed Google Scholar

  4. Shin-Cheng Mu

    You can also search for this author inPubMed Google Scholar

  5. Nishant Sinha

    You can also search for this author inPubMed Google Scholar

  6. Bow-Yaw Wang

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toOndřej Lengál.

Editor information

Editors and Affiliations

  1. Department of Computer Science, University of California, Santa Barbara, Santa Barbara, California, USA

    Amr El Abbadi

  2. Université de Lausanne, Lausanne, Switzerland

    Benoît Garbinato

Rights and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Chen, YF., Hong, CD., Lengál, O., Mu, SC., Sinha, N., Wang, BY. (2017). An Executable Sequential Specification for Spark Aggregation. In: El Abbadi, A., Garbinato, B. (eds) Networked Systems. NETYS 2017. Lecture Notes in Computer Science(), vol 10299. Springer, Cham. https://doi.org/10.1007/978-3-319-59647-1_31

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp