Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud

  • Chapter
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((TLDKS,volume 10430))

Abstract

Recently, some Scientific Workflow Management Systems (SWfMSs) with provenance support (e.g. Chiron) have been deployed in the cloud. However, they typically use a single cloud site. In this paper, we consider a multisite cloud, where the data and computing resources are distributed at different sites (possibly in different regions). Based on a multisite architecture of SWfMS,i.e. multisite Chiron, and its provenance model, we propose a multisite task scheduling algorithm that considers the time to generate provenance data. We performed an extensive experimental evaluation of our algorithm using Microsoft Azure multisite cloud and two real-life scientific workflows (Buzz and Montage). The results show that our scheduling algorithm is up to 49.6% better than baseline algorithms in terms of total execution time.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

Notes

  1. 1.

    In a shared file system, all the computing nodes of the cluster share some data storage that are generally remotely located [22].

  2. 2.

    When each site has the same computing capacity, this ism. But when not all the sites have the same computing capacity, this should be\(m^2\).

  3. 3.

    For instance, the time to execute “SELECT count(*) from eactivity” at the provenance database from each site: 0.0027s from the WEU site, 0.0253s from the NEU site and 0.1117s from the CUS site.

References

  1. Azure service bus.http://azure.microsoft.com/en-us/services/service-bus/

  2. DBLP Computer Science Bibliography.http://dblp.uni-trier.de/

  3. Microsoft Azure.http://azure.microsoft.com

  4. Montage.http://montage.ipac.caltech.edu/docs/gridtools.html

  5. Parameters of different types of VMS in microsoft Azure.https://azure.microsoft.com/en-us/pricing/details/virtual-machines/

  6. Bhuvaneshwar, K., Sulakhe, D., Gauba, R., Rodriguez, A., Madduri, R., Dave, U., Lacinski, L., Foster, I., Gusev, Y., Madhavan, S.: A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Comput. Struct. Biotechnol. J.13, 64–74 (2015)

    Article  Google Scholar 

  7. Bouganim, L., Fabret, F., Mohan, C., Valduriez, P.: Dynamic query scheduling in data integration systems. In: Proceedings of the 16th International Conference on Data Engineering, pp. 425–434 (2000)

    Google Scholar 

  8. Bouganim, L., Kapitskaia, O., Valduriez, P.: Memory-adaptive scheduling for large query execution. In: Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, pp. 105–115 (1998)

    Google Scholar 

  9. Cala, J., Xu, Y., Wijaya, E.A., Missier, P.: From scripted HPC-based NGS pipelines to workflows on the cloud. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 694–700 (2014)

    Google Scholar 

  10. Calheiros, R.N., Buyya, R.: Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE Trans. Parallel Distrib. Syst.25(7), 1787–1796 (2014)

    Article  Google Scholar 

  11. de Oliveira, D., Ocaña, K.A.C.S., Baião, F., Mattoso, M.: A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput.10(3), 521–552 (2012)

    Article  Google Scholar 

  12. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)

    Google Scholar 

  13. Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst.25(5), 528–540 (2009)

    Article  Google Scholar 

  14. Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the montage example. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2008)

    Google Scholar 

  15. Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program.13(3), 219–237 (2005)

    Google Scholar 

  16. Dias, J., Ogasawara, E.S., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Algebraic dataflows for big data analysis. In: IEEE International Conference on Big Data, pp. 150–155 (2013)

    Google Scholar 

  17. Duan, R., Prodan, R., Li, X.: Multi-objective game theoretic scheduling of bag-of-tasks workflows on hybrid clouds. IEEE Trans. Cloud Comput.2(1), 29–42 (2014)

    Article  Google Scholar 

  18. Etminani, K., Naghibzadeh, M.: A min-min max-min selective algorithm for grid task scheduling. In: The Third IEEE/IFIP International Conference in Central Asia on Internet (ICI 2007), pp. 1–7 (2007)

    Google Scholar 

  19. Hiden, H., Watson, P., Woodman, S., Leahy, D.: E-science central: cloud-based e-science and its application to chemical property modelling. Technical report CS-TR-1227 (2010)

    Google Scholar 

  20. Hiden, H., Woodman, S., Watson, P., Cala, J.: Developing cloud applications using the e-science central platform. Philos. Trans. R. Soc. London A Math. Phys. Eng. Sci.371, 2012 (1983)

    Google Scholar 

  21. Liu, J., Pacitti, E., Valduriez, P., de Oliveira, D., Mattoso, M.: Multi-objective scheduling of scientific workflows in multisite clouds. Future Gener. Comput. Syst.63, 76–95 (2016)

    Article  Google Scholar 

  22. Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: Parallelization of scientific workflows in the cloud. Research report RR-8565 (2014)

    Google Scholar 

  23. Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput.13, 457–493 (2015)

    Article  Google Scholar 

  24. Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: Scientific workflow scheduling with provenance support in multisite cloud. In: 12th International Meeting on High Performance Computing for Computational Science VECPAR, p. 8 (2016)

    Google Scholar 

  25. Liu, J., Silva, V., Pacitti, E., Valduriez, P., Mattoso, M.: Scientific workflow partitioning in multisite cloud. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8805, pp. 105–116. Springer, Cham (2014). doi:10.1007/978-3-319-14325-5_10

    Google Scholar 

  26. Maheswaran, M., Ali, S., Siegel, H.J., Hensgen, D., Freund, R.F.: Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In: 8th Heterogeneous Computing Workshop, p. 30 (1999)

    Google Scholar 

  27. Martins, V., Pacitti, E., Dick, M.E., Jiménez-Peris, R.: Scalable and topology-aware reconciliation on P2P networks. Distrib. Parallel Databases24(1–3), 1–43 (2008)

    Article  Google Scholar 

  28. Mattoso, M., Dias, J., Ocana, K.A., Ogasawara, E., Costa, F., Horta, F., Silva, V., de Oliveira, D.: Dynamic steering of HPC scientific workflows: a survey. Future Gener. Comput. Syst.46, 100–113 (2014)

    Article  Google Scholar 

  29. Ogasawara, E.S., Dias, J., Silva, V., Chirigati, F.S., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Chiron: a parallel engine for algebraic scientific workflows. Concurr. Comput. Pract. Exp.25(16), 2327–2341 (2013)

    Article  Google Scholar 

  30. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011). doi:10.1007/978-1-4419-8834-8

    Google Scholar 

  31. Pacitti, E., Akbarinia, R., Dick, M.E.: P2P Techniques for Decentralized Applications. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2012)

    Google Scholar 

  32. Pineda-Morales, L., Costan, A., Antoniu, G.: Towards multi-site metadata management for geographically distributed cloud workflows. In: 2015 IEEE International Conference on Cluster Computing, CLUSTER, pp. 294–303 (2015)

    Google Scholar 

  33. Sandberg, R., Golgberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the sun network filesystem. In: Innovations in Internetworking, pp. 379–390 (1988)

    Google Scholar 

  34. Schenk, O., Gärtner, K.: Two-level dynamic scheduling in PARDISO: improved scalability on shared memory multiprocessing systems. Parallel Comput.28(2), 187–197 (2002)

    Article MATH  Google Scholar 

  35. Smanchat, S., Indrawan, M., Ling, S., Enticott, C., Abramson, D.: Scheduling multiple parameter sweep workflow instances on the grid. In: 5th IEEE International Conference on E-Science, pp. 300–306 (2009)

    Google Scholar 

  36. Tarapanoff, K., Quoniam, L., de Araújo Júnior, R.H., Alvares, L.: Intelligence obtained by applying data mining to a database of french theses on the subject of brazil. Inf. Res.7(1), 41–53 (2001)

    Google Scholar 

  37. Topcuouglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst.13(3), 260–274 (2002)

    Article  Google Scholar 

  38. Wieczorek, M., Prodan, R., Fahringer, T.: Scheduling of scientific workflows in the askalon grid environment. SIGMOD Rec.34(3), 56–62 (2005)

    Article  Google Scholar 

  39. Yu, Z., Shi, W.: An adaptive rescheduling strategy for grid workflow applications. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–8 (2007)

    Google Scholar 

Download references

Acknowledgment

Work partially funded by EU H2020 Programme and MCTI/RNP-Brazil (HPC4E grant agreement number 689772), CNPq, FAPERJ, and INRIA (SciDISC project), Microsoft (ZcloudFlow project) and performed in the context of the Computational Biology Institute (www.ibc-montpellier.fr). We would like to thank Weiwei Chen and the Pegasus project for their help in modeling and executing the Montage SWf.

Author information

Authors and Affiliations

  1. Inria, Microsoft-Inria Joint Centre, LIRMM and University of Montpellier, Montpellier, France

    Ji Liu, Esther Pacitti & Patrick Valduriez

  2. COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil

    Marta Mattoso

Authors
  1. Ji Liu

    You can also search for this author inPubMed Google Scholar

  2. Esther Pacitti

    You can also search for this author inPubMed Google Scholar

  3. Patrick Valduriez

    You can also search for this author inPubMed Google Scholar

  4. Marta Mattoso

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toJi Liu.

Editor information

Editors and Affiliations

  1. IRIT, Paul Sabatier University, Toulouse, France

    Abdelkader Hameurlain

  2. FAW, University of Linz, Linz, Austria

    Josef Küng

  3. FAW, University of Linz, Linz, Austria

    Roland Wagner

  4. Inria and LIRMM, University of Montpellier, Montpellier, France

    Reza Akbarinia

  5. Inria and LIRMM, University of Montpellier, Montpellier, France

    Esther Pacitti

Appendix

Appendix

In this section, we analyze the convergence of the DIM algorithm. Since there are finite tasks in the bag of tasks scheduled at Site\(s_i\), Algorithm 2 always converges. Next, let us analyze the convergence of Algorithm 1. In order to make the problem simple, we assume that the total time at the two sites are the same after executing Algorithm 2 and we denote the time to transfer data, including input data and provenance data, as\(\alpha * ExecTime(T,s)\) in Formula3. Thus, we have:

$$\begin{aligned} \begin{aligned} TotalTime( T, s ) =&(1+\alpha ) * ExecTime( T, s ) = C * \frac{|T|}{CC(s)}\\ C =&(1 + \alpha ) * AvgWorkload(T)\\ CC(s) =&\sum _{VM_i \in s}ComputingCapacity( VM_i ) \end{aligned} \end{aligned}$$
(10)

TotalTime(Ts), |T|,AvgWorkload(T) and\(ComputingCapacity(VM_i)\) represent the same values as those in Formula3.CC(s) represents the computing capacity at each site.C andCC(s) are constant values. We denote the total execution time at each site by\(T_{opt}\) when all the sites are in the optimal situation,i.e. each site has the same total execution time. We denote a cost function in Formula11 to measure the distance of current scheduling plan (SP) and the optimal scheduling plan.

$$\begin{aligned} \begin{aligned} J( SP ) = \sum _{i = 1}^{m}(J( s_i )) = \sum _{i = 1}^{m}(TotalTime( T_i, s_i) - T_{opt})^2 \end{aligned} \end{aligned}$$
(11)

where\(J(s_i)\) represents the cost function of Site\(s_i\).\(T_i\) and\(s_i\) are defined by the scheduling planSP.

First, let us analyze the situation when all the sites have the same computing capacity. In each iteration of Algorithm 1, we choose two sites (\(s_i\),\(s_j\)) with the maximum total execution time and minimum total execution time. That means that we choose at least one site\(s_i\), which has the biggest distance between the current situation and the optimal situation among all the sites as shown in Formula12. In addition, the total execution time of the maximum total execution time should be bigger than or equal to\(T_{opt}\) and the total execution time of the minimum total execution time should be smaller than or equal toTopt as shown in Formula13.

$$\begin{aligned} \begin{aligned} J( s_i ) = \max _{j = 1}^{m}(TotalTime( T_j, s_j) - T_{opt})^2 \end{aligned} \end{aligned}$$
(12)
$$\begin{aligned} \begin{aligned} (TotalTime( T_i, s_i ) - T_{opt}) * (T_{opt} - TotalTime( T_j, s_j )) \ge 0 \end{aligned} \end{aligned}$$
(13)

We denote the other site by\(s_j\). Since the two sites have the same computing capacity, we denote the total execution time of each site by\(TotalTime'( T, s_{ij} )\) after executing Algorithm 2. Since the two sites have the same execution time, according to Formula10, they should have the same number of tasks, which is denoted byT. Thus, we have Formula14.

$$\begin{aligned} \begin{aligned} TotalTime( T_i, s_i) =&\,C * \frac{|T_i|}{CC(s)}\\ TotalTime( T_j, s_j) =&\,C * \frac{|T_j|}{CC(s)}\\ TotalTime'( T, s_{ij} ) =&\,C * \frac{|T|}{CC(s)} \\ 2 * |T| =&|T_i| + |T_j| \end{aligned} \end{aligned}$$
(14)

According this formula, we can get Formula15.

$$\begin{aligned} \begin{aligned} TotalTime'( T, s_{ij} ) = \frac{TotalTime( T_i, s_i) + TotalTime( T_j, s_j)}{2} \end{aligned} \end{aligned}$$
(15)

Thus, the cost function of the two selected sites after one iteration can be expressed as Formula16.

$$\begin{aligned} \begin{aligned} J'( SP', s_i, s_j ) =\,&(TotalTime'( T, s_i) - T_{opt})^2 + (TotalTime'( T, s_j) - T_{opt})^2\\ =\,&2 * (TotalTime'( T, s_{ij}) - T_{opt})^2\\ =\,&2 * (\frac{TotalTime( T_i, s_i) + TotalTime( T_j, s_j)}{2} - T_{opt})^2\\ =\,&2 * (\frac{(TotalTime( T_i, s_i) - T_{opt}) - (T_{opt} - TotalTime( T_j, s_j)}{2}))^2\\ =\,&\frac{(TotalTime( T_i, s_i) - T_{opt})^2 + (T_{opt} - TotalTime( T_j, s_j)^2}{2} \\&\,- (TotalTime( T_i, s_i) - T_{opt}) * (T_{opt} - TotalTime( T_j, s_j)\\ =\,&\frac{J( SP, s_i, s_j )}{2} \\&\,- (TotalTime( T_i, s_i) - T_{opt}) * (T_{opt} - TotalTime( T_j, s_j) \end{aligned} \end{aligned}$$
(16)

where we denote the scheduling plan after the iteration bySP’. We denote the cost function of the two selected sites before the iteration as\(J(SP, s_i, s_j)\). According to Formula13, we get the Formula17.

$$\begin{aligned} \begin{aligned} J'( SP', s_i, s_j ) <\frac{J( SP, s_i, s_j )}{2} \end{aligned} \end{aligned}$$
(17)

Thus, afterm iterations,J(SP) becomes less than\(\frac{J(SP)}{2}\). The minimum modification in one iteration should be bigger than\(\frac{C}{CC(s)}\). Thus, after at most\(m*\log _2 (J(SP) - \frac{C}{CC(s)})\) iterations, Algorithm 1 terminates.

Then, let us consider a situation where some sites (\(\rho m\),\(>1\rho >0\)) have much bigger computing capacity than other sites (\((1-\rho )m\)). We assume that after executing Algorithm 2 between two sites of different computing capacity, the total execution time of the two sites becomes the original total execution time of the site, which has bigger computing capacity. In this situation, in order to reduceJ(SP) to\(\frac{J(SP)}{2}\), we need at most\(\rho * (1-\rho ) * m^2\). Thus, after at most\(\rho * (1-\rho )*m^2*\log _2 (J(SP) - \frac{C}{CC(s)})\) iterations, Algorithm 1 terminates. Furthermore, the other situations are between the first situation where all the sites have the same computing capacity and this situation.

Rights and permissions

Copyright information

© 2017 Springer-Verlag GmbH Germany

About this chapter

Cite this chapter

Liu, J., Pacitti, E., Valduriez, P., Mattoso, M. (2017). Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud. In: Hameurlain, A., Küng, J., Wagner, R., Akbarinia, R., Pacitti, E. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIII. Lecture Notes in Computer Science(), vol 10430. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-55696-2_3

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp