Part of the book series:Lecture Notes in Computer Science ((TLDKS,volume 10430))
542Accesses
Abstract
Recently, some Scientific Workflow Management Systems (SWfMSs) with provenance support (e.g. Chiron) have been deployed in the cloud. However, they typically use a single cloud site. In this paper, we consider a multisite cloud, where the data and computing resources are distributed at different sites (possibly in different regions). Based on a multisite architecture of SWfMS,i.e. multisite Chiron, and its provenance model, we propose a multisite task scheduling algorithm that considers the time to generate provenance data. We performed an extensive experimental evaluation of our algorithm using Microsoft Azure multisite cloud and two real-life scientific workflows (Buzz and Montage). The results show that our scheduling algorithm is up to 49.6% better than baseline algorithms in terms of total execution time.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 5719
- Price includes VAT (Japan)
- Softcover Book
- JPY 7149
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In a shared file system, all the computing nodes of the cluster share some data storage that are generally remotely located [22].
- 2.
When each site has the same computing capacity, this ism. But when not all the sites have the same computing capacity, this should be\(m^2\).
- 3.
For instance, the time to execute “SELECT count(*) from eactivity” at the provenance database from each site: 0.0027s from the WEU site, 0.0253s from the NEU site and 0.1117s from the CUS site.
References
Azure service bus.http://azure.microsoft.com/en-us/services/service-bus/
DBLP Computer Science Bibliography.http://dblp.uni-trier.de/
Microsoft Azure.http://azure.microsoft.com
Parameters of different types of VMS in microsoft Azure.https://azure.microsoft.com/en-us/pricing/details/virtual-machines/
Bhuvaneshwar, K., Sulakhe, D., Gauba, R., Rodriguez, A., Madduri, R., Dave, U., Lacinski, L., Foster, I., Gusev, Y., Madhavan, S.: A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Comput. Struct. Biotechnol. J.13, 64–74 (2015)
Bouganim, L., Fabret, F., Mohan, C., Valduriez, P.: Dynamic query scheduling in data integration systems. In: Proceedings of the 16th International Conference on Data Engineering, pp. 425–434 (2000)
Bouganim, L., Kapitskaia, O., Valduriez, P.: Memory-adaptive scheduling for large query execution. In: Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, pp. 105–115 (1998)
Cala, J., Xu, Y., Wijaya, E.A., Missier, P.: From scripted HPC-based NGS pipelines to workflows on the cloud. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 694–700 (2014)
Calheiros, R.N., Buyya, R.: Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE Trans. Parallel Distrib. Syst.25(7), 1787–1796 (2014)
de Oliveira, D., Ocaña, K.A.C.S., Baião, F., Mattoso, M.: A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput.10(3), 521–552 (2012)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)
Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst.25(5), 528–540 (2009)
Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the montage example. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2008)
Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program.13(3), 219–237 (2005)
Dias, J., Ogasawara, E.S., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Algebraic dataflows for big data analysis. In: IEEE International Conference on Big Data, pp. 150–155 (2013)
Duan, R., Prodan, R., Li, X.: Multi-objective game theoretic scheduling of bag-of-tasks workflows on hybrid clouds. IEEE Trans. Cloud Comput.2(1), 29–42 (2014)
Etminani, K., Naghibzadeh, M.: A min-min max-min selective algorithm for grid task scheduling. In: The Third IEEE/IFIP International Conference in Central Asia on Internet (ICI 2007), pp. 1–7 (2007)
Hiden, H., Watson, P., Woodman, S., Leahy, D.: E-science central: cloud-based e-science and its application to chemical property modelling. Technical report CS-TR-1227 (2010)
Hiden, H., Woodman, S., Watson, P., Cala, J.: Developing cloud applications using the e-science central platform. Philos. Trans. R. Soc. London A Math. Phys. Eng. Sci.371, 2012 (1983)
Liu, J., Pacitti, E., Valduriez, P., de Oliveira, D., Mattoso, M.: Multi-objective scheduling of scientific workflows in multisite clouds. Future Gener. Comput. Syst.63, 76–95 (2016)
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: Parallelization of scientific workflows in the cloud. Research report RR-8565 (2014)
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput.13, 457–493 (2015)
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: Scientific workflow scheduling with provenance support in multisite cloud. In: 12th International Meeting on High Performance Computing for Computational Science VECPAR, p. 8 (2016)
Liu, J., Silva, V., Pacitti, E., Valduriez, P., Mattoso, M.: Scientific workflow partitioning in multisite cloud. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8805, pp. 105–116. Springer, Cham (2014). doi:10.1007/978-3-319-14325-5_10
Maheswaran, M., Ali, S., Siegel, H.J., Hensgen, D., Freund, R.F.: Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In: 8th Heterogeneous Computing Workshop, p. 30 (1999)
Martins, V., Pacitti, E., Dick, M.E., Jiménez-Peris, R.: Scalable and topology-aware reconciliation on P2P networks. Distrib. Parallel Databases24(1–3), 1–43 (2008)
Mattoso, M., Dias, J., Ocana, K.A., Ogasawara, E., Costa, F., Horta, F., Silva, V., de Oliveira, D.: Dynamic steering of HPC scientific workflows: a survey. Future Gener. Comput. Syst.46, 100–113 (2014)
Ogasawara, E.S., Dias, J., Silva, V., Chirigati, F.S., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Chiron: a parallel engine for algebraic scientific workflows. Concurr. Comput. Pract. Exp.25(16), 2327–2341 (2013)
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011). doi:10.1007/978-1-4419-8834-8
Pacitti, E., Akbarinia, R., Dick, M.E.: P2P Techniques for Decentralized Applications. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2012)
Pineda-Morales, L., Costan, A., Antoniu, G.: Towards multi-site metadata management for geographically distributed cloud workflows. In: 2015 IEEE International Conference on Cluster Computing, CLUSTER, pp. 294–303 (2015)
Sandberg, R., Golgberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the sun network filesystem. In: Innovations in Internetworking, pp. 379–390 (1988)
Schenk, O., Gärtner, K.: Two-level dynamic scheduling in PARDISO: improved scalability on shared memory multiprocessing systems. Parallel Comput.28(2), 187–197 (2002)
Smanchat, S., Indrawan, M., Ling, S., Enticott, C., Abramson, D.: Scheduling multiple parameter sweep workflow instances on the grid. In: 5th IEEE International Conference on E-Science, pp. 300–306 (2009)
Tarapanoff, K., Quoniam, L., de Araújo Júnior, R.H., Alvares, L.: Intelligence obtained by applying data mining to a database of french theses on the subject of brazil. Inf. Res.7(1), 41–53 (2001)
Topcuouglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst.13(3), 260–274 (2002)
Wieczorek, M., Prodan, R., Fahringer, T.: Scheduling of scientific workflows in the askalon grid environment. SIGMOD Rec.34(3), 56–62 (2005)
Yu, Z., Shi, W.: An adaptive rescheduling strategy for grid workflow applications. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–8 (2007)
Acknowledgment
Work partially funded by EU H2020 Programme and MCTI/RNP-Brazil (HPC4E grant agreement number 689772), CNPq, FAPERJ, and INRIA (SciDISC project), Microsoft (ZcloudFlow project) and performed in the context of the Computational Biology Institute (www.ibc-montpellier.fr). We would like to thank Weiwei Chen and the Pegasus project for their help in modeling and executing the Montage SWf.
Author information
Authors and Affiliations
Inria, Microsoft-Inria Joint Centre, LIRMM and University of Montpellier, Montpellier, France
Ji Liu, Esther Pacitti & Patrick Valduriez
COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Marta Mattoso
- Ji Liu
You can also search for this author inPubMed Google Scholar
- Esther Pacitti
You can also search for this author inPubMed Google Scholar
- Patrick Valduriez
You can also search for this author inPubMed Google Scholar
- Marta Mattoso
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toJi Liu.
Editor information
Editors and Affiliations
IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Josef Küng
FAW, University of Linz, Linz, Austria
Roland Wagner
Inria and LIRMM, University of Montpellier, Montpellier, France
Reza Akbarinia
Inria and LIRMM, University of Montpellier, Montpellier, France
Esther Pacitti
Appendix
Appendix
In this section, we analyze the convergence of the DIM algorithm. Since there are finite tasks in the bag of tasks scheduled at Site\(s_i\), Algorithm 2 always converges. Next, let us analyze the convergence of Algorithm 1. In order to make the problem simple, we assume that the total time at the two sites are the same after executing Algorithm 2 and we denote the time to transfer data, including input data and provenance data, as\(\alpha * ExecTime(T,s)\) in Formula3. Thus, we have:
TotalTime(T, s), |T|,AvgWorkload(T) and\(ComputingCapacity(VM_i)\) represent the same values as those in Formula3.CC(s) represents the computing capacity at each site.C andCC(s) are constant values. We denote the total execution time at each site by\(T_{opt}\) when all the sites are in the optimal situation,i.e. each site has the same total execution time. We denote a cost function in Formula11 to measure the distance of current scheduling plan (SP) and the optimal scheduling plan.
where\(J(s_i)\) represents the cost function of Site\(s_i\).\(T_i\) and\(s_i\) are defined by the scheduling planSP.
First, let us analyze the situation when all the sites have the same computing capacity. In each iteration of Algorithm 1, we choose two sites (\(s_i\),\(s_j\)) with the maximum total execution time and minimum total execution time. That means that we choose at least one site\(s_i\), which has the biggest distance between the current situation and the optimal situation among all the sites as shown in Formula12. In addition, the total execution time of the maximum total execution time should be bigger than or equal to\(T_{opt}\) and the total execution time of the minimum total execution time should be smaller than or equal toTopt as shown in Formula13.
We denote the other site by\(s_j\). Since the two sites have the same computing capacity, we denote the total execution time of each site by\(TotalTime'( T, s_{ij} )\) after executing Algorithm 2. Since the two sites have the same execution time, according to Formula10, they should have the same number of tasks, which is denoted byT. Thus, we have Formula14.
According this formula, we can get Formula15.
Thus, the cost function of the two selected sites after one iteration can be expressed as Formula16.
where we denote the scheduling plan after the iteration bySP’. We denote the cost function of the two selected sites before the iteration as\(J(SP, s_i, s_j)\). According to Formula13, we get the Formula17.
Thus, afterm iterations,J(SP) becomes less than\(\frac{J(SP)}{2}\). The minimum modification in one iteration should be bigger than\(\frac{C}{CC(s)}\). Thus, after at most\(m*\log _2 (J(SP) - \frac{C}{CC(s)})\) iterations, Algorithm 1 terminates.
Then, let us consider a situation where some sites (\(\rho m\),\(>1\rho >0\)) have much bigger computing capacity than other sites (\((1-\rho )m\)). We assume that after executing Algorithm 2 between two sites of different computing capacity, the total execution time of the two sites becomes the original total execution time of the site, which has bigger computing capacity. In this situation, in order to reduceJ(SP) to\(\frac{J(SP)}{2}\), we need at most\(\rho * (1-\rho ) * m^2\). Thus, after at most\(\rho * (1-\rho )*m^2*\log _2 (J(SP) - \frac{C}{CC(s)})\) iterations, Algorithm 1 terminates. Furthermore, the other situations are between the first situation where all the sites have the same computing capacity and this situation.
Rights and permissions
Copyright information
© 2017 Springer-Verlag GmbH Germany
About this chapter
Cite this chapter
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M. (2017). Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud. In: Hameurlain, A., Küng, J., Wagner, R., Akbarinia, R., Pacitti, E. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIII. Lecture Notes in Computer Science(), vol 10430. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-55696-2_3
Download citation
Published:
Publisher Name:Springer, Berlin, Heidelberg
Print ISBN:978-3-662-55695-5
Online ISBN:978-3-662-55696-2
eBook Packages:Computer ScienceComputer Science (R0)
Share this chapter
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative