Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((TLDKS,volume 10430))

542Accesses
5Citations

Abstract

Recently, some Scientific Workflow Management Systems (SWfMSs) with provenance support (e.g. Chiron) have been deployed in the cloud. However, they typically use a single cloud site. In this paper, we consider a multisite cloud, where the data and computing resources are distributed at different sites (possibly in different regions). Based on a multisite architecture of SWfMS,i.e. multisite Chiron, and its provenance model, we propose a multisite task scheduling algorithm that considers the time to generate provenance data. We performed an extensive experimental evaluation of our algorithm using Microsoft Azure multisite cloud and two real-life scientific workflows (Buzz and Montage). The results show that our scheduling algorithm is up to 49.6% better than baseline algorithms in terms of total execution time.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Scientific Workflow Scheduling with Provenance Support in Multisite Cloud

Big Data Scientific Workflows in the Cloud: Challenges and Future Prospects

Distributed Caching of Scientific Workflows in Multisite Cloud

Notes

1.
In a shared file system, all the computing nodes of the cluster share some data storage that are generally remotely located [22].
2.
When each site has the same computing capacity, this ism. But when not all the sites have the same computing capacity, this should be$m^2$.
3.
For instance, the time to execute “SELECT count(*) from eactivity” at the provenance database from each site: 0.0027s from the WEU site, 0.0253s from the NEU site and 0.1117s from the CUS site.

References

Azure service bus.http://azure.microsoft.com/en-us/services/service-bus/
DBLP Computer Science Bibliography.http://dblp.uni-trier.de/
Microsoft Azure.http://azure.microsoft.com
Montage.http://montage.ipac.caltech.edu/docs/gridtools.html
Parameters of different types of VMS in microsoft Azure.https://azure.microsoft.com/en-us/pricing/details/virtual-machines/
Bhuvaneshwar, K., Sulakhe, D., Gauba, R., Rodriguez, A., Madduri, R., Dave, U., Lacinski, L., Foster, I., Gusev, Y., Madhavan, S.: A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Comput. Struct. Biotechnol. J.13, 64–74 (2015)
Article Google Scholar
Bouganim, L., Fabret, F., Mohan, C., Valduriez, P.: Dynamic query scheduling in data integration systems. In: Proceedings of the 16th International Conference on Data Engineering, pp. 425–434 (2000)
Google Scholar
Bouganim, L., Kapitskaia, O., Valduriez, P.: Memory-adaptive scheduling for large query execution. In: Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, pp. 105–115 (1998)
Google Scholar
Cala, J., Xu, Y., Wijaya, E.A., Missier, P.: From scripted HPC-based NGS pipelines to workflows on the cloud. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 694–700 (2014)
Google Scholar
Calheiros, R.N., Buyya, R.: Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE Trans. Parallel Distrib. Syst.25(7), 1787–1796 (2014)
Article Google Scholar
de Oliveira, D., Ocaña, K.A.C.S., Baião, F., Mattoso, M.: A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput.10(3), 521–552 (2012)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)
Google Scholar
Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst.25(5), 528–540 (2009)
Article Google Scholar
Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the montage example. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2008)
Google Scholar
Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program.13(3), 219–237 (2005)
Google Scholar
Dias, J., Ogasawara, E.S., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Algebraic dataflows for big data analysis. In: IEEE International Conference on Big Data, pp. 150–155 (2013)
Google Scholar
Duan, R., Prodan, R., Li, X.: Multi-objective game theoretic scheduling of bag-of-tasks workflows on hybrid clouds. IEEE Trans. Cloud Comput.2(1), 29–42 (2014)
Article Google Scholar
Etminani, K., Naghibzadeh, M.: A min-min max-min selective algorithm for grid task scheduling. In: The Third IEEE/IFIP International Conference in Central Asia on Internet (ICI 2007), pp. 1–7 (2007)
Google Scholar
Hiden, H., Watson, P., Woodman, S., Leahy, D.: E-science central: cloud-based e-science and its application to chemical property modelling. Technical report CS-TR-1227 (2010)
Google Scholar
Hiden, H., Woodman, S., Watson, P., Cala, J.: Developing cloud applications using the e-science central platform. Philos. Trans. R. Soc. London A Math. Phys. Eng. Sci.371, 2012 (1983)
Google Scholar
Liu, J., Pacitti, E., Valduriez, P., de Oliveira, D., Mattoso, M.: Multi-objective scheduling of scientific workflows in multisite clouds. Future Gener. Comput. Syst.63, 76–95 (2016)
Article Google Scholar
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: Parallelization of scientific workflows in the cloud. Research report RR-8565 (2014)
Google Scholar
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput.13, 457–493 (2015)
Article Google Scholar
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: Scientific workflow scheduling with provenance support in multisite cloud. In: 12th International Meeting on High Performance Computing for Computational Science VECPAR, p. 8 (2016)
Google Scholar
Liu, J., Silva, V., Pacitti, E., Valduriez, P., Mattoso, M.: Scientific workflow partitioning in multisite cloud. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8805, pp. 105–116. Springer, Cham (2014). doi:10.1007/978-3-319-14325-5_10
Google Scholar
Maheswaran, M., Ali, S., Siegel, H.J., Hensgen, D., Freund, R.F.: Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In: 8th Heterogeneous Computing Workshop, p. 30 (1999)
Google Scholar
Martins, V., Pacitti, E., Dick, M.E., Jiménez-Peris, R.: Scalable and topology-aware reconciliation on P2P networks. Distrib. Parallel Databases24(1–3), 1–43 (2008)
Article Google Scholar
Mattoso, M., Dias, J., Ocana, K.A., Ogasawara, E., Costa, F., Horta, F., Silva, V., de Oliveira, D.: Dynamic steering of HPC scientific workflows: a survey. Future Gener. Comput. Syst.46, 100–113 (2014)
Article Google Scholar
Ogasawara, E.S., Dias, J., Silva, V., Chirigati, F.S., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Chiron: a parallel engine for algebraic scientific workflows. Concurr. Comput. Pract. Exp.25(16), 2327–2341 (2013)
Article Google Scholar
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011). doi:10.1007/978-1-4419-8834-8
Google Scholar
Pacitti, E., Akbarinia, R., Dick, M.E.: P2P Techniques for Decentralized Applications. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2012)
Google Scholar
Pineda-Morales, L., Costan, A., Antoniu, G.: Towards multi-site metadata management for geographically distributed cloud workflows. In: 2015 IEEE International Conference on Cluster Computing, CLUSTER, pp. 294–303 (2015)
Google Scholar
Sandberg, R., Golgberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the sun network filesystem. In: Innovations in Internetworking, pp. 379–390 (1988)
Google Scholar
Schenk, O., Gärtner, K.: Two-level dynamic scheduling in PARDISO: improved scalability on shared memory multiprocessing systems. Parallel Comput.28(2), 187–197 (2002)
Article MATH Google Scholar
Smanchat, S., Indrawan, M., Ling, S., Enticott, C., Abramson, D.: Scheduling multiple parameter sweep workflow instances on the grid. In: 5th IEEE International Conference on E-Science, pp. 300–306 (2009)
Google Scholar
Tarapanoff, K., Quoniam, L., de Araújo Júnior, R.H., Alvares, L.: Intelligence obtained by applying data mining to a database of french theses on the subject of brazil. Inf. Res.7(1), 41–53 (2001)
Google Scholar
Topcuouglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst.13(3), 260–274 (2002)
Article Google Scholar
Wieczorek, M., Prodan, R., Fahringer, T.: Scheduling of scientific workflows in the askalon grid environment. SIGMOD Rec.34(3), 56–62 (2005)
Article Google Scholar
Yu, Z., Shi, W.: An adaptive rescheduling strategy for grid workflow applications. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–8 (2007)
Google Scholar

Download references

Acknowledgment

Work partially funded by EU H2020 Programme and MCTI/RNP-Brazil (HPC4E grant agreement number 689772), CNPq, FAPERJ, and INRIA (SciDISC project), Microsoft (ZcloudFlow project) and performed in the context of the Computational Biology Institute (www.ibc-montpellier.fr). We would like to thank Weiwei Chen and the Pegasus project for their help in modeling and executing the Montage SWf.

Author information

Authors and Affiliations

Inria, Microsoft-Inria Joint Centre, LIRMM and University of Montpellier, Montpellier, France
Ji Liu, Esther Pacitti & Patrick Valduriez
COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Marta Mattoso

Authors

Ji Liu
View author publications
You can also search for this author inPubMed Google Scholar
Esther Pacitti
View author publications
You can also search for this author inPubMed Google Scholar
Patrick Valduriez
View author publications
You can also search for this author inPubMed Google Scholar
Marta Mattoso
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toJi Liu.

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Josef Küng
FAW, University of Linz, Linz, Austria
Roland Wagner
Inria and LIRMM, University of Montpellier, Montpellier, France
Reza Akbarinia
Inria and LIRMM, University of Montpellier, Montpellier, France
Esther Pacitti

Appendix

In this section, we analyze the convergence of the DIM algorithm. Since there are finite tasks in the bag of tasks scheduled at Site$s_i$, Algorithm 2 always converges. Next, let us analyze the convergence of Algorithm 1. In order to make the problem simple, we assume that the total time at the two sites are the same after executing Algorithm 2 and we denote the time to transfer data, including input data and provenance data, as$\alpha * ExecTime(T,s)$ in Formula3. Thus, we have:

$$\begin{aligned} \begin{aligned} TotalTime( T, s ) =&(1+\alpha ) * ExecTime( T, s ) = C * \frac{|T|}{CC(s)}\\ C =&(1 + \alpha ) * AvgWorkload(T)\\ CC(s) =&\sum _{VM_i \in s}ComputingCapacity( VM_i ) \end{aligned} \end{aligned}$$

(10)

TotalTime(T, s), |T|,AvgWorkload(T) and$ComputingCapacity(VM_i)$ represent the same values as those in Formula3.CC(s) represents the computing capacity at each site.C andCC(s) are constant values. We denote the total execution time at each site by$T_{opt}$ when all the sites are in the optimal situation,i.e. each site has the same total execution time. We denote a cost function in Formula11 to measure the distance of current scheduling plan (SP) and the optimal scheduling plan.

$$\begin{aligned} \begin{aligned} J( SP ) = \sum _{i = 1}^{m}(J( s_i )) = \sum _{i = 1}^{m}(TotalTime( T_i, s_i) - T_{opt})^2 \end{aligned} \end{aligned}$$

(11)

where$J(s_i)$ represents the cost function of Site$s_i$.$T_i$ and$s_i$ are defined by the scheduling planSP.

First, let us analyze the situation when all the sites have the same computing capacity. In each iteration of Algorithm 1, we choose two sites ($s_i$,$s_j$) with the maximum total execution time and minimum total execution time. That means that we choose at least one site$s_i$, which has the biggest distance between the current situation and the optimal situation among all the sites as shown in Formula12. In addition, the total execution time of the maximum total execution time should be bigger than or equal to$T_{opt}$ and the total execution time of the minimum total execution time should be smaller than or equal toTopt as shown in Formula13.

$$\begin{aligned} \begin{aligned} J( s_i ) = \max _{j = 1}^{m}(TotalTime( T_j, s_j) - T_{opt})^2 \end{aligned} \end{aligned}$$

(12)

$$\begin{aligned} \begin{aligned} (TotalTime( T_i, s_i ) - T_{opt}) * (T_{opt} - TotalTime( T_j, s_j )) \ge 0 \end{aligned} \end{aligned}$$

(13)

We denote the other site by$s_j$. Since the two sites have the same computing capacity, we denote the total execution time of each site by$TotalTime'( T, s_{ij} )$ after executing Algorithm 2. Since the two sites have the same execution time, according to Formula10, they should have the same number of tasks, which is denoted byT. Thus, we have Formula14.

$$\begin{aligned} \begin{aligned} TotalTime( T_i, s_i) =&\,C * \frac{|T_i|}{CC(s)}\\ TotalTime( T_j, s_j) =&\,C * \frac{|T_j|}{CC(s)}\\ TotalTime'( T, s_{ij} ) =&\,C * \frac{|T|}{CC(s)} \\ 2 * |T| =&|T_i| + |T_j| \end{aligned} \end{aligned}$$

(14)

According this formula, we can get Formula15.

$$\begin{aligned} \begin{aligned} TotalTime'( T, s_{ij} ) = \frac{TotalTime( T_i, s_i) + TotalTime( T_j, s_j)}{2} \end{aligned} \end{aligned}$$

(15)

Thus, the cost function of the two selected sites after one iteration can be expressed as Formula16.

$$\begin{aligned} \begin{aligned} J'( SP', s_i, s_j ) =\,&(TotalTime'( T, s_i) - T_{opt})^2 + (TotalTime'( T, s_j) - T_{opt})^2\\ =\,&2 * (TotalTime'( T, s_{ij}) - T_{opt})^2\\ =\,&2 * (\frac{TotalTime( T_i, s_i) + TotalTime( T_j, s_j)}{2} - T_{opt})^2\\ =\,&2 * (\frac{(TotalTime( T_i, s_i) - T_{opt}) - (T_{opt} - TotalTime( T_j, s_j)}{2}))^2\\ =\,&\frac{(TotalTime( T_i, s_i) - T_{opt})^2 + (T_{opt} - TotalTime( T_j, s_j)^2}{2} \\&\,- (TotalTime( T_i, s_i) - T_{opt}) * (T_{opt} - TotalTime( T_j, s_j)\\ =\,&\frac{J( SP, s_i, s_j )}{2} \\&\,- (TotalTime( T_i, s_i) - T_{opt}) * (T_{opt} - TotalTime( T_j, s_j) \end{aligned} \end{aligned}$$

(16)

where we denote the scheduling plan after the iteration bySP’. We denote the cost function of the two selected sites before the iteration as$J(SP, s_i, s_j)$. According to Formula13, we get the Formula17.

$$\begin{aligned} \begin{aligned} J'( SP', s_i, s_j ) <\frac{J( SP, s_i, s_j )}{2} \end{aligned} \end{aligned}$$

(17)

Thus, afterm iterations,J(SP) becomes less than$\frac{J(SP)}{2}$. The minimum modification in one iteration should be bigger than$\frac{C}{CC(s)}$. Thus, after at most$m*\log _2 (J(SP) - \frac{C}{CC(s)})$ iterations, Algorithm 1 terminates.

Then, let us consider a situation where some sites ($\rho m$,$>1\rho >0$) have much bigger computing capacity than other sites ($(1-\rho )m$). We assume that after executing Algorithm 2 between two sites of different computing capacity, the total execution time of the two sites becomes the original total execution time of the site, which has bigger computing capacity. In this situation, in order to reduceJ(SP) to$\frac{J(SP)}{2}$, we need at most$\rho * (1-\rho ) * m^2$. Thus, after at most$\rho * (1-\rho )*m^2*\log _2 (J(SP) - \frac{C}{CC(s)})$ iterations, Algorithm 1 terminates. Furthermore, the other situations are between the first situation where all the sites have the same computing capacity and this situation.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, J., Pacitti, E., Valduriez, P., Mattoso, M. (2017). Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud. In: Hameurlain, A., Küng, J., Wagner, R., Akbarinia, R., Pacitti, E. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIII. Lecture Notes in Computer Science(), vol 10430. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-55696-2_3

Download citation

DOI:https://doi.org/10.1007/978-3-662-55696-2_3
Published:08 August 2017
Publisher Name:Springer, Berlin, Heidelberg
Print ISBN:978-3-662-55695-5
Online ISBN:978-3-662-55696-2
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Scientific Workflow Scheduling with Provenance Support in Multisite Cloud

Big Data Scientific Workflows in the Cloud: Challenges and Future Prospects

Distributed Caching of Scientific Workflows in Multisite Cloud

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Access this chapter

Subscribe and save

Buy Now