Movatterモバイル変換

I have a Spark 2.1 job where I maintain multiple Dataset objects/RDD's that represent different queries over our underlying Hive/HDFS datastore. I've noticed that if I simply iterate over the List of Datasets, they execute one at a time. Each individual query operates in parallel, but I feel that we are not maximizing our resources by not running the different datasets in parallel as well.

There doesn't seem to be a lot out there regarding doing this, as most questions appear to be around parallelizing a single RDD or Dataset, not parallelizing multiple within the same job.

Is this inadvisable for some reason? Can I just use a executor service, thread pool, or futures to do this?

Thanks!

Improve this question

askedFeb 17, 2018 at 5:58

Brian

9372 gold badges13 silver badges25 bronze badges

you can find multiple questions and answers in stackoverflow itself for examplestackoverflow.com/questions/31757737/… andstackoverflow.com/questions/30214474/… and there are a lot of materials explaining how to do them in the web as well
Anahcolus
– Anahcolus
2018-02-17 12:14:26 +00:00
CommentedFeb 17, 2018 at 12:14
yes you can do this, the easiest way is to use scala's parallel collection
Raphael Roth
– Raphael Roth
2018-02-17 20:39:11 +00:00
CommentedFeb 17, 2018 at 20:39
1
@RameshMaharjan Upon review - yes those questions are relevant, but without understanding that is the question I should be asking, it's hard to find those answers :).
Brian
– Brian
2018-02-18 02:50:12 +00:00
CommentedFeb 18, 2018 at 2:50

Add a comment |

1 Answer1

Sorted by: Reset to default

Yes you can use multithreading in the driver code, but normally this does not increase performance, unless your queries operate on very skewed data and/or cannot be parallelized well enough to fully utilize the resources.

You can do something like that:

val datasets : Seq[Dataset[_]] = ???datasets  .par // transform to parallel Seq  .foreach(ds => ds.write.saveAsTable(...)

Improve this answer

answeredFeb 17, 2018 at 20:43

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

1 Comment

Sundeep

Sundeep Over a year ago

i have multiple dataframes that iam reading them from the sql server and how do i have it to run parallely to create parquet files for each DFs ?

2018-10-31T11:11:34.53Z+00:00

Your Answer

tips on writing great answers.

Draft saved

Draft discarded

Sign up orlog in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Movatterモバイル変換

Collectives™ on Stack Overflow

How can I parallelize multiple Datasets in Spark?

1 Answer1

1 Comment

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

Subscribe to RSS