NotificationsYou must be signed in to change notification settings
Fork50
Star136

Scripts to analyze Spark's performance

136 stars 50 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
04_15_14_disk		04_15_14_disk
04_15_14_full		04_15_14_full
04_15_14_mem		04_15_14_mem
job_logs		job_logs
test_files		test_files
README.md		README.md
all_utilization.py		all_utilization.py
breakeven.py		breakeven.py
concurrency.py		concurrency.py
disk_utilization.py		disk_utilization.py
estimate_runtime_lower_bound.py		estimate_runtime_lower_bound.py
get_query_stats.py		get_query_stats.py
job.py		job.py
make_disk_breakdown.sh		make_disk_breakdown.sh
parse_all.py		parse_all.py
parse_logs.py		parse_logs.py
parse_stragglers.py		parse_stragglers.py
plot_proc_results.py		plot_proc_results.py
run_tests.sh		run_tests.sh
running_tasks_template.gp		running_tasks_template.gp
sample_waterfall.jpg		sample_waterfall.jpg
scatter_base.gp		scatter_base.gp
simulate.py		simulate.py
stage.py		stage.py
task.py		task.py
upload.py		upload.py
utilization_scatter.py		utilization_scatter.py
utilization_scatter_base.gp		utilization_scatter_base.gp
waterfall_base.gp		waterfall_base.gp

Repository files navigation

Understanding Spark Performance

NOTE: These scripts are now deprecated, because the information that they show is now part ofApache Spark's UI. To view how each task in a stage spent its time, click on the stage detail pagein the Spark UI, and then click "Event timeline". This will display a (much nicer-looking version of)the the plot output by the scripts here.

This repository contains scripts to understand the performance of jobs run withApache Spark.

Configuring Spark to log performance data

In order to use these tools, you'll first need to configure Spark to log performance data while jobs are runningby setting the Spark configuration parameterspark.eventLog.enabled totrue. This configuration parametercauses the Spark master to write a log with information about each completed task to a file on the master. The masteralready tracks this information (much of it is displayed in Spark's web UI); setting this configuration optionjust causes the master to output all of the data for later consumption. By default, the event log is written toa series of files in the folder/tmp/spark-events/ on the machine where the Spark master runs.Spark creates a folder within that directory for each application, and logs are stored in a filenamedEVENT_LOG_1 within the application's folder. You can change the parameterspark.eventLog.dir to write the event log elsewhere (e.g., to HDFS). See theSpark configuration documentation for moreinformation about configuring logging.

These scripts are most accurate with more recent versions of Spark because of instrumentationinaccuracies that were recently fixed (e.g.,SPARK-2570 was only included in 1.3.1, and fixesa problem where not all of the time to write shuffle files to disk was recorded). If you use thesescripts with an older version of Spark, the compute time may include time that was actually spentdoing I/O (in addition to the inaccuracies in compute time mentioned in theMissing data section).

Analyzing performance data

After you have collected an event log file with JSON data about the job(s) you'd like to understand, runtheparse_logs.py script to generate a visualization of the jobs' performance:

python parse_logs.py EVENT_LOG_1 --waterfall-only

The--waterfall-only flag tells the script to just generate the visualization, and skip morecomplex performance analysis. To see all available options, use the flag--help.

For each job in theEVENT_LOG_1 file, the Python script will output a gnuplot file that, whenplotted, will generate a waterfall depicting how time was spent by each of the tasks in the job.The plot files are named[INPUT_FILENAME]_[JOB_ID]_waterfall.gp. To plot the waterfall for job 0, forexample:

gnuplot EVENT_LOG_1_0_waterfall.gp

will create a file `EVENT_LOG_1_0_waterfall.pdf. The waterfall plots each task as a horizontalline. The horizontal line is colored by how tasks spend time. Tics on the y-axis delineatedifferent stages of tasks.

Here's an example waterfall:

This waterfall shows the runtime of a job that sorts a small amount of input data. The job hastwo stages that each have 40 tasks. The first stage reads input data and saves the data to disk,sorted based on which reduce task will read the data. Tasks in the second stage read the datasaved by the previous stage over the network and then sort that partition of the data. One thingthat stands out for this job is that tasks in the first stage sometimes spend a lot time writingoutput data to disk (shown in teal). In this case, this is because the job was running on top of theext3 file system, which performs poorly when writing many small files; once we upgraded to ext4,the job completed much more quickly and most of the teal-colored time spent writing shuffle outputdata disappeared.

Keep in mind that these plots do not depict exactlywhen a task was doing what. For example, forthe "Output write wait" in the above plot, each task writes output data many times during taskexecution, and not only at the very end. Spark only logs the total time spent writing output data,and does not log exactly when during execution tasks block writing output, because logging thelatter would require saving significantly more information. So, the placement of where in a task'sruntime each component of exeuction is shown in the graph is purely fictional; only thetotal amount of time spent in each part of task execution and the start and end time of the taskis accurate.

One thing to keep in mind is that Spark does not currently include instrumentation to measure thetime spent reading input data from disk or writing job output to disk (the ``Output write wait''shown in the waterfall is time to write shuffle output to disk, which Spark does haveinstrumentation for); as a result, the timeshown as `Compute' may include time using the disk. We have a custom Hadoop branch that measures thetime Hadoop spends transferring data to/from disk, and we are hopeful that similar timing metricswill someday be included in the Hadoop FileStatistics API. In the meantime, it is not currentlypossible to understand how much of a Spark task's time is spent reading from disk via HDFS.

Missing data

Parts of the visualization are currently inaccurate due to incomplete parts of Spark's logging.In particular, the HDFS read time and output write time (when writing to HDFS) are only accurateif you are running a special version of Spark and HDFS. Contact Kay Ousterhout if you are interestedin doing this; otherwise, just be aware that part of the pink compute time may be spent read fromor writing to HDFS. (In the future, we're hoping that this time will be exposed in the defaultmetrics reported by HDFS; seeHADOOP-11873to track progress on adding such metrics.)

Another problem is that the shuffle write time is currently incorrect (it doesn't include much ofthe time spent writing shuffle output) for many versions of Spark.This Spark JIRA searchtracks the various issues with the shuffle write time.This will result in the shuffle write time showing up as compute time.

Finally, Spark does not currently expose metrics about the amount of time spent spillingintermediate data to disk. Spilling happens when a task uses more memory than is available on themachine, so needs to temporarily store intermediate data on disk. You can tell if your tasks arespilling because the stage page in Spark's UI reports the bytes spilled to disk by each task.Until a fix forSPARK-3577 is merged, spilltime is not logged by Spark tasks, so time spent spilling data to disk will show up as computetime.

FAQ

How do I upgrade my version of gnuplot to support generating PDF output?

The gnuplot file generated by these scripts uses the pdfcairo terminal device to generateslightly nicer looking files, but many versions of gnuplot (especially on macs) do not includepdfcairo by default.

If you'd like a quick fix to this problem, you can just change the gnuplot file generatedby the scripts to generate postscript output instead of pdf output. To do this, change theline at the top that readsset terminal pdfcairo ... to instead sayset terminal postscript ...,and at the very bottom of the file, change the line that readsset output X.pdf to instead sayset output X.ps. This graph won't look quite as nice, but you can get a very basic versionworking.

For mac users, to update your version of gnuplot to include pdfcairo, I recommend firstuninstalling your current version of gnuplot, and then usingHomebrew to installgnuplot with pdfcairo enabled:

brew install gnuplot --cairo --pdf --tutorial

For linux users, recommend to uninstall your current version of gnuplot and then install gnuplot-5.0.1 with cairo:

Download gnuplot-5 from http://www.gnuplot.info/download.htmltar -zxf gnuplot-5.0.1.tar.gzcd gnuplot-5.0.1./configure --with-cairomakemake install

I've found that trying to upgrade existing versions of gnuplot to include pdfcairo is much moredifficult than just re-installing gnuplot.

In theory, you can also use macports to install gnuplot with pdfcairo(describedhere)but I've found that this often fails.

####This graph is way too hard to read! How do I make it bigger?

The first line of the gnuplot file includes a size (by default,size 5,5). The two coordinatesdescribe the length and width; increase these to generate a larger graph.

####I'm getting an error that says "'AttributeError: 'module' object has no attribute 'percentile'"

If you get an error that ends with:

median_runtime = numpy.percentile(runtimes, 50)AttributeError: 'module' object has no attribute 'percentile'

you need to upgrade your version ofnumpy to at least 1.5.

####Parts of my plot are outside of the plot area, and/or some tasks seem to be overlapping others.

This typically happens when you try to plot multiple gnuplot files with one command, e.g.,with a command like:

gnuplot *.gp

Gnuplot will put all of the data into a single plot, rather than in separate PDFs. Try plottingeach gnuplot file separately.

How can I figure out which parts of the code each stage executes?

While a common cause of confusion in Spark, this question is unfortunately not answered by thistool. Right now, the event logs don't include this information. There have been murmurs aboutadding more detail about this to the Spark UI, but as far as I know, this hasn't been done yet.

The scheduler delay (yellow) seems to take a large amount of time. What might the cause be?

At a very high level, usually the best way to reduce scheduler delay is to consolidate jobs intofewer tasks.

The scheduler delay is essentially message propagation delay to (1) send a message from thescheduler to an executor to launch a task and (2) to send a message from the executor back to theecheduler stating that the task has completed. This can be high when the task is large or the taskresult is large, because then it takes longer for the scheduler to ship the task to the executor,and vice versa for the result. To diagnose this problem, take a look at the Spark UI (or directlylook at the JSON in your event log) to look at the result size of each task, to see if this islarge.

Another reason the scheduler delay can be high is if the scheduler is launching a large number oftasks over a short period. In this case, the task completed messages get queued at the schedulerand can't be processed immediately, which also increases scheduler delay. When I've benchmarkedthe Spark scheduler in the past, I've found it can handle about 1.5K tasks / second (see section 7.6inthis paper. This was fora now-antiquated version of Spark, but in theory this shouldn't have changed much, because the Sparkperformance regression tests run before each release have a test that measures this.

One last reason we've sometimes seen in the AMPLab is that it can take a while for the executor toactually launch the task (which involves getting a thread -- possibly a new one -- from a threadpool for the task to run in). This is currently included in scheduler delay. A few months ago,I proposed adding metrics about this to the Spark UI, but it was deemed too confusing and notuseful to a suffuciently broad audience (seediscussion here:apache/spark#2832). If you want to understand thismetric, you can implement the reverse ofthis commit(which is part of the aforementioned pull request) to measure whether this time is the cause of thelong scheduler delay.

About

Scripts to analyze Spark's performance

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Understanding Spark Performance

Configuring Spark to log performance data

Analyzing performance data

Missing data

FAQ

How do I upgrade my version of gnuplot to support generating PDF output?

How can I figure out which parts of the code each stage executes?

The scheduler delay (yellow) seems to take a large amount of time. What might the cause be?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors4

Uh oh!

Languages

Movatterモバイル変換

kayousterhout/trace-analysis

Folders and files

Latest commit

History

Repository files navigation

Understanding Spark Performance

Configuring Spark to log performance data

Analyzing performance data

Missing data

FAQ

How do I upgrade my version of gnuplot to support generating PDF output?

How can I figure out which parts of the code each stage executes?

The scheduler delay (yellow) seems to take a large amount of time. What might the cause be?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors4

Uh oh!

Languages

Packages