Monte Carlo methods using Dataproc and Apache Spark

Dataproc andApache Spark provide infrastructure and capacity that you can use to run Monte Carlosimulations written in Java, Python, or Scala.

Monte Carlo methods can help answer a wide range of questions in business,engineering, science, mathematics, and other fields. By using repeated randomsampling to create a probability distribution for a variable, a MonteCarlo simulation can provide answers to questions that might otherwise beimpossible to answer. In finance, for example, pricing an equity option requiresanalyzing the thousands of ways the price of the stock could change over time.Monte Carlo methods provide a way to simulate those stock price changes over awide range of possible outcomes, while maintaining control over the domain ofpossible inputs to the problem.

In the past, running thousands of simulations could take a very long time andaccrue high costs. Dataproc enables you to provisioncapacity on demand and pay for it by the minute. Apache Spark lets you useclusters of tens, hundreds, or thousands of servers to run simulations in a waythat is intuitive and scales to meet your needs. This means that you can runmore simulations more quickly, which can help your business innovate fasterand manage risk better.

Security is always important when working with financial data.Dataproc runs on Google Cloud, which helps to keep your datasafe, secure, and privatein several ways. For example, all data is encrypted during transmission andwhen at rest, and Google Cloud isISO 27001, SOC3, and PCI compliant.

Objectives

  • Create a managed Dataproc cluster withApache Spark pre-installed.
  • Run a Monte Carlo simulation using Python that estimates the growth of a stockportfolio over time.
  • Run a Monte Carlo simulation using Scala that simulates how a casino makesmoney.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use thepricing calculator.

New Google Cloud users might be eligible for afree trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, seeClean up.

Before you begin

Create a Dataproc cluster

Follow the steps tocreate a Dataproc clusterfrom the Google Cloud console. The default cluster settings, which includestwo-worker nodes, is sufficient for this tutorial.

Disable logging for warnings

By default, Apache Spark prints verbose logging in the console window. For thepurpose of this tutorial, change the logging level to log onlyerrors. Follow these steps:

Usessh to connect to the Dataproc cluster's primary node

The primary node of the Dataproc cluster has the-m suffix on its VM name.

  1. In the Google Cloud console, go to theVM instances page.

    Go to VM instances

  2. In the list of virtual machine instances, clickSSH in the row of the instance that you want to connect to.

    SSH button next to instance name.

An SSH window opens connected to the primary node.

Connected, host fingerprint: ssh-rsa 2048 ......user@clusterName-m:~$

Change the logging setting

  1. From the primary node's home directory, edit/etc/spark/conf/log4j.properties.

    sudonano/etc/spark/conf/log4j.properties
  2. Setlog4j.rootCategory equal toERROR.

    # Set only errors to be logged to the consolelog4j.rootCategory=ERROR,consolelog4j.appender.console=org.apache.log4j.ConsoleAppenderlog4j.appender.console.target=System.errlog4j.appender.console.layout=org.apache.log4j.PatternLayoutlog4j.appender.console.layout.ConversionPattern=%d{yy/MM/ddHH:mm:ss}%p%c{1}:%m%n
  3. Save the changes and exit the editor. If you want to enable verbose loggingagain, reverse the change by restoring the value for.rootCategory to itsoriginal (INFO) value.

Spark programming languages

Spark supports Python, Scala, and Java as programming languages for standaloneapplications, and provides interactive interpreters for Python and Scala. Thelanguage you choose is a matter of personal preference. This tutorial uses theinteractive interpreters because you canexperiment by changing the code, trying different input values, and then viewingthe results.

Estimate portfolio growth

In finance, Monte Carlo methods are sometimes used to run simulations that tryto predict how an investment might perform. By producing random samples ofoutcomes over a range of probable market conditions, a Monte Carlo simulationcan answer questions about how a portfolio might perform on average or inworst-case scenarios.

Follow these steps to create a simulation that uses Monte Carlo methods totry to estimate the growth of a financial investment based on a few commonmarket factors.

Note: This code is provided only as an example. Don't use this code tomake investment decisions.
  1. Start the Python interpreter from the Dataproc primary node.

    pyspark

    Wait for the Spark prompt>>>.

  2. Enter the following code. Make sure you maintain the indentation in thefunction definition.

    importrandomimporttimefromoperatorimportadddefgrow(seed):random.seed(seed)portfolio_value=INVESTMENT_INITforiinrange(TERM):growth=random.normalvariate(MKT_AVG_RETURN,MKT_STD_DEV)portfolio_value+=portfolio_value*growth+INVESTMENT_ANNreturnportfolio_value
  3. Pressreturn until you see the Spark prompt again.

    The preceding code defines a function that models what might happen when aninvestor has an existing retirement account that is investedin the stock market, to which they add additional money each year. Thefunction generates a random return on the investment, as a percentage, everyyear for the duration of a specified term. The function takes a seed valueas a parameter. This value is used to reseed the random number generator,which ensures that the function doesn't get the same list of random numberseach time it runs. Therandom.normalvariate function ensures that randomvalues occur across anormal distribution for the specified mean and standard deviation. The function increases thevalue of the portfolio by the growth amount, which could be positive ornegative, and adds a yearly sum that represents further investment.

    You define the required constants in an upcoming step.

  4. Create many seeds to feed to the function. At the Sparkprompt, enter the following code, which generates 10,000 seeds:

    seeds=sc.parallelize([time.time()+iforiinrange(10000)])

    The result of theparallelize operation is aresilient distributed dataset (RDD),which is a collection of elements that are optimized for parallelprocessing. In this case, the RDD contains seeds that are based onthe current system time.

    When creating the RDD, Spark slices the data based on the number of workersand cores available. In this case, Spark chooses to use eight slices,one slice for each core. That's fine for this simulation, which has 10,000items of data. For larger simulations, each slice might be larger than thedefault limit. In that case, specifying a second parameter toparallelizecan increase the number slices, which can help to keep the size of eachslice manageable, while Spark still takes advantage of all eight cores.

  5. Feed the RDD that contains the seeds to the growth function.

    results=seeds.map(grow)

    Themap method passes each seed in the RDD to thegrow function andappends each result to a new RDD, which is stored inresults. Notethat this operation, which performs atransformation, doesn't produce itsresults right away. Spark won't do this work until the results are needed.Thislazy evaluation is why you can enter code without the constants beingdefined.

  6. Specify some values for the function.

    INVESTMENT_INIT=100000# starting amountINVESTMENT_ANN=10000# yearly new investmentTERM=30# number of yearsMKT_AVG_RETURN=0.11# percentageMKT_STD_DEV=0.18# standard deviation
  7. Callreduce to aggregate the values in the RDD. Enter the followingcode to sum the results in the RDD:

    sum=results.reduce(add)
  8. Estimate and display the average return:

    print(sum/10000.)

    Be sure to include the dot (.) character at the end. It signifies floating-point arithmetic.

  9. Now change an assumption and see how the results change. For example, you canenter a new value for the market's average return:

    MKT_AVG_RETURN=0.07
  10. Run the simulation again.

    print(sc.parallelize([time.time()+iforiinrange(10000)]) \.map(grow).reduce(add)/10000.)
  11. When you're done experimenting, pressCTRL+D to exit the Python interpreter.

Program a Monte Carlo simulation in Scala

Monte Carlo, of course, is famous as a gambling destination. In this section,you use Scala to create a simulation that models the mathematical advantagethat a casino enjoys in a game of chance. The "house edge" at a real casinovaries widely from game to game; it can be over 20% inkeno, for example. Thistutorial creates a simple game where the house has only a one-percentadvantage. Here's how the game works:

  • The player places a bet, consisting of a number of chips from a bankroll fund.
  • The player rolls a 100-sided die (how cool would that be?).
  • If the result of the roll is a number from 1 to 49, the player wins.
  • For results 50 to 100, the player loses the bet.

You can see that this game creates a one-percent disadvantage for the player: in51 of the 100 possible outcomes for each roll, the player loses.

Follow these steps to create and run the game:

  1. Start the Scala interpreter from the Dataproc primary node.

    spark-shell
  2. Copy and paste the following code to create the game. Scala doesn't have thesame requirements as Python when it comes to indentation, so you can simplycopy and paste this code at thescala> prompt.

    valSTARTING_FUND=10valSTAKE=1//theamountofthebetvalNUMBER_OF_GAMES=25defrollDie:Int={valr=scala.util.Randomr.nextInt(99)+1}defplayGame(stake:Int):(Int)={valfaceValue=rollDieif(faceValue <50)(2*stake)else(0)}//Functiontoplaythegamemultipletimes//ReturnsthefinalfundamountdefplaySession(startingFund:Int=STARTING_FUND,stake:Int=STAKE,numberOfGames:Int=NUMBER_OF_GAMES):(Int)={//Initializevaluesvar(currentFund,currentStake,currentGame)=(startingFund,0,1)//Keepplayinguntilnumberofgamesisreachedorfundsrunoutwhile(currentGame <=numberOfGames &&currentFund >0){//SetthecurrentbetanddeductitfromthefundcurrentStake=math.min(stake,currentFund)currentFund-=currentStake//Playthegameval(winnings)=playGame(currentStake)//AddanywinningscurrentFund+=winnings//IncrementtheloopcountercurrentGame+=1}(currentFund)}
  3. Pressreturn until you see thescala> prompt.

  4. Enter the following code to play the game 25 times, which is the defaultvalue forNUMBER_OF_GAMES.

    playSession()

    Your bankroll started with a value of 10 units. Is it higher or lower, now?

  5. Now simulate 10,000 players betting 100 chips per game. Play 10,000 games ina session. This Monte Carlo simulation calculates the probability oflosing all your money before the end of the session. Enter the follow code:

    (sc.parallelize(1to10000,500).map(i=>playSession(100000,100,250000)).map(i=>if(i==0)1else0).reduce(_+_)/10000.0)

    Note that the syntax.reduce(_+_) is shorthand in Scala for aggregatingby using a summing function. It is functionally equivalent to the.reduce(add) syntax that you saw in the Python example.

    The preceding code performs the following steps:

    • Creates an RDD with the results of playing the session.
    • Replaces bankrupt players' results with the number1 and nonzeroresults with the number0.
    • Sums the count of bankrupt players.
    • Divides the count by the number of players.

    A typical result might be:

    0.998

    Which represents a near guarantee of losing all your money, even though thecasino had only a one-percent advantage.

Clean up

Delete the project

    Caution: Deleting a project has the following effects:
    • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
    • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.
  1. In the Google Cloud console, go to theManage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then clickDelete.
  3. In the dialog, type the project ID, and then clickShut down to delete the project.

What's next

  • For more on submitting Spark jobs to Dataprocwithout having to usessh to connect to the cluster, readDataproc—Submit a job

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.