Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Download Microsoft EdgeMore info about Internet Explorer and Microsoft Edge
Table of contentsExit editor mode

Tutorial: Create a Scala Maven application for Apache Spark in HDInsight using IntelliJ

Feedback

In this article

In this tutorial, you learn how to create an Apache Spark application written in Scala using Apache Maven with IntelliJ IDEA. The article uses Apache Maven as the build system. And starts with an existing Maven archetype for Scala provided by IntelliJ IDEA. Creating a Scala application in IntelliJ IDEA involves the following steps:

  • Use Maven as the build system.
  • Update Project Object Model (POM) file to resolve Spark module dependencies.
  • Write your application in Scala.
  • Generate a jar file that can be submitted to HDInsight Spark clusters.
  • Run the application on Spark cluster using Livy.

In this tutorial, you learn how to:

  • Install Scala plugin for IntelliJ IDEA
  • Use IntelliJ to develop a Scala Maven application
  • Create a standalone Scala project

Prerequisites

Install Scala plugin for IntelliJ IDEA

Do the following steps to install the Scala plugin:

  1. Open IntelliJ IDEA.

  2. On the welcome screen, navigate toConfigure >Plugins to open thePlugins window.

    Screenshot showing IntelliJ Welcome Screen.

  3. SelectInstall for Azure Toolkit for IntelliJ.

    Screenshot showing IntelliJ Azure Tool Kit.

  4. SelectInstall for the Scala plugin that is featured in the new window.

    Screenshot showing IntelliJ Scala Plugin.

  5. After the plugin installs successfully, you must restart the IDE.

Use IntelliJ to create application

  1. Start IntelliJ IDEA, and selectCreate New Project to open theNew Project window.

  2. SelectApache Spark/HDInsight from the left pane.

  3. SelectSpark Project (Scala) from the main window.

  4. From theBuild tool drop-down list, select one of the following values:

    • Maven for Scala project-creation wizard support.
    • SBT for managing the dependencies and building for the Scala project.

    Screenshot showing create application.

  5. SelectNext.

  6. In theNew Project window, provide the following information:

    PropertyDescription
    Project nameEnter a name.
    Project locationEnter the location to save your project.
    Project SDKThis field will be blank on your first use of IDEA. SelectNew... and navigate to your JDK.
    Spark VersionThe creation wizard integrates the proper version for Spark SDK and Scala SDK. If the Spark cluster version is earlier than 2.0, selectSpark 1.x. Otherwise, selectSpark2.x. This example usesSpark 2.3.0 (Scala 2.11.8).

    IntelliJ IDEA Selecting the Spark SDK.

  7. SelectFinish.

Create a standalone Scala project

  1. Start IntelliJ IDEA, and selectCreate New Project to open theNew Project window.

  2. SelectMaven from the left pane.

  3. Specify aProject SDK. If blank, selectNew... and navigate to the Java installation directory.

  4. Select theCreate from archetype checkbox.

  5. From the list of archetypes, selectorg.scala-tools.archetypes:scala-archetype-simple. This archetype creates the right directory structure and downloads the required default dependencies to write Scala program.

    Screenshot shows the selected archetype in the New Project window.

  6. SelectNext.

  7. ExpandArtifact Coordinates. Provide relevant values forGroupId, andArtifactId.Name, andLocation will autopopulate. The following values are used in this tutorial:

    • GroupId: com.microsoft.spark.example
    • ArtifactId: SparkSimpleApp

    Screenshot shows the Artifact Coordinates option in the New Project window.

  8. SelectNext.

  9. Verify the settings and then selectNext.

  10. Verify the project name and location, and then selectFinish. The project will take a few minutes to import.

  11. Once the project has imported, from the left pane navigate toSparkSimpleApp >src >test >scala >com >microsoft >spark >example. Right-clickMySpec, and then selectDelete.... You don't need this file for the application. SelectOK in the dialog box.

  12. In the later steps, you update thepom.xml to define the dependencies for the Spark Scala application. For those dependencies to be downloaded and resolved automatically, you must configure Maven.

  13. From theFile menu, selectSettings to open theSettings window.

  14. From theSettings window, navigate toBuild, Execution, Deployment >Build Tools >Maven >Importing.

  15. Select theImport Maven projects automatically checkbox.

  16. SelectApply, and then selectOK. You'll then be returned to the project window.

    :::image type="content" source="./media/apache-spark-create-standalone-application/configure-maven-download.png" alt-text="Configure Maven for automatic downloads." border="true":::
  17. From the left pane, navigate tosrc >main >scala >com.microsoft.spark.example, and then double-clickApp to open App.scala.

  18. Replace the existing sample code with the following code and save the changes. This code reads the data from the HVAC.csv (available on all HDInsight Spark clusters). Retrieves the rows that only have one digit in the sixth column. And writes the output to/HVACOut under the default storage container for the cluster.

    package com.microsoft.spark.exampleimport org.apache.spark.SparkConfimport org.apache.spark.SparkContext/**  * Test IO to wasb  */object WasbIOTest {    def main (arg: Array[String]): Unit = {        val conf = new SparkConf().setAppName("WASBIOTest")        val sc = new SparkContext(conf)        val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")        //find the rows which have only one digit in the 7th column in the CSV        val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)        rdd1.saveAsTextFile("wasb:///HVACout")    }}
  19. In the left pane, double-clickpom.xml.

  20. Within<project>\<properties> add the following segments:

    <scala.version>2.11.8</scala.version><scala.compat.version>2.11.8</scala.compat.version><scala.binary.version>2.11</scala.binary.version>
  21. Within<project>\<dependencies> add the following segments:

    <dependency>    <groupId>org.apache.spark</groupId>    <artifactId>spark-core_${scala.binary.version}</artifactId>    <version>2.3.0</version></dependency>
    Save changes to pom.xml.
  22. Create the .jar file. IntelliJ IDEA enables creation of JAR as an artifact of a project. Do the following steps.

    1. From theFile menu, selectProject Structure....

    2. From theProject Structure window, navigate toArtifacts >the plus symbol + >JAR >From modules with dependencies....

      `IntelliJ IDEA project structure add jar`.

    3. In theCreate JAR from Modules window, select the folder icon in theMain Class text box.

    4. In theSelect Main Class window, select the class that appears by default and then selectOK.

      `IntelliJ IDEA project structure select class`.

    5. In theCreate JAR from Modules window, ensure theextract to the target JAR option is selected, and then selectOK. This setting creates a single JAR with all dependencies.

      IntelliJ IDEA project structure jar from module.

    6. TheOutput Layout tab lists all the jars that are included as part of the Maven project. You can select and delete the ones on which the Scala application has no direct dependency. For the application, you're creating here, you can remove all but the last one (SparkSimpleApp compile output). Select the jars to delete and then select the negative symbol-.

      `IntelliJ IDEA project structure delete output`.

      Ensure sure theInclude in project build checkbox is selected. This option ensures that the jar is created every time the project is built or updated. SelectApply and thenOK.

    7. To create the jar, navigate toBuild >Build Artifacts >Build. The project will compile in about 30 seconds. The output jar is created under\out\artifacts.

      IntelliJ IDEA project artifact output.

Run the application on the Apache Spark cluster

To run the application on the cluster, you can use the following approaches:

Clean up resources

If you're not going to continue to use this application, delete the cluster that you created with the following steps:

  1. Sign in to theAzure portal.

  2. In theSearch box at the top, typeHDInsight.

  3. SelectHDInsight clusters underServices.

  4. In the list of HDInsight clusters that appears, select the... next to the cluster that you created for this tutorial.

  5. SelectDelete. SelectYes.

Screenshot showing how to delete an HDInsight cluster via the Azure portal.

Next step

In this article, you learned how to create an Apache Spark scala application. Advance to the next article to learn how to run this application on an HDInsight Spark cluster using Livy.


Feedback

Was this page helpful?

YesNoNo

Need help with this topic?

Want to try using Ask Learn to clarify or guide you through this topic?

Suggest a fix?

  • Last updated on

In this article

Was this page helpful?

YesNo
NoNeed help with this topic?

Want to try using Ask Learn to clarify or guide you through this topic?

Suggest a fix?