Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Play ButtonPause Button
Upkar Lidder
Upkar Lidder

Posted on

     

A Gentle Intro to Apache Spark for Developers

I recently hosted anonline meetup on Apache Spark withIBM Developer. Spark has been around for a few years, but theinterest in still growing to my surprise. Apache Spark was developed at the University of California, Berkeley’s AMPLab. The Spark codebase was open sourced and donated to the Apache Software Foundation in 2010.

Apache Spark

The background of the attendees was quite diverse

  • Developer (25%)
  • Architect (12.5%)
  • Data Scientist (41.7%)
  • Other (12.8%)

We looked at theWHAT andWHY of spark and then dove in the three data structures that you might encounter when working with Spark …

We also looked at some Transformations, Actions, built-in functions and UDF (user defined functions). For example the following function creates a new column calledGENDER based on the contents of the columnGenderCode.

# ------------------------------# Derive gender from salutation# ------------------------------def deriveGender(col):    """ input: pyspark.sql.types.Column        output: "male", "female" or "unknown"    """        if col in ['Mr.', 'Master.']:        return 'male'    elif col in ['Mrs.', 'Miss.']:        return 'female'    else:        return 'unknown';deriveGenderUDF = func.udf(lambda c: deriveGender(c), types.StringType())customer_df = customer_df.withColumn("GENDER", deriveGenderUDF(customer_df["GenderCode"]))customer_df.cache()
Enter fullscreen modeExit fullscreen mode

withColumn creates a new column in the customer_df dataframe with the values fromderiveGenderUDF (our user defined function). The deriveGenderUDF is essentially thederiveGender function. If this does not make sense, watch the webinar as we go into a lot more detail.

Finally, we created a spark cluster environment on IBM Cloud, and used a Jupyter notebook to explore customer data with the following columns …

"CUST_ID", "CUSTNAME", "ADDRESS1", "ADDRESS2", "CITY", "POSTAL_CODE", "POSTAL_CODE_PLUS4", "STATE", "COUNTRY_CODE", "EMAIL_ADDRESS", "PHONE_NUMBER","AGE","GenderCode","GENERATION","NATIONALITY", "NATIONAL_ID", "DRIVER_LICENSE"
Enter fullscreen modeExit fullscreen mode

After cleaning the data using in-built and user defined methods, we used PixieDust to visualize the data. The cool thing about pixieDust is that you don’t need to set it up or configure it. You just pass it a Spark DataFrame or a Pandas DataFrame and you are good to go! You can find the completenotebook here.

Thank youIBM Developer andMax Katz for the opportunity to present and special thanks to Lisa Jung for being a patient co-presenter!

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Upkar Lidder is a Full Stack Developer and Data Wrangler with a decade of development experience in a variety of roles. Educated in Canada and currently residing in the USA. <3 Python and JS!
  • Location
    San Francisco
  • Joined

More fromUpkar Lidder

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp