Posted onAug 23, 2019

A Gentle Intro to Apache Spark for Developers

I recently hosted anonline meetup on Apache Spark withIBM Developer. Spark has been around for a few years, but theinterest in still growing to my surprise. Apache Spark was developed at the University of California, Berkeley’s AMPLab. The Spark codebase was open sourced and donated to the Apache Software Foundation in 2010.

The background of the attendees was quite diverse

Developer (25%)
Architect (12.5%)
Data Scientist (41.7%)
Other (12.8%)

We looked at theWHAT andWHY of spark and then dove in the three data structures that you might encounter when working with Spark …

We also looked at some Transformations, Actions, built-in functions and UDF (user defined functions). For example the following function creates a new column calledGENDER based on the contents of the columnGenderCode.

# ------------------------------# Derive gender from salutation# ------------------------------def deriveGender(col):    """ input: pyspark.sql.types.Column        output: "male", "female" or "unknown"    """        if col in ['Mr.', 'Master.']:        return 'male'    elif col in ['Mrs.', 'Miss.']:        return 'female'    else:        return 'unknown';deriveGenderUDF = func.udf(lambda c: deriveGender(c), types.StringType())customer_df = customer_df.withColumn("GENDER", deriveGenderUDF(customer_df["GenderCode"]))customer_df.cache()

withColumn creates a new column in the customer_df dataframe with the values fromderiveGenderUDF (our user defined function). The deriveGenderUDF is essentially thederiveGender function. If this does not make sense, watch the webinar as we go into a lot more detail.

Finally, we created a spark cluster environment on IBM Cloud, and used a Jupyter notebook to explore customer data with the following columns …

"CUST_ID", "CUSTNAME", "ADDRESS1", "ADDRESS2", "CITY", "POSTAL_CODE", "POSTAL_CODE_PLUS4", "STATE", "COUNTRY_CODE", "EMAIL_ADDRESS", "PHONE_NUMBER","AGE","GenderCode","GENERATION","NATIONALITY", "NATIONAL_ID", "DRIVER_LICENSE"

After cleaning the data using in-built and user defined methods, we used PixieDust to visualize the data. The cool thing about pixieDust is that you don’t need to set it up or configure it. You just pass it a Spark DataFrame or a Pandas DataFrame and you are good to go! You can find the completenotebook here.

Thank youIBM Developer andMax Katz for the opportunity to present and special thanks to Lisa Jung for being a patient co-presenter!

Top comments(0)

For further actions, you may consider blocking this person and/orreporting abuse

Movatterモバイル変換

DEV Community

A Gentle Intro to Apache Spark for Developers

Top comments(0)

More fromUpkar Lidder