

I recently hosted anonline meetup on Apache Spark withIBM Developer. Spark has been around for a few years, but theinterest in still growing to my surprise. Apache Spark was developed at the University of California, Berkeley’s AMPLab. The Spark codebase was open sourced and donated to the Apache Software Foundation in 2010.
The background of the attendees was quite diverse
- Developer (25%)
- Architect (12.5%)
- Data Scientist (41.7%)
- Other (12.8%)
We looked at theWHAT andWHY of spark and then dove in the three data structures that you might encounter when working with Spark …
We also looked at some Transformations, Actions, built-in functions and UDF (user defined functions). For example the following function creates a new column calledGENDER based on the contents of the columnGenderCode.
# ------------------------------# Derive gender from salutation# ------------------------------def deriveGender(col): """ input: pyspark.sql.types.Column output: "male", "female" or "unknown" """ if col in ['Mr.', 'Master.']: return 'male' elif col in ['Mrs.', 'Miss.']: return 'female' else: return 'unknown';deriveGenderUDF = func.udf(lambda c: deriveGender(c), types.StringType())customer_df = customer_df.withColumn("GENDER", deriveGenderUDF(customer_df["GenderCode"]))customer_df.cache()
withColumn creates a new column in the customer_df dataframe with the values fromderiveGenderUDF (our user defined function). The deriveGenderUDF is essentially thederiveGender function. If this does not make sense, watch the webinar as we go into a lot more detail.
Finally, we created a spark cluster environment on IBM Cloud, and used a Jupyter notebook to explore customer data with the following columns …
"CUST_ID", "CUSTNAME", "ADDRESS1", "ADDRESS2", "CITY", "POSTAL_CODE", "POSTAL_CODE_PLUS4", "STATE", "COUNTRY_CODE", "EMAIL_ADDRESS", "PHONE_NUMBER","AGE","GenderCode","GENERATION","NATIONALITY", "NATIONAL_ID", "DRIVER_LICENSE"
After cleaning the data using in-built and user defined methods, we used PixieDust to visualize the data. The cool thing about pixieDust is that you don’t need to set it up or configure it. You just pass it a Spark DataFrame or a Pandas DataFrame and you are good to go! You can find the completenotebook here.
Thank youIBM Developer andMax Katz for the opportunity to present and special thanks to Lisa Jung for being a patient co-presenter!
Top comments(0)
For further actions, you may consider blocking this person and/orreporting abuse