- Notifications
You must be signed in to change notification settings - Fork95
mahmoudparsian/data-algorithms-with-spark
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Data Algorithms with Spark by Mahmoud Parsian
| "... This book will be a great resource for both readers looking to implement existing algorithms in a scalable fashion and readers who are developing new, custom algorithms using Spark. ..." Dr. Matei Zaharia Original Creator of Apache Spark FOREWORD by Dr. Matei Zaharia |
Foreword by Dr. Matei Zaharia (Original Creator of Apache Spark)
Author:Mahmoud Parsian
Thisnew O'Reilly bookis the successor Edition ofData Algorithms(published byO'Reilly)
This book uses PySpark (much simpler and readable)
@OReillyMedia: Data Algorithms with Spark, By @mahmoudparsian
Autor Contact: [
Email ] [
Mahmoud Parsian @LinkedIn ][
Mahmoud Parsian @GitHub ]
This GitHub repository will host all source code and scripts forData Algorithms with Spark
Chapter solutions are provided inPySpark and Scala
- PySpark solutions are provided byMahmoud Parsian
- Scala solutions are provided byDeepak Kumar andBiman Mandal
All programs are tested with the following software:
| Spark | Python | Scala | Java |
|---|---|---|---|
| Apache Spark 3.4.0 | Python 3.10.5 | Scala 2.13 | Java 11 |
| Chapter | Title |
|---|---|
| Glossary | Glossary of Big Data, MapReduce, Spark |
| Chapter 1 | Introduction to Data Algorithms |
| Chapter 2 | Transformations in Action |
| Chapter 3 | Mapper Transformations |
| Chapter 4 | Reductions in Spark |
| Chapter 5 | Partitioning Data |
| Chapter 6 | Graph Algorithms |
| Chapter 7 | Interacting with External Data Sources |
| Chapter 8 | Ranking Algorithms |
| Chapter 9 | Fundamental Data Design Patterns |
| Chapter 10 | Common Data Design Patterns |
| Chapter 11 | Join Design Patterns |
| Chapter 12 | Feature Engineering in PySpark |
| Bonus Chapter | Title / Description |
|---|---|
| Glossary | Glossary of Big Data, MapReduce, Spark |
| Word Count | Solutions for Word Count using RDDs and DataFrames |
| Anagrams | Find words, which are anagrams |
| Lambda Expressions | Using Lambda Expressions in PySpark programs |
| TF-IDF | Term Frequency - Inverse Document Frequency |
| K-mers | K-mers for DNA Sequences |
| Correlation | All vs. All Correlation |
| Mapping Partitions | mapPartitions() Complete Example |
| UDF | User-Defined Function Examples |
| DataFrames Transformations | Examples on Creation and Transformation of DataFrames |
| DataFrames Tutorials | DataFrames Tutorials: from collections and CSV text files |
| Join Operations | Examples on join of RDDs and DataFrames |
| PySpark Tutorial 101 | Examples on using PySpark RDDs and DataFrames |
| Physical Data Partitioning | Tutorial of Physical Data Partitioning |
| Monoids and Combiners | Monoid as a Design Principle |



About
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
No releases published
Packages0
No packages published
Contributors5
Uh oh!
There was an error while loading.Please reload this page.
Email
Mahmoud Parsian @LinkedIn
Mahmoud Parsian @GitHub