NotificationsYou must be signed in to change notification settings
Fork0
Star5

This is Spark running at 10Gb/s

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src/main		src/main
Makefile		Makefile
README.md		README.md
install.sh		install.sh
pom.xml		pom.xml

Repository files navigation

Apache Spark at High-Speed

This is Spark running at 10Gb/s!Note: The researched that created this work concluded with presentations at ApacheCon Europe 2015.

Project Goal

The goal of this project is to get a single stream in Apache Spark Streaming processing to a throughput of ~10Gb/s. To be clear, this is a single "processor" or "thread" (one map-reduce pipeline). It is independent of the inherant parallelism of the Map-Reduce style processing of Apache Spark.

Why? Given that Apache Spark is map-reduce style programing, why negate that to focusing on a single pipeline? In order to target the next-generation of distributed computing, it is not enough to use just parallelism. One must ensure that each of the individual pipelines must be optimized for maximum-thoughput and then parallelized. In order to support a project running on 40Gb/s networks and producing 13+ Gb/s of data, individual streams must process at ~1Gb/s.

Implementation Quirks

JVM network stack provides an inefficient interface. The interface allows the reading on individual bytes from a Socket, and thus requries many function calls and a lot of copying of single bytes.

Thus a JNI solution that can read blocks of data off a Berkley socket was used in order to ensure that the networking layer runs at peak performance.

Originally, the implementation was intended to run Fourier tansforms at high-speed. However, in order to achieve bare-bone efficiency this code as been disabled.

Current State

At the conclusion of the research, an individual Spark pipeline was recorded at processing around 500MB/s. Much of the Fourier processing code is commented out in order to get bare-bones efficiency.

Supported Research

High-Throughput Apache Spark Streaming
- At Apache Con Europe (http://sched.co/400u)
High-Throughput Kafka and Kafka in HPChttps://github.com/LeStarch/kafka-benchmarking

About

This is Spark running at 10Gb/s

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Apache Spark at High-Speed

Project Goal

Implementation Quirks

Current State

Supported Research

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

LeStarch/hyper-spark

Folders and files

Latest commit

History

Repository files navigation

Apache Spark at High-Speed

Project Goal

Implementation Quirks

Current State

Supported Research

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages