This repository was archived by the owner on Jan 31, 2025. It is now read-only.

hazelcast/hazelcast-jetPublic archive

NotificationsYou must be signed in to change notification settings
Fork207
Star1.1k

Distributed Stream and Batch Processing

jet-start.sh

License

View license

1.1k stars 207 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2,683 Commits
.github		.github
.mvn		.mvn
checkstyle		checkstyle
examples		examples
extensions		extensions
hazelcast-jet-all		hazelcast-jet-all
hazelcast-jet-core		hazelcast-jet-core
hazelcast-jet-distribution		hazelcast-jet-distribution
hazelcast-jet-spring		hazelcast-jet-spring
hazelcast-jet-sql		hazelcast-jet-sql
images		images
licenses		licenses
site		site
spotbugs		spotbugs
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
generate_client_protocol.sh		generate_client_protocol.sh
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Repository files navigation

Note on Hazelcast 5

With the release of Hazelcast 5.0, development of Jet has been moved to thecore Hazelcast Repository - pleasefollow the repository for details on how to use Hazelcast for building data pipelines.

Hazelcast 5 also comes with extensive documentation, replacing the existing Jetdocs:https://docs.hazelcast.com/hazelcast/latest/index.html

What is Jet

Jet is an open-source, in-memory, distributedbatch and stream processing engine. You can use it to process largevolumes of real-time events or huge batches of static datasets. To givea sense of scale, a single node of Jet has been proven toaggregate 10million events persecond withlatency under 10 milliseconds.

It provides a Java API to build stream and batch processing applicationsthrough the use of adataflow programmingmodel. After you deploy yourapplication to a Jet cluster, Jet will automatically use all thecomputational resources on the cluster to run your application.

If you add more nodes to the cluster while your application is running,Jet automatically scales up your application to run on the new nodes. Ifyou remove nodes from the cluster, it scales it down seamlessly withoutlosing the current computational state, providingexactly-onceprocessingguarantees.

For example, you can represent the classical word count problem thatreads some local files and outputs the frequency of each word to consoleusing the following API:

JetInstancejet =Jet.bootstrappedInstance();Pipelinep =Pipeline.create();p.readFrom(Sources.files("/path/to/text-files")) .flatMap(line ->traverseArray(line.toLowerCase().split("\\W+"))) .filter(word -> !word.isEmpty()) .groupingKey(word ->word) .aggregate(counting()) .writeTo(Sinks.logger());jet.newJob(p).join();

and then deploy the application to the cluster:

bin/jet submit word-count.jar

Another application which aggregates millions of sensor readings persecond with 10-millisecond resolution from Kafka looks like thefollowing:

Pipelinep =Pipeline.create();p.readFrom(KafkaSources.<String,Reading>kafka(kafkaProperties,"sensors")) .withTimestamps(event ->event.getValue().timestamp(),10)// use event timestamp, allowed lag in ms .groupingKey(reading ->reading.sensorId()) .window(sliding(1_000,10))// sliding window of 1s by 10ms .aggregate(averagingDouble(reading ->reading.temperature())) .writeTo(Sinks.logger());jet.newJob(p).join();

Jet comes with out-of-the-box support for many kinds ofdata sourcesand sinks, including:

Apache Kafka
Local Files (Text, Avro, JSON)
Apache Hadoop (Azure Data Lake, S3, GCS)
Apache Pulsar
Debezium
Elasticsearch
JDBC
JMS
InfluxDB
Hazelcast
Redis
MongoDB
Twitter

When Should You Use Jet

Jet is a good fit when you need to process large amounts of data in adistributed fashion. You can use it to build a variety ofdata-processing applications, such as:

Low-latency stateful stream processing. For example, detecting trendsin 100 Hz sensor data from 100,000 devices and sending correctivefeedback within 10 milliseconds.
High-throughput, large-state stream processing. For example,tracking GPS locations of millions of users, inferring their velocityvectors.
Batch processing of big data volumes, for example analyzing aday's worth of stock trading data to update the risk exposure of agiven portfolio.

Key Features

Predictable Latency Under Load

Jet uses a unique execution model withcooperativemultithreadingand can achieveextremely lowlatencies whileprocessing millions of items per second on just a single node:

The engine is able to run anywhere from tens to thousands of jobsconcurrently on a fixed number of threads.

Fault Tolerance With No Infrastructure

Jet stores computational state in a distributed, replicatedin-memorystore anddoes not require the presence of a distributed file system norinfrastructure like Zookeeper to provide high-availability andfault-tolerance.

Jet implements a version of theChandy-Lamportalgorithm to provideexactly-once processing under the face offailures. When interfacing with external transactional systems likedatabases, it can provide end-to-end processing guarantees usingtwo-phasecommit.

Advanced Event Processing

Event data can often arriveout oforder and Jet hasfirst-class support for dealing with this disorder. Jet implements atechnique calleddistributedwatermarksto treat disordered events as if they were arriving in order.

How Do I Get Started

Follow theGet Startedguide to start using Jet.

Download

You can download Jet fromhttps://jet-start.sh.

Alternatively, you can use the latestdockerimage:

dockerrun -p5701:5701hazelcast/hazelcast-jet

Use the following Maven coordinates to add Jet to your application:

<groupId>com.hazelcast.jet</groupId><artifactId>hazelcast-jet</artifactId><version>4.2</version>

Tutorials

See thetutorials fortutorials on using Jet. Some examples:

Reference

Jet supports a variety of transforms and operators. These include:

Statelesstransforms suchas mapping and filtering.
Statefultransforms such asaggregations and stateful mapping.

Community

Hazelcast Jet team actively answers questions onStackOverflow andHazelcast Community Slack.

You are also encouraged to join thehazelcast-jet mailinglist if you areinterested in community discussions

How Can I Contribute

Thanks for your interest in contributing! The easiest way is to justsend a pull request. Have a look at the issues marked asgood firstissuefor some guidance.

Building From Source

To build, use:

./mvnw clean package -DskipTests

Use Latest Snapshot Release

You can always use the latest snapshot release if you want to try thefeatures currently under development.

Maven snippet:

<repositories>    <repository>        <id>snapshot-repository</id>        <name>Maven2 Snapshot Repository</name>        <url>https://oss.sonatype.org/content/repositories/snapshots</url>        <snapshots>            <enabled>true</enabled>            <updatePolicy>daily</updatePolicy>        </snapshots>    </repository></repositories><dependencies>    <dependency>        <groupId>com.hazelcast.jet</groupId>        <artifactId>hazelcast-jet</artifactId>        <version>4.3-SNAPSHOT</version>    </dependency></dependencies>

Trigger Phrases in the Pull Request Conversation

When you create a pull request (PR), it must pass a build-and-testprocedure. Maintainers will be notified about your PR, and they cantrigger the build using special comments. These are the phrases you maysee used in the comments on your PR:

verify - run the default PR builder, equivalent tomvn clean install
run-nightly-tests - use the settings for the nightly build (mvn clean install -Pnightly). This includes slower tests in the run,which we don't normally run on every PR
run-windows - run the tests on a Windows machine (HighFive is notsupported here)
run-cdc-debezium-tests - run all tests in theextensions/cdc-debezium module
run-cdc-mysql-tests - run all tests in theextensions/cdc-mysqlmodule
run-cdc-postgres-tests - run all tests in theextensions/cdc-postgres module

Where not indicated, the builds run on a Linux machine with Oracle JDK8.

License

Source code in this repository is covered by one of two licenses:

The default license throughout the repository is Apache License 2.0unless theheader specifies another license. Please see theLicensingsection for more information.