- Notifications
You must be signed in to change notification settings - Fork207
Distributed Stream and Batch Processing
License
hazelcast/hazelcast-jet
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
With the release of Hazelcast 5.0, development of Jet has been moved to thecore Hazelcast Repository - pleasefollow the repository for details on how to use Hazelcast for building data pipelines.
Hazelcast 5 also comes with extensive documentation, replacing the existing Jetdocs:https://docs.hazelcast.com/hazelcast/latest/index.html
Jet is an open-source, in-memory, distributedbatch and stream processing engine. You can use it to process largevolumes of real-time events or huge batches of static datasets. To givea sense of scale, a single node of Jet has been proven toaggregate 10million events persecond withlatency under 10 milliseconds.
It provides a Java API to build stream and batch processing applicationsthrough the use of adataflow programmingmodel. After you deploy yourapplication to a Jet cluster, Jet will automatically use all thecomputational resources on the cluster to run your application.
If you add more nodes to the cluster while your application is running,Jet automatically scales up your application to run on the new nodes. Ifyou remove nodes from the cluster, it scales it down seamlessly withoutlosing the current computational state, providingexactly-onceprocessingguarantees.
For example, you can represent the classical word count problem thatreads some local files and outputs the frequency of each word to consoleusing the following API:
JetInstancejet =Jet.bootstrappedInstance();Pipelinep =Pipeline.create();p.readFrom(Sources.files("/path/to/text-files")) .flatMap(line ->traverseArray(line.toLowerCase().split("\\W+"))) .filter(word -> !word.isEmpty()) .groupingKey(word ->word) .aggregate(counting()) .writeTo(Sinks.logger());jet.newJob(p).join();
and then deploy the application to the cluster:
bin/jet submit word-count.jar
Another application which aggregates millions of sensor readings persecond with 10-millisecond resolution from Kafka looks like thefollowing:
Pipelinep =Pipeline.create();p.readFrom(KafkaSources.<String,Reading>kafka(kafkaProperties,"sensors")) .withTimestamps(event ->event.getValue().timestamp(),10)// use event timestamp, allowed lag in ms .groupingKey(reading ->reading.sensorId()) .window(sliding(1_000,10))// sliding window of 1s by 10ms .aggregate(averagingDouble(reading ->reading.temperature())) .writeTo(Sinks.logger());jet.newJob(p).join();
Jet comes with out-of-the-box support for many kinds ofdata sourcesand sinks, including:
- Apache Kafka
- Local Files (Text, Avro, JSON)
- Apache Hadoop (Azure Data Lake, S3, GCS)
- Apache Pulsar
- Debezium
- Elasticsearch
- JDBC
- JMS
- InfluxDB
- Hazelcast
- Redis
- MongoDB
Jet is a good fit when you need to process large amounts of data in adistributed fashion. You can use it to build a variety ofdata-processing applications, such as:
- Low-latency stateful stream processing. For example, detecting trendsin 100 Hz sensor data from 100,000 devices and sending correctivefeedback within 10 milliseconds.
- High-throughput, large-state stream processing. For example,tracking GPS locations of millions of users, inferring their velocityvectors.
- Batch processing of big data volumes, for example analyzing aday's worth of stock trading data to update the risk exposure of agiven portfolio.
Jet uses a unique execution model withcooperativemultithreadingand can achieveextremely lowlatencies whileprocessing millions of items per second on just a single node:
The engine is able to run anywhere from tens to thousands of jobsconcurrently on a fixed number of threads.
Jet stores computational state in a distributed, replicatedin-memorystore anddoes not require the presence of a distributed file system norinfrastructure like Zookeeper to provide high-availability andfault-tolerance.
Jet implements a version of theChandy-Lamportalgorithm to provideexactly-once processing under the face offailures. When interfacing with external transactional systems likedatabases, it can provide end-to-end processing guarantees usingtwo-phasecommit.
Event data can often arriveout oforder and Jet hasfirst-class support for dealing with this disorder. Jet implements atechnique calleddistributedwatermarksto treat disordered events as if they were arriving in order.
Follow theGet Startedguide to start using Jet.
You can download Jet fromhttps://jet-start.sh.
Alternatively, you can use the latestdockerimage:
dockerrun -p5701:5701hazelcast/hazelcast-jet
Use the following Maven coordinates to add Jet to your application:
<groupId>com.hazelcast.jet</groupId><artifactId>hazelcast-jet</artifactId><version>4.2</version>
See thetutorials fortutorials on using Jet. Some examples:
Jet supports a variety of transforms and operators. These include:
- Statelesstransforms suchas mapping and filtering.
- Statefultransforms such asaggregations and stateful mapping.
Hazelcast Jet team actively answers questions onStackOverflow andHazelcast Community Slack.
You are also encouraged to join thehazelcast-jet mailinglist if you areinterested in community discussions
Thanks for your interest in contributing! The easiest way is to justsend a pull request. Have a look at the issues marked asgood firstissuefor some guidance.
To build, use:
./mvnw clean package -DskipTests
You can always use the latest snapshot release if you want to try thefeatures currently under development.
Maven snippet:
<repositories> <repository> <id>snapshot-repository</id> <name>Maven2 Snapshot Repository</name> <url>https://oss.sonatype.org/content/repositories/snapshots</url> <snapshots> <enabled>true</enabled> <updatePolicy>daily</updatePolicy> </snapshots> </repository></repositories><dependencies> <dependency> <groupId>com.hazelcast.jet</groupId> <artifactId>hazelcast-jet</artifactId> <version>4.3-SNAPSHOT</version> </dependency></dependencies>
When you create a pull request (PR), it must pass a build-and-testprocedure. Maintainers will be notified about your PR, and they cantrigger the build using special comments. These are the phrases you maysee used in the comments on your PR:
verify
- run the default PR builder, equivalent tomvn clean install
run-nightly-tests
- use the settings for the nightly build (mvn clean install -Pnightly
). This includes slower tests in the run,which we don't normally run on every PRrun-windows
- run the tests on a Windows machine (HighFive is notsupported here)run-cdc-debezium-tests
- run all tests in theextensions/cdc-debezium
modulerun-cdc-mysql-tests
- run all tests in theextensions/cdc-mysql
modulerun-cdc-postgres-tests
- run all tests in theextensions/cdc-postgres
module
Where not indicated, the builds run on a Linux machine with Oracle JDK8.
Source code in this repository is covered by one of two licenses:
The default license throughout the repository is Apache License 2.0unless theheader specifies another license. Please see theLicensingsection for more information.
We owe (the good parts of) our CLI tool's user experience topicocli.
Copyright (c) 2008-2021, Hazelcast, Inc. All Rights Reserved.
Visitwww.hazelcast.com for more info.
About
Distributed Stream and Batch Processing