- Notifications
You must be signed in to change notification settings - Fork12
License
googleapis/java-pubsublite-spark
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Java idiomatic client forPub/Sub Lite Spark Connector.
If you are using Maven, add this to your pom.xml file:
<dependency> <groupId>com.google.cloud</groupId> <artifactId>pubsublite-spark-sql-streaming</artifactId> <version>1.0.0</version></dependency>
If you are using Gradle without BOM, add this to your dependencies:
implementation'com.google.cloud:pubsublite-spark-sql-streaming:1.0.0'If you are using SBT, add this to your dependencies:
libraryDependencies+="com.google.cloud"%"pubsublite-spark-sql-streaming"%"1.0.0"
See theAuthentication section in the base directory's README.
The client application making API calls must be grantedauthorization scopes required for the desired Pub/Sub Lite Spark Connector APIs, and the authenticated principal must have theIAM role(s) required to access GCP resources using the Pub/Sub Lite Spark Connector API calls.
You will need aGoogle Cloud Platform Console project with the Pub/Sub Lite Spark ConnectorAPI enabled.You will need toenable billing to use Google Pub/Sub Lite Spark Connector.Follow these instructions to get your project set up. You will also need to set up the local development environment byinstalling the Google Cloud SDK and running the following commands in command line:gcloud auth login andgcloud config set project [YOUR PROJECT ID].
You'll need to obtain thepubsublite-spark-sql-streaming library. See theQuickstart sectionto addpubsublite-spark-sql-streaming as a dependency in your code.
Google Cloud Pub/Sub Lite is a zonal, real-time messagingservice that lets you send and receive messages between independentapplications. You can manually configure the throughput and storage capacityfor Pub/Sub Lite systems.
The Pub/Sub Lite Spark connector supports Pub/Sub Lite as an input source toApache Spark Structured Streaming in both the default micro-batch processingmode and theexperimental continous processing mode. The connector works inall Apache Spark distributions, includingGoogle Cloud Dataprocand manual Spark installations.
Followthe instruction to create a new subscription or use an existing subscription. If using an existing subscription, the connector will read from the oldest unacknowledged message in the subscription.
If you do not have an Apache Spark environment, you can create aCloud Dataproc cluster with pre-configured auth. The following examples assume you are using Cloud Dataproc, but you can usespark-submit on any cluster.
MY_CLUSTER=...gcloud dataproc clusters create "$MY_CLUSTER"The latest version of the connector is publicly available from theMaven Central repository. You can download and pass it in the--jars option when using thespark-submit command.
| Connector version | Spark version |
|---|---|
| ≤0.3.4 | 2.4.X |
| Current | 3.X.X |
There are 3 java samples (word count, simple write, simple read) undersamples that shows using the connector inside Dataproc.
Here is an example in Python:
df=spark.readStream \ .format("pubsublite") \ .option("pubsublite.subscription","projects/$PROJECT_NUMBER/locations/$LOCATION/subscriptions/$SUBSCRIPTION_ID") \ .load
Here is an example in Java:
Dataset<Row>df =spark .readStream() .format("pubsublite") .option("pubsublite.subscription","projects/$PROJECT_NUMBER/locations/$LOCATION/subscriptions/$SUBSCRIPTION_ID") .load();
Note that the connector supports both MicroBatch Processing andContinuous Processing.
Here is an example in Python:
df.writeStream \ .format("pubsublite") \ .option("pubsublite.topic","projects/$PROJECT_NUMBER/locations/$LOCATION/topics/$TOPIC_ID") \ .option("checkpointLocation","path/to/HDFS/dir") .outputMode("complete") \ .trigger(processingTime="2 seconds") \ .start()
Here is an example in Java:
df.writeStream() .format("pubsublite") .option("pubsublite.topic","projects/$PROJECT_NUMBER/locations/$LOCATION/topics/$TOPIC_ID") .option("checkpointLocation","path/to/HDFS/dir") .outputMode(OutputMode.Complete()) .trigger(Trigger.ProcessingTime(2,TimeUnit.SECONDS)) .start();
When reading from Pub/Sub Lite, the connector supports a number of configuration options:
| Option | Type | Required | Default Value | Meaning |
|---|---|---|---|---|
| pubsublite.subscription | String | Y | Full subscription path that the connector will read from. | |
| pubsublite.flowcontrol.byteoutstandingperpartition | Long | N | 50_000_000 | Max number of bytes per partition that will be cached in workers before Spark processes the messages. |
| pubsublite.flowcontrol.messageoutstandingperpartition | Long | N | Long.MAX | Max number of messages per partition that will be cached in workers before Spark processes the messages. |
| pubsublite.flowcontrol.maxmessagesperbatch | Long | N | Long.MAX | Max number of messages in micro batch. |
| gcp.credentials.key | String | N | Application Default Credentials | Service account JSON in base64. |
When writing to Pub/Sub Lite, the connector supports a number of configuration options:
| Option | Type | Required | Default Value | Meaning |
|---|---|---|---|---|
| pubsublite.topic | String | Y | Full topic path that the connector will write to. | |
| gcp.credentials.key | String | N | Application Default Credentials | Service account JSON in base64. |
When reading from Pub/Sub Lite, the connector has a fixed data schema as follows:
| Data Field | Spark Data Type | Notes |
|---|---|---|
| subscription | StringType | Full subscription path |
| partition | LongType | |
| offset | LongType | |
| key | BinaryType | |
| data | BinaryType | |
| attributes | MapType[StringType, ArrayType[BinaryType]] | |
| publish_timestamp | TimestampType | |
| event_timestamp | TimestampType | Nullable |
When writing to Pub/Sub Lite, the connetor matches the following data field and data types as follows:
| Data Field | Spark Data Type | Required |
|---|---|---|
| key | BinaryType | N |
| data | BinaryType | N |
| attributes | MapType[StringType, ArrayType[BinaryType]] | N |
| event_timestamp | TimestampType | N |
Note that when a data field is present in the table but the data type mismatches, the connector will throw IllegalArgumentException that terminates the query.
The connector is built using Maven. Following command creates a JAR file with shaded dependencies:
mvn package
See thePub/Sub Lite pricing documentation.
No, the number of Spark partitions is set to be the number of Pub/Sub Lite partitions of the topic that the subscription is attached to.
Use a service account JSON key andGOOGLE_APPLICATION_CREDENTIALS as describedhere.
Credentials can be provided withgcp.credentials.key option, it needs to be passed in as a base64-encoded string.
Example:
spark.readStream.format("pubsublite").option("gcp.credentials.key","<SERVICE_ACCOUNT_JSON_IN_BASE64>")
Samples are in thesamples/ directory.
| Sample | Source Code | Try it |
|---|---|---|
| Admin Utils | source code | ![]() |
| Common Utils | source code | ![]() |
| Publish Words | source code | ![]() |
| Read Results | source code | ![]() |
| Simple Read | source code | ![]() |
| Simple Write | source code | ![]() |
| Word Count | source code | ![]() |
To get help, follow the instructions in theshared Troubleshooting document.
Pub/Sub Lite Spark Connector uses gRPC for the transport layer.
Java 8 or above is required for using this client.
Google's Java client libraries,Google Cloud Client LibrariesandGoogle Cloud API Libraries,follow theOracle Java SE support roadmap(see the Oracle Java SE Product Releases section).
In general, new feature development occurs with support for the lowest JavaLTS version covered by Oracle's Premier Support (which typically lasts 5 yearsfrom initial General Availability). If the minimum required JVM for a givenlibrary is changed, it is accompanied by asemver major release.
Java 11 and (in September 2021) Java 17 are the best choices for newdevelopment.
Google tests its client libraries with all current LTS versions covered byOracle's Extended Support (which typically lasts 8 years from initialGeneral Availability).
Google's client libraries support legacy versions of Java runtimes with longterm stable libraries that don't receive feature updates on a best efforts basisas it may not be possible to backport all patches.
Google provides updates on a best efforts basis to apps that continue to useJava 7, though apps might need to upgrade to current versions of the librarythat supports their JVM.
The latest versions and the supported Java versions are identified onthe individual GitHub repositorygithub.com/GoogleAPIs/java-SERVICENAMEand ongoogle-cloud-java.
This library followsSemantic Versioning.
Contributions to this library are always welcome and highly encouraged.
SeeCONTRIBUTING for more information how to get started.
Please note that this project is released with a Contributor Code of Conduct. By participating inthis project you agree to abide by its terms. SeeCode of Conduct for moreinformation.
Apache 2.0 - SeeLICENSE for more information.
| Java Version | Status |
|---|---|
| Java 8 | |
| Java 8 OSX | |
| Java 8 Windows | |
| Java 11 |
Java is a registered trademark of Oracle and/or its affiliates.
About
Resources
License
Code of conduct
Contributing
Security policy
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
