Stream logs from Google Cloud to Splunk

Last reviewed 2024-10-24 UTC

This document describes a reference architecture that helps you create aproduction-ready, scalable, fault-tolerant, log export mechanism that streamslogs and events from your resources in Google Cloud into Splunk. Splunk isa popular analytics tool that offers a unified security and observabilityplatform. In fact, you have the choice of exporting the logging data to eitherSplunk EnterpriseorSplunk Cloud Platform.If you're an administrator, you can also use this architecture for either IToperations or security use cases.

This reference architecture assumes a resource hierarchy that is similar to thefollowing diagram. All the Google Cloud resource logs from theorganization, folder, and project levels are gathered into an aggregated sink.Then, the aggregated sink sends these logs to a log export pipeline, whichprocesses the logs and exports them to Splunk.

Organization-level aggregated log sink for export to Splunk.

Architecture

The following diagram shows the reference architecture that you use when youdeploy this solution. This diagram demonstrates how log data flows fromGoogle Cloud to Splunk.

Flow of logs from Google Cloud to Splunk.

This architecture includes the following components:

  • Cloud Logging: To start the process,Cloud Logging collects the logs into an organization-level aggregated logsink and sends the logs to Pub/Sub.
  • Pub/Sub: ThePub/Sub service then creates a single topic and subscription forthe logs and forwards the logs to the main Dataflow pipeline.
  • Dataflow: There are twoDataflow pipelines in this reference architecture:
    • Primary Dataflow pipeline: At the center of the process,the main Dataflow pipeline is aPub/Sub to Splunk streaming pipeline which pulls logs from thePub/Sub subscription and delivers them to Splunk.
    • Secondary Dataflow pipeline: Parallel to the primaryDataflow pipeline, the secondary Dataflowpipeline is a Pub/Sub to Pub/Sub streamingpipeline to replay messages if a delivery fails.
  • Splunk: At the end of the process, Splunk Enterprise or Splunk CloudPlatform acts as an HTTP Event Collector (HEC) and receives the logs forfurther analysis. You can deploy Splunk on-premises, inGoogle Cloud as SaaS, or through ahybrid approach.

Use case

This reference architecture uses a cloud, push-based approach. In thispush-based method, you use thePub/Sub to SplunkDataflowtemplateto stream logs to aSplunk HTTP Event Collector (HEC).The reference architecture also discusses Dataflow pipelinecapacity planning and how to handle potential delivery failures when there aretransient server or network issues.

While this reference architecture focuses on Google Cloud logs, the samearchitecture can be used to export other Google Cloud data, such asreal-time asset changes and security findings. By integrating logs fromCloud Logging, you can continue to use existing partner services like Splunkas a unified log analytics solution.

The push-based method to stream Google Cloud data into Splunk has thefollowing advantages:

  • Managed service. As a managed service, Dataflowmaintains the required resources in Google Cloud for data processingtasks such as log export.
  • Distributed workload. This method lets you distribute workloads acrossmultiple workers for parallel processing, so there is no single point offailure.
  • Security. Because Google Cloud pushes your data to Splunk HEC,there's none of the maintenance and security burden associated with creatingand managing service account keys.
  • Autoscaling. The Dataflow service autoscales the numberof workers in response to variations in incoming log volume and backlog.
  • Fault-tolerance. ​​If there are transient server or network issues, thepush-based method automatically tries to resend the data to the Splunk HEC. Italso supports unprocessed topics (also known asdead-letter topics) forany undeliverable log messages to avoid data loss.
  • Simplicity. You avoid the management overhead and the cost of runningone or moreheavy forwarders inSplunk.

This reference architecture applies to businesses in many different industryverticals, including regulated ones such as pharmaceutical and financialservices. When you choose to export your Google Cloud data into Splunk,you might choose to do so for the following reasons:

  • Business analytics
  • IT operations
  • Application performance monitoring
  • Security operations
  • Compliance

Design alternatives

An alternative method for log export to Splunk is one where you pull logs fromGoogle Cloud. In this pull-based method, you use Google Cloud APIsto fetch the data through theSplunk Add-on for Google Cloud.You might choose to use the pull-based method in the following situations:

  • Your Splunk deployment does not offer a Splunk HEC endpoint.
  • Your log volume is low.
  • You want to export and analyze Cloud Monitoring metrics,Cloud Storage objects, Cloud Resource Manager API metadata, Cloud Billingdata, or low-volume logs.
  • You already manage one or more heavy forwarders in Splunk.
  • You use the hostedInputs Data Manager for Splunk Cloud.

Also, keep in mind the additional considerations that arise when you use thispull-based method:

  • A single worker handles the data ingestion workload, which does not offerautoscaling capabilities.
  • In Splunk, the use of a heavy forwarder to pull data might cause a singlepoint of failure.
  • The pull-based method requires you to create and manage the service accountkeys that you use to configure the Splunk Add-on for Google Cloud.

Before using the Splunk Add-on, log entries must first be routed toPub/Sub using a log sink. To create a log sink withPub/Sub topic as the destination, seecreate a sink.Make sure to grant the Pub/Sub Publisher role(roles/pubsub.publisher) to the sink's writer identity over thatPub/Sub topic destination. For moreinformation about configuring sink destination permissions, seeSet destination permissions.

To enable the Splunk Add-on, perform the following steps:

  1. In Splunk, follow the Splunk instructions toinstall the Splunk Add-on for Google Cloud.
  2. Create a Pub/Sub pull subscriptionfor the Pub/Sub topic where the logs are routed to, if youdon't have one already.
  3. Create a service account.
  4. Create a service account key forthe service account that you just created.
  5. Grant the Pub/Sub Viewer (roles/pubsub.viewer) andPub/Sub Subscriber (roles/pubsub.subscriber) roles to theservice account to let the account receive messages from thePub/Sub subscription.
  6. In Splunk, follow the Splunk instructions toconfigure a new Pub/Sub inputin the Splunk Add-on for Google Cloud.

    The Pub/Sub messages from the log export appear in Splunk.

To verify that the add-on is working, perform the following steps:

  1. In Cloud Monitoring, open Metrics Explorer.
  2. In theResources menu, selectpubsub_subscription.
  3. In theMetric categories, selectpubsub/subscription/pull_message_operation_count.
  4. Monitor the number of message-pull operations for one to two minutes.

Design considerations

The following guidelines can help you to develop an architecture that meets yourorganization's requirements for security, privacy, compliance, operationalefficiency, reliability, fault tolerance, performance, and cost optimization.

Security, privacy, and compliance

The following sections describe the security considerations for this referencearchitecture:

Use private IP addresses to secure the VMs that support the Dataflow pipeline

You should restrict access to the worker VMs that are used in theDataflow pipeline. To restrict access, deploy these VMs withprivate IP addresses. However, these VMs also need to be able to use HTTPS tostream the exported logs into Splunk and access the internet. To provide thisHTTPS access, you need a Cloud NAT gateway which automatically allocatesCloud NAT IP addresses to the VMs that need them. Make sure to map the subnetthat contains the VMs to the Cloud NAT gateway.

Enable Private Google Access

When you create a Cloud NAT gateway, Private Google Access becomesenabled automatically. However, to allow Dataflow workers withprivate IP addresses to access the external IP addresses that GoogleCloud APIs and services use, you must also manually enablePrivate Google Access for the subnet.

Restrict Splunk HEC ingress traffic to known IP addresses used by Cloud NAT

If you want to restrict traffic into the Splunk HEC to a subset of known IPaddresses, you can reserve static IP addresses and manually assign them to theCloud NAT gateway. Depending on your Splunk deployment, you can thenconfigure your Splunk HEC ingress firewall rules using these static IPaddresses. For more information about Cloud NAT, seeSet up and manage network address translation with Cloud NAT.

Store the Splunk HEC token in Secret Manager

When you deploy the Dataflow pipeline, you can pass the tokenvalue in one of the following ways:

  • Plaintext
  • Ciphertext encrypted with a Cloud Key Management Service key
  • Secret version encrypted and managed by Secret Manager

In this reference architecture, you use the Secret Manager optionbecause this option offers the least complex and most efficient way to protectyour Splunk HEC token. This option also prevents leakage of the Splunk HEC tokenfrom the Dataflow console or the job details.

A secret in Secret Manager contains a collection of secretversions. Each secret version stores the actual secret data, such as the SplunkHEC token. If you later choose to rotate your Splunk HEC token as an addedsecurity measure, you can add the new token as a new secret version to thissecret. For general information about the rotation of secrets, seeAboutrotation schedules.

Create a custom Dataflow worker service account to follow least privilege best practices

Workers in the Dataflow pipeline use theDataflow worker service accountto access resources and execute operations. By default, the workers use yourproject'sCompute Engine default service accountas the worker service account, which grants them broad permissions to allresources in your project. However, to run Dataflow jobs inproduction, we recommend that you create a custom service account with a minimumset of roles and permissions. You can then assign this custom service account toyour Dataflow pipeline workers.

The following diagram lists the required roles that you must assign to a serviceaccount to enable Dataflow workers to run aDataflow job successfully.

Roles required to assign to a Dataflow worker service account.

As shown in the diagram, you need to assign the following roles to the serviceaccount for your Dataflow worker:

  • Dataflow Admin
  • Dataflow Worker
  • Storage Object Admin
  • Pub/Sub Subscriber
  • Pub/Sub Viewer
  • Pub/Sub Publisher
  • Secret Accessor

Configure SSL validation with an internal root CA certificate if you use a private CA

By default, the Dataflow pipeline uses theDataflow worker’s default trust store to validate the SSLcertificate for your Splunk HEC endpoint. If you use a private certificateauthority (CA) to sign an SSL certificate that is used by the Splunk HECendpoint, you can import your internal root CA certificate into the trust store.The Dataflowworkers can then use the imported certificate for SSL certificate validation.

You can use and import your own internal root CA certificate for Splunkdeployments with self-signed or privately signed certificates. You can alsodisable SSL validation entirely for internal development and testing purposesonly. This internal root CA method works best for non-internet facing, internalSplunk deployments.

For more information, see thePub/Sub to Splunk Dataflow template parametersrootCaCertificatePath anddisableCertificateValidation.

Operational efficiency

The following sections describe the operational efficiency considerations forthis reference architecture:

Use UDF to transform logs or events in-flight

The Pub/Sub to Splunk Dataflow template supportsuser-defined functions (UDF) for custom event transformation. Example use casesinclude enriching records with additional fields, redacting some sensitivefields, or filtering out undesired records. UDF enables you to change theDataflow pipeline's output format without having to re-compile orto maintain the template code itself. This reference architecture uses a UDF tohandle messages that the pipeline isn't able to deliver to Splunk.

Replay unprocessed messages

Sometimes, the pipeline receives delivery errors and doesn't try to deliver themessage again. In this case, Dataflow sends these unprocessedmessages to an unprocessed topic as shown in the following diagram. After youfix the root cause of the delivery failure, you can then replay the unprocessedmessages.

Error handling when exporting logs to Splunk.

The following steps outline the process shown in the previous diagram:

  1. The main delivery pipeline from Pub/Sub to Splunkautomatically forwards undeliverable messages to the unprocessed topic foruser investigation.
  2. The operator or site reliability engineer (SRE) investigates the failedmessages in the unprocessed subscription. The operator troubleshoots andfixes the root cause of the delivery failure. For example, fixing an HECtoken misconfiguration might enable the messages to be delivered.

    Note: To avoid data backlog or data loss, you must regularly inspect anyfailed messages in the unprocessed subscription and resolve any issuesbefore Pub/Sub discards the messages. The maximum messageretention in a Pub/Sub subscription is seven days, which isthe default setting for both the input and the unprocessed subscriptions inthis reference architecture. You can monitor and receive alerts on theunacknowledged messagesin the unprocessed subscription. For more information about monitoring andalerting for Pub/Sub subscriptions, seeMonitor message backlog.
  3. The operator triggers the replay failed message pipeline. ThisPub/Sub to Pub/Sub pipeline (highlighted inthe dotted section of the preceding diagram) is a temporary pipelinethat moves the failed messages from the unprocessed subscription back tothe original log sink topic.

  4. The main delivery pipeline re-processes the previously failed messages. Thisstep requires the pipeline to use a UDF for correct detection and decodingof failed messages payloads. The following code is an example function thatimplements this conditional decoding logic, including a tally of deliveryattempts for tracking purposes:

    // If the message has already been converted to a Splunk HEC object// with a stringified obj.event JSON payload, then it's a replay of// a previously failed delivery.// Unnest and parse the obj.event. Drop the previously injected// obj.attributes such as errorMessage and timestampif(obj.event){try{event=JSON.parse(obj.event);redelivery=true;}catch(e){event=obj;}}else{event=obj;}// Keep a tally of delivery attemptsevent.delivery_attempt=event.delivery_attempt||1;if(redelivery){event.delivery_attempt+=1;}

Reliability and fault tolerance

In regard to reliability and fault tolerance, the following table, Table 1,lists some possible Splunk delivery errors. The table also lists thecorrespondingerrorMessage attributes that the pipeline records with eachmessage before forwarding these messages to the unprocessed topic.

Table 1: Splunk delivery error types

Delivery error typeAutomatically retried by pipeline?ExampleerrorMessage attribute

Transient network error

Yes

Read timed out

or

Connection reset

Splunk server 5xx error

Yes

Splunk write status code: 503

Splunk server 4xx error

No

Splunk write status code: 403

Splunk server down

No

The target server failed to respond

Splunk SSL certificate invalid

No

Host name X does not match the certificate

JavaScript syntax error in the user-defined function (UDF)

No

ReferenceError: X is not defined

In some cases, the pipeline appliesexponential backoff andautomatically tries to deliver the message again. For example, when the Splunkserver generates a5xx error code, the pipeline needs to re-deliver themessage. These error codes occur when the Splunk HEC endpoint is overloaded.

Alternatively, there could be a persistent issue that prevents a message frombeing submitted to the HEC endpoint. For such persistent issues, the pipelinedoes not try to deliver the message again. The following are examples ofpersistent issues:

  • A syntax error in theUDFfunction.
  • An invalid HEC token that causes the Splunk server to generate a4xx"Forbidden" server response.

Performance and cost optimization

In regard to performance and cost optimization, you need to determine themaximum size and throughput for your Dataflow pipeline. You mustcalculate the correct size and throughput values so that your pipeline canhandle peak daily log volume (GB/day) and log message rate (events per second,or EPS) from the upstream Pub/Sub subscription.

You must select the size and throughput values so that the system doesn't incureither of the following issues:

  • Delays caused by message backlogging or message throttling.
  • Extra costs from overprovisioning a pipeline.

After you perform the size and throughput calculations, you can use the resultsto configure an optimal pipeline that balances performance and cost. Toconfigure your pipeline capacity, you use the following settings:

The following sections provide an explanation of these settings. Whenapplicable, these sections also provide formulas and example calculations thatuse each formula. These example calculations and resulting values assume anorganization with the following characteristics:

  • Generates 1 TB of logs daily.
  • Has an average message size of 1 KB.
  • Has a sustained peak message rate that is two times the average rate.

Because your Dataflow environment is unique, substitute theexample values with values from your own organization as you work through thesteps.

Machine type

Best practice: Set the--worker-machine-type flag ton2-standard-4 toselect a machine size that provides the best performance to cost ratio.

Because then2-standard-4 machine type can handle 12k EPS, we recommend thatyou use this machine type as a baseline for all of your Dataflowworkers.

For this reference architecture, set the--worker-machine-type flag to a valueofn2-standard-4.

Machine count

Best practice: Set the--max-workers flag to control the maximum number ofworkers needed to handle expected peak EPS.

Dataflow autoscaling allows the service to adaptively change thenumber of workers used to execute your streaming pipeline when there are changesto resource usage and load. To avoid over-provisioning when autoscaling, werecommend that you always define the maximum number of virtual machines that areused as Dataflow workers. You define the maximum number ofvirtual machines with the--max-workers flag when you deploy theDataflow pipeline.

Dataflow statically provisions the storage component as follows:

  • An autoscaling pipeline deploys one data persistent disk for each potentialstreaming worker. The default persistent disk size is 400 GB, and you setthe maximum number of workers with the--max-workers flag. The disksare mounted to the running workers at any point in time, including startup.

  • Because each worker instance is limited to 15 persistent disks, the minimumnumber of starting workers is⌈--max-workers/15⌉. So, if the default valueis--max-workers=20, the pipeline usage (and cost) is as follows:

    • Storage: static with 20 persistent disks.
    • Compute: dynamic with minimum of 2 worker instances (⌈20/15⌉ = 2),and a maximum of 20.

    This value is equivalent to 8 TB of a Persistent Disk. This size of Persistent Disk could incur unnecessary cost if the disks are not fully used, especially if only one or two workers are running the majority of the time.

To determine the maximum number of workers that you need for your pipeline, usethe following formulas in sequence:

  1. Determine the average events per second (EPS) using the following formula:

    \( {AverageEventsPerSecond}\simeq\frac{TotalDailyLogsInTB}{AverageMessageSizeInKB}\times\frac{10^9}{24\times3600} \)

    Example calculation: Given the example values of 1 TB of logs per daywith an average message size of 1 KB, this formula generates an average EPSvalue of 11.5k EPS.

  2. Determine the sustained peak EPS by using the following formula, wherethe multiplierN represents the bursty nature of logging:

    \( {PeakEventsPerSecond = N \times\ AverageEventsPerSecond} \)

    Example calculation: Given an example value ofN=2 and the average EPSvalue of 11.5k that you calculated in the previous step, this formulagenerates a sustained peak EPS value of 23k EPS.

  3. Determine the maximum required number of vCPUs by using the followingformula:

    \( {maxCPUs = ⌈PeakEventsPerSecond / 3k ⌉} \)

    Example calculation: Using the sustained peak EPS value of 23k that youcalculated in the previous step, this formula generates a maximum of⌈23 / 3⌉ = 8 vCPU cores.

    Note: A single vCPU in a Splunk Dataflow pipeline can generally process 3k EPS, assuming there are no artificially low rate limits. So, one Dataflow VM worker of machine type (n2-standard-4) is generally enough to process up to 12k EPS.
  4. Determine the maximum number of Dataflow workers by usingthe following formula:

    Example calculation: Using the example maximum vCPUs value of 8 that wascalculated in the previous step, this formula [8/4] generates a maximumnumber of 2 for ann2-standard-4 machine type.

For this example, you would set the--max-workers flag to a value of2based on the previous set of example calculations. However, remember to use yourown unique values and calculations when you deploy this reference architecturein your environment.

Parallelism

Best practice: Set theparallelism parameterin the Pub/Sub to Splunk Dataflow template to twicethe number of vCPUs used by the maximum number of Dataflowworkers.

Theparallelism parameter helps maximize the number of parallel Splunk HECconnections, which in turn maximizes the EPS rate for your pipeline.

The defaultparallelism value of1 disables parallelism and limits theoutput rate. You need to override this default setting to account for 2 to 4parallel connections per vCPU, with the maximum number of workers deployed. As arule, you calculate the override value for this setting by multiplying themaximum number of Dataflow workers by the number of vCPUs perworker, and then doubling this value.

To determine the total number of parallel connections to the Splunk HEC acrossall Dataflow workers, use the following formula:

\( {parallelism = maxCPUs * 2} \)

Example calculation: Using the example maximum vCPUs of 8 that waspreviously calculated for machine count, this formula generates the number ofparallel connections to be 8 x 2 = 16.

For this example, you would set theparallelism parameter to a value of16based on the previous example calculation. However, remember to use your ownunique values and calculations when you deploy this reference architecture inyour environment.

Batch count

Best practice: To enable the Splunk HECto process events in batches rather than one at a time, set thebatchCount parameter to a value between 10 to 50 events/request for logs.

Configuring the batch count helps to increase EPS and reduce the load on theSplunk HEC endpoint. The setting combines multiple events into a single batchfor more efficient processing. We recommend that you set thebatchCountparameter to a value between 10 to 50 events/request for logs, provided the maxbuffering delay of two seconds is acceptable.

\( {batchCount >= 10} \)

Because the average log message size is 1 KB in this example, we recommend thatyou batch at least 10 events per request. For this example, you would set thebatchCount parameter to a value of10. However, remember to use your ownunique values and calculations when you deploy this reference architecture inyour environment.

For more information about these performance and cost optimizationrecommendations, seePlan your Dataflowpipeline.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-10-24 UTC.