Troubleshoot IBM Spectrum Symphony connectors

This document helps you resolve common issues with the IBM Spectrum Symphonyintegration for Google Cloud. Specifically, this document providestroubleshooting guidance for theIBM Spectrum Symphony host factoryservice,the connectors forCompute Engine andGKEprovidersand theSymphony Operator forKubernetes.

Symphony host factory service issues

These issues relate to the centralSymphony host factory service. You can findthe main log file for this service at the following location on Linux:

$EGO_TOP/hostfactory/log/hostfactory.hostname.log

You set the$EGO_TOP environment variable when youload the host factory environmentvariables.In IBM Spectrum Symphony,$EGO_TOP points to theinstallation root of the Enterprise Grid Orchestrator (EGO), which is the coreresource manager for the cluster. The default installation path for$EGO_TOPon Linux is typically/opt/ibm/spectrumcomputing.

Cluster doesn't add new VMs for pending workloads

This issue occurs when the Symphony queue contains jobs, but the host factoryfails to provision new virtual machines (VMs) to manage the load. The hostfactory log file contains noSCALE-OUT messages.

This issue usually occurs when the Symphony requestor isn't correctlyconfigured or enabled. To resolve the issue, check the status of theconfigured requestor to verify that it is enabled and that there is a pendingworkload.

  1. Locate the requestor configuration file. The file is typically located at:

    $HF_TOP/conf/requestors/hostRequestors.json

    The$HF_TOP environment variable is defined in your environment when youuse thesourcecommand.The value is the path to the top-level installation directory for the IBMSpectrum Symphony host factory service.

  2. Open thehostRequestors.json file and locate thesymAinstentry. In that section, verify that theenabled parameter is set to avalue of1 and that the providers list includes the name of yourconfigured Google Cloud provider instance.

    • For Compute Engine configurations,the provider list must show the name of the Compute Engine providerthat you created inEnable the provider instanceduring the Compute Engine provider installation.
    • For GKE configurations, the provider list must show thename of the GKE provider that you created inEnable the provider instanceduring the GKE provider provider installation.
  3. After you confirm that the symAinst requestor is enabled, check if a consumerhas a pending workload that requires a scale-out.

    View a list of all consumers and their workload status:

    egoshconsumerlist
  4. In the output, look for the consumer associated with your workload andverify that the workload is pending. If the requestor is enabled and aworkload is pending, but the host factory service does not initiatescale-out requests, then check the HostFactory service logs for errors.

Host factory service not starting

If the host factory service doesn't run, follow these steps to resolve the issue:

  1. Check the status of theHostFactory service:

    egoshservicelist

    In the output, locate theHostFactory service and check that theSTATEfield shows a status ofSTARTED.

  2. If theHostFactory service is not started, restart it:

    egoshservicestopHostFactoryegoshservicestartHostFactory

Other errors and logging

If you encounter other errors with the host factory service, then increase thelog verbosity to get more detailed logs. To do so, complete the following steps:

  1. Open thehostfactoryconf.json file for editing. The file is typicallylocated at:

    $EGO_TOP/hostfactory/conf/

    For more information about the value of the$EGO_TOP environment variable, seeSymphony host factory service issues.

  2. Update theHF_LOGLEVEL value fromLOG_INFO toLOG_DEBUG:

    {..."HF_LOGLEVEL":"LOG_DEBUG",...}
  3. Save the file after you make the change.

  4. To make the change take effect, restart theHostFactory service:

    egoshservicestopHostFactoryegoshservicestartHostFactory

After you restart, theHostFactory service generates more detailed logs, whichyou can use to troubleshoot complex issues. You can view these logs in the mainhost factory log file, located at$EGO_TOP/hostfactory/log/hostfactory.hostname.log on Linux.

Host factory provider issues

The following issues occur within thehost factory providerscripts forCompute Engine or Google Kubernetes Engine.

Check the provider logs (hf-gce.log orhf-gke.log) for detailed errormessages. The location of thehf-gce.log andhf-gke.log files is determinedby theLOGFILE variable set in the provider's configuration file inEnablethe providerinstance.

Virtual machine or pod is not provisioned

This issue might occur after the host factory provider logs show a call to therequestMachines.sh script, but the resource doesn't appear in your Google Cloudproject.

To resolve this issue, follow these steps:

  1. Check the provider script logs (hf-gce.log orhf-gke.log) for error messages from the Google Cloud API. The location of thehf-gce.log andhf-gke.log files is determined by theLOGFILE variable setin the provider's configuration file inEnable the providerinstance.

  2. Verify that the service account has the correct IAM permissions:

    1. Follow the instructions inView currentaccess.
    2. Verify that the service account has theCompute Instance Admin(v1)(roles/compute.instanceAdmin.v1) IAM role on the project.For more information about how to grant roles, seeManage access toprojects, folders, andorganizations.
  3. To ensure that the Compute Engine parameters in your host templateare valid, you must verify the following:

    1. The host template parameters must be in thegcpgceinstprov_templates.jsonfile that you created when youset up a providerinstanceduring the Compute Engine provider installation. The most commonparameters to validate aregcp_zone andgcp_instance_group.

    2. Verify that the instance group set by thegcp_instance_group parameterexists. To confirm the instance group, follow the instructions inView aMIG'sproperties,by using thegcp_instance_group name andgcp_zone zone values from thetemplate file.

Pod gets stuck inPending orError state on GKE

This issue might occur after thehf-gke log shows it created theGCPSymphonyResource resource, but the corresponding pod in the GKE clusternever reaches aRunning state and might show a status likePending,ImagePullBackOff, orCrashLoopBackOff.

This issue occurs if there is a problem within the Kubernetes cluster, such asan invalid container image name, insufficient CPU or memory resources, or amisconfigured volume or network setting.

To resolve this issue, usekubectl describe to inspect the events for both thecustom resource and the pod to identify the root cause:

kubectldescribegcpsymphonyresourceRESOURCE_NAMEkubectldescribepodPOD_NAME

Replace the following:

  • RESOURCE_NAME: the name of the resource.
  • POD_NAME: the name of the pod.

Troubleshoot Kubernetes operator issues

TheKubernetesoperatormanages the lifecycle of a Symphony pod. The followingsections can help you troubleshoot common issues you might encounter with theoperator and these resources.

Diagnose issues with resource status fields

The Kubernetes operator manages Symphony workloads in GKE withtwo primary resource types:

  • TheGCPSymphonyResource (GCPSR) resource manages the lifecycle of compute pods forSymphony workloads.
  • TheMachineReturnRequest (MRR) resource handles the return and cleanup of computeresources.

Use these status fields to diagnose issues with theGCPSymphonyResourceresource:

  • phase: The current lifecycle phase of the resource. The options arePending,Running,WaitingCleanup, orCompleted.
  • availableMachines: The number of compute pods that are ready.
  • conditions: Detailed status conditions with timestamps.
  • returnedMachines: A list of returned pods.

Use these status fields to diagnose issues with theMachineReturnRequestresource:

  • phase: The current phase of the return request. The options arePending,InProgress,Completed,Failed, orPartiallyCompleted.
  • totalMachines: The total number of machines to return.
  • returnedMachines: The number of successfully returned machines.
  • failedMachines: The number of machines that failed to return.
  • machineEvents: Per-machine status details.

GCPSymphonyResource resource stuck in thePending state

This issue occurs when theGCPSymphonyResource resource remains in thePending state and the value ofavailableMachines does not increase.

This issue might occur for one these reasons:

  • Insufficient node capacity in your cluster.
  • Problems with pulling the container image.
  • Resource quota limitations.

To resolve this issue:

  1. Check the status of the pods to identify any issues with image pulls orresource allocation:

    kubectldescribepods-ngcp-symphony-lsymphony.requestId=REQUEST_ID

    ReplaceREQUEST_ID with your request ID.

  2. Inspect nodes to ensure sufficient capacity:

    kubectlgetnodes-owide
  3. Pods might show aPending status. This issue usually occurs when the Kubernetescluster needs to scale up and takes longer than expected. Monitor the nodesto ensure the control plane can scale out.

Pods are not returned

This issue occurs when you create aMachineReturnRequest (MRR), but the numberofreturnedMachines does not increase.

This issue can occur for these reasons:

  • Pods are stuck in aTerminating state.
  • There are node connectivity issues.

To resolve this issue:

  1. Check for pods stuck in theTerminating state:

    kubectlgetpods-ngcp-symphony--field-selector=status.phase=Terminating
  2. Describe theMachineReturnRequest to get details about the return process:

    kubectldescribemrrMRR_NAME-ngcp-symphony

    ReplaceMRR_NAME with the name of yourMachineReturnRequest.

  3. Manually delete the custom resource object. This deletion activates the finalcleanup logic:

    kubectldeletegcpsymphonyresourceRESOURCE_NAME

    ReplaceRESOURCE_NAME with the name of theGCPSymphonyResource resource.

High number of failed machines in aMachineReturnRequest

This issue occurs when thefailedMachines count in theMachineReturnRequeststatus is greater than0. This issue can occur for these reasons:

  • Pod deletion has timed out.
  • A node is unavailable.

To resolve this issue:

  1. Check themachineEvents in theMachineReturnRequest status for specificerror messages:

    kubectldescribemrrMRR_NAME-ngcp-symphony
  2. Look for node failure events or control plane performance issues:

    1. Get the status of all nodes:

      kubectlgetnodes-owide
    2. Inspect a specific node:

      kubectldescribenodeNODE_NAME

Pods are not deleted

This issue occurs when deleted pods are stuck in aTerminating orErrorstate.

This issue can occur for these reasons:

  • An overwhelmed control plane or operator, which can cause timeouts or APIthrottle events.
  • The manual deletion of the parentGCPSymphonyResource resource.

To resolve this issue:

  1. Check if the parentGCPSymphonyResource resource is still available and not in theWaitingCleanup state:

    kubectldescribegcpsymphonyresourceRESOURCE_NAME
  2. If the parentGCPSymphonyResource resource is no longer on the system,manually remove the finalizer from the pod or pods. The finalizer tellsKubernetes to wait for the Symphony operator to complete its cleanup tasksbefore Kubernetes fully deletes the pod. First, inspect the YAML configurationto find the finalizer:

    kubectlgetpods-ngcp-symphony-lsymphony.requestId=REQUEST_ID-oyaml

    ReplaceREQUEST_ID with the request IDassociated with the pods.

  3. In the output, look for the finalizers field within the metadata section.You should see an output similar to this snippet:

    metadata:...finalizers:-   symphony-operator/finalizer
  4. To manually remove the finalizer from the pod or pods, use thekubectlpatch command:

    kubectlpatchpod-ngcp-symphony-lsymphony.requestId=REQUEST_ID--typejson-p'[{"op": "remove", "path": "/metadata/finalizers", "value": "symphony-operator/finalizer"}]'

    ReplaceREQUEST_ID with the request IDassociated with the pods.

Old Symphony resources are not automatically deleted from the GKE cluster

After a workload completes and GKE stops its pods, the associatedGCPSymphonyResource andMachineReturnRequest objects remain in yourGKE cluster for longer than the expected 24-hour cleanup period.

This issue occurs when aGCPSymphonyResource object lacks the requiredCompleted status condition. The operator's automatic cleanup process dependson this status to remove the object. To resolve this issue, complete thefollowing steps:

  1. Review the details of theGCPSymphonyResource resource in question:

    kubectlgetgcpsrGCPSR_NAME-oyaml

    ReplaceGCPSR_NAME with the name of theGCPSymphonyResource resource with this issue.

  2. Review the conditions for one of typeCompleted with a status ofTrue:

    status:  availableMachines: 0  conditions:  -   lastTransitionTime: "2025-04-14T14:22:40.855099+00:00"    message: GCPSymphonyResource g555dc430-f1a3-46bb-8b69-5c4c481abc25-2pzvc has      no pods.    reason: NoPods    status: "True"        # This condition will ensure this    type: Completed       # custom resource is cleaned up by the operator  phase: WaitingCleanup  returnedMachines:  -   name: g555dc430-f1a3-46bb-8b69-5c4c481abc25-2pzvc-pod-0    returnRequestId: 7fd6805f-9a00-41f9-afe9-c38aa35002db    returnTime: "2025-04-14T14:22:39.373216+00:00"

    If this condition is not seen on theGCPSymphonyResource details, but thephase: WaitingCleanup is shown instead, theCompleted event has beenlost.

  3. Check for pods associated with theGCPSymphonyResource:

    kubectlgetpods-lsymphony.requestId=REQUEST_ID

    ReplaceREQUEST_ID with the request ID.

  4. If no pods exist, safely delete theGCPSymphonyResource resource:

    kubectldeletegcpsrGCPSR_NAME

    ReplaceGCPSR_NAME with the name of yourGCPSymphonyResource.

  5. If the pods existed before you deleted theGCPSymphonyResource, then you mustdelete them. If the pods still exist, then follow the steps in thePods are notdeleted section.

Pod does not join the Symphony cluster

This issue happens when a pod runs in GKE, but itdoesn't appear as a valid host in the Symphony cluster.

This issue occurs if the Symphony software runs inside the pod is unable toconnect and register with the Symphony primary host. This issue is often due to networkconnectivity issues or misconfiguration of the Symphony clientwithin the container.

To resolve this issue, check the logs of the Symphony services running insidethe pod.

  1. Use SSH or exec to access the pod and view the logs:

    kubectlexec-itPOD_NAME--/bin/bash

    ReplacePOD_NAME with the name of the pod.

  2. When you have a sh inside the pod, the logs for the EGO and LIM daemonsare located in the$EGO_TOP/kernel/log directory. The$EGO_TOPenvironment variable points to the root of the IBM Spectrum Symphonyinstallation:

    cd$EGO_TOP/kernel/log

    For more information on the value of the$EGO_TOP environment variable,seeSymphony host factory serviceissues.

  3. Examine logs for configuration or network errors that block the connectionfrom the GKE pod to the on-premises Symphony primary pod.

Machine return request fails

This issue might occur during scale-in operations when you create aMachineReturnRequest custom resource, but the object gets stuck, and theoperator does not terminate the corresponding Symphony pod.

A failure in the operator's finalizer logic prevents the clean deletion of thepod and its associated custom resource. This problem can lead to orphanedresources and unnecessary costs.

To resolve this issue, manually delete the custom resource, which should activatethe operator's cleanup logic:

kubectldeletegcpsymphonyresourceRESOURCE_NAME

ReplaceRESOURCE_NAME with the name of theresource.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-05 UTC.