- Notifications
You must be signed in to change notification settings - Fork6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Add Cloud Logging example for Ray on GKE#50060
base:master
Are you sure you want to change the base?
Conversation
Signed-off-by: Wei Zhao <weizhaowz@google.com>
@@ -289,6 +289,71 @@ Finally, use a LogQL query to view logs for a specific RayCluster or RayJob, and | |||
[ConfigLink]: https://raw.githubusercontent.com/ray-project/ray/releases/2.4.0/doc/source/cluster/kubernetes/configs/ray-cluster.log.yaml | |||
[KubernetesDownwardAPI]: https://kubernetes.io/docs/concepts/workloads/pods/downward-api/ | |||
### Configure logging sidecar with Fluentbit on GKE | |||
If you want to deploy your ray cluster on GKE and use Cloud Logging, you can read the following steps:\ | |||
When you create a cluster on GKE using these [instructions](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/README.md), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would remove this link and only refer tohttps://cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/how-to/collect-view-logs-metrics that has steps for cluster creation already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
A bit concern is that the command of creating the cluster in this doc is outdated, for example, the --location is required but not provided in the command, and the sample value(1.30.2-gke.1060005) for --cluster-version cause an internal error without detail message, maybe enable autopilot can provide a better cluster configuration to run Ray cluster. So I use it as a reference for logs query only. But please let me know if we need to update the google doc in parallel so we can use it as a reference to create a cluster as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Let's update the Google docs in parallel, in general we should only referencegcloud
for cluster creation examples here and not ai-on-gke (using terraform)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, the Google docs update is ongoing.
If you don't see the logs in GCP Logs Explorer, below is some debugging information. | ||
#### Verify Fluenbit sidecar and Daemonset | ||
When the Ray cluster is created on GEK using the above instructions, a Fluentbit sidecar container should be ready in the Pod and collecting logs from the Ray container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
GEK -> GKE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
fixed
Also, Daemonset Fluentbit pods are ready to forward the logs to Cloud Logging as well, you can use these commands to verify it. | ||
* Get the name of the pod. You may need to modify the namespace if you've modified the terraform file in the instruction. | ||
```shell | ||
kubectl get pods -n ai-on-gke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Instead of runningkubectl get pods
, can you show the section of the Pod manifest that containers the sidecar (when running kubectl get pods -o yaml`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
changed
Signed-off-by: Wei Zhao <weizhaowz@google.com>
Signed-off-by: Wei Zhao <weizhaowz@google.com>
@@ -289,6 +289,71 @@ Finally, use a LogQL query to view logs for a specific RayCluster or RayJob, and | |||
[ConfigLink]: https://raw.githubusercontent.com/ray-project/ray/releases/2.4.0/doc/source/cluster/kubernetes/configs/ray-cluster.log.yaml | |||
[KubernetesDownwardAPI]: https://kubernetes.io/docs/concepts/workloads/pods/downward-api/ | |||
### Configure logging sidecar with Fluentbit on GKE | |||
If you want to deploy your Ray cluster on GKE and use Cloud Logging, you can read the following steps: | |||
When you create a cluster on GKE using these [instructions](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/README.md), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Suggest removing reference to ai-on-gke here for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
removed
For example, if you submit a Ray job as described in the instructions, you can follow this [document](https://cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/how-to/collect-view-logs-metrics#view_ray_logs) to read the job's logs. | ||
If you don't see the logs in GCP Logs Explorer, below is some debugging information. | ||
#### Verify the Fluenbit sidecar and Daemonset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think just "Verify the Fluenbit sidecar" is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
updated
DaemonSet Fluentbit pods should also be ready to forward the logs to Cloud Logging. You can use these commands to verify this. | ||
* Get the name of the pod. You may need to modify the namespace if you've modified the terraform file in the instruction. | ||
```shell | ||
kubectl get pods -n ai-on-gke -o yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
can you remove-n ai-on-gke
. Assuming namespacedefault
in these guides is usually fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
updated
``` | ||
* Verify that a Fluentbit sidecar is present in the Pod. | ||
```shell | ||
kubectl get pod <pod-name> -n ai-on-gke -o go-template='{{range .spec.containers}}{{.name}}{{"\n"}}{{end}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
same here, remove-n ai-on-gke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
changed
* Verify that the Fluentbit sidecar has collected logs from the Ray container. | ||
```shell | ||
kubectl logs pod <pod-name> -n ai-on-gke -c fluentbit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
likewise, remove-n ai-on-gke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
changed
kubectl logs pod <pod-name> -n ai-on-gke -c fluentbit | ||
``` | ||
* Verify that a Fluentbit DaemonSet is ready to forward logs to Cloud Logging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would suggest removing details about the fluentbit daemonset, it's not important here and it's already covered by verifying GKE logging confiugration below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
removed
Signed-off-by: Wei Zhao <weizhaowz@google.com>
Signed-off-by: Wei Zhao <weizhaowz@google.com>
Signed-off-by: Wei Zhao <weizhaowz@google.com>
Signed-off-by: Wei Zhao <weizhaowz@google.com>
Signed-off-by: Wei Zhao <weizhaowz@google.com>
Why are these changes needed?
Add Cloud Logging tutorial for Ray on GKE, including the instruction to create Ray cluster and debug information.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.