Debug NeMo RL Applications#

This guide explains how to debug NeMo RL applications, covering two scenarios. It first outlines the procedure for debugging distributed Ray worker/actor processes using the Ray Distributed Debugger within a SLURM environment, and then details debugging the main driver script.

Debug Worker/Actors on SLURM#

Since Ray programs can spawn multiple workers and actors, using the Ray Distributed Debugger is essential to accurately jump to breakpoints on each worker.

Prerequisites#

  • Install theRay Debugger VS Code/Cursor extension.

  • Launch theinteractive cluster withray.sub.

  • Launch VS Code/Cursor on the SLURM login node (wheresqueue/sbatch is available).

  • Addbreakpoint() in your code under actors & tasks (i.e. classes or functions decorated with@ray.remote).

  • EnsureRAY_DEBUG=legacy is not set since this debugging requires the default distributed debugger.

Forward a Port from the Head Node#

From the SLURM login node, query the nodes used by the interactiveray.sub job as follows:

teryk@slurm-login:~$squeue--meJOBIDPARTITIONNAMEUSERSTTIMENODESNODELIST(REASON)2504248batchray-clusterterrykR15:014node-12,node-[22,30],node-49

The first node is always the head node, so we need to port forward the dashboard port to the login node:

# Traffic from the login node's $LOCAL is forwarded to node-12:$DASHBOARD_PORT# - If you haven't changed the default DASHBOARD_PORT in ray.sub, it is likely 8265# - Choose a LOCAL_PORT that isn't taken. If the cluster is multi-tenant, 8265#   on the login node is likely taken by someone else.ssh-L$LOCAL_PORT:localhost:$DASHBOARD_PORT-Nnode-12# Example chosing a port other than 8265 for the LOCAL_PORTssh-L52640:localhost:8265-Nnode-12

The example output from the port-forwarding withssh may print logs like this, where the warning is expected.

Warning: Permanently added 'node-12' (ED25519) to the list of known hosts.bind [::1]:52640: Cannot assign requested address

Open the Ray Debugger Extension#

In VS Code or Cursor, open the Ray Debugger extension by clicking the Ray icon in the activity bar or searching for “View: Show Ray Debugger” in the Command Palette (Ctrl+Shift+P or Cmd+Shift+P).

Ray Debugger Extension Step 1

Add the Ray Cluster#

Click on the “Add Cluster” button in the Ray Debugger panel.

Ray Debugger Extension Step 2

Enter the address and port you set up in the port forwarding step. If you followed the example above using port 52640, you would enter:

Ray Debugger Extension Step 3

Add a Breakpoint and Run Your Program#

The Ray Debugger Panel for cluster127.0.0.1:52640 lists all active breakpoints. To begin debugging, select a breakpoint from the dropdown and clickStartDebugging to jump to that worker.

Note that you can jump between breakpoints across all workers with this process.

Ray Debugger Extension Step 4

Debug with legacy Ray debugger#

To use legacy ray debugger, you can use two ways

  1. In general, setRAY_DEBUG=legacy and add--ray-debugger-external to yourraystart command

  2. If you are usingray.sub in a slurm cluster, you can simply setRAY_DEBUG=legacy beforesbatchray.sub, the script can detect this environment variable and attach--ray-debugger-external automatically.

After you start ray with these changes, you can addbreakpoint to your code. When you run the program, it will stop at where breakpoints are inserted. Then you can use a separate terminal to attach to the header node viabash<JOB_ID>-attach.sh (this script should automatically be generated byray.sub), and runraydebug to see all the breakpoints. You can enter any breakpoint and interactively debug. Please refer toRay documentation for more info on this debugging approach.