This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can trysigning in orchanging directories.
Access to this page requires authorization. You can trychanging directories.
This article demonstrates how to debug an application that uses C++ Accelerated Massive Parallelism (C++ AMP) to take advantage of the graphics processing unit (GPU). It uses a parallel-reduction program that sums up a large array of integers. This walkthrough illustrates the following tasks:
Before you start this walkthrough:
Note
C++ AMP headers are deprecated starting with Visual Studio 2022 version 17.0.Including any AMP headers will generate build errors. Define_SILENCE_AMP_DEPRECATION_WARNINGS
before including any AMP headers to silence the warnings.
Note
Your computer might show different names or locations for some of the Visual Studio user interface elements in the following instructions. The Visual Studio edition that you have and the settings that you use determine these elements. For more information, seePersonalizing the IDE.
The instructions for creating a project vary depending on which version of Visual Studio you're using. Make sure you have the correct documentation version selected above the table of contents on this page.
On the menu bar, chooseFile >New >Project to open theCreate a New Project dialog box.
At the top of the dialog, setLanguage toC++, setPlatform toWindows, and setProject type toConsole.
From the filtered list of project types, chooseConsole App then chooseNext. In the next page, enterAMPMapReduce
in theName box to specify a name for the project, and specify the project location if you want a different one.
Choose theCreate button to create the client project.
Start Visual Studio.
On the menu bar, chooseFile >New >Project.
UnderInstalled in the templates pane, chooseVisual C++.
ChooseWin32 Console Application, typeAMPMapReduce
in theName box, and then choose theOK button.
Choose theNext button.
Clear thePrecompiled header check box, and then choose theFinish button.
InSolution Explorer, deletestdafx.h,targetver.h, andstdafx.cpp from the project.
Next:
Open AMPMapReduce.cpp and replace its content with the following code.
// AMPMapReduce.cpp defines the entry point for the program.// The program performs a parallel-sum reduction that computes the sum of an array of integers.#include <stdio.h>#include <tchar.h>#include <amp.h>const int BLOCK_DIM = 32;using namespace concurrency;void sum_kernel_tiled(tiled_index<BLOCK_DIM> t_idx, array<int, 1> &A, int stride_size) restrict(amp){ tile_static int localA[BLOCK_DIM]; index<1> globalIdx = t_idx.global * stride_size; index<1> localIdx = t_idx.local; localA[localIdx[0]] = A[globalIdx]; t_idx.barrier.wait(); // Aggregate all elements in one tile into the first element. for (int i = BLOCK_DIM / 2; i > 0; i /= 2) { if (localIdx[0] < i) { localA[localIdx[0]] += localA[localIdx[0] + i]; } t_idx.barrier.wait(); } if (localIdx[0] == 0) { A[globalIdx] = localA[0]; }}int size_after_padding(int n){ // The extent might have to be slightly bigger than num_stride to // be evenly divisible by BLOCK_DIM. You can do this by padding with zeros. // The calculation to do this is BLOCK_DIM * ceil(n / BLOCK_DIM) return ((n - 1) / BLOCK_DIM + 1) * BLOCK_DIM;}int reduction_sum_gpu_kernel(array<int, 1> input){ int len = input.extent[0]; //Tree-based reduction control that uses the CPU. for (int stride_size = 1; stride_size < len; stride_size *= BLOCK_DIM) { // Number of useful values in the array, given the current // stride size. int num_strides = len / stride_size; extent<1> e(size_after_padding(num_strides)); // The sum kernel that uses the GPU. parallel_for_each(extent<1>(e).tile<BLOCK_DIM>(), [&input, stride_size] (tiled_index<BLOCK_DIM> idx) restrict(amp) { sum_kernel_tiled(idx, input, stride_size); }); } array_view<int, 1> output = input.section(extent<1>(1)); return output[0];}int cpu_sum(const std::vector<int> &arr) { int sum = 0; for (size_t i = 0; i < arr.size(); i++) { sum += arr[i]; } return sum;}std::vector<int> rand_vector(unsigned int size) { srand(2011); std::vector<int> vec(size); for (size_t i = 0; i < size; i++) { vec[i] = rand(); } return vec;}array<int, 1> vector_to_array(const std::vector<int> &vec) { array<int, 1> arr(vec.size()); copy(vec.begin(), vec.end(), arr); return arr;}int _tmain(int argc, _TCHAR* argv[]){ std::vector<int> vec = rand_vector(10000); array<int, 1> arr = vector_to_array(vec); int expected = cpu_sum(vec); int actual = reduction_sum_gpu_kernel(arr); bool passed = (expected == actual); if (!passed) { printf("Actual (GPU): %d, Expected (CPU): %d", actual, expected); } printf("sum: %s\n", passed ? "Passed!" : "Failed!"); getchar(); return 0;}
On the menu bar, chooseFile >Save All.
InSolution Explorer, open the shortcut menu forAMPMapReduce, and then chooseProperties.
In theProperty Pages dialog box, underConfiguration Properties, chooseC/C++ >Precompiled Headers.
For thePrecompiled Header property, selectNot Using Precompiled Headers, and then choose theOK button.
On the menu bar, chooseBuild >Build Solution.
In this procedure, you'll use the Local Windows Debugger to make sure that the CPU code in this application is correct. The segment of the CPU code in this application that is especially interesting is thefor
loop in thereduction_sum_gpu_kernel
function. It controls the tree-based parallel reduction that is run on the GPU.
InSolution Explorer, open the shortcut menu forAMPMapReduce, and then chooseProperties.
In theProperty Pages dialog box, underConfiguration Properties, chooseDebugging. Verify thatLocal Windows Debugger is selected in theDebugger to launch list.
Return to theCode Editor.
Set breakpoints on the lines of code shown in the following illustration (approximately lines 67 line 70).
CPU breakpoints
On the menu bar, chooseDebug >Start Debugging.
In theLocals window, observe the value forstride_size
until the breakpoint at line 70 is reached.
On the menu bar, chooseDebug >Stop Debugging.
This section shows how to debug the GPU code, which is the code contained in thesum_kernel_tiled
function. The GPU code computes the sum of integers for each "block" in parallel.
InSolution Explorer, open the shortcut menu forAMPMapReduce, and then chooseProperties.
In theProperty Pages dialog box, underConfiguration Properties, chooseDebugging.
In theDebugger to launch list, selectLocal Windows Debugger.
In theDebugger Type list, verify thatAuto is selected.
Auto is the default value. In versions before Windows 10,GPU Only is the required value instead ofAuto.
Choose theOK button.
Set a breakpoint at line 30, as shown in the following illustration.
GPU breakpoint
On the menu bar, chooseDebug >Start Debugging. The breakpoints in the CPU code at lines 67 and 70 don't get executed during GPU debugging because those lines of code run on the CPU.
To open theGPU Threads window, on the menu bar, chooseDebug >Windows >GPU Threads.
You can inspect the state the GPU threads in theGPU Threads window that appears.
Dock theGPU Threads window at the bottom of Visual Studio. Choose theExpand Thread Switch button to display the tile and thread text boxes. TheGPU Threads window shows the total number of active and blocked GPU threads, as shown in the following illustration.
GPU Threads window
313 tiles get allocated for this computation. Each tile contains 32 threads. Because local GPU debugging occurs on a software emulator, there are four active GPU threads. The four threads execute the instructions simultaneously and then move on together to the next instruction.
In theGPU Threads window, there are four GPU threads active and 28 GPU threads blocked at thetile_barrier::wait statement defined at about line 21 (t_idx.barrier.wait();
). All 32 GPU threads belong to the first tile,tile[0]
. An arrow points to the row that includes the current thread. To switch to a different thread, use one of the following methods:
In the row for the thread to switch to in theGPU Threads window, open the shortcut menu and chooseSwitch To Thread. If the row represents more than one thread, you'll switch to the first thread according to the thread coordinates.
Enter the tile and thread values of the thread in the corresponding text boxes and then choose theSwitch Thread button.
TheCall Stack window displays the call stack of the current GPU thread.
To open theParallel Stacks window, on the menu bar, chooseDebug >Windows >Parallel Stacks.
You can use theParallel Stacks window to simultaneously inspect the stack frames of multiple GPU threads.
Dock theParallel Stacks window at the bottom of Visual Studio.
Make sure thatThreads is selected in the list in the upper-left corner. In the following illustration, theParallel Stacks window shows a call-stack focused view of the GPU threads that you saw in theGPU Threads window.
Parallel Stacks window
32 threads went from_kernel_stub
to the lambda statement in theparallel_for_each
function call and then to thesum_kernel_tiled
function, where the parallel reduction occurs. 28 out of the 32 threads have progressed to thetile_barrier::wait
statement and remain blocked at line 22, while the other four threads remain active in thesum_kernel_tiled
function at line 30.
You can inspect the properties of a GPU thread. They're available in theGPU Threads window in the rich DataTip of theParallel Stacks window. To see them, hover the pointer on the stack frame ofsum_kernel_tiled
. The following illustration shows the DataTip.
GPU thread DataTip
For more information about theParallel Stacks window, seeUsing the Parallel Stacks Window.
To open theParallel Watch window, on the menu bar, chooseDebug >Windows >Parallel Watch >Parallel Watch 1.
You can use theParallel Watch window to inspect the values of an expression across multiple threads.
Dock theParallel Watch 1 window to the bottom of Visual Studio. There are 32 rows in the table of theParallel Watch window. Each corresponds to a GPU thread that appeared in both the GPU Threads window and theParallel Stacks window. Now, you can enter expressions whose values you want to inspect across all 32 GPU threads.
Select theAdd Watch column header, enterlocalIdx
, and then choose theEnter key.
Select theAdd Watch column header again, typeglobalIdx
, and then choose theEnter key.
Select theAdd Watch column header again, typelocalA[localIdx[0]]
, and then choose theEnter key.
You can sort by a specified expression by selecting its corresponding column header.
Select thelocalA[localIdx[0]] column header to sort the column. The following illustration shows the results of sorting bylocalA[localIdx[0]].
Results of sort
You can export the content in theParallel Watch window to Excel by choosing theExcel button and then choosingOpen in Excel. If you have Excel installed on your development computer, the button opens an Excel worksheet that contains the content.
In the upper-right corner of theParallel Watch window, there's a filter control that you can use to filter the content by using Boolean expressions. EnterlocalA[localIdx[0]] > 20000
in the filter control text box and then choose theEnter key.
The window now contains only threads on which thelocalA[localIdx[0]]
value is greater than 20000. The content is still sorted by thelocalA[localIdx[0]]
column, which is the sorting action you chose earlier.
You can mark specific GPU threads by flagging them in theGPU Threads window, theParallel Watch window, or the DataTip in theParallel Stacks window. If a row in the GPU Threads window contains more than one thread, flagging that row flags all threads contained in the row.
Select the[Thread] column header in theParallel Watch 1 window to sort by tile index and thread index.
On the menu bar, chooseDebug >Continue, which causes the four threads that were active to progress to the next barrier (defined at line 32 of AMPMapReduce.cpp).
Choose the flag symbol on the left side of the row that contains the four threads that are now active.
The following illustration shows the four active flagged threads in theGPU Threads window.
Active threads in the GPU Threads window
TheParallel Watch window and the DataTip of theParallel Stacks window both indicate the flagged threads.
If you want to focus on the four threads that you flagged, you can choose to show only the flagged threads. It limits what you see in theGPU Threads,Parallel Watch, andParallel Stacks windows.
Choose theShow Flagged Only button on any of the windows or on theDebug Location toolbar. The following illustration shows theShow Flagged Only button on theDebug Location toolbar.
Show Flagged Only button
Now theGPU Threads,Parallel Watch, andParallel Stacks windows display only the flagged threads.
You can freeze (suspend) and thaw (resume) GPU threads from either theGPU Threads window or theParallel Watch window. You can freeze and thaw CPU threads the same way; for information, seeHow to: Use the Threads Window.
Choose theShow Flagged Only button to display all the threads.
On the menu bar, chooseDebug >Continue.
Open the shortcut menu for the active row and then chooseFreeze.
The following illustration of theGPU Threads window shows that all four threads are frozen.
Frozen threads in theGPU Threads window
Similarly, theParallel Watch window shows that all four threads are frozen.
On the menu bar, chooseDebug >Continue to allow the next four GPU threads to progress past the barrier at line 22 and to reach the breakpoint at line 30. TheGPU Threads window shows that the four previously frozen threads remain frozen and in the active state.
On the menu bar, chooseDebug,Continue.
From theParallel Watch window, you can also thaw individual or multiple GPU threads.
On the shortcut menu for one of the threads in theGPU Threads window, chooseGroup By,Address.
The threads in theGPU Threads window are grouped by address. The address corresponds to the instruction in disassembly where each group of threads is located. 24 threads are at line 22 where thetile_barrier::wait Method is executed. 12 threads are at the instruction for the barrier at line 32. Four of these threads are flagged. Eight threads are at the breakpoint at line 30. Four of these threads are frozen. The following illustration shows the grouped threads in theGPU Threads window.
Grouped threads in theGPU Threads window
You can also do theGroup By operation by opening the shortcut menu for theParallel Watch window's data grid. SelectGroup By, and then choose the menu item that corresponds to how you want to group the threads.
You run all the threads in a given tile to the line that contains the cursor by usingRun Current Tile To Cursor.
On the shortcut menu for the frozen threads, chooseThaw.
In theCode Editor, put the cursor in line 30.
On the shortcut menu for theCode Editor, chooseRun Current Tile To Cursor.
The 24 threads that were previously blocked at the barrier at line 21 have progressed to line 32. It's shown in theGPU Threads window.
C++ AMP overview
Debugging GPU code
How to: Use the GPU Threads window
How to: Use the Parallel Watch window
Analyzing C++ AMP code with the Concurrency Visualizer
Was this page helpful?
Was this page helpful?