Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
This repository was archived by the owner on Mar 21, 2024. It is now read-only.
/cubPublic archive

Releases: NVIDIA/cub

CUB 2.1.0

08 Mar 22:03
Compare
Choose a tag to compare
Loading

Breaking Changes

  • #553: Deprecate theCUB_USE_COOPERATIVE_GROUPS macro, as all supported CTK distributions provide CG. This macro will be removed in a future version of CUB.

New Features

  • #359: Add newDeviceBatchMemcpy algorithm.
  • #565: AddDeviceMergeSort::StableSortKeysCopy API. Thanks to David Wendt (@davidwendt) for this contribution.
  • #585: Add SM90 tuning policy forDeviceRadixSort. Thanks to Andy Adinets (@canonizer) for this contribution.
  • #586: Introduce a new mechanism to opt-out of compiling CDP support in CUB algorithms by definingCUB_DISABLE_CDP.
  • #589: Support 64-bit indexing inDeviceReduce.
  • #607: Support 128-bit integers in radix sort.

Bug Fixes

  • #547: Resolve several long-running issues resulting from using multiple versions of CUB within the same process. Adds an inline namespace that encodes CUB version and targeted PTX architectures.
  • #562: Fix bug inBlockShuffle resulting from an invalid thread offset. Thanks to@sjfeng1999 for this contribution.
  • #564: Fix bug inBlockRadixRank when used with blocks that are not a multiple of 32 threads.
  • #579: Ensure that all threads in the logical warp participate in the index-shuffle forBlockRadixRank. Thanks to Andy Adinets (@canonizer) for this contribution.
  • #582: Fix reordering in CUB member initializer lists.
  • #589: FixDeviceSegmentedSort when used withbool keys.
  • #590: Fix CUB’s CMake install rules. Thanks to Robert Maynard (@robertmaynard) for this contribution.
  • #592: Fix overflow inDeviceReduce.
  • #598: FixDeviceRunLengthEncode when the first item is aNaN.
  • #611: FixWarpScanExclusive for vector types.

Other Enhancements

Contributors

  • @robertmaynard
  • @canonizer
  • @upsj
  • @sjfeng1999
  • @davidwendt
robertmaynard, canonizer, and 3 other contributors
Assets2
Loading
Bewitching-coder, s415901513, and y0-0n reacted with thumbs up emojililohuang reacted with hooray emoji
4 people reacted

CUB 2.0.1

08 Mar 21:50
Compare
Choose a tag to compare
Loading

Other Enhancements

  • Skip device-side synchronization on SM90+. These syncs are a debugging-only feature and not required for correctness, and a warning will be emitted if this happens.
Loading

CUB 1.17.2

13 Sep 17:05
Compare
Choose a tag to compare
Loading

Summary

CUB 1.17.2 is a minor bugfix release.

  • #547: Introduce an annotated inline namespace to prevent issues with
    collisions and mismatched kernel configurations across libraries. The new
    namespace encodes the CUB version and target SM architectures.
Loading

CUB 2.0.0

15 Aug 16:53
Compare
Choose a tag to compare
Loading

Summary

The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtimedebug_synchronous debugging flags.

This release also includes several new features.DeviceHistogram now supports__half and better handles various edge cases.WarpReduce now performs correctly when restricted to a single-thread “warp”, and will use the__reduce_add_sync accelerated intrinsic (introduced with Ampere) when appropriate.DeviceRadixSort learned to handle the case wherebegin_bit == end_bit.

Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.

Breaking Changes

  • #448 Add libcu++ dependency (v1.8.0+).
  • #448: The following macros are no longer defined by default. They can be re-enabled by definingCUB_PROVIDE_LEGACY_ARCH_MACROS. These will be completely removed in a future release.
    • CUB_IS_HOST_CODE: Replace withNV_IF_TARGET.
    • CUB_IS_DEVICE_CODE: Replace withNV_IF_TARGET.
    • CUB_INCLUDE_HOST_CODE: Replace withNV_IF_TARGET.
    • CUB_INCLUDE_DEVICE_CODE: Replace withNV_IF_TARGET.
  • #486: CUB’s CUDA Runtime support macros have been updated to supportNV_IF_TARGET. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.
    • CUB_RUNTIME_FUNCTION: Execution space annotations for functions that invoke CUDA Runtime APIs.
      • Old behavior:
        • RDC enabled: Defined to__host__ __device__
        • RDC not enabled:
          • NVCC host pass: Defined to__host__ __device__
          • NVCC device pass: Defined to__host__
      • New behavior:
        • RDC enabled: Defined to__host__ __device__
        • RDC not enabled: Defined to__host__
    • CUB_RUNTIME_ENABLED: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:
      • RDC enabled: Macro is defined.
      • RDC not enabled:
        • NVCC host pass: Macro is defined.
        • NVCC device pass: Macro is not defined.
    • CUB_RDC_ENABLED: New macro, may be combined withNV_IF_TARGET to replace most usages ofCUB_RUNTIME_ENABLED. Behavior:
      • RDC enabled: Macro is defined.
      • RDC not enabled: Macro is not defined.
  • #509: A compile-time error is now emitted when a__device__-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).
    • Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
    • When using libcu++ 1.9.0, an error will be emitted with information about work-arounds:
      • Use a named function object with a__device__-only implementation ofoperator().
      • Use a__host__ __device__ lambda.
      • Usecuda::proclaim_return_type (Added in libcu++ 1.9.0)
  • #509: Use the result type of the binary reduction operator for accumulating intermediate results in theDeviceReduce algorithm, following guidance fromhttp://wg21.link/P2322R6.
    • This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
    • In addition to the behavioral changes, the interfaces for theDispatch*Reduce layer have changed:
      • DispatchReduce:
        • Now accepts accumulator type as last parameter.
        • Now accepts initializer type instead of output iterator value type.
        • Constructor now acceptsinit as initial type instead of output iterator value type.
      • DispatchSegmentedReduce:
        • Accepts accumulator type as last parameter.
        • Accepts initializer type instead of output iterator value type.
    • Thread operators now accept parameters using different types:Equality,Inequality,InequalityWrapper,Sum,Difference,Division,Max,ArgMax,Min,ArgMin.
    • ThreadReduce now accepts accumulator type and uses a different type forprefix.
  • #511: Use the result type of the binary operator for accumulating intermediate results in theDeviceScan,DeviceScanByKey, andDeviceReduceByKey algorithms, following guidance fromhttp://wg21.link/P2322R6.
    • This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
    • In addition to the behavioral changes, the interfaces for theDispatch layer have changed:
      • DispatchScan now accepts accumulator type as a template parameter.
      • DispatchScanByKey now accepts accumulator type as a template parameter.
      • DispatchReduceByKey now accepts accumulator type as the last template parameter.
  • #527: Deprecate thedebug_synchronous flags on device algorithms.
    • This flag no longer has any effect. DefineCUB_DEBUG_SYNC during compilation to enable these checks.
    • Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.

New Features

  • #514: Support__half inDeviceHistogram.
  • #516: Add support for single-threaded invocations ofWarpReduce.
  • #516: Use__reduce_add_sync hardware acceleration forWarpReduce on supported architectures.

Bug Fixes

  • #481: Fix the device-wide radix sort implementations to simply copy the input to the output whenbegin_bit == end_bit.
  • #487: FixDeviceHistogram::Even for a variety of edge cases:
    • Bin ids are now correctly computed when mixing different types forSampleT andLevelT.
    • Bin ids are now correctly computed whenLevelT is an integral type and the number of levels does not evenly divide the level range.
  • #508: Ensure thattemp_storage_bytes is properly set in theAdjacentDifferenceCopy device algorithms.
  • #508: Remove excessive calls to the binary operator given to theAdjacentDifferenceCopy device algorithms.
  • #533: Fix debugging utilities when RDC is disabled.

Other Enhancements

  • #448: Removed special case code for unsupported CUDA architectures.
  • #448: Replace several usages of__CUDA_ARCH__ with<nv/target> to handle host/device code divergence.
  • #448: Mark unused PTX arch parameters as legacy.
  • #476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to@canonizer for this contribution.
  • #480: AddCUB_DISABLE_BF16_SUPPORT to avoid including thecuda_bf16.h header or using the__nv_bfloat16 type.
  • #486: Add debug log messages for post-kernel debug synchronizations.
  • #490: Clarify documentation for in-place usage ofDeviceScan algorithms.
  • #494: Clarify documentation for in-place usage ofDeviceHistogram algorithms.
  • #495: Clarify documentation for in-place usage ofDevicePartition algorithms.
  • #499: Clarify documentation for in-place usage ofDevice*Sort algorithms.
  • #500: Clarify documentation for in-place usage ofDeviceReduce algorithms.
  • #501: Clarify documentation for in-place usage ofDeviceRunLengthEncode algorithms.
  • #503: Clarify documentation for in-place usage ofDeviceSelect algorithms.
  • #518: Fix typo inWarpMergeSort documentation.
  • #519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.

Contributors

  • @canonizer
canonizer
Loading

CUB 1.17.1

15 Aug 16:19
Compare
Choose a tag to compare
Loading

Summary

CUB 1.17.1 is a minor bugfix release.

  • #508: Ensure thattemp_storage_bytes is properly set in
    theAdjacentDifferenceCopy device algorithms.
  • #508: Remove excessive calls to the binary operator given to
    theAdjacentDifferenceCopy device algorithms.
  • Fix device-side debug synchronous behavior inDeviceSegmentedSort.
Loading

CUB 1.17.0

09 May 18:07
Compare
Choose a tag to compare
Loading

CUB 1.17.0

Summary

CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.

Known Issues

“Run-to-run” Determinism Broken

Several CUB device algorithms are documented to provide deterministic results (per device) for non-associative reduction operators (e.g. floating-point addition). Unfortunately, the implementations of these algorithms contain performance optimizations that violate this guarantee. TheDeviceReduce::ReduceByKey andDeviceScan algorithms are known to be affected. We’re currently evaluating the scope and impact of correcting this in a future CUB release. See NVIDIA/cub#471 for details.

Bug Fixes

  • #444: FixedDeviceSelect to work with discard iterators and mixed input/output types.
  • #452: Fixed install issue whenCMAKE_INSTALL_LIBDIR contained nested directories. Thanks to@robertmaynard for this contribution.
  • #462: Fixed bug that produced incorrect results fromDeviceSegmentedSort on sm_61 and sm_70.
  • #464: FixedDeviceSelect::Flagged so that flags are normalized to 0 or 1.
  • #468: Fixed overflow issues inDeviceRadixSort givennum_items close to 2^32. Thanks to@canonizer for this contribution.
  • #498: Fixed compiler regression inBlockAdjacentDifference. Thanks to@MKKnorr for this contribution.

Other Enhancements

  • #445: Remove device-sync inDeviceSegmentedSort when launched via CDP.
  • #449: Fixed invalid link in documentation. Thanks to@kshitij12345 for this contribution.
  • #450:BlockDiscontinuity: Replaced recursive-template loop unrolling with#pragma unroll. Thanks to@kshitij12345 for this contribution.
  • #451: Replaced the deprecatedTexRefInputIterator implementation with an alias toTexObjInputIterator. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB.
  • #456:BlockAdjacentDifference: Replaced recursive-template loop unrolling with#pragma unroll. Thanks to@kshitij12345 for this contribution.
  • #466:cub::DeviceAdjacentDifference API has been updated to use the newOffsetT deduction approach described in#212.
  • #470: Fix several doxygen-related warnings. Thanks to@karthikeyann for this contribution.

Contributors

  • @robertmaynard
  • @canonizer
  • @karthikeyann
  • @kshitij12345
  • @MKKnorr
robertmaynard, canonizer, and 3 other contributors
Loading

CUB 1.16.0

08 Feb 19:34
Compare
Choose a tag to compare
Loading

Summary

CUB 1.16.0 is a major release providing several improvements to the device scope algorithms.DeviceRadixSort now supports large (64-bit indexed) input data. A newUniqueByKey algorithm has been added toDeviceSelect.DeviceAdjacentDifference provides newSubtractLeft andSubtractRight functionality.

This release also deprecates several obsolete APIs, including type traits andBlockAdjacentDifference algorithms. Many bugfixes and documentation updates are also included.

64-bit Offsets inDeviceRadixSort Public APIs

Users frequently want to process large datasets using CUB’s device-scope algorithms, but the current public APIs limit input data sizes to those that can be indexed by a 32-bit integer. Beginning with this release, CUB is updating these APIs to support 64-bit offsets, as discussed in#212.

The device-scope algorithms will be updated with 64-bit offset support incrementally, starting with thecub::DeviceRadixSort family of algorithms. Thanks to@canonizer for contributing this functionality.

NewDeviceSelect::UniqueByKey Algorithm

cub::DeviceSelect now provides aUniqueByKey algorithm, which has been ported from Thrust. Thanks to@zasdfgbnm for this contribution.

NewDeviceAdjacentDifference Algorithms

The newcub::DeviceAdjacentDifference interface, also ported from Thrust, providesSubtractLeft andSubtractRight algorithms as CUB kernels.

Deprecation Notices

Synchronous CUDA Dynamic Parallelism Support

A future version of CUB will change thedebug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).

This will only affect calls to CUB device-scope algorithms launched from device-side code withdebug_synchronous = true. Such invocations will continue to print extra debugging information, but they will no longer synchronize after kernel launches.

Deprecated Traits

CUB provided a variety of metaprogramming type traits in order to support C++03. Since C++14 is now required, these traits have been deprecated in favor of their STL equivalents, as shown below:

Deprecated CUB TraitReplacement STL Trait
cub::Ifstd::conditional
cub::Equalsstd::is_same
cub::IsPointerstd::is_pointer
cub::IsVolatilestd::is_volatile
cub::RemoveQualifiersstd::remove_cv
cub::EnableIfstd::enable_if

CUB now uses the STL traits internally, resulting in a ~6% improvement in compile time.

Misnamedcub::BlockAdjacentDifference APIs

The algorithms incub::BlockAdjacentDifference have been deprecated, as their names did not clearly describe their intent. TheFlagHeads method is nowSubtractLeft, andFlagTails has been replaced bySubtractRight.

Breaking Changes

  • #331: Deprecate the misnamedBlockAdjacentDifference::FlagHeads andFlagTails methods. Use the newSubtractLeft andSubtractRight methods instead.
  • #364: Deprecate some obsolete type traits. These should be replaced by the equivalent traits in<type_traits> as described above.

New Features

  • #331: Port thethrust::adjacent_difference kernel and expose it ascub::DeviceAdjacentDifference.
  • #405: Port thethrust::unique_by_key kernel and expose it ascub::DeviceSelect::UniqueByKey. Thanks to @zasdfgbmn for this contribution.

Enhancements

  • #340: Allow 64-bit offsets inDeviceRadixSort public APIs. Thanks to@canonizer for this contribution.
  • #400: Implement a significant reduction inDeviceMergeSort compilation time.
  • #415: Support user-definedCMAKE_INSTALL_INCLUDEDIR values in Thrust’s CMake install rules. Thanks for@robertmaynard for this contribution.

Bug Fixes

  • #381: Fix shared memory alignment indyn_smem example.
  • #393: Fix some collisions with themin/max macros defined inwindows.h.
  • #404: Fix bad cast inutil_device.
  • #410: Fix CDP issues inDeviceSegmentedSort.
  • #411: Ensure that thenv_exec_check_disable pragma is only used on nvcc.
  • #418: Fix-Wsizeof-array-div warning on gcc 11. Thanks to@robertmaynard for this contribution.
  • #420: Fix new uninitialized variable warning inDiscardIterator on gcc 10.
  • #423: Fix some collisions with thesmall macro defined inwindows.h.
  • #426: Fix some issues with version handling in CUB’s CMake packages.
  • #430: Remove documentation forDeviceSpmv parameters that are absent from public APIs.
  • #432: Remove incorrect documentation forDeviceScan algorithms that guaranteed run-to-run deterministic results for floating-point addition.

Contributors

  • @robertmaynard
  • @zasdfgbnm
  • @canonizer
robertmaynard, zasdfgbnm, and canonizer
Loading

CUB 1.15.0

25 Oct 19:07
Compare
Choose a tag to compare
Loading

Summary

CUB 1.15.0 includes a newcub::DeviceSegmentedSort algorithm, which demonstrates up to 5000x speedup compared tocub::DeviceSegmentedRadixSort when sorting a large number of small segments. A newcub::FutureValue<T> helper allows thecub::DeviceScan algorithms to lazily load theinitial_value from a pointer.cub::DeviceScan also addedScanByKey functionality.

The newDeviceSegmentedSort algorithm partitions segments into size groups. Each group is processed with specialized kernels using a variety of sorting algorithms. This approach varies the number of threads allocated for sorting each segment and utilizes the GPU more efficiently.

cub::FutureValue<T> provides the ability to use the result of a previous kernel as a scalar input to a CUB device-scope algorithm without unnecessary synchronization:

int *d_intermediate_result = ...;intermediate_kernel<<<blocks, threads>>>(d_intermediate_result,// output                                         arg1,// input                                         arg2);// input// Wrap the intermediate pointer in a FutureValue -- no need to explicitly// sync when both kernels are stream-ordered. The pointer is read after// the ExclusiveScan kernel starts executing.cub::FutureValue<int>init_value(d_intermediate_result);cub::DeviceScan::ExclusiveScan(d_temp_storage,                               temp_storage_bytes,                               d_in,                               d_out,cub::Sum(),                               init_value,                               num_items);

Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.

Deprecation Notices

A future version of CUB will change thedebug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).

This will only affect calls to CUB device-scope algorithms launched from device-side code withdebug_synchronous = true. These algorithms will continue to print extra debugging information, but they will no longer synchronize after kernel launches.

Breaking Changes

  • #305: The template parameters ofcub::DispatchScan have changed to support the newcub::FutureValue helper. More details under "New Features".
  • #377: Remove brokenoperator->() fromcub::TransformInputIterator, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

New Features

  • #305: Add overloads tocub::DeviceScan algorithms that allow the output of a previous kernel to be used asinitial_value without explicit synchronization. See the newcub::FutureValue helper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
  • #354: Addcub::BlockRunLengthDecode algorithm. Thanks to Elias Stehle (@elstehle) for this contribution.
  • #357: Addcub::DeviceSegmentedSort, an optimized version ofcub::DeviceSegmentedSort with improved load balancing and small array performance.
  • #376: Add "by key" overloads tocub::DeviceScan. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

Bug Fixes

  • #349: Doxygen and unused variable fixes.
  • #363: Maintenance updates for the newcub::DeviceMergeSort algorithms.
  • #382: Fix several-Wconversion warnings. Thanks to Matt Stack (@matt-stack) for this contribution.
  • #388: Fix debug assertion on MSVC when usingcub::CachingDeviceAllocator.
  • #395: Support building with__CUDA_NO_HALF_CONVERSIONS__. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

Contributors

  • @zasdfgbnm
  • @elstehle
  • @matt-stack
zasdfgbnm, elstehle, and matt-stack
Loading

CUB 1.14.0 (NVIDIA HPC SDK 21.9)

24 Aug 18:41
792ac3d
This commit was created on GitHub.com and signed with GitHub’sverified signature. The key has expired.
GPG key ID:4AEE18F83AFDEB23
Expired
Verified
Learn about vigilant mode.
Compare
Choose a tag to compare
Loading

CUB 1.14.0 is a major release accompanying the NVIDIA HPC SDK 21.9.

This release provides the often-requested merge sort algorithm, ported from thethrust::sort implementation. Merge sort provides more flexibility than the existing radix sort by supporting arbitrary data types and comparators, though radix sorting is still faster for supported inputs. This functionality is provided through the newcub::DeviceMergeSort andcub::BlockMergeSort algorithms.

The namespace wrapping mechanism has been overhauled for 1.14. The existing macros (CUB_NS_PREFIX/CUB_NS_POSTFIX) can now be replaced by a single macro,CUB_WRAPPED_NAMESPACE, which is set to the name of the desired wrapped namespace. Defining a similarTHRUST_CUB_WRAPPED_NAMESPACE macro will embed boththrust:: andcub:: symbols in the same external namespace. The prefix/postfix macros are still supported, but now require a newCUB_NS_QUALIFIER macro to be defined, which provides the fully qualified CUB namespace (e.g.::foo::cub). Seecub/util_namespace.cuh for details.

Breaking Changes

  • #350: When theCUB_NS_[PRE|POST]FIX macros are set,CUB_NS_QUALIFIER must also be defined to the fully qualified CUB namespace (e.g.#define CUB_NS_QUALIFIER ::foo::cub). Note that this is handled automatically when using the new[THRUST_]CUB_WRAPPED_NAMESPACE mechanism.

New Features

  • #322: Ported the merge sort algorithm from Thrust;cub::BlockMergeSort andcub::DeviceMergeSort are now available.
  • #326: Simplify the namespace wrapper macros, and detect when Thrust's symbols are in a wrapped namespace.

Bug Fixes

  • #160,#163,#352: Fixed several bugs incub::DeviceSpmv and added basic tests for this algorithm. Thanks to James Wyles and Seunghwa Kang for their contributions.
  • #328: Fixed error handling bug and incorrect debugging output incub::CachingDeviceAllocator. Thanks to Felix Kallenborn for this contribution.
  • #335: Fixed a compile error affecting clang and NVRTC. Thanks to Jiading Guo for this contribution.
  • #351: Fixed some errors in thecub::DeviceHistogram documentation.

Enhancements

  • #348: Add an example that demonstrates how to use dynamic shared memory with a CUB block algorithm. Thanks to Matthias Jouanneaux for this contribution.
Loading

CUB 1.13.1 (CUDA Toolkit 11.5)

25 Oct 19:01
Compare
Choose a tag to compare
Loading

CUB 1.13.1 is a minor release accompanying the CUDA Toolkit 11.5.

This release provides a new hook for embedding thecub:: namespace inside
a custom namespace. This is intended to work around various issues related to
linking multiple shared libraries that use CUB. The existingCUB_NS_PREFIX and
CUB_NS_POSTFIX macros already provided this capability; this update provides a
simpler mechanism that is extended to and integrated with Thrust. Simply define
THRUST_CUB_WRAPPED_NAMESPACE to a namespace name, and boththrust:: and
cub:: will be placed inside the new namespace. Using different wrapped
namespaces for each shared library will prevent issues like those reported in
NVIDIA/thrust#1401.

New Features

  • #326: AddTHRUST_CUB_WRAPPED_NAMESPACE hooks.
Loading
Previous13456
Previous

[8]ページ先頭

©2009-2025 Movatter.jp