- Notifications
You must be signed in to change notification settings - Fork67
Work on property pool HLSL impl#649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| uint64_t endOffset; | ||
| }; | ||
| NBL_CONSTEXPR uint32_t MaxPropertiesPerDispatch =128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
is there any reason to keep this around anymore?
| // Define the range of invocations (X axis) that will be transfered over in this dispatch | ||
| // May be sectioned off in the case of overflow or any other situation that doesn't allow | ||
| // for a full transfer | ||
| uint64_t beginOffset; | ||
| uint64_t endOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
would be useful to make it clear we're counting in DWORDs or shorts (if you want to do 16bit transfer atoms instead)
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| template<bool Fill,bool SrcIndexIota,bool DstIndexIota> | ||
| struct TransferLoopPermutationDstIota | ||
| { | ||
| voidcopyLoop(uint baseInvocationIndex,uint propertyId, TransferRequest transferRequest,uint dispatchSize) | ||
| { | ||
| if (transferRequest.srcIndexSizeLog2 ==0) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota,0> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
| elseif (transferRequest.srcIndexSizeLog2 ==1) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota,1> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
| elseif (transferRequest.srcIndexSizeLog2 ==2) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota,2> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
| else/*if (transferRequest.srcIndexSizeLog2 == 3)*/ { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota,3> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
| } | ||
| }; | ||
| template<bool Fill,bool SrcIndexIota> | ||
| struct TransferLoopPermutationSrcIota | ||
| { | ||
| voidcopyLoop(uint baseInvocationIndex,uint propertyId, TransferRequest transferRequest,uint dispatchSize) | ||
| { | ||
| bool dstIota = transferRequest.dstIndexAddr ==0; | ||
| if (dstIota) { TransferLoopPermutationDstIota<Fill, SrcIndexIota,true> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
| else { TransferLoopPermutationDstIota<Fill, SrcIndexIota,false> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
| } | ||
| }; | ||
| template<bool Fill> | ||
| struct TransferLoopPermutationFill | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
only use structs instead of templated functions when you need partial specialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm not sure what you mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The struct functor only makes sense if:
- you do a partial specialization, i.e.
template<typename ArbitraryType> struct MyFunctor<true,ArbitraryType>because functions can be only fully specialized - you need to pass the functor as a template arg/lambda because HLSL202x doesn't allow function pointers/references or you want a stateful functor
template<typename Accessor,typename Compare>uint32_tfind_first(inout Accessor accessor,const Compare comparator);
if neither of the above applies, just use a templated function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
ah ok, I think your original comment was the wrong way around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
ah ok, I think your original comment was the wrong way around
"only use structs instead of templated functions when you need partial specialization"
- you don't need partial specialization
- you're using structs
- but you only use structs when you have partial specialization
- dont use structs here.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| staticinlineconstexprauto invalid = PropertyAddressAllocator::invalid_address; | ||
| staticinlineconstexpruint64_t invalid =0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
when did we agree to change theinvalid from the AddressAllocator invalid (0xdeadbeefu) to 0?
Uh oh!
There was an error while loading.Please reload this page.
| // TODO: Reuse asset manager from elsewhere? | ||
| auto assetManager = core::make_smart_refctd_ptr<asset::IAssetManager>(core::smart_refctd_ptr<system::ISystem>(system)); | ||
| auto loadShader = [&](constchar* path) | ||
| { | ||
| asset::IAssetLoader::SAssetLoadParams params = {}; | ||
| auto assetBundle = assetManager->getAsset(path, params); | ||
| auto assets = assetBundle.getContents(); | ||
| assert(!assets.empty()); | ||
| auto cpuShader = asset::IAsset::castDown<asset::ICPUShader>(assets[0]); | ||
| auto shader = m_device->createShader(cpuShader.get()); | ||
| return shader; | ||
| }; | ||
| auto shader =loadShader("../../../include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'd rather usesystem->openFile and use thenbl/builtin/hlsl/property_pool/copy.comp.hlsl path directly toi opena memory mapped file and createIGPUShader from it directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
look at the oldloadBuiltinData code
| transferRequest.srcIndexAddr = srcRequest->srcAddressesOffset ? addressBufferDeviceAddr + srcRequest->srcAddressesOffset :0; | ||
| transferRequest.dstIndexAddr = srcRequest->dstAddressesOffset ? addressBufferDeviceAddr + srcRequest->dstAddressesOffset :0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think I told you explicitly to get rid of theaddresses buffer, and just have each transferRequest supply the correctly offsetted BDA or SBufferBinding instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
given how you're doingsrcAddr anddstAddr, theSBufferBinding route is preferrable
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| uint32_t maxScratchSize = MaxPropertiesPerDispatch *sizeof(nbl::hlsl::property_pools::TransferRequest); | ||
| if (scratch.offset + maxScratchSize > scratch.buffer->getSize()) | ||
| logger.log("CPropertyPoolHandler: The scratch buffer binding provided might not be big enough in the worst case! (Scratch buffer size: %i Max scratch size: %i)", | ||
| system::ILogger::ELL_WARNING, | ||
| scratch.buffer->getSize() - scratch.offset, | ||
| maxScratchSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'd validate better, also right now your code will choke/crash on a scratch buffer thats too small, you're still usingMaxPropertiesPerDispatch to figure out thenumberOfPasses
Uh oh!
There was an error while loading.Please reload this page.
| pushConstants.beginOffset = baseDWORD; | ||
| pushConstants.endOffset = endDWORD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
this is why naming your variables matters, Offset == ByteOffset, but we're clearly counting in DWORDs?
Uh oh!
There was an error while loading.Please reload this page.
| if (contiguous) | ||
| { | ||
| m_indexToAddr =reinterpret_cast<uint32_t*>(reinterpret_cast<uint8_t*>(reserved)+getReservedSize(capacity)); | ||
| m_indexToAddr =reinterpret_cast<uint64_t*>(reinterpret_cast<uint8_t*>(reserved)+getReservedSize(capacity)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
you can't touch this up without making the reserved mem bigger!
| // TODO: instead use some sort of replace function for getting optimal size? | ||
| [numthreads(512,1,1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I already wrote on discord to codegen a 5 line compute shader withmain in the C++ and leavecopy.hlsl as stage agnostic
| transferRequest.srcAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr,8); | ||
| transferRequest.dstAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr +sizeof(uint64_t),8); | ||
| transferRequest.srcIndexAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr +sizeof(uint64_t) *2,8); | ||
| transferRequest.dstIndexAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr +sizeof(uint64_t) *3,8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
make a wrapper forvk::RawBufferLoad andStore innbl::hlsl::legacy which uses our type traits to default the alignment
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| // Loading transfer request from the pointer (can't use struct | ||
| // with BDA on HLSL SPIRV) | ||
| static TransferRequest TransferRequest::newFromAddress(const uint64_t transferCmdAddr) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
keep it with the struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The struct is shared with c++ code, so i wouldn't be able to use vk::rawbufferread; I could take the 64 bit value though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
you can use#ifndef __HLSL_VERSION in the impl of the method
| // TODO: instead use some sort of replace function for getting optimal size? | ||
| NBL_CONSTEXPR uint32_t OptimalDispatchSize =256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
you can use the device JIT to query the max compute dispatch size, I'd round it down to nearest PoT though, so the divisions aren't expensive
c00ee34 to7ac728bCompare
Description
Implementing
CPropertyPoolHandlerandCPropertyPoolin HLSL, using direct buffer address instead of allocating descriptors sets for buffers.Notes about impl:
Testing
TODO list: