- Notifications
You must be signed in to change notification settings - Fork468
[Xet] Basic shard creation#1633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
export function compute_range_verification_hash(chunkHashes: string[]): string; | ||
export function compute_file_hash(chunks_array: Array<{ hash: string; length: number }>): string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@assafvayner need those two functions from the wasm :)
(also , versions of those two or at least the last one with.update
would be nice eventually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
what do you mean by.update
for those functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
where you can feed it data progressively before callingfinalize()
to get the hash.
assafvaynerJul 17, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
let's keep the xorb and range hash computation simple and take just an array of items since those have roughly reasonable limit of ~1K items
the file hash I can see the value but we don't have this feature imlpemented in xet-core yet, and might be a while (it's not simple). For now there's just acompute_file_hash
function that takes all the chunks at once but we may be able to update that later
coyotte508Jul 17, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Hmm I don't see the difference between range hash & file hash, they both have all the chunk hashes for a file no? (the only diff is that file hash has chunk lengths too)
the file hash I can see the value but we don't have this feature imlpemented in xet-core yet, and might be a while (it's not simple)
yes no problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
the range hash is at most 1 xorb's worth of hashes (this is a bit odd to explain, that's why we need to write the whole spec).
let's say a file has the following structure:
xorb A chunks 0-1024 (out of 1024)xorb B chunks 0-500 (out of 1024)xorb A chunks 1-44
Then the range hashes for the verification section of the shard containing this file info will need to have:
range_hash(xorb_A.chunks_hashes.slice(0, 1025))range_hash(xorb_B.chunks_hashes.slice(0, 501))range_hash(xorb_A.chunks_hashes.slice(1, 45))
notice that all the reasonable parameters to the range_hash function are <= number of chunks in a xorb
Uh oh!
There was an error while loading.Please reload this page.
cc@Kakulukian@assafvayner for viz, follow up to#1616
Based onhttps://github.com/huggingface/xet-core/blob/7e41fb0dd7cfb276222b9668d0b97a984647721e/spec/shard.md
Need to handle: