- Notifications
You must be signed in to change notification settings - Fork2.4k
Description
Bug Report forhttps://neetcode.io/problems/gpt-dataset
Please describe the bug below and include any steps to reproduce the bug or screenshots if possible.
class Solution:
def batch_loader(self, raw_dataset: str, context_length: int, batch_size: int) -> Tuple[List[List[str]]]:
torch.manual_seed(0)
tokenized = raw_dataset.split()
indices = torch.randint(low=0, high=len(tokenized) - context_length, size=(batch_size,)).tolist()
X = []
Y = []
for idx in indices:
X.append(tokenized[idx:idx+context_length])
Y.append(tokenized[idx+1:idx+1+context_length])
return X, Y
In the provided solution,high=len(tokenized) - context_length
can result in invalid index when generating they
output vector. If the random index =len(tokenized)-context_length
, then the end index foridx+1:idx+1+context_length
will belen(tokenized)-context_length+idx+1+context_length-1
which equalslen(tokenized)
and is out of bounds.