kyegomez/zetaPublic

NotificationsYou must be signed in to change notification settings
Fork53
Star533

Build high-performance AI models with modular building blocks

License

Apache-2.0 license

533 stars 53 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 890 Commits
.github		.github
docs		docs
examples		examples
experimental		experimental
images		images
scripts		scripts
tests		tests
zeta		zeta
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
multi_query_attention.py		multi_query_attention.py
muon.py		muon.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Repository files navigation

Build SOTA AI Models 80% faster with modular, high-performance, and scalable building blocks!

After building out thousands of neural nets and facing the same annoying bottlenecks of chaotic codebases with no modularity and low performance modules, Zeta needed to be born to enable me and others to quickly prototype, train, and optimize the latest SOTA neural nets and deploy them into production.

Zeta places a radical emphasis on useability, modularity, and performance. Zeta is now currently employed in 100s of models across my github and across others.Get started below and LMK if you want my help building any model, I'm here for you 😊 💜

Install

$ pip3 install -U zetascale

Usage

Starting Your Journey

Creating a model empowered with the aforementioned breakthrough research features is a breeze. Here's how to quickly materialize the renowned Multi Query Attention

importtorchfromzetaimportMultiQueryAttention# Modelmodel=MultiQueryAttention(dim=512,heads=8,)# Inputtext=torch.randn(2,4,512)# Outputoutput,_,_=model(text)print(output.shape)print(output)

`SwiGLU`

The SwiGLU activation function takes an input tensor and applies a gating mechanism to selectively pass information. It consists of two parts: the "switch" gate and the "glu" gate. The switch gate controls the flow of information, while the glu gate performs a non-linear transformation on the input.

importtorchfromzeta.nnimportSwiGLUStackedx=torch.randn(5,10)swiglu=SwiGLUStacked(10,20)swiglu(x).shape

In this example, we first import the necessary modules, including torch for tensor operations and SwiGLUStacked from zeta.nn for the SwiGLU activation function.

We then create a random input tensor x with a shape of (5, 10). Next, we instantiate an instance of SwiGLUStacked with an input size of 10 and an output size of 20.

Finally, we pass the input tensor x to the swiglu module, which applies the SwiGLU activation function to it. The resulting output tensor is stored in the output variable. We print the shape of the output tensor to see the

RelativePositionBias

RelativePositionBias quantizes the distance between two positions into a certain number of buckets and then uses an embedding to get the relative position bias. This mechanism aids in the attention mechanism by providing biases based on relative positions between the query and key, rather than relying solely on their absolute positions.

importtorchfromtorchimportnnfromzeta.nnimportRelativePositionBias# Initialize the RelativePositionBias modulerel_pos_bias=RelativePositionBias()# Example 1: Compute bias for a single batchbias_matrix=rel_pos_bias(1,10,10)# Example 2: Utilize in conjunction with an attention mechanism# NOTE: This is a mock example, and may not represent an actual attention mechanism's complete implementation.classMockAttention(nn.Module):def__init__(self):super().__init__()self.rel_pos_bias=RelativePositionBias()defforward(self,queries,keys):bias=self.rel_pos_bias(queries.size(0),queries.size(1),keys.size(1))# Further computations with bias in the attention mechanism...returnNone# Placeholder# Example 3: Modify default configurationscustom_rel_pos_bias=RelativePositionBias(bidirectional=False,num_buckets=64,max_distance=256,num_heads=8)

`FeedForward`

The FeedForward module performs a feedforward operation on the input tensor x. It consists of a multi-layer perceptron (MLP) with an optional activation function and LayerNorm.Used in most language, multi-modal, and modern neural networks.

importtorchfromzeta.nnimportFeedForwardmodel=FeedForward(256,512,glu=True,post_act_ln=True,dropout=0.2)x=torch.randn(1,256)output=model(x)print(output.shape)

`BitLinear`

The BitLinear module performs linear transformation on the input data, followed by quantization and dequantization. The quantization process is performed using the absmax_quantize function, which quantizes the input tensor based on the absolute maximum value,from the paper

importtorchfromtorchimportnnimportzeta.quantasqtclassMyModel(nn.Module):def__init__(self):super().__init__()self.linear=qt.BitLinear(10,20)defforward(self,x):returnself.linear(x)# Initialize the modelmodel=MyModel()# Create a random tensor of size (128, 10)input=torch.randn(128,10)# Perform the forward passoutput=model(input)# Print the size of the outputprint(output.size())# torch.Size([128, 20])

`PalmE`

This is an implementation of the multi-modal Palm-E model using a decoder llm as the backbone with an VIT image encoder to process vision, it's very similiar to GPT4, Kosmos, RTX2, and many other multi-modality model architectures

importtorchfromzeta.structsimport (AutoRegressiveWrapper,Decoder,Encoder,Transformer,ViTransformerWrapper,)classPalmE(torch.nn.Module):"""        PalmE is a transformer architecture that uses a ViT encoder and a transformer decoder.        Args:            image_size (int): Size of the image.            patch_size (int): Size of the patch.            encoder_dim (int): Dimension of the encoder.            encoder_depth (int): Depth of the encoder.            encoder_heads (int): Number of heads in the encoder.            num_tokens (int): Number of tokens.            max_seq_len (int): Maximum sequence length.            decoder_dim (int): Dimension of the decoder.            decoder_depth (int): Depth of the decoder.            decoder_heads (int): Number of heads in the decoder.            alibi_num_heads (int): Number of heads in the alibi attention.            attn_kv_heads (int): Number of heads in the attention key-value projection.            use_abs_pos_emb (bool): Whether to use absolute positional embeddings.            cross_attend (bool): Whether to cross attend in the decoder.            alibi_pos_bias (bool): Whether to use positional bias in the alibi attention.            rotary_xpos (bool): Whether to use rotary positional embeddings.            attn_flash (bool): Whether to use attention flash.            qk_norm (bool): Whether to normalize the query and key in the attention layer.        Returns:                torch.Tensor: The output of the model.        Usage:    img = torch.randn(1, 3, 256, 256)    text = torch.randint(0, 20000, (1, 1024))    model = PalmE()    output = model(img, text)    print(output)    """def__init__(self,image_size=256,patch_size=32,encoder_dim=512,encoder_depth=6,encoder_heads=8,num_tokens=20000,max_seq_len=1024,decoder_dim=512,decoder_depth=6,decoder_heads=8,alibi_num_heads=4,attn_kv_heads=2,use_abs_pos_emb=False,cross_attend=True,alibi_pos_bias=True,rotary_xpos=True,attn_flash=True,qk_norm=True,    ):super().__init__()# vit architectureself.encoder=ViTransformerWrapper(image_size=image_size,patch_size=patch_size,attn_layers=Encoder(dim=encoder_dim,depth=encoder_depth,heads=encoder_heads            ),        )# palm model architectureself.decoder=Transformer(num_tokens=num_tokens,max_seq_len=max_seq_len,use_abs_pos_emb=use_abs_pos_emb,attn_layers=Decoder(dim=decoder_dim,depth=decoder_depth,heads=decoder_heads,cross_attend=cross_attend,alibi_pos_bias=alibi_pos_bias,alibi_num_heads=alibi_num_heads,rotary_xpos=rotary_xpos,attn_kv_heads=attn_kv_heads,attn_flash=attn_flash,qk_norm=qk_norm,            ),        )# autoregressive wrapper to enable generation of tokensself.decoder=AutoRegressiveWrapper(self.decoder)defforward(self,img:torch.Tensor,text:torch.Tensor):"""Forward pass of the model."""try:encoded=self.encoder(img,return_embeddings=True)returnself.decoder(text,context=encoded)exceptExceptionaserror:print(f"Failed in forward method:{error}")raise# Usage with random inputsimg=torch.randn(1,3,256,256)text=torch.randint(0,20000, (1,1024))# Initiliaze the modelmodel=PalmE()output=model(img,text)print(output)

`Unet`

Unet is a famous convolutional neural network architecture originally used for biomedical image segmentation but soon became the backbone of the generative AI Mega-revolution. The architecture comprises two primary pathways: downsampling and upsampling, followed by an output convolution. Due to its U-shape, the architecture is named U-Net. Its symmetric architecture ensures that the context (from downsampling) and the localization (from upsampling) are captured effectively.

importtorchfromzeta.nnimportUnet# Initialize the U-Net modelmodel=Unet(n_channels=1,n_classes=2)# Random input tensor with dimensions [batch_size, channels, height, width]x=torch.randn(1,1,572,572)# Forward pass through the modely=model(x)# Outputprint(f"Input shape:{x.shape}")print(f"Output shape:{y.shape}")

`VisionEmbeddings`

The VisionEmbedding class is designed for converting images into patch embeddings, making them suitable for processing by transformer-based models. This class plays a crucial role in various computer vision tasks and enables the integration of vision data into transformer architectures!

importtorchfromzeta.nnimportVisionEmbedding# Create an instance of VisionEmbeddingvision_embedding=VisionEmbedding(img_size=224,patch_size=16,in_chans=3,embed_dim=768,contain_mask_token=True,prepend_cls_token=True,)# Load an example image (3 channels, 224x224)input_image=torch.rand(1,3,224,224)# Perform image-to-patch embeddingoutput=vision_embedding(input_image)# The output now contains patch embeddings, ready for input to a transformer model

`niva`

Niva focuses on weights of certain layers (specified by quantize_layers). Ideal for models where runtime activation is variable. 👁️ Example Layers: nn.Embedding, nn.LSTM.

importtorchfromzetaimportniva# Load a pre-trained modelmodel=YourModelClass()# Quantize the model dynamically, specifying layers to quantizeniva(model=model,model_path="path_to_pretrainedim_weights.pt",output_path="quantizedim.pt",quant_type="dynamic",quantize_layers=[nn.Linear,nn.Conv2d],dtype=torch.qint8,)

`FusedDenseGELUDense`

Increase model speed by 2x with this module that fuses together 2 hyper-optimized dense ops from bits and bytes and a gelu together!

importtorchfromzeta.nnimportFusedDenseGELUDensex=torch.randn(1,512)model=FusedDenseGELUDense(512,1024)out=model(x)out.shape

`FusedDropoutLayerNorm`

FusedDropoutLayerNorm is a fused kernel of dropout and layernorm to speed up FFNs or MLPS by 2X

importtorchfromtorchimportnnfromzeta.nnimportFusedDropoutLayerNorm# Initialize the modulemodel=FusedDropoutLayerNorm(dim=512)# Create a sample input tensorx=torch.randn(1,512)# Forward passoutput=model(x)# Check output shapeprint(output.shape)# Expected: torch.Size([1, 512])

`Mamba`

Pytorch implementation of the new SSM model architecture Mamba

importtorchfromzeta.nnimportMambaBlock# Initialize Mambablock=MambaBlock(dim=64,depth=1)# Random inputx=torch.randn(1,10,64)# Apply the model to the blocky=block(x)print(y.shape)# torch.Size([1, 10, 64])

`FiLM`

importtorchfromzeta.nnimportFilm# Initialize the Film layerfilm_layer=Film(dim=128,hidden_dim=64,expanse_ratio=4)# Create some dummy data for conditions and hiddensconditions=torch.randn(10,128)# Batch size is 10, feature size is 128hiddens=torch.randn(10,1,128)# Batch size is 10, sequence length is 1, feature size is 128# Pass the data through the Film layermodulated_features=film_layer(conditions,hiddens)# Print the shape of the outputprint(modulated_features.shape)# Should be [10, 1, 128]

`hyper_optimize`

A single wrapper for torch.fx, torch.script, torch.compile, dynamic quantization, mixed precision through torch.amp, with execution time metrics all in once place!

importtorchfromzeta.nnimporthyper_optimize@hyper_optimize(torch_fx=False,torch_script=False,torch_compile=True,quantize=True,mixed_precision=True,enable_metrics=True,)defmodel(x):returnx @xout=model(torch.randn(1,3,32,32))print(out)

DPO - Direct Policy Optimization

Direct Policy Optimization employed for many RLHF applications for LLMs.

importtorchfromtorchimportnnfromzeta.rlimportDPO# Define a simple policy modelclassPolicyModel(nn.Module):def__init__(self,dim,output_dim):super().__init__()self.fc=nn.Linear(dim,output_dim)defforward(self,x):returnself.fc(x)dim=10output_dim=5policy_model=PolicyModel(dim,output_dim)# Initialize DPO with the policy modeldpo_model=DPO(model=policy_model,beta=0.1)# Sample preferred and unpreferred sequencespreferred_seq=torch.randint(0,output_dim, (3,dim))unpreferred_seq=torch.randint(0,output_dim, (3,dim))# Compute lossloss=dpo_model(preferred_seq,unpreferred_seq)print(loss)

PyTorch Model Logging

A decorator that logs the execution of the pytorch model, including parameters, gradients, and memory usage.

fromzeta.utilsimportverbose_executionimporttorchfromtorchimportnnfromzeta.utils.verbose_executionimportverbose_execution# # Configure Loguru (optional)@verbose_execution(log_params=True,log_gradients=True,log_memory=True)classYourPyTorchModel(nn.Module):def__init__(self):super().__init__()self.conv1=nn.Conv2d(3,64,3)self.relu=nn.ReLU()self.flatten=nn.Flatten()self.fc=nn.Linear(64*222*222,10)# Adjusted input sizedefforward(self,x):x=self.conv1(x)x=self.relu(x)x=self.flatten(x)x=self.fc(x)returnx# Create and use your modelmodel=YourPyTorchModel()input_tensor=torch.randn(1,3,224,224)output=model(input_tensor)# If you want to see gradient information, you need to perform a backward passloss=output.sum()loss.backward()

Sigmoid Attention

Attention 18% faster with sigmoid instead of attention. replace traditional softmax in attention with a sigmoid and a constant (not learned) scalar bias based on the sequence length.

importtorchfromzetaimportSigmoidAttentionfromloguruimportloggerbatch_size=32seq_len=128dim=512heads=8x=torch.rand(batch_size,seq_len,dim)mask=torch.ones(batch_size,seq_len,seq_len)# Example masksigmoid_attn=SigmoidAttention(dim,heads,seq_len)output=sigmoid_attn(x,mask)print(output.shape)

Documentation

All classes must have documentation if you see a class or function without documentation then please report it to me atkye@apac.ai,

Documentation is atzeta.apac.ai

Running tests

You should install the pre-commit hooks with pre-commit install. This will run the linter, mypy, and a subset of the tests on every commit.

For more examples on how to run the full test suite please refer to the CI workflow.

Some examples of running tests locally:

python3 -m pip install -e'.[testing]'# install extra deps for testingpython3 -m pytest tests/# whole test suite

Community

Join our growing community around the world, for real-time support, ideas, and discussions on how to build better models 😊

View our officialDocs
Chat live with us onDiscord
Follow us onTwitter
Connect with us onLinkedIn
Visit us onYouTube
Join the Swarms community on Discord!

🤝 Schedule a 1-on-1 Session

Want to train a custom AI model for a real-world task like General Multi-Modal Models, Facial Recognitions, Drug Discovery, Humanoid Robotics? I'll help you create the model architecture then train the model and then optimize it to meet your quality assurance standards.

Book a1-on-1 Session with Kye here., the Creator, to discuss any issues, provide feedback, or explore how we can improve Zeta for you or help you build your own custom models!

🫶 Contributions:

The easiest way to contribute is to pick any issue with thegood first issue tag 💪. Read the Contributing guidelineshere. Bug Report?File here | Feature Request?File here

Zeta is an open-source project, and contributions are VERY welcome. If you want to contribute, you can create new features, fix bugs, or improve the infrastructure. Please refer to theCONTRIBUTING.md and ourcontributing board to participate in Roadmap discussions!

Accelerate Backlog

Help us accelerate our backlog by supporting us financially! Note, we're an open source corporation and so all the revenue we generate is through donations at the moment ;)

License

Apache

Citation

@misc{zetascale,title ={Zetascale Framework},author ={Kye Gomez},year ={2024},howpublished ={\url{https://github.com/kyegomez/zeta}},}

About

Build high-performance AI models with modular building blocks

zeta.apac.ai

Releases6

0.0.3 Latest

Jul 10, 2023

+ 5 releases

Sponsor this project

Learn more about GitHub Sponsors

Packages

No packages published

Contributors10

Languages

Python99.7%
Other0.3%

Movatterモバイル変換

Uh oh!

License

kyegomez/zeta

Folders and files

Latest commit

History

Repository files navigation

Install

Usage

Starting Your Journey

SwiGLU

RelativePositionBias

FeedForward

BitLinear

PalmE

Unet

VisionEmbeddings

niva

FusedDenseGELUDense

FusedDropoutLayerNorm

Mamba

FiLM

hyper_optimize

DPO - Direct Policy Optimization

PyTorch Model Logging

Sigmoid Attention

Documentation

Running tests

Community

🤝 Schedule a 1-on-1 Session

🫶 Contributions:

Accelerate Backlog

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases6

Sponsor this project

Uh oh!

Packages0

Uh oh!

Contributors10

Languages

`SwiGLU`

`FeedForward`

`BitLinear`

`PalmE`

`Unet`

`VisionEmbeddings`

`niva`

`FusedDenseGELUDense`

`FusedDropoutLayerNorm`

`Mamba`

`FiLM`

`hyper_optimize`

Packages