Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[FEATURE] Unifiedembed Interfaces for Vision Transformer Models for MIM Pretrain Research #2540

Open
Labels
enhancementNew feature or request
@ryan-minato

Description

@ryan-minato

This feature request is related to challenges in Masked Image Modeling (MIM) pre-training using vision transformer models intimm. Currently, embedding and feature extraction are tightly coupled withinforward_features, making it difficult to inject mask operations after initial embedding and positional encoding but before the transformer stages, which is a common MIM requirement. Researchers need to access embedded tokens for masking before passing them through subsequent transformer layers.

Describe the solution you'd like

I propose a refactoring of all vision transformer models(e.g. vit, swin_transformer, etc.) to uniformly expose two distinct interfaces:

  • embed(self, x): This method should take the input tensor x (e.g., image patches) and return the embedded vectors, including positional encodings if applicable. The output should be ready for the transformer encoder stages.
  • forward_stages(self, x): This method should take the output from embed(x) (i.e., the embedded and position-encoded tokens) and pass them through the transformer encoder layers.

This separation would make the existingforward_features(x) effectively equivalent toforward_stages(embed(x)). This allows researchers to easily perform mask operations on the embedded tokens returned byembed(x) before passing them toforward_stages(x), enabling flexible MIM pre-training experiments.

Describe alternatives you've considered

I have considered alternative solutions, such as adding a mask parameter directly to theforward_features andforwardmethods, similar tovision_transformer's implementation.

defforward_features(self,x:torch.Tensor,attn_mask:Optional[torch.Tensor]=None)->torch.Tensor:
"""Forward pass through feature layers (embeddings, transformer blocks, post-transformer norm)."""

While seemingly straightforward, this approach presents drawbacks:

  • Function Signature: It still alters the method's interface, even if default parameter values mitigate direct breakage.
  • Mask Value Flexibility: More importantly, it would limit flexibility in how masked-out positions are handled, restricting whether masked token values can be learned (e.g., a learnable mask token) or simply set to zero. Separatingembed andforward_stages provides full control.

Additional context

If this refactoring aligns with the library's design, I would gladly contribute a Pull Request to implement these changes across relevant vision transformer models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp