microsoft/unilmPublic

NotificationsYou must be signed in to change notification settings
Fork2.6k
Star20.9k

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

License

MIT license

20.9k stars 2.6k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,236 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Diff-Transformer		Diff-Transformer
LatentLM		LatentLM
PFPO		PFPO
YOCO		YOCO
adalm		adalm
beats		beats
beit		beit
beit2		beit2
beit3		beit3
bitnet		bitnet
decoding		decoding
deepnet		deepnet
deltalm		deltalm
dit		dit
e5		e5
edgelm		edgelm
glan		glan
infoxlm		infoxlm
kosmos-1		kosmos-1
kosmos-2.5		kosmos-2.5
kosmos-2		kosmos-2
layoutlm		layoutlm
layoutlmft		layoutlmft
layoutlmv2		layoutlmv2
layoutlmv3		layoutlmv3
layoutreader		layoutreader
layoutxlm		layoutxlm
longnet		longnet
longvit		longvit
markuplm		markuplm
mathscale		mathscale
metalm		metalm
minilm		minilm
retnet		retnet
s2s-ft		s2s-ft
simlm		simlm
speechlm		speechlm
speecht5		speecht5
storage		storage
textdiffuser-2		textdiffuser-2
textdiffuser		textdiffuser
trocr		trocr
unilm-v1		unilm-v1
unilm		unilm
unimim		unimim
valle		valle
vl-beit		vl-beit
vlmo		vlmo
wavlm		wavlm
xdoc		xdoc
xlmt		xlmt
xmoe		xmoe
xtune		xtune
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md

Repository files navigation

aka.ms/GeneralAI

Hiring

We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and General AI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to fuwei@microsoft.com.

Foundation Architecture

TorchScale - A Library of Foundation Architectures (repo)

Fundamental research to develop new architectures for foundation models and AI, focusing on modeling generality and capability, as well as training stability and efficiency.

Stability -DeepNet: scaling Transformers to 1,000 Layers and beyond

Generality -Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)

Capability - ALength-Extrapolatable Transformer

Efficiency & Transferability -X-MoE: scalable & finetunable sparse Mixture-of-Experts (MoE)

The Revolution of Model Architecture

BitNet: 1-bit Transformers for Large Language Models

RetNet: Retentive Network: A Successor to Transformer for Large Language Models

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Foundation Models

The Evolution of (M)LLM (Multimodal LLM)

Kosmos-2.5:A Multimodal Literate Model

Kosmos-2:Grounding Multimodal Large Language Models to the World

Kosmos-1:A Multimodal Large Language Model (MLLM)

MetaLM:Language Models are General-Purpose Interfaces

The Big Convergence - Large-scale self-supervised pre-training acrosstasks (predictive and generative),languages (100+ languages), andmodalities (language, image, audio, layout/format + language, vision + language, audio + language, etc.)

Language & Multilingual

UniLM: unified pre-training for language understanding and generation

InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages

DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages

MiniLM: small and fast pre-trained models for language understanding and generation

AdaLM: domain, language, and task adaptation of pre-trained models

EdgeLM(NEW): small pre-trained models on edge/client devices

SimLM (NEW): large-scale pre-training for similarity matching

E5 (NEW): text embeddings

MiniLLM (NEW): Knowledge Distillation of Large Language Models

Vision

BEiT/BEiT-2: generative self-supervised pre-training for vision / BERT Pre-Training of Image Transformers

DiT: self-supervised pre-training for Document Image Transformers

TextDiffuser/TextDiffuser-2 (NEW): Diffusion Models as Text Painters

Speech

WavLM: speech pre-training for full stack tasks

VALL-E: a neural codec language model for TTS

Multimodal (X + Language)

LayoutLM/LayoutLMv2/LayoutLMv3: multimodal (text + layout/format + image)Document Foundation Model forDocument AI (e.g. scanned documents, PDF, etc.)

LayoutXLM: multimodal (text + layout/format + image)Document Foundation Model for multilingual Document AI

MarkupLM: markup language model pre-training for visually-rich document understanding

XDoc: unified pre-training for cross-format document understanding

UniSpeech: unified pre-training for self-supervised learning and supervised learning for ASR

UniSpeech-SAT: universal speech representation learning with speaker-aware pre-training

SpeechT5: encoder-decoder pre-training for spoken language processing

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

VLMo: Unified vision-language pre-training

VL-BEiT (NEW): Generative Vision-Language Pre-training - evolution ofBEiT to multimodal

BEiT-3 (NEW): a general-purpose multimodal foundation model, and a major milestone ofThe Big Convergence of Large-scale Pre-training Across Tasks, Languages, and Modalities.

Toolkits

s2s-ft: sequence-to-sequence fine-tuning toolkit

Aggressive Decoding (NEW): lossless and efficient sequence-to-sequence decoding algorithm

Applications

TrOCR: transformer-based OCR w/ pre-trained models

LayoutReader: pre-training of text and layout for reading order detection

XLM-T: multilingual NMT w/ pretrained cross-lingual encoders

News

December, 2024:RedStone was released!
December, 2023:LongNet andLongViT released
[Model Release] Dec, 2023:TextDiffuser-2 models, code anddemo.
Sep, 2023:Kosmos-2.5 - a multimodal literate model for machine reading of text-intensive images.
[Model Release] May, 2023:TextDiffuser models and code.
[Model Release] March, 2023:BEiT-3 pretrained models and code.
March, 2023:Kosmos-1 - a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot).
January, 2023:VALL-E a language modeling approach for text to speech synthesis (TTS), which achieves state-of-the-art zero-shot TTS performance. Seehttps://aka.ms/valle for demos of our work.
[Model Release] January, 2023:E5 - Text Embeddings by Weakly-Supervised Contrastive Pre-training.
November, 2022:TorchScale 0.1.1 was released!
November, 2022:TrOCR was accepted by AAAI 2023.
[Model Release] November, 2022:XDocBASE models for cross-format document understanding.
[Model Release] September, 2022:TrOCRBASE andLARGE models for Scene Text Recognition (STR).
[Model Release] September, 2022:BEiT v2 code and pretrained models.
August, 2022:BEiT-3 - a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks
July, 2022:SimLM - Large-scale self-supervised pre-training for similarity matching
June, 2022:DiT andLayoutLMv3 were accepted by ACM Multimedia 2022.
June, 2022:MetaLM - Language models are general-purpose interfaces to foundation models (language/multilingual, vision, speech, and multimodal)
June, 2022:VL-BEiT - bidirectional multimodal Transformer learned from scratch with one unified pretraining task, one shared backbone, and one-stage training, supporting both vision and vision-language tasks.
[Model Release] June, 2022:LayoutLMv3 Chinese - Chinese version of LayoutLMv3
[Code Release] May, 2022:Aggressive Decoding - Lossless Speedup for Seq2seq Generation
April, 2022:Transformers at Scale =DeepNet +X-MoE
[Model Release] April, 2022:LayoutLMv3 - Pre-training for Document AI with Unified Text and Image Masking
[Model Release] March, 2022:EdgeFormer - Parameter-efficient Transformer for On-device Seq2seq Generation
[Model Release] March, 2022:DiT - Self-supervised Document Image Transformer. Demos:Document Layout Analysis,Document Image Classification
January, 2022:BEiT was accepted byICLR 2022 as Oral presentation (54 out of 3391).
[Model Release] December 16th, 2021:TrOCRsmall models for handwritten and printed texts, with 3x inference speedup.
November 24th, 2021:VLMo as the new SOTA on theVQA Challenge
November, 2021:Multilingual translation at scale: 10000 language pairs and beyond
[Model Release] November, 2021:MarkupLM - Pre-training for text and markup language (e.g. HTML/XML)
[Model Release] November, 2021:VLMo - Unified vision-language pre-training w/BEiT
October, 2021:WavLM Large achieves state-of-the-art performance on theSUPERB benchmark
[Model Release] October, 2021:WavLM - Large-scale self-supervised pre-trained models for speech.
[Model Release] October 2021:TrOCR is onHuggingFace
September 28th, 2021: T-ULRv5 (akaXLM-E/InfoXLM) as the SOTA on theXTREME leaderboard. //Blog
[Model Release] September, 2021:LayoutLM-cased are onHuggingFace
[Model Release] September, 2021:TrOCR - Transformer-based OCR w/ pre-trainedBEiT and RoBERTa models.
August 2021:LayoutLMv2 andLayoutXLM are onHuggingFace
[Model Release] August, 2021:LayoutReader - Built with LayoutLM to improve general reading order detection.
[Model Release] August, 2021:DeltaLM - Encoder-decoder pre-training for language generation and translation.
August 2021:BEiT is onHuggingFace
[Model Release] July, 2021:BEiT - Towards BERT moment for CV
[Model Release] June, 2021:LayoutLMv2,LayoutXLM,MiniLMv2, andAdaLM.
May, 2021:LayoutLMv2, InfoXLMv2, MiniLMv2, UniLMv3, and AdaLM were accepted by ACL 2021.
April, 2021:LayoutXLM is coming by extending the LayoutLM into multilingual support! A multilingual form understanding benchmarkXFUND is also introduced, which includes forms with human labeled key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
March, 2021:InfoXLM was accepted by NAACL 2021.
December 29th, 2020:LayoutLMv2 is coming with the new SOTA on a wide variety of document AI tasks, includingDocVQA andSROIE leaderboard.
October 8th, 2020: T-ULRv2 (akaInfoXLM) as the SOTA on theXTREME leaderboard. //Blog
September, 2020:MiniLM was accepted by NeurIPS 2020.
July 16, 2020:InfoXLM (Multilingual UniLM)arXiv
June, 2020:UniLMv2 was accepted by ICML 2020;LayoutLM was accepted by KDD 2020.
April 5, 2020:Multilingual MiniLM released!
September, 2019:UniLMv1 was accepted by NeurIPS 2019.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.Portions of the source code are based on thetransformers project.

Microsoft Open Source Code of Conduct