CLARIN-PL/embeddingsPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star36

Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language

License

MIT license

36 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.github/workflows		.github/workflows
docker/embeddings-gpu		docker/embeddings-gpu
embeddings		embeddings
examples		examples
experimental		experimental
nbs		nbs
tests		tests
webpage		webpage
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
settings.ini		settings.ini

Repository files navigation

CLARIN Embeddings

State-of-the-art Text Representations for Natural Language Processingtasks, an initial version of library focus on the Polish Language

This library was used during the development ofthe LEPISZCZEbenchmark (NeurIPS 2022).

Installation

pip install clarinpl-embeddings

Example

Text-classification with polemo2 dataset and transformer-basedembeddings

fromembeddings.pipeline.lightning_classificationimportLightningClassificationPipelinepipeline=LightningClassificationPipeline(dataset_name_or_path="clarin-pl/polemo2-official",embedding_name_or_path="allegro/herbert-base-cased",input_column_name="text",target_column_name="target",output_path=".")print(pipeline.run())

⚠️ As for now, default pipeline model hyperparameters may provide poor results. It will be subject to change in further releases. We encourage users to useOptimized Pipelines to select appropriate hyperparameters.

Conventions

We use many of the HuggingFace concepts such as models(https://huggingface.co/models) or datasets(https://huggingface.co/datasets) to make our library as easy to use asit is possible. We want to enable users to create, customise, test, andexecute NLP / NLU / SLU tasks in the fastest possible manner. Moreover,we present easy to use static embeddings, that were trained byCLARIN-PL.

Pipelines

We share predefined pipelines for common NLP tasks with correspondingscripts. For Transformer based pipelines we utilizePyTorchLighting ⚡ trainers with TransformersAutoModels. For static embedding based pipelines we useFlair library under the hood.

Transformer embedding based pipelines (e.g. Bert, RoBERTA, Herbert):

Task	Class	Script
Text classification	LightningClassificationPipeline	evaluate_lightning_document_classification.py
Sequence labelling	LightningSequenceLabelingPipeline	evaluate_lightning_sequence_labeling.py

Run classification task

The example with non-default arguments

python evaluate_lightning_document_classification.py \    --embedding-name-or-path allegro/herbert-base-cased \    --dataset-name clarin-pl/polemo2-official \    --input-columns-name text \    --target-column-name target

Run sequence labeling task

The example with default language model and dataset.

python evaluate_lightning_sequence_labeling.py

Compatible datasets

As most datasets in HuggingFace repository should be compatible with ourpipelines, there are several datasets that were tested by the authors.

dataset name	task type	input_column_name(s)	target_column_name	description
clarin-pl/kpwr-ner	sequence labeling (named entity recognition)	tokens	ner	KPWR-NER is a part of the Polish Corpus of Wrocław University of Technology (KPWr). Its objective is recognition of named entities, e.g., people, institutions etc.
clarin-pl/polemo2-official	classification (sentiment analysis)	text	target	A corpus of consumer reviews from 4 domains: medicine, hotels, products and school.
clarin-pl/2021-punctuation-restoration	punctuation restoration	text_in	text_out	Dataset contains original texts and ASR output. It is a part of PolEval 2021 Competition.
clarin-pl/nkjp-pos	sequence labeling (part-of-speech tagging)	tokens	pos_tags	NKJP-POS is a part of the National Corpus of Polish. Its objective is part-of-speech tagging, e.g., nouns, verbs, adjectives, adverbs, etc.
clarin-pl/aspectemo	sequence labeling (sentiment classification)	tokens	labels	AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews used in many projects on the use of different methods in sentiment analysis.
laugustyniak/political-advertising-pl	sequence labeling (political advertising )	tokens	tags	First publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language.
laugustyniak/abusive-clauses-pl	classification (abusive-clauses)	text	class	Dataset with Polish abusive clauses examples.
allegro/klej-dyk	pair classification (question answering)*	(question, answer)	target	The Did You Know (pol. Czy wiesz?) dataset consists of human-annotated question-answer pairs.
allegro/klej-psc	pair classification (text summarization)*	(extract_text, summary_text)	label	The Polish Summaries Corpus contains news articles and their summaries.
allegro/klej-cdsc-e	pair classification (textual entailment)*	(sentence_A, sentence_B)	entailment_judgment	The polish sentence pairs which are human-annotated for textualentailment.

^{*only pair classification task is supported for now}

Passing task model and task training parameters to predefined flair pipelines

Model and training parameters can be controlled viatask_model_kwargsandtask_train_kwargs parameters that can be populated using theadvanced config. Tutorial on how to use configs can be found in/tutorials directory of the repository. Two types of config aredefined in our library: BasicConfig and AdvancedConfig. In summary, theBasicConfig takes arguments and automatically assign them into properkeyword group, while the AdvancedConfig takes as the input keywordgroups that should be already correctly mapped.

The list of available config can be found below:

Lightning:

LightningBasicConfig
LightningAdvancedConfig

Example with`polemo2` dataset

Lightning pipeline

fromembeddings.config.lightning_configimportLightningBasicConfigfromembeddings.pipeline.lightning_classificationimportLightningClassificationPipelineconfig=LightningBasicConfig(learning_rate=0.01,max_epochs=1,max_seq_length=128,finetune_last_n_layers=0,accelerator="cpu")pipeline=LightningClassificationPipeline(embedding_name_or_path="allegro/herbert-base-cased",dataset_name_or_path="clarin-pl/polemo2-official",input_column_name=["text"],target_column_name="target",load_dataset_kwargs={"train_domains": ["hotels","medicine"],"dev_domains": ["hotels","medicine"],"test_domains": ["hotels","medicine"],"text_cfg":"text",    },output_path=".",config=config)

You can also define an Advanced config with populated keyword arguments.In general, the keywords are passed to the object when constructingspecific pipelines. We can identify and trace the keyword arguments tofind the possible arguments that can be set in the config kwargs.

fromembeddings.config.lightning_configimportLightningAdvancedConfigconfig=LightningAdvancedConfig(finetune_last_n_layers=0,task_train_kwargs={"max_epochs":1,"devices":"auto","accelerator":"cpu","deterministic":True,    },task_model_kwargs={"learning_rate":5e-4,"use_scheduler":False,"optimizer":"AdamW","adam_epsilon":1e-8,"warmup_steps":100,"weight_decay":0.0,    },datamodule_kwargs={"downsample_train":0.01,"downsample_val":0.01,"downsample_test":0.05,    },dataloader_kwargs={"num_workers":0},)

Available embedding models for Polish

Instead of theallegro/herbert-base-cased model, user can pass anymodel fromHuggingFace Hub that iscompatible withTransformers orwith our library.

Embedding	Type	Description
clarin-pl/herbert-kgr10	bert	HerBERT Large trained on supplementary data - the KGR10 corpus.
…

Optimized pipelines

Transformers embeddings

Task	Optimized Pipeline
Lightning Text Classification	OptimizedLightingClassificationPipeline
Lightning Sequence Labeling	OptimizedLightingSequenceLabelingPipeline

Example with Text Classification

Optimized pipelines can be run via following snippet of code:

fromembeddings.config.lighting_config_spaceimportLightingTextClassificationConfigSpacefromembeddings.pipeline.lightning_hps_pipelineimportOptimizedLightingClassificationPipelinepipeline=OptimizedLightingClassificationPipeline(config_space=LightingTextClassificationConfigSpace(embedding_name_or_path="allegro/herbert-base-cased"    ),dataset_name_or_path="clarin-pl/polemo2-official",input_column_name="text",target_column_name="target",).persisting(best_params_path="best_prams.yaml",log_path="hps_log.pickle")df,metadata=pipeline.run()

Training model with obtained parameters

After the parameters search process we can train model with bestparameters found. But firstly we have to setoutput_path parameter,which is not automatically generated fromOptimizedLightingClassificationPipeline.

metadata["output_path"]="."

Now we are able to train the pipeline

fromembeddings.pipeline.lightning_classificationimportLightningClassificationPipelinepipeline=LightningClassificationPipeline(**metadata)results=pipeline.run()

Selection of best embedding model

Instead of performing search with single embedding model we can searchwith multiple embedding models via passing them as list to ConfigSpace.

pipeline=OptimizedLightingClassificationPipeline(config_space=LightingTextClassificationConfigSpace(embedding_name_or_path=["allegro/herbert-base-cased","clarin-pl/roberta-polish-kgr10"]    ),dataset_name_or_path="clarin-pl/polemo2-official",input_column_name="text",target_column_name="target",).persisting(best_params_path="best_prams.yaml",log_path="hps_log.pickle")

Citation

The paper describing the library is available onarXiv or in proceedings ofNeurIPS2022.

@inproceedings{augustyniak2022lepiszcze, author = {Augustyniak, Lukasz and Tagowski, Kamil and Sawczyn, Albert and Janiak, Denis and Bartusiak, Roman and Szymczak, Adrian and Janz, Arkadiusz and Szyma\'{n}ski, Piotr and W\k{a}troba, Marcin and Morzy, Miko\l aj and Kajdanowicz, Tomasz and Piasecki, Maciej}, booktitle = {Advances in Neural Information Processing Systems}, editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh}, pages = {21805--21818}, publisher = {Curran Associates, Inc.}, title = {This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish}, url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/890b206ebb79e550f3988cb8db936f42-Paper-Datasets_and_Benchmarks.pdf}, volume = {35}, year = {2022}}

About

Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language

clarin-pl.github.io/embeddings/

Movatterモバイル変換

License

CLARIN-PL/embeddings

Folders and files

Latest commit

History

Repository files navigation

CLARIN Embeddings

Installation

Example

⚠️ As for now, default pipeline model hyperparameters may provide poor results. It will be subject to change in further releases. We encourage users to useOptimized Pipelines to select appropriate hyperparameters.

Conventions

Pipelines

Transformer embedding based pipelines (e.g. Bert, RoBERTA, Herbert):

Run classification task

Run sequence labeling task

Compatible datasets

Passing task model and task training parameters to predefined flair pipelines

Lightning:

Example withpolemo2 dataset

Lightning pipeline

Available embedding models for Polish

Optimized pipelines

Transformers embeddings

Example with Text Classification

Training model with obtained parameters

Selection of best embedding model

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors12

Uh oh!

Languages

Example with`polemo2` dataset

Packages