- Notifications
You must be signed in to change notification settings - Fork3
Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language
License
CLARIN-PL/embeddings
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
State-of-the-art Text Representations for Natural Language Processingtasks, an initial version of library focus on the Polish Language
This library was used during the development ofthe LEPISZCZEbenchmark (NeurIPS 2022).
pip install clarinpl-embeddings
Text-classification with polemo2 dataset and transformer-basedembeddings
fromembeddings.pipeline.lightning_classificationimportLightningClassificationPipelinepipeline=LightningClassificationPipeline(dataset_name_or_path="clarin-pl/polemo2-official",embedding_name_or_path="allegro/herbert-base-cased",input_column_name="text",target_column_name="target",output_path=".")print(pipeline.run())
⚠️ As for now, default pipeline model hyperparameters may provide poor results. It will be subject to change in further releases. We encourage users to useOptimized Pipelines to select appropriate hyperparameters.
We use many of the HuggingFace concepts such as models(https://huggingface.co/models) or datasets(https://huggingface.co/datasets) to make our library as easy to use asit is possible. We want to enable users to create, customise, test, andexecute NLP / NLU / SLU tasks in the fastest possible manner. Moreover,we present easy to use static embeddings, that were trained byCLARIN-PL.
We share predefined pipelines for common NLP tasks with correspondingscripts. For Transformer based pipelines we utilizePyTorchLighting ⚡ trainers with TransformersAutoModels. For static embedding based pipelines we useFlair library under the hood.
| Task | Class | Script |
|---|---|---|
| Text classification | LightningClassificationPipeline | evaluate_lightning_document_classification.py |
| Sequence labelling | LightningSequenceLabelingPipeline | evaluate_lightning_sequence_labeling.py |
The example with non-default arguments
python evaluate_lightning_document_classification.py \ --embedding-name-or-path allegro/herbert-base-cased \ --dataset-name clarin-pl/polemo2-official \ --input-columns-name text \ --target-column-name target
The example with default language model and dataset.
python evaluate_lightning_sequence_labeling.py
As most datasets in HuggingFace repository should be compatible with ourpipelines, there are several datasets that were tested by the authors.
| dataset name | task type | input_column_name(s) | target_column_name | description |
|---|---|---|---|---|
| clarin-pl/kpwr-ner | sequence labeling (named entity recognition) | tokens | ner | KPWR-NER is a part of the Polish Corpus of Wrocław University of Technology (KPWr). Its objective is recognition of named entities, e.g., people, institutions etc. |
| clarin-pl/polemo2-official | classification (sentiment analysis) | text | target | A corpus of consumer reviews from 4 domains: medicine, hotels, products and school. |
| clarin-pl/2021-punctuation-restoration | punctuation restoration | text_in | text_out | Dataset contains original texts and ASR output. It is a part of PolEval 2021 Competition. |
| clarin-pl/nkjp-pos | sequence labeling (part-of-speech tagging) | tokens | pos_tags | NKJP-POS is a part of the National Corpus of Polish. Its objective is part-of-speech tagging, e.g., nouns, verbs, adjectives, adverbs, etc. |
| clarin-pl/aspectemo | sequence labeling (sentiment classification) | tokens | labels | AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews used in many projects on the use of different methods in sentiment analysis. |
| laugustyniak/political-advertising-pl | sequence labeling (political advertising ) | tokens | tags | First publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language. |
| laugustyniak/abusive-clauses-pl | classification (abusive-clauses) | text | class | Dataset with Polish abusive clauses examples. |
| allegro/klej-dyk | pair classification (question answering)* | (question, answer) | target | The Did You Know (pol. Czy wiesz?) dataset consists of human-annotated question-answer pairs. |
| allegro/klej-psc | pair classification (text summarization)* | (extract_text, summary_text) | label | The Polish Summaries Corpus contains news articles and their summaries. |
| allegro/klej-cdsc-e | pair classification (textual entailment)* | (sentence_A, sentence_B) | entailment_judgment | The polish sentence pairs which are human-annotated for textualentailment. |
*only pair classification task is supported for now
Model and training parameters can be controlled viatask_model_kwargsandtask_train_kwargs parameters that can be populated using theadvanced config. Tutorial on how to use configs can be found in/tutorials directory of the repository. Two types of config aredefined in our library: BasicConfig and AdvancedConfig. In summary, theBasicConfig takes arguments and automatically assign them into properkeyword group, while the AdvancedConfig takes as the input keywordgroups that should be already correctly mapped.
The list of available config can be found below:
- LightningBasicConfig
- LightningAdvancedConfig
fromembeddings.config.lightning_configimportLightningBasicConfigfromembeddings.pipeline.lightning_classificationimportLightningClassificationPipelineconfig=LightningBasicConfig(learning_rate=0.01,max_epochs=1,max_seq_length=128,finetune_last_n_layers=0,accelerator="cpu")pipeline=LightningClassificationPipeline(embedding_name_or_path="allegro/herbert-base-cased",dataset_name_or_path="clarin-pl/polemo2-official",input_column_name=["text"],target_column_name="target",load_dataset_kwargs={"train_domains": ["hotels","medicine"],"dev_domains": ["hotels","medicine"],"test_domains": ["hotels","medicine"],"text_cfg":"text", },output_path=".",config=config)
You can also define an Advanced config with populated keyword arguments.In general, the keywords are passed to the object when constructingspecific pipelines. We can identify and trace the keyword arguments tofind the possible arguments that can be set in the config kwargs.
fromembeddings.config.lightning_configimportLightningAdvancedConfigconfig=LightningAdvancedConfig(finetune_last_n_layers=0,task_train_kwargs={"max_epochs":1,"devices":"auto","accelerator":"cpu","deterministic":True, },task_model_kwargs={"learning_rate":5e-4,"use_scheduler":False,"optimizer":"AdamW","adam_epsilon":1e-8,"warmup_steps":100,"weight_decay":0.0, },datamodule_kwargs={"downsample_train":0.01,"downsample_val":0.01,"downsample_test":0.05, },dataloader_kwargs={"num_workers":0},)
Instead of theallegro/herbert-base-cased model, user can pass anymodel fromHuggingFace Hub that iscompatible withTransformers orwith our library.
| Embedding | Type | Description |
|---|---|---|
| clarin-pl/herbert-kgr10 | bert | HerBERT Large trained on supplementary data - the KGR10 corpus. |
| … |
| Task | Optimized Pipeline |
|---|---|
| Lightning Text Classification | OptimizedLightingClassificationPipeline |
| Lightning Sequence Labeling | OptimizedLightingSequenceLabelingPipeline |
Optimized pipelines can be run via following snippet of code:
fromembeddings.config.lighting_config_spaceimportLightingTextClassificationConfigSpacefromembeddings.pipeline.lightning_hps_pipelineimportOptimizedLightingClassificationPipelinepipeline=OptimizedLightingClassificationPipeline(config_space=LightingTextClassificationConfigSpace(embedding_name_or_path="allegro/herbert-base-cased" ),dataset_name_or_path="clarin-pl/polemo2-official",input_column_name="text",target_column_name="target",).persisting(best_params_path="best_prams.yaml",log_path="hps_log.pickle")df,metadata=pipeline.run()
After the parameters search process we can train model with bestparameters found. But firstly we have to setoutput_path parameter,which is not automatically generated fromOptimizedLightingClassificationPipeline.
metadata["output_path"]="."
Now we are able to train the pipeline
fromembeddings.pipeline.lightning_classificationimportLightningClassificationPipelinepipeline=LightningClassificationPipeline(**metadata)results=pipeline.run()
Instead of performing search with single embedding model we can searchwith multiple embedding models via passing them as list to ConfigSpace.
pipeline=OptimizedLightingClassificationPipeline(config_space=LightingTextClassificationConfigSpace(embedding_name_or_path=["allegro/herbert-base-cased","clarin-pl/roberta-polish-kgr10"] ),dataset_name_or_path="clarin-pl/polemo2-official",input_column_name="text",target_column_name="target",).persisting(best_params_path="best_prams.yaml",log_path="hps_log.pickle")
The paper describing the library is available onarXiv or in proceedings ofNeurIPS2022.
@inproceedings{augustyniak2022lepiszcze, author = {Augustyniak, Lukasz and Tagowski, Kamil and Sawczyn, Albert and Janiak, Denis and Bartusiak, Roman and Szymczak, Adrian and Janz, Arkadiusz and Szyma\'{n}ski, Piotr and W\k{a}troba, Marcin and Morzy, Miko\l aj and Kajdanowicz, Tomasz and Piasecki, Maciej}, booktitle = {Advances in Neural Information Processing Systems}, editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh}, pages = {21805--21818}, publisher = {Curran Associates, Inc.}, title = {This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish}, url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/890b206ebb79e550f3988cb8db936f42-Paper-Datasets_and_Benchmarks.pdf}, volume = {35}, year = {2022}}About
Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors12
Uh oh!
There was an error while loading.Please reload this page.