CAMeL-Lab/camel_parserPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star15

License

MIT license

15 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
data		data
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
__init__.py		__init__.py
download_models.py		download_models.py
handle_multiple_conll_files.py		handle_multiple_conll_files.py
handle_multiple_texts.py		handle_multiple_texts.py
requirements.txt		requirements.txt
sample_starting_point.py		sample_starting_point.py
text_to_conll_cli.py		text_to_conll_cli.py

Repository files navigation

CamelParser

Introduction

CamelParser is an open-source Python-based Arabic dependency parser targeting two popularArabic dependency formalisms, the Columbia Arabic Treebank (CATiB), and Universal Dependencies (UD).

The CamelParser pipeline handles the processing of raw text and produces tokenization,part-of-speech and rich morphological features. For disambiguation, users can choose betweenthe BERT unfactored disambiguator, or a lighter Maximum Likelihood Estimation (MLE) disambiguator,both of which are included in CAMeL Tools. For dependency parsing, we use the SuPar Biaffine Dependency Parser.

Installation

Clone this repo
Set up a virtual environment using Python 3.11.13 (you can follow the tutorialhere).

Currently, CamelParser does not work with later versions of python due to issues with some dependencies.

Install the required packages:

pip install -r requirements.txt

Download dependency parsing models:

python download_models.py

Currently, two Arabic script models, CATiB and UD, will be downloaded from the CAMeL Lab'sparser models collectionon Hugging Face. More models will be added soon!

Examples

The CamelParser allows users to pass either a string or a file containingone or more sentences.Below are examples using the differentstring inputs that CamelParser accepts.We pass each example as a string using -s.However, we do recommend using the file method (-i) along with the path to the filewhen passing multiple sentences.

You can also refer to thesample_starting_point.py to use the parser in your code, or for more advanced usage:

text_to_conll_cli.pyhandle_multiple_texts.pyhandle_multiple_conll_files.py

Passing text

python text_to_conll_cli.py -f text -s"جامعة نيويورك أبو ظبي تنشر أول أطلس لكوكب المريخ باللغة العربية."

The verbose version of the above example (default values are shown)

python text_to_conll_cli.py -f text -b r13 -d bert -m catib -s"جامعة نيويورك أبو ظبي تنشر أول أطلس لكوكب المريخ باللغة العربية."

Passing preprocessed text (cleaned and whitespace tokenized)

python text_to_conll_cli.py -f preprocessed_text -s"جامعة نيويورك أبو ظبي تنشر أول أطلس لكوكب المريخ باللغة العربية ."

Note that the difference between the -f text and preprocessed_text parser input settings isthat for text we use different utilities from CAMeL Tools tonormalize unicode,dediactritize,clean the text usingarclean,and performwhitespace tokenization.

tokenized is used when 1) the text has already been tokenized, and 2) only dependency relations are needed;the POS tags and features will not be generated.

python text_to_conll_cli.py -f tokenized -s"جامعة نيويورك أبو ظبي تنشر أول أطلس ل+ كوكب المريخ ب+ اللغة العربية ."

tokenized_tagged is used when the user has the tokens and POS tags. They should be passed as tuples.

python text_to_conll_cli.py -f tokenized_tagged -s"(جامعة, NOM) (نيويورك, PROP) (أبو, PROP) (ظبي, PROP) (تنشر, VRB) (أول, NOM) (أطلس, NOM) (ل+, PRT) (كوكب, NOM) (المريخ, PROP) (ب+, PRT) (اللغة, NOM) (العربية, NOM) (., PNX)"

Using a custom model

You can use your own dependency parser models by

placing the model in the models directory (this directory is created when you run download_models.py, but you can create it yourself)
place your model in the models directory
when running one of the scripts, add -m [model_name]. Just type the model name WITHOUT the path.

Extending the code

You can also use different parts of the code to create your own pipeline.The handle_multiple_texts.py is an example of that. It can be used to parse a directory of text files,saving the resulting CoNLL-X files to a given output directory.

Using another morphology database

Curently, the CamelParser uses CAMeLTools' default morphology database, the morphology-db-msa-r13.

For our paper, we used the calima-msa-s31 database. To use this database,follow these steps (note that you need an account with the LDC):

Install camel_tools v1.5.6 or later (you can check this using camel_data -v)
Download the camel data for the BERT unfactored (MSA) model, as well as the morphology database:

camel_data -i morphology-db-msa-s31camel_data -i disambig-bert-unfactored-msa

Download the LDC2010L01 from the ldc downloads:
- go tohttps://catalog.ldc.upenn.edu/organization/downloads
- search for LDC2010L01.tgz and download it
DO NOT EXTRACT LDC2010L01.tgz! We'll use the following command from camel tools to install the db:

camel_data -p morphology-db-msa-s31 /path/to/LDC2010L01.tgz

When running the main script, use -b and pass calima-msa-s31.

Reproducing paper results

To reproduce the results in our paperCamelParser2.0: A State-of-the-Art Dependency Parser for Arabic, please use the code foundin thepaper_version branch.

Citation

If you find the CamelParser useful in your research, please cite

@inproceedings{Elshabrawy:2023:camelparser,title ="{CamelParser2.0: A State-of-the-Art Dependency Parser for Arabic}",author ={Ahmed Elshabrawy andMuhammed AbuOdeh andGo Inoue andNizar Habash} ,booktitle ={Proceedings of The First Arabic Natural Language Processing Conference (ArabicNLP 2023)},year ="2023"}

About

No description, website, or topics provided.

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CamelParser

Introduction

Installation

Examples

Using a custom model

Extending the code

Using another morphology database

Reproducing paper results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

CAMeL-Lab/camel_parser

Folders and files

Latest commit

History

Repository files navigation

CamelParser

Introduction

Installation

Examples

Using a custom model

Extending the code

Using another morphology database

Reproducing paper results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages