ViralLab/TurkishBERTweetPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star38

TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

License

MIT license

38 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
Preprocessor		Preprocessor
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_results.png		main_results.png

Repository files navigation

TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

Main Results

Model

Model	#params	Arch.	Max length	Pre-training data
`VRLLab/TurkishBERTweet`	163M	base	128	894M Turkish Tweets (uncased)

Lora Adapters

Model	train f1	dev f1	test f1	Dataset Size
`VRLLab/TurkishBERTweet-Lora-SA`	0.799	0.687	0.692	42,476 Turkish Tweets
`VRLLab/TurkishBERTweet-Lora-HS`	0.915	0.796	0.831	4,683 Turkish Tweets

Example usage

git clone git@github.com:ViralLab/TurkishBERTweet.gitcd TurkishBERTweetpython -m venv venvsource venv/bin/activatepip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118pip install peftpip install transformerspip install urlextract

Twitter Preprocessor

fromPreprocessorimportpreprocesstext="""Lab'ımıza "viral" adını verdik çünkü amacımız disiplinler arası sınırları aşmak ve aralarında yeni bağlantılar kurmak! 🔬 #ViralLabhttps://varollab.com/"""preprocessed_text=preprocess(text)print(preprocessed_text)

Output:

lab'ımıza "viral" adını verdik çünkü amacımız disiplinler arası sınırları aşmak ve aralarında yeni bağlantılar kurmak! <emoji> mikroskop </emoji> <hashtag> virallab </hashtag> <http> varollab.com </http>

Feature Extraction

importtorchfromtransformersimportAutoTokenizer,AutoModelfromPreprocessorimportpreprocesstokenizer=AutoTokenizer.from_pretrained("VRLLab/TurkishBERTweet")turkishBERTweet=AutoModel.from_pretrained("VRLLab/TurkishBERTweet")text="""Lab'ımıza "viral" adını verdik çünkü amacımız disiplinler arası sınırları aşmak ve aralarında yeni bağlantılar kurmak! 💥🔬 #ViralLab #DisiplinlerArası #YenilikçiBağlantılar"""preprocessed_text=preprocess(text)input_ids=torch.tensor([tokenizer.encode(preprocessed_text)])withtorch.no_grad():features=turkishBERTweet(input_ids)# Models outputs are now tuples

Sentiment Classification

importtorchfrompeftimport (PeftModel,PeftConfig,)fromtransformersimport (AutoModelForSequenceClassification,AutoTokenizer)fromPreprocessorimportpreprocesspeft_model="VRLLab/TurkishBERTweet-Lora-SA"peft_config=PeftConfig.from_pretrained(peft_model)# loading Tokenizerpadding_side="right"tokenizer=AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path,padding_side=padding_side)ifgetattr(tokenizer,"pad_token_id")isNone:tokenizer.pad_token_id=tokenizer.eos_token_idid2label_sa= {0:"negative",2:"positive",1:"neutral"}turkishBERTweet_sa=AutoModelForSequenceClassification.from_pretrained(peft_config.base_model_name_or_path,return_dict=True,num_labels=len(id2label_sa),id2label=id2label_sa)turkishBERTweet_sa=PeftModel.from_pretrained(turkishBERTweet_sa,peft_model)sample_texts= ["Viral lab da insanlar hep birlikte çalışıyorlar. hepbirlikte çalışan insanlar birbirlerine yakın oluyorlar.","americanin diplatlari turkiyeye gelmesin 😤","Mark Zuckerberg ve Elon Musk'un boks müsabakası süper olacak! 🥷","Adam dun ne yediğini unuttu"    ]preprocessed_texts= [preprocess(s)forsinsample_texts]withtorch.no_grad():forsinpreprocessed_texts:ids=tokenizer.encode_plus(s,return_tensors="pt")label_id=turkishBERTweet_sa(**ids).logits.argmax(-1).item()print(id2label_sa[label_id],":",s)

positive : viral lab da insanlar hep birlikte çalışıyorlar. hepbirlikte çalışan insanlar birbirlerine yakın oluyorlar.negative : americanin diplatlari turkiyeye gelmesin <emoji> burundan_buharla_yüzleşmek </emoji>positive : mark zuckerberg ve elon musk'un boks müsabakası süper olacak! <emoji> kadın_muhafız_koyu_ten_tonu </emoji>neutral : adam dun ne yediğini unuttu

HateSpeech Detection

frompeftimport (PeftModel,PeftConfig,)fromtransformersimport (AutoModelForSequenceClassification,AutoTokenizer)fromPreprocessorimportpreprocesspeft_model="VRLLab/TurkishBERTweet-Lora-HS"peft_config=PeftConfig.from_pretrained(peft_model)# loading Tokenizerpadding_side="right"tokenizer=AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path,padding_side=padding_side)ifgetattr(tokenizer,"pad_token_id")isNone:tokenizer.pad_token_id=tokenizer.eos_token_idid2label_hs= {0:"No",1:"Yes"}turkishBERTweet_hs=AutoModelForSequenceClassification.from_pretrained(peft_config.base_model_name_or_path,return_dict=True,num_labels=len(id2label_hs),id2label=id2label_hs)turkishBERTweet_hs=PeftModel.from_pretrained(turkishBERTweet_hs,peft_model)sample_texts= ["Viral lab da insanlar hep birlikte çalışıyorlar. hepbirlikte çalışan insanlar birbirlerine yakın oluyorlar.","kasmayin artik ya kac kere tanik olduk bu azgin tehlikeli\u201cmultecilerin\u201d yaptiklarina? bir afgan taragindan kafasi tasla ezilip tecavuz edilen kiza da git boyle cihangir solculugu yap yerse?",    ]preprocessed_texts= [preprocess(s)forsinsample_texts]withtorch.no_grad():forsinpreprocessed_texts:ids=tokenizer.encode_plus(s,return_tensors="pt")label_id=turkishBERTweet_hs(**ids).logits.argmax(-1).item()print(id2label_hs[label_id],":",s)

No : viral lab da insanlar hep birlikte çalışıyorlar. hepbirlikte çalışan insanlar birbirlerine yakın oluyorlar.Yes : kasmayin artik ya kac kere tanik olduk bu azgin tehlikeli “multecilerin” yaptiklarina? bir afgan taragindan kafasi tasla ezilip tecavuz edilen kiza da git boyle cihangir solculugu yap yerse?

Citation

@article{najafi2024turkishbertweet,title={Turkishbertweet: Fast and reliable large language model for social media analysis},author={Najafi, Ali and Varol, Onur},journal={Expert Systems with Applications},pages={124737},year={2024},publisher={Elsevier}}

Acknowledgments

We thankFatih Amasyali for providing access to Tweet Sentiment datasets from Kemik group.This material is based upon work supported by the Google Cloud Research Credits program with the award GCP19980904. We also thank TUBITAK (121C220 and 222N311) for funding this project.

About

TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

varollab.com/TurkishBERTweet

Contributors2

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Table of contents

TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis

Main Results

Model

Lora Adapters

Example usage

Twitter Preprocessor

Feature Extraction

Sentiment Classification

HateSpeech Detection

Citation

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors2

Uh oh!

Languages