Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit3db857c

Browse files
committed
Updated readme with LLM fine-tuning for text classification
1 parent4bbca96 commit3db857c

File tree

1 file changed

+364
-0
lines changed

1 file changed

+364
-0
lines changed

‎README.md‎

Lines changed: 364 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@
4646
-[Text-to-Text Generation](#text-to-text-generation)
4747
-[Fill-Mask](#fill-mask)
4848
-[Vector Database](#vector-database)
49+
-[LLM Fine-tuning](#llm-fine-tuning)
50+
-[Text Classification](#llm-fine-tuning-text-classification)
4951
<!-- - [Regression](#regression)
5052
- [Classification](#classification)-->
5153

@@ -866,5 +868,367 @@ Sentence Similarity involves determining the degree of similarity between two te
866868
# Classification-->
867869

868870

871+
#LLM Fine-tuning
869872

873+
In this section, we will provide a step-by-step walkthrough for fine-tuning a Language Model (LLM) for differnt tasks.
874+
875+
##Prerequisites
876+
877+
1. Ensure you have the PostgresML extension installed and configured in your PostgreSQL database. You can find installation instructions for PostgresML in the official documentation.
878+
879+
2. Obtain a Hugging Face API token to push the fine-tuned model to the Hugging Face Model Hub. Follow the instructions on the[Hugging Face website](https://huggingface.co/settings/tokens) to get your API token.
880+
881+
##LLM Fine-tuning Text Classification
882+
883+
###1. Loading the Dataset
884+
885+
To begin, create a table to store your dataset. In this example, we use the 'imdb' dataset from Hugging Face. IMDB dataset contains three splits: train (25K rows), test (25K rows) and unsupervised (50K rows). In train and test splits, negative class has label 0 and positive class label 1. All rows in unsupervised split has a label of -1.
886+
```sql
887+
SELECTpgml.load_dataset('imdb');
888+
```
889+
890+
###2. Prepare dataset for fine-tuning
891+
892+
We will create a view of the dataset by performing the following operations:
893+
894+
- Add a new text column named "class" that has positive and negative classes.
895+
- Shuffled view of the dataset to ensure randomness in the distribution of data.
896+
- Remove all the unsupervised splits that have label = -1.
897+
898+
```sql
899+
CREATEVIEWpgml.imdb_shuffled_viewAS
900+
SELECT
901+
label,
902+
CASE WHEN label=0 THEN'negative'
903+
WHEN label=1 THEN'positive'
904+
ELSE'neutral'
905+
ENDAS class,
906+
text
907+
FROMpgml.imdb
908+
WHERE label!=-1
909+
ORDER BY RANDOM();
910+
```
911+
912+
###3 Exploratory Data Analysis (EDA) on Shuffled Data
913+
914+
Before splitting the data into training and test sets, it's essential to perform exploratory data analysis (EDA) to understand the distribution of labels and other characteristics of the dataset. In this section, we'll use the`pgml.imdb_shuffled_view` to explore the shuffled data.
915+
916+
####3.1 Distribution of Labels
917+
918+
To analyze the distribution of labels in the shuffled dataset, you can use the following SQL query:
919+
920+
```sql
921+
-- Count the occurrences of each label in the shuffled dataset
922+
SELECT
923+
label,
924+
COUNT(*)AS label_count
925+
FROMpgml.imdb_shuffled_view
926+
GROUP BY label
927+
ORDER BY label;
928+
929+
930+
This query provides insights into the distribution of labels, helping you understand the balanceor imbalance of classesin your dataset.
931+
932+
#### 3.2 Sample Records
933+
To get a glimpse of the data, you can retrieve a sample of recordsfrom the shuffled dataset:
934+
935+
```sql
936+
Copy code
937+
-- Retrieve a sample of records from the shuffled dataset
938+
SELECT *
939+
FROM pgml.imdb_shuffled_view
940+
LIMIT 10; -- Adjust the limit based on the desired number of records
941+
```
942+
This query allows you to inspect a few records to understand the structureand content of the shuffled data.
943+
944+
#### 3.3 Additional Exploratory Analysis
945+
Feel free to explore other aspects of the data, suchas the distribution oftext lengths, word frequencies,or any other features relevant to your analysis. Performing EDA is crucial for gaining insights into your datasetand making informed decisions during subsequent steps of the workflow.
946+
947+
### 4. Splitting Data into Training and Test Sets
948+
949+
Create views for trainingand test data by splitting the shuffled dataset.In this example,80% is allocated for training,and20% for testing.
950+
951+
```sql
952+
-- Create a view for training data
953+
CREATE VIEW pgml.imdb_train_view AS
954+
SELECT *
955+
FROM pgml.imdb_shuffled_view
956+
LIMIT (SELECT COUNT(*) * 0.8 FROM pgml.imdb_shuffled_view);
957+
958+
-- Create a view for test data
959+
CREATE VIEW pgml.imdb_test_view AS
960+
SELECT *
961+
FROM pgml.imdb_shuffled_view
962+
OFFSET (SELECT COUNT(*) * 0.8 FROM pgml.imdb_shuffled_view);
963+
```
964+
965+
### 5. Fine-Tuning the Language Model
966+
967+
Now, fine-tune the Language Model fortext classification using the created training view.In the following sections, you will see a detailed explanation of different parameters used during fine-tuning.
968+
969+
```sql
970+
SELECT pgml.tune(
971+
'imdb_review_sentiment',
972+
task => 'text-classification',
973+
relation_name => 'pgml.imdb_train_view',
974+
model_name => 'distilbert-base-uncased',
975+
test_size => 0.2,
976+
test_sampling => 'last',
977+
hyperparams => '{
978+
"training_args" : {
979+
"learning_rate": 2e-5,
980+
"per_device_train_batch_size": 16,
981+
"per_device_eval_batch_size": 16,
982+
"num_train_epochs": 20,
983+
"weight_decay": 0.01,
984+
"hub_token" : "YOUR_HUB_TOKEN",
985+
"push_to_hub" : true
986+
},
987+
"dataset_args" : { "text_column" : "text", "class_column" : "class" }
988+
}'
989+
);
990+
```
991+
992+
* project_name ('imdb_review_sentiment'): The project_name parameter specifies a unique name for your fine-tuning project. It helps identifyand organize different fine-tuning tasks within the PostgreSQL database.In this example, the project is named'imdb_review_sentiment,' reflecting the sentiment analysis taskon the IMDb dataset. You cancheck`pgml.projects` for list of projects.
993+
994+
* task ('text-classification'): The task parameter defines the nature of the machine learning task to be performed.In this case, it's set to'text-classification,' indicating that the fine-tuning is geared towards training a model for text classification.
995+
996+
* relation_name ('pgml.imdb_train_view'): The relation_name parameter identifies the training dataset to be used for fine-tuning. It specifies the view or table containing the training data. In this example,'pgml.imdb_train_view' is the view created from the shuffled IMDb dataset, and it serves as the source for model training.
997+
998+
* model_name ('distilbert-base-uncased'): The model_name parameter denotes the pre-trained language model architecture to be fine-tuned. In this case,'distilbert-base-uncased' is selected. DistilBERT is a distilled version of BERT, and the'uncased' variant indicates that the model does not differentiate between uppercase and lowercase letters.
999+
1000+
* test_size (0.2): The test_size parameter determines the proportion of the dataset reserved for testing during fine-tuning. In this example, 20% of the dataset is set aside for evaluation, helping assess the model's performanceon unseen data.
1001+
1002+
* test_sampling ('last'): The test_sampling parameter defines the strategy for sampling test datafrom the dataset.In this case,'last' indicates that the most recent portion of the data, following the specified test size, is used for testing. Adjusting this parameter might be necessary basedon your specific requirementsand dataset characteristics.
1003+
1004+
#### 5.1 Dataset Arguments (dataset_args)
1005+
The dataset_args section allows you to specify critical parameters related to your dataset for language model fine-tuning.
1006+
1007+
* text_column: The name of the column containing thetext datain your dataset.In this example, it's set to "text."
1008+
* class_column: The name of the column containing the class labels in your dataset. In this example, it'sset to"class."
1009+
1010+
#### 5.2 Training Arguments (training_args)
1011+
Fine-tuning a language model requires careful consideration of training parametersin the training_args section. Below is a subset of training args that you can pass to fine-tuning. You can find an exhaustive list of parametersin Hugging Face documentationon [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).
1012+
1013+
* learning_rate: The learning rate for the training. It controls the step size during the optimization process. Adjust basedon your model's convergence behavior.
1014+
* per_device_train_batch_size: The batch size per GPU for training. This parameter controls the number of training samples utilized in one iteration. Adjust based on your available GPU memory.
1015+
* per_device_eval_batch_size: The batch size per GPU for evaluation. Similar to per_device_train_batch_size, but used during model evaluation.
1016+
* num_train_epochs: The number of training epochs. An epoch is one complete pass through the entire training dataset. Adjust based on the model's convergenceand your dataset size.
1017+
* weight_decay: L2 regularization term for weight decay. It helps prevent overfitting. Adjust basedon the complexity of your model.
1018+
* hub_token: Your Hugging Face API token to push the fine-tuned model to the Hugging Face Model Hub. Replace"YOUR_HUB_TOKEN" with the actual token.
1019+
* push_to_hub: Aboolean flag indicating whether to push the model to the Hugging Face Model Hub after fine-tuning.
1020+
1021+
1022+
#### 5.3 Monitoring
1023+
During training, metricslike loss, gradient norm will be printedas infoand also loggedinpgml.logs table. Below is a snapshot of such output.
1024+
1025+
```json
1026+
INFO: {
1027+
"loss": 0.3453,
1028+
"grad_norm": 5.230295181274414,
1029+
"learning_rate": 1.9e-05,
1030+
"epoch": 0.25,
1031+
"step": 500,
1032+
"max_steps": 10000,
1033+
"timestamp": "2024-03-07 01:59:15.090612"
1034+
}
1035+
INFO: {
1036+
"loss": 0.2479,
1037+
"grad_norm": 2.7754225730895996,
1038+
"learning_rate": 1.8e-05,
1039+
"epoch": 0.5,
1040+
"step": 1000,
1041+
"max_steps": 10000,
1042+
"timestamp": "2024-03-07 02:01:12.064098"
1043+
}
1044+
INFO: {
1045+
"loss": 0.223,
1046+
"learning_rate": 1.6000000000000003e-05,
1047+
"epoch": 1.0,
1048+
"step": 2000,
1049+
"max_steps": 10000,
1050+
"timestamp": "2024-03-07 02:05:08.141220"
1051+
}
1052+
```
1053+
1054+
Once the training is completed, model will be evaluated against the validation dataset. You will see the belowin the client terminal. Accuracyon the evaluation dataset is0.934and F1-score is0.93.
1055+
1056+
```json
1057+
INFO: {
1058+
"train_runtime": 2359.5335,
1059+
"train_samples_per_second": 67.81,
1060+
"train_steps_per_second": 4.238,
1061+
"train_loss": 0.11267969808578492,
1062+
"epoch": 5.0,
1063+
"step": 10000,
1064+
"max_steps": 10000,
1065+
"timestamp": "2024-03-07 02:36:38.783279"
1066+
}
1067+
INFO: {
1068+
"eval_loss": 0.3691485524177551,
1069+
"eval_f1": 0.9343711842996372,
1070+
"eval_accuracy": 0.934375,
1071+
"eval_runtime": 41.6167,
1072+
"eval_samples_per_second": 192.23,
1073+
"eval_steps_per_second": 12.014,
1074+
"epoch": 5.0,
1075+
"step": 10000,
1076+
"max_steps": 10000,
1077+
"timestamp": "2024-03-07 02:37:31.762917"
1078+
}
1079+
```
1080+
1081+
Once the training is completed, you cancheck querypgml.logs table using the model_idor by finding the latest modelon the project.
1082+
1083+
```bash
1084+
pgml: SELECT logs->>'epoch' AS epoch, logs->>'step' AS step, logs->>'loss' AS loss FROM pgml.logs WHERE model_id = 993 AND jsonb_exists(logs, 'loss');
1085+
epoch | step | loss
1086+
-------+-------+--------
1087+
0.25 | 500 | 0.3453
1088+
0.5 | 1000 | 0.2479
1089+
0.75 | 1500 | 0.223
1090+
1.0 | 2000 | 0.2165
1091+
1.25 | 2500 | 0.1485
1092+
1.5 | 3000 | 0.1563
1093+
1.75 | 3500 | 0.1559
1094+
2.0 | 4000 | 0.142
1095+
2.25 | 4500 | 0.0816
1096+
2.5 | 5000 | 0.0942
1097+
2.75 | 5500 | 0.075
1098+
3.0 | 6000 | 0.0883
1099+
3.25 | 6500 | 0.0432
1100+
3.5 | 7000 | 0.0426
1101+
3.75 | 7500 | 0.0444
1102+
4.0 | 8000 | 0.0504
1103+
4.25 | 8500 | 0.0186
1104+
4.5 | 9000 | 0.0265
1105+
4.75 | 9500 | 0.0248
1106+
5.0 | 10000 | 0.0284
1107+
```
1108+
1109+
During training, model is periodically uploaded to Hugging Face Hub. You will find the model at`https://huggingface.co/<username>/<project_name>`. An example model that was automatically pushed to Hugging Face Hub is [here](https://huggingface.co/santiadavani/imdb_review_sentiement).
1110+
1111+
### 6. Inference using fine-tuned model
1112+
Now, that we have fine-tuned modelon Hugging Face Hub, we can use [`pgml.transform`](https://postgresml.org/docs/introduction/apis/sql-extensions/pgml.transform/text-classification) to performreal-time predictionsas wellas batch predictions.
1113+
1114+
**Real-time predictions**
1115+
Here is an examplepgml.transform call forreal-time predictionson the newly minted LLM fine-tunedon IMDB review dataset.
1116+
```sql
1117+
SELECT pgml.transform(
1118+
task => '{
1119+
"task": "text-classification",
1120+
"model": "santiadavani/imdb_review_sentiement"
1121+
}'::JSONB,
1122+
inputs => ARRAY[
1123+
'I would not give this movie a rating, its not worthy. I watched it only because I am a Pfieffer fan. ',
1124+
'This movie was sooooooo good! It was hilarious! There are so many jokes that you can just watch the'
1125+
]
1126+
);
1127+
transform
1128+
--------------------------------------------------------------------------------------------------------
1129+
[{"label": "negative", "score": 0.999561846256256}, {"label": "positive", "score": 0.986771047115326}]
1130+
(1 row)
1131+
1132+
Time: 175.264 ms
1133+
```
1134+
1135+
1136+
**Batch predictions**
1137+
1138+
```sql
1139+
pgml=# SELECT
1140+
LEFT(text, 100) AS truncated_text,
1141+
class,
1142+
predicted_class[0]->>'label' AS predicted_class,
1143+
(predicted_class[0]->>'score')::float AS score
1144+
FROM (
1145+
SELECT
1146+
LEFT(text, 100) AS text,
1147+
class,
1148+
pgml.transform(
1149+
task => '{
1150+
"task": "text-classification",
1151+
"model": "santiadavani/imdb_review_sentiement"
1152+
}'::JSONB,
1153+
inputs => ARRAY[text]
1154+
) AS predicted_class
1155+
FROM pgml.imdb_test_view
1156+
LIMIT 2
1157+
) AS subquery;
1158+
truncated_text | class | predicted_class | score
1159+
------------------------------------------------------------------------------------------------------+----------+-----------------+--------------------
1160+
I wouldn't give this movie a rating, it's not worthy. I watched it only because I'm a Pfieffer fan. | negative | negative | 0.9996490478515624
1161+
This movie was sooooooo good! It was hilarious! There are so many jokes that you can just watch the | positive | positive | 0.9972313046455384
1162+
1163+
Time: 1337.290 ms (00:01.337)
1164+
```
1165+
1166+
## 7. Restarting Training from a Previous Trained Model
1167+
1168+
Sometimes, it's necessary to restart the training process from a previously trained model. This can be advantageous for various reasons, such as model fine-tuning, hyperparameter adjustments, or addressing interruptions in the training process. `pgml.tune` provides a seamless way to restart training while leveraging the progress made in the existing model. Below is a guide on how to restart training using a previous model as a starting point:
1169+
1170+
### Define the Previous Model
1171+
1172+
Specify the name of the existing model you want to use as a starting point. This is achieved by setting the `model_name` parameter in the `pgml.tune` function. In the example below, it is set to'santiadavani/imdb_review_sentiement'.
1173+
1174+
```sql
1175+
model_name =>'santiadavani/imdb_review_sentiement',
1176+
```
1177+
1178+
### Adjust Hyperparameters
1179+
Fine-tune hyperparameters as needed for the restarted training process. This might include modifying learning rates, batch sizes, or training epochs. In the example below, hyperparameters such as learning rate, batch sizes, and epochs are adjusted.
1180+
1181+
```sql
1182+
"training_args": {
1183+
"learning_rate": 2e-5,
1184+
"per_device_train_batch_size": 16,
1185+
"per_device_eval_batch_size": 16,
1186+
"num_train_epochs": 1,
1187+
"weight_decay": 0.01,
1188+
"hub_token": "",
1189+
"push_to_hub": true
1190+
},
1191+
```
1192+
1193+
### Ensure Consistent Dataset Configuration
1194+
Confirm that the dataset configuration remains consistent, including specifying the same text and class columns as in the previous training. This ensures compatibility between the existing model and the restarted training process.
1195+
1196+
```sql
1197+
"dataset_args": {
1198+
"text_column": "text",
1199+
"class_column": "class"
1200+
},
1201+
```
1202+
1203+
### Run the pgml.tune Function
1204+
Execute the `pgml.tune` function with the updated parameters to initiate the training restart. The function will leverage the existing model and adapt it based on the adjusted hyperparameters and dataset configuration.
1205+
1206+
```sql
1207+
SELECT pgml.tune(
1208+
'imdb_review_sentiement',
1209+
task =>'text-classification',
1210+
relation_name =>'pgml.imdb_train_view',
1211+
model_name =>'santiadavani/imdb_review_sentiement',
1212+
test_size => 0.2,
1213+
test_sampling =>'last',
1214+
hyperparams =>'{
1215+
"training_args": {
1216+
"learning_rate": 2e-5,
1217+
"per_device_train_batch_size":16,
1218+
"per_device_eval_batch_size":16,
1219+
"num_train_epochs":1,
1220+
"weight_decay":0.01,
1221+
"hub_token":"",
1222+
"push_to_hub": true
1223+
},
1224+
"dataset_args": {"text_column":"text","class_column":"class" }
1225+
}'
1226+
);
1227+
```
1228+
1229+
By following these steps, you can effectively restart training from a previously trained model, allowing for further refinement and adaptation of the model based on new requirements or insights. Adjust parameters as needed for your specific use case and dataset.
1230+
1231+
## Conclusion
1232+
1233+
By following these steps, you can leverage PostgresML to seamlessly integrate fine-tuning of Language Models for text classification directly within your PostgreSQL database. Adjust the dataset, model, and hyperparameters to suit your specific requirements.
8701234

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp