|
46 | 46 | -[Text-to-Text Generation](#text-to-text-generation) |
47 | 47 | -[Fill-Mask](#fill-mask) |
48 | 48 | -[Vector Database](#vector-database) |
| 49 | +-[LLM Fine-tuning](#llm-fine-tuning) |
| 50 | +-[Text Classification](#llm-fine-tuning-text-classification) |
49 | 51 | <!-- - [Regression](#regression) |
50 | 52 | - [Classification](#classification)--> |
51 | 53 |
|
@@ -866,5 +868,367 @@ Sentence Similarity involves determining the degree of similarity between two te |
866 | 868 | # Classification--> |
867 | 869 |
|
868 | 870 |
|
| 871 | +#LLM Fine-tuning |
869 | 872 |
|
| 873 | +In this section, we will provide a step-by-step walkthrough for fine-tuning a Language Model (LLM) for differnt tasks. |
| 874 | + |
| 875 | +##Prerequisites |
| 876 | + |
| 877 | +1. Ensure you have the PostgresML extension installed and configured in your PostgreSQL database. You can find installation instructions for PostgresML in the official documentation. |
| 878 | + |
| 879 | +2. Obtain a Hugging Face API token to push the fine-tuned model to the Hugging Face Model Hub. Follow the instructions on the[Hugging Face website](https://huggingface.co/settings/tokens) to get your API token. |
| 880 | + |
| 881 | +##LLM Fine-tuning Text Classification |
| 882 | + |
| 883 | +###1. Loading the Dataset |
| 884 | + |
| 885 | +To begin, create a table to store your dataset. In this example, we use the 'imdb' dataset from Hugging Face. IMDB dataset contains three splits: train (25K rows), test (25K rows) and unsupervised (50K rows). In train and test splits, negative class has label 0 and positive class label 1. All rows in unsupervised split has a label of -1. |
| 886 | +```sql |
| 887 | +SELECTpgml.load_dataset('imdb'); |
| 888 | +``` |
| 889 | + |
| 890 | +###2. Prepare dataset for fine-tuning |
| 891 | + |
| 892 | +We will create a view of the dataset by performing the following operations: |
| 893 | + |
| 894 | +- Add a new text column named "class" that has positive and negative classes. |
| 895 | +- Shuffled view of the dataset to ensure randomness in the distribution of data. |
| 896 | +- Remove all the unsupervised splits that have label = -1. |
| 897 | + |
| 898 | +```sql |
| 899 | +CREATEVIEWpgml.imdb_shuffled_viewAS |
| 900 | +SELECT |
| 901 | + label, |
| 902 | + CASE WHEN label=0 THEN'negative' |
| 903 | + WHEN label=1 THEN'positive' |
| 904 | + ELSE'neutral' |
| 905 | + ENDAS class, |
| 906 | +text |
| 907 | +FROMpgml.imdb |
| 908 | +WHERE label!=-1 |
| 909 | +ORDER BY RANDOM(); |
| 910 | +``` |
| 911 | + |
| 912 | +###3 Exploratory Data Analysis (EDA) on Shuffled Data |
| 913 | + |
| 914 | +Before splitting the data into training and test sets, it's essential to perform exploratory data analysis (EDA) to understand the distribution of labels and other characteristics of the dataset. In this section, we'll use the`pgml.imdb_shuffled_view` to explore the shuffled data. |
| 915 | + |
| 916 | +####3.1 Distribution of Labels |
| 917 | + |
| 918 | +To analyze the distribution of labels in the shuffled dataset, you can use the following SQL query: |
| 919 | + |
| 920 | +```sql |
| 921 | +-- Count the occurrences of each label in the shuffled dataset |
| 922 | +SELECT |
| 923 | + label, |
| 924 | +COUNT(*)AS label_count |
| 925 | +FROMpgml.imdb_shuffled_view |
| 926 | +GROUP BY label |
| 927 | +ORDER BY label; |
| 928 | + |
| 929 | + |
| 930 | +This query provides insights into the distribution of labels, helping you understand the balanceor imbalance of classesin your dataset. |
| 931 | + |
| 932 | +#### 3.2 Sample Records |
| 933 | +To get a glimpse of the data, you can retrieve a sample of recordsfrom the shuffled dataset: |
| 934 | + |
| 935 | +```sql |
| 936 | +Copy code |
| 937 | +-- Retrieve a sample of records from the shuffled dataset |
| 938 | +SELECT * |
| 939 | +FROM pgml.imdb_shuffled_view |
| 940 | +LIMIT 10; -- Adjust the limit based on the desired number of records |
| 941 | +``` |
| 942 | +This query allows you to inspect a few records to understand the structureand content of the shuffled data. |
| 943 | + |
| 944 | +#### 3.3 Additional Exploratory Analysis |
| 945 | +Feel free to explore other aspects of the data, suchas the distribution oftext lengths, word frequencies,or any other features relevant to your analysis. Performing EDA is crucial for gaining insights into your datasetand making informed decisions during subsequent steps of the workflow. |
| 946 | + |
| 947 | +### 4. Splitting Data into Training and Test Sets |
| 948 | + |
| 949 | +Create views for trainingand test data by splitting the shuffled dataset.In this example,80% is allocated for training,and20% for testing. |
| 950 | + |
| 951 | +```sql |
| 952 | +-- Create a view for training data |
| 953 | +CREATE VIEW pgml.imdb_train_view AS |
| 954 | +SELECT * |
| 955 | +FROM pgml.imdb_shuffled_view |
| 956 | +LIMIT (SELECT COUNT(*) * 0.8 FROM pgml.imdb_shuffled_view); |
| 957 | +
|
| 958 | +-- Create a view for test data |
| 959 | +CREATE VIEW pgml.imdb_test_view AS |
| 960 | +SELECT * |
| 961 | +FROM pgml.imdb_shuffled_view |
| 962 | +OFFSET (SELECT COUNT(*) * 0.8 FROM pgml.imdb_shuffled_view); |
| 963 | +``` |
| 964 | + |
| 965 | +### 5. Fine-Tuning the Language Model |
| 966 | + |
| 967 | +Now, fine-tune the Language Model fortext classification using the created training view.In the following sections, you will see a detailed explanation of different parameters used during fine-tuning. |
| 968 | + |
| 969 | +```sql |
| 970 | +SELECT pgml.tune( |
| 971 | + 'imdb_review_sentiment', |
| 972 | + task => 'text-classification', |
| 973 | + relation_name => 'pgml.imdb_train_view', |
| 974 | + model_name => 'distilbert-base-uncased', |
| 975 | + test_size => 0.2, |
| 976 | + test_sampling => 'last', |
| 977 | + hyperparams => '{ |
| 978 | + "training_args" : { |
| 979 | + "learning_rate": 2e-5, |
| 980 | + "per_device_train_batch_size": 16, |
| 981 | + "per_device_eval_batch_size": 16, |
| 982 | + "num_train_epochs": 20, |
| 983 | + "weight_decay": 0.01, |
| 984 | + "hub_token" : "YOUR_HUB_TOKEN", |
| 985 | + "push_to_hub" : true |
| 986 | + }, |
| 987 | + "dataset_args" : { "text_column" : "text", "class_column" : "class" } |
| 988 | + }' |
| 989 | +); |
| 990 | +``` |
| 991 | + |
| 992 | +* project_name ('imdb_review_sentiment'): The project_name parameter specifies a unique name for your fine-tuning project. It helps identifyand organize different fine-tuning tasks within the PostgreSQL database.In this example, the project is named'imdb_review_sentiment,' reflecting the sentiment analysis taskon the IMDb dataset. You cancheck`pgml.projects` for list of projects. |
| 993 | + |
| 994 | +* task ('text-classification'): The task parameter defines the nature of the machine learning task to be performed.In this case, it's set to'text-classification,' indicating that the fine-tuning is geared towards training a model for text classification. |
| 995 | +
|
| 996 | +* relation_name ('pgml.imdb_train_view'): The relation_name parameter identifies the training dataset to be used for fine-tuning. It specifies the view or table containing the training data. In this example,'pgml.imdb_train_view' is the view created from the shuffled IMDb dataset, and it serves as the source for model training. |
| 997 | +
|
| 998 | +* model_name ('distilbert-base-uncased'): The model_name parameter denotes the pre-trained language model architecture to be fine-tuned. In this case,'distilbert-base-uncased' is selected. DistilBERT is a distilled version of BERT, and the'uncased' variant indicates that the model does not differentiate between uppercase and lowercase letters. |
| 999 | +
|
| 1000 | +* test_size (0.2): The test_size parameter determines the proportion of the dataset reserved for testing during fine-tuning. In this example, 20% of the dataset is set aside for evaluation, helping assess the model's performanceon unseen data. |
| 1001 | + |
| 1002 | +* test_sampling ('last'): The test_sampling parameter defines the strategy for sampling test datafrom the dataset.In this case,'last' indicates that the most recent portion of the data, following the specified test size, is used for testing. Adjusting this parameter might be necessary basedon your specific requirementsand dataset characteristics. |
| 1003 | + |
| 1004 | +#### 5.1 Dataset Arguments (dataset_args) |
| 1005 | +The dataset_args section allows you to specify critical parameters related to your dataset for language model fine-tuning. |
| 1006 | + |
| 1007 | +* text_column: The name of the column containing thetext datain your dataset.In this example, it's set to "text." |
| 1008 | +* class_column: The name of the column containing the class labels in your dataset. In this example, it'sset to"class." |
| 1009 | + |
| 1010 | +#### 5.2 Training Arguments (training_args) |
| 1011 | +Fine-tuning a language model requires careful consideration of training parametersin the training_args section. Below is a subset of training args that you can pass to fine-tuning. You can find an exhaustive list of parametersin Hugging Face documentationon [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). |
| 1012 | + |
| 1013 | +* learning_rate: The learning rate for the training. It controls the step size during the optimization process. Adjust basedon your model's convergence behavior. |
| 1014 | +* per_device_train_batch_size: The batch size per GPU for training. This parameter controls the number of training samples utilized in one iteration. Adjust based on your available GPU memory. |
| 1015 | +* per_device_eval_batch_size: The batch size per GPU for evaluation. Similar to per_device_train_batch_size, but used during model evaluation. |
| 1016 | +* num_train_epochs: The number of training epochs. An epoch is one complete pass through the entire training dataset. Adjust based on the model's convergenceand your dataset size. |
| 1017 | +* weight_decay: L2 regularization term for weight decay. It helps prevent overfitting. Adjust basedon the complexity of your model. |
| 1018 | +* hub_token: Your Hugging Face API token to push the fine-tuned model to the Hugging Face Model Hub. Replace"YOUR_HUB_TOKEN" with the actual token. |
| 1019 | +* push_to_hub: Aboolean flag indicating whether to push the model to the Hugging Face Model Hub after fine-tuning. |
| 1020 | + |
| 1021 | + |
| 1022 | +#### 5.3 Monitoring |
| 1023 | +During training, metricslike loss, gradient norm will be printedas infoand also loggedinpgml.logs table. Below is a snapshot of such output. |
| 1024 | + |
| 1025 | +```json |
| 1026 | +INFO: { |
| 1027 | + "loss": 0.3453, |
| 1028 | + "grad_norm": 5.230295181274414, |
| 1029 | + "learning_rate": 1.9e-05, |
| 1030 | + "epoch": 0.25, |
| 1031 | + "step": 500, |
| 1032 | + "max_steps": 10000, |
| 1033 | + "timestamp": "2024-03-07 01:59:15.090612" |
| 1034 | +} |
| 1035 | +INFO: { |
| 1036 | + "loss": 0.2479, |
| 1037 | + "grad_norm": 2.7754225730895996, |
| 1038 | + "learning_rate": 1.8e-05, |
| 1039 | + "epoch": 0.5, |
| 1040 | + "step": 1000, |
| 1041 | + "max_steps": 10000, |
| 1042 | + "timestamp": "2024-03-07 02:01:12.064098" |
| 1043 | +} |
| 1044 | +INFO: { |
| 1045 | + "loss": 0.223, |
| 1046 | + "learning_rate": 1.6000000000000003e-05, |
| 1047 | + "epoch": 1.0, |
| 1048 | + "step": 2000, |
| 1049 | + "max_steps": 10000, |
| 1050 | + "timestamp": "2024-03-07 02:05:08.141220" |
| 1051 | +} |
| 1052 | +``` |
| 1053 | + |
| 1054 | +Once the training is completed, model will be evaluated against the validation dataset. You will see the belowin the client terminal. Accuracyon the evaluation dataset is0.934and F1-score is0.93. |
| 1055 | + |
| 1056 | +```json |
| 1057 | +INFO: { |
| 1058 | + "train_runtime": 2359.5335, |
| 1059 | + "train_samples_per_second": 67.81, |
| 1060 | + "train_steps_per_second": 4.238, |
| 1061 | + "train_loss": 0.11267969808578492, |
| 1062 | + "epoch": 5.0, |
| 1063 | + "step": 10000, |
| 1064 | + "max_steps": 10000, |
| 1065 | + "timestamp": "2024-03-07 02:36:38.783279" |
| 1066 | +} |
| 1067 | +INFO: { |
| 1068 | + "eval_loss": 0.3691485524177551, |
| 1069 | + "eval_f1": 0.9343711842996372, |
| 1070 | + "eval_accuracy": 0.934375, |
| 1071 | + "eval_runtime": 41.6167, |
| 1072 | + "eval_samples_per_second": 192.23, |
| 1073 | + "eval_steps_per_second": 12.014, |
| 1074 | + "epoch": 5.0, |
| 1075 | + "step": 10000, |
| 1076 | + "max_steps": 10000, |
| 1077 | + "timestamp": "2024-03-07 02:37:31.762917" |
| 1078 | +} |
| 1079 | +``` |
| 1080 | + |
| 1081 | +Once the training is completed, you cancheck querypgml.logs table using the model_idor by finding the latest modelon the project. |
| 1082 | + |
| 1083 | +```bash |
| 1084 | +pgml: SELECT logs->>'epoch' AS epoch, logs->>'step' AS step, logs->>'loss' AS loss FROM pgml.logs WHERE model_id = 993 AND jsonb_exists(logs, 'loss'); |
| 1085 | + epoch | step | loss |
| 1086 | +-------+-------+-------- |
| 1087 | + 0.25 | 500 | 0.3453 |
| 1088 | + 0.5 | 1000 | 0.2479 |
| 1089 | + 0.75 | 1500 | 0.223 |
| 1090 | + 1.0 | 2000 | 0.2165 |
| 1091 | + 1.25 | 2500 | 0.1485 |
| 1092 | + 1.5 | 3000 | 0.1563 |
| 1093 | + 1.75 | 3500 | 0.1559 |
| 1094 | + 2.0 | 4000 | 0.142 |
| 1095 | + 2.25 | 4500 | 0.0816 |
| 1096 | + 2.5 | 5000 | 0.0942 |
| 1097 | + 2.75 | 5500 | 0.075 |
| 1098 | + 3.0 | 6000 | 0.0883 |
| 1099 | + 3.25 | 6500 | 0.0432 |
| 1100 | + 3.5 | 7000 | 0.0426 |
| 1101 | + 3.75 | 7500 | 0.0444 |
| 1102 | + 4.0 | 8000 | 0.0504 |
| 1103 | + 4.25 | 8500 | 0.0186 |
| 1104 | + 4.5 | 9000 | 0.0265 |
| 1105 | + 4.75 | 9500 | 0.0248 |
| 1106 | + 5.0 | 10000 | 0.0284 |
| 1107 | +``` |
| 1108 | + |
| 1109 | +During training, model is periodically uploaded to Hugging Face Hub. You will find the model at`https://huggingface.co/<username>/<project_name>`. An example model that was automatically pushed to Hugging Face Hub is [here](https://huggingface.co/santiadavani/imdb_review_sentiement). |
| 1110 | + |
| 1111 | +### 6. Inference using fine-tuned model |
| 1112 | +Now, that we have fine-tuned modelon Hugging Face Hub, we can use [`pgml.transform`](https://postgresml.org/docs/introduction/apis/sql-extensions/pgml.transform/text-classification) to performreal-time predictionsas wellas batch predictions. |
| 1113 | + |
| 1114 | +**Real-time predictions** |
| 1115 | +Here is an examplepgml.transform call forreal-time predictionson the newly minted LLM fine-tunedon IMDB review dataset. |
| 1116 | +```sql |
| 1117 | + SELECT pgml.transform( |
| 1118 | + task => '{ |
| 1119 | + "task": "text-classification", |
| 1120 | + "model": "santiadavani/imdb_review_sentiement" |
| 1121 | + }'::JSONB, |
| 1122 | + inputs => ARRAY[ |
| 1123 | + 'I would not give this movie a rating, its not worthy. I watched it only because I am a Pfieffer fan. ', |
| 1124 | + 'This movie was sooooooo good! It was hilarious! There are so many jokes that you can just watch the' |
| 1125 | + ] |
| 1126 | +); |
| 1127 | + transform |
| 1128 | +-------------------------------------------------------------------------------------------------------- |
| 1129 | + [{"label": "negative", "score": 0.999561846256256}, {"label": "positive", "score": 0.986771047115326}] |
| 1130 | +(1 row) |
| 1131 | +
|
| 1132 | +Time: 175.264 ms |
| 1133 | +``` |
| 1134 | + |
| 1135 | + |
| 1136 | +**Batch predictions** |
| 1137 | + |
| 1138 | +```sql |
| 1139 | +pgml=# SELECT |
| 1140 | + LEFT(text, 100) AS truncated_text, |
| 1141 | + class, |
| 1142 | + predicted_class[0]->>'label' AS predicted_class, |
| 1143 | + (predicted_class[0]->>'score')::float AS score |
| 1144 | +FROM ( |
| 1145 | + SELECT |
| 1146 | + LEFT(text, 100) AS text, |
| 1147 | + class, |
| 1148 | + pgml.transform( |
| 1149 | + task => '{ |
| 1150 | + "task": "text-classification", |
| 1151 | + "model": "santiadavani/imdb_review_sentiement" |
| 1152 | + }'::JSONB, |
| 1153 | + inputs => ARRAY[text] |
| 1154 | + ) AS predicted_class |
| 1155 | + FROM pgml.imdb_test_view |
| 1156 | + LIMIT 2 |
| 1157 | +) AS subquery; |
| 1158 | + truncated_text | class | predicted_class | score |
| 1159 | +------------------------------------------------------------------------------------------------------+----------+-----------------+-------------------- |
| 1160 | + I wouldn't give this movie a rating, it's not worthy. I watched it only because I'm a Pfieffer fan. | negative | negative | 0.9996490478515624 |
| 1161 | + This movie was sooooooo good! It was hilarious! There are so many jokes that you can just watch the | positive | positive | 0.9972313046455384 |
| 1162 | +
|
| 1163 | + Time: 1337.290 ms (00:01.337) |
| 1164 | +``` |
| 1165 | + |
| 1166 | +## 7. Restarting Training from a Previous Trained Model |
| 1167 | + |
| 1168 | +Sometimes, it's necessary to restart the training process from a previously trained model. This can be advantageous for various reasons, such as model fine-tuning, hyperparameter adjustments, or addressing interruptions in the training process. `pgml.tune` provides a seamless way to restart training while leveraging the progress made in the existing model. Below is a guide on how to restart training using a previous model as a starting point: |
| 1169 | +
|
| 1170 | +### Define the Previous Model |
| 1171 | +
|
| 1172 | +Specify the name of the existing model you want to use as a starting point. This is achieved by setting the `model_name` parameter in the `pgml.tune` function. In the example below, it is set to'santiadavani/imdb_review_sentiement'. |
| 1173 | +
|
| 1174 | +```sql |
| 1175 | +model_name =>'santiadavani/imdb_review_sentiement', |
| 1176 | +``` |
| 1177 | +
|
| 1178 | +### Adjust Hyperparameters |
| 1179 | +Fine-tune hyperparameters as needed for the restarted training process. This might include modifying learning rates, batch sizes, or training epochs. In the example below, hyperparameters such as learning rate, batch sizes, and epochs are adjusted. |
| 1180 | +
|
| 1181 | +```sql |
| 1182 | +"training_args": { |
| 1183 | + "learning_rate": 2e-5, |
| 1184 | + "per_device_train_batch_size": 16, |
| 1185 | + "per_device_eval_batch_size": 16, |
| 1186 | + "num_train_epochs": 1, |
| 1187 | + "weight_decay": 0.01, |
| 1188 | + "hub_token": "", |
| 1189 | + "push_to_hub": true |
| 1190 | +}, |
| 1191 | +``` |
| 1192 | +
|
| 1193 | +### Ensure Consistent Dataset Configuration |
| 1194 | +Confirm that the dataset configuration remains consistent, including specifying the same text and class columns as in the previous training. This ensures compatibility between the existing model and the restarted training process. |
| 1195 | +
|
| 1196 | +```sql |
| 1197 | +"dataset_args": { |
| 1198 | + "text_column": "text", |
| 1199 | + "class_column": "class" |
| 1200 | +}, |
| 1201 | +``` |
| 1202 | +
|
| 1203 | +### Run the pgml.tune Function |
| 1204 | +Execute the `pgml.tune` function with the updated parameters to initiate the training restart. The function will leverage the existing model and adapt it based on the adjusted hyperparameters and dataset configuration. |
| 1205 | +
|
| 1206 | +```sql |
| 1207 | +SELECT pgml.tune( |
| 1208 | +'imdb_review_sentiement', |
| 1209 | + task =>'text-classification', |
| 1210 | + relation_name =>'pgml.imdb_train_view', |
| 1211 | + model_name =>'santiadavani/imdb_review_sentiement', |
| 1212 | + test_size => 0.2, |
| 1213 | + test_sampling =>'last', |
| 1214 | + hyperparams =>'{ |
| 1215 | +"training_args": { |
| 1216 | +"learning_rate": 2e-5, |
| 1217 | +"per_device_train_batch_size":16, |
| 1218 | +"per_device_eval_batch_size":16, |
| 1219 | +"num_train_epochs":1, |
| 1220 | +"weight_decay":0.01, |
| 1221 | +"hub_token":"", |
| 1222 | +"push_to_hub": true |
| 1223 | + }, |
| 1224 | +"dataset_args": {"text_column":"text","class_column":"class" } |
| 1225 | + }' |
| 1226 | +); |
| 1227 | +``` |
| 1228 | +
|
| 1229 | +By following these steps, you can effectively restart training from a previously trained model, allowing for further refinement and adaptation of the model based on new requirements or insights. Adjust parameters as needed for your specific use case and dataset. |
| 1230 | +
|
| 1231 | +## Conclusion |
| 1232 | +
|
| 1233 | +By following these steps, you can leverage PostgresML to seamlessly integrate fine-tuning of Language Models for text classification directly within your PostgreSQL database. Adjust the dataset, model, and hyperparameters to suit your specific requirements. |
870 | 1234 |
|