
Notes fromMLOps ZoomCamp
1.1 What is MLOps
MLOps (Machine Learning Operations) refers to the practices, processes, and tools used to manage the entire lifecycle of machine learning models. It bridges the gap between data scientists, software engineers, and operations teams to ensure successful deployment and maintenance of ML models.
Key Components
- Data Management and Versioning
- Model Training and Evaluation
- Deployment and Infrastructure
- Continuous Integration and Delivery
- Monitoring and Governance
1.2 Environment Preparation
You can use an EC2 instance or your local environment
Step 1
Download and install the Anaconda distribution of Python:
wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.shbash Anaconda3-2022.05-Linux-x86_64.sh
Step 2
Update existing packages:
sudo apt update
Step 3
Install Docker:
sudo apt install docker.io
Step 4
Create a separate directory for the installation and get the latest release of Docker Compose:
mkdir softcd softwget https://github.com/docker/compose/releases/download/v2.18.0/docker-compose-linux-x86_64 -O docker-composechmod +x docker-composenano ~/.bashrc
Add the following line to the .bashrc file:
export PATH="${HOME}/soft:${PATH}"
Save and exit the .bashrc file, then apply the changes:
source ~/.bashrc
Step 5
Run Docker to check if it's working:
docker run hello-world
1.3 Training a ride duration prediction model
Dataset
Dataset used is2022 NYC green taxi trip records
More information on the data is found athttps://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
Download the dataset
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
Imports
Import required packages
import pandas as pd import pickle import seaborn as sns import matplotlib.pyplot as plt from sklearn.feature_extraction import DictVectorizer from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lassofrom sklearn.linear_model import Ridgefrom sklearn.metrics import mean_squared_error
Reading the file:
jan_data = pd.read_parquet("./data/green_tripdata_2022-01.parquet")jan_data.head()
VendorID | lpep_pickup_datetime | lpep_dropoff_datetime | store_and_fwd_flag | RatecodeID | PULocationID | DOLocationID | passenger_count | trip_distance | fare_amount | extra | mta_tax | tip_amount | tolls_amount | ehail_fee | improvement_surcharge | total_amount | payment_type | trip_type | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 2022-01-01 00:14:21 | 2022-01-01 00:15:33 | N | 1 | 42 | 42 | 1 | 0.44 | 3.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 4.8 | 2 | 1 | 0 | |
1 | 1 | 2022-01-01 00:20:55 | 2022-01-01 00:29:38 | N | 1 | 116 | 41 | 1 | 2.1 | 9.5 | 0.5 | 0.5 | 0 | 0 | 0.3 | 10.8 | 2 | 1 | 0 | |
2 | 1 | 2022-01-01 00:57:02 | 2022-01-01 01:13:14 | N | 1 | 41 | 140 | 1 | 3.7 | 14.5 | 3.25 | 0.5 | 4.6 | 0 | 0.3 | 23.15 | 1 | 1 | 2.75 | |
3 | 2 | 2022-01-01 00:07:42 | 2022-01-01 00:15:57 | N | 1 | 181 | 181 | 1 | 1.69 | 8 | 0.5 | 0.5 | 0 | 0 | 0.3 | 9.3 | 2 | 1 | 0 | |
4 | 2 | 2022-01-01 00:07:50 | 2022-01-01 00:28:52 | N | 1 | 33 | 170 | 1 | 6.26 | 22 | 0.5 | 0.5 | 5.21 | 0 | 0.3 | 31.26 | 1 | 1 | 2.75 |
jan_data.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 62495 entries, 0 to 62494Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VendorID 62495 non-null int64 1 lpep_pickup_datetime 62495 non-null datetime64[ns] 2 lpep_dropoff_datetime 62495 non-null datetime64[ns] 3 store_and_fwd_flag 56200 non-null object 4 RatecodeID 56200 non-null float64 5 PULocationID 62495 non-null int64 6 DOLocationID 62495 non-null int64 7 passenger_count 56200 non-null float64 8 trip_distance 62495 non-null float64 9 fare_amount 62495 non-null float64 10 extra 62495 non-null float64 11 mta_tax 62495 non-null float64 12 tip_amount 62495 non-null float64 13 tolls_amount 62495 non-null float64 14 ehail_fee 0 non-null object 15 improvement_surcharge 62495 non-null float64 16 total_amount 62495 non-null float64 17 payment_type 56200 non-null float64 18 trip_type 56200 non-null float64 19 congestion_surcharge 56200 non-null float64 dtypes: datetime64[ns](2), float64(13), int64(3), object(2)memory usage: 9.5+ MB
Calculate duration of trip from dropoff and pickup times
jan_dropoff = pd.to_datetime(jan_data["lpep_dropoff_datetime"])jan_pickup = pd.to_datetime(jan_data["lpep_pickup_datetime"])jan_data["duration"] = jan_dropoff - jan_pickup# Convert the values to minutesjan_data["duration"] = jan_data.duration.apply(lambda td: td.total_seconds()/60)
Check the distribution of the duration
jan_data.duration.describe(percentiles=[0.95, 0.98, 0.99])
count 62495.000000mean 19.019387std 78.215732min 0.00000050% 11.58333395% 35.43833398% 49.72266799% 68.453000max 1439.466667Name: duration, dtype: float64
sns.distplot(jan_data.duration)
We can see the data is skewed due to the presence of outliers
Keeping only the records with the duration between 1 and 70 minutes
jan_data = jan_data[(jan_data.duration >= 1) & (jan_data.duration <= 60)]
One Hot Encoding
Using Dictionary Vectorizer for One Hot Encoding
Our categorical values that I will consider are the pickup and dropoff locations
categorical = ["PULocationID", "DOLocationID"]numerical = ["trip_distance"]
Convert the column type to string from integersjan_data.loc[:, categorical] = jan_data[categorical].astype(str)
# Change our values to dictionariestrain_jan_data = jan_data[categorical + numerical].to_dict(orient='records')
dv = DictVectorizer()X_train_jan = dv.fit_transform(train_jan_data)# Convert the feature matrix to an arrayfm_array = X_train_jan.toarray()
# Get the dimensionality of the feature matrixfm_array.shape
(59837, 471)
Python function that would do the above steps
Custom function to read and preprocess the data
def read_dataframe(filename): # Read the parquet file df = pd.read_parquet(filename) # Calculate the duration df_dropoff = pd.to_datetime(df["lpep_dropoff_datetime"]) df_pickup = pd.to_datetime(df["lpep_pickup_datetime"]) df["duration"] = df_dropoff - df_pickup # Remove outliers df["duration"] = df.duration.apply(lambda td: td.total_seconds()/60) df = df[(jan_data.duration >= 1) & (df.duration <= 60)] # Preparation for OneHotEncoding using DictVectorizer categorical = ["PULocationID", "DOLocationID"] df[categorical] = df[categorical].astype(str) return df
Fitting Linear Regression Model
# Using January data as train and Feb as Validationdf_train = read_dataframe("./data/green_tripdata_2022-01.parquet")df_val = read_dataframe("./data/green_tripdata_2022-02.parquet")
dv = DictVectorizer()categorical = ["PULocationID", "DOLocationID"]numerical = ["trip_distance"]train_dicts= df_train[categorical + numerical].to_dict(orient='records')X_train = dv.fit_transform(train_dicts)val_dicts= df_val[categorical + numerical].to_dict(orient='records')X_val = dv.transform(val_dicts)
target = 'duration'y_train = df_train[target].valuesy_val = df_val[target].values
lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_val)mean_squared_error(y_val, y_pred, squared=False)
8.364575685718151
Try other models like lasso and Ridge
Save the model
with open('models/lin_reg.bin', 'wb') as f_out: pickle.dump((dv, lr), f_out)
Cover Photo byAlina Grubnyak onUnsplash
Top comments(1)

- Email
- LocationFlintshire
- EducationSchool of Life
- WorkWeb developer
- Joined
👏 Nice article, welcome to the community!
For further actions, you may consider blocking this person and/orreporting abuse