Posted onJun 18, 2023 • Edited onJul 24, 2023

Demystifying MLOps: Week 1

#machinelearning #mlops #aws

Notes fromMLOps ZoomCamp

1.1 What is MLOps

MLOps (Machine Learning Operations) refers to the practices, processes, and tools used to manage the entire lifecycle of machine learning models. It bridges the gap between data scientists, software engineers, and operations teams to ensure successful deployment and maintenance of ML models.

Key Components

Data Management and Versioning
Model Training and Evaluation
Deployment and Infrastructure
Continuous Integration and Delivery
Monitoring and Governance

1.2 Environment Preparation

You can use an EC2 instance or your local environment

Step 1

Download and install the Anaconda distribution of Python:

wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.shbash Anaconda3-2022.05-Linux-x86_64.sh

Step 2

Update existing packages:

sudo apt update

Step 3

Install Docker:

sudo apt install docker.io

Step 4

Create a separate directory for the installation and get the latest release of Docker Compose:

mkdir softcd softwget https://github.com/docker/compose/releases/download/v2.18.0/docker-compose-linux-x86_64 -O docker-composechmod +x docker-composenano ~/.bashrc

Add the following line to the .bashrc file:

export PATH="${HOME}/soft:${PATH}"

Save and exit the .bashrc file, then apply the changes:

source ~/.bashrc

Step 5

Run Docker to check if it's working:

docker run hello-world

1.3 Training a ride duration prediction model

Dataset

Dataset used is2022 NYC green taxi trip records

More information on the data is found athttps://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

Download the dataset

!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet

Imports

Import required packages

import pandas as pd import pickle import seaborn as sns import matplotlib.pyplot as plt from sklearn.feature_extraction import DictVectorizer from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lassofrom sklearn.linear_model import Ridgefrom sklearn.metrics import mean_squared_error

Reading the file:

jan_data = pd.read_parquet("./data/green_tripdata_2022-01.parquet")jan_data.head()

	VendorID	lpep_pickup_datetime	lpep_dropoff_datetime	store_and_fwd_flag	RatecodeID	PULocationID	DOLocationID	passenger_count	trip_distance	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	payment_type	trip_type	congestion_surcharge
0	2	2022-01-01 00:14:21	2022-01-01 00:15:33	N	1	42	42	1	0.44	3.5	0.5	0.5	0	0.3	4.8	2	1	0
1	1	2022-01-01 00:20:55	2022-01-01 00:29:38	N	1	116	41	1	2.1	9.5	0.5	0.5	0	0.3	10.8	2	1	0
2	1	2022-01-01 00:57:02	2022-01-01 01:13:14	N	1	41	140	1	3.7	14.5	3.25	0.5	4.6	0.3	23.15	1	1	2.75
3	2	2022-01-01 00:07:42	2022-01-01 00:15:57	N	1	181	181	1	1.69	8	0.5	0.5	0	0.3	9.3	2	1	0
4	2	2022-01-01 00:07:50	2022-01-01 00:28:52	N	1	33	170	1	6.26	22	0.5	0.5	5.21	0.3	31.26	1	1	2.75

jan_data.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 62495 entries, 0 to 62494Data columns (total 20 columns): #   Column                 Non-Null Count  Dtype         ---  ------                 --------------  -----          0   VendorID               62495 non-null  int64          1   lpep_pickup_datetime   62495 non-null  datetime64[ns] 2   lpep_dropoff_datetime  62495 non-null  datetime64[ns] 3   store_and_fwd_flag     56200 non-null  object         4   RatecodeID             56200 non-null  float64        5   PULocationID           62495 non-null  int64          6   DOLocationID           62495 non-null  int64          7   passenger_count        56200 non-null  float64        8   trip_distance          62495 non-null  float64        9   fare_amount            62495 non-null  float64        10  extra                  62495 non-null  float64        11  mta_tax                62495 non-null  float64        12  tip_amount             62495 non-null  float64        13  tolls_amount           62495 non-null  float64        14  ehail_fee              0 non-null      object         15  improvement_surcharge  62495 non-null  float64        16  total_amount           62495 non-null  float64        17  payment_type           56200 non-null  float64        18  trip_type              56200 non-null  float64        19  congestion_surcharge   56200 non-null  float64       dtypes: datetime64[ns](2), float64(13), int64(3), object(2)memory usage: 9.5+ MB

Calculate duration of trip from dropoff and pickup times

jan_dropoff = pd.to_datetime(jan_data["lpep_dropoff_datetime"])jan_pickup = pd.to_datetime(jan_data["lpep_pickup_datetime"])jan_data["duration"] = jan_dropoff - jan_pickup# Convert the values to minutesjan_data["duration"] = jan_data.duration.apply(lambda td: td.total_seconds()/60)

Check the distribution of the duration

jan_data.duration.describe(percentiles=[0.95, 0.98, 0.99])

count    62495.000000mean        19.019387std         78.215732min          0.00000050%         11.58333395%         35.43833398%         49.72266799%         68.453000max       1439.466667Name: duration, dtype: float64

sns.distplot(jan_data.duration)

We can see the data is skewed due to the presence of outliers
Keeping only the records with the duration between 1 and 70 minutes

jan_data = jan_data[(jan_data.duration >= 1) & (jan_data.duration <= 60)]

One Hot Encoding

Using Dictionary Vectorizer for One Hot Encoding
Our categorical values that I will consider are the pickup and dropoff locations

categorical = ["PULocationID", "DOLocationID"]numerical = ["trip_distance"]

Convert the column type to string from integers

jan_data.loc[:, categorical] = jan_data[categorical].astype(str)

# Change our values to dictionariestrain_jan_data = jan_data[categorical + numerical].to_dict(orient='records')

dv = DictVectorizer()X_train_jan = dv.fit_transform(train_jan_data)# Convert the feature matrix to an arrayfm_array = X_train_jan.toarray()

# Get the dimensionality of the feature matrixfm_array.shape

(59837, 471)

Python function that would do the above steps

Custom function to read and preprocess the data

def read_dataframe(filename):    # Read the parquet file    df = pd.read_parquet(filename)    # Calculate the duration    df_dropoff = pd.to_datetime(df["lpep_dropoff_datetime"])    df_pickup = pd.to_datetime(df["lpep_pickup_datetime"])    df["duration"] = df_dropoff - df_pickup    # Remove outliers    df["duration"] = df.duration.apply(lambda td: td.total_seconds()/60)    df = df[(jan_data.duration >= 1) & (df.duration <= 60)]    # Preparation for OneHotEncoding using DictVectorizer    categorical = ["PULocationID", "DOLocationID"]    df[categorical] = df[categorical].astype(str)    return df

Fitting Linear Regression Model

# Using January data as train and Feb as Validationdf_train = read_dataframe("./data/green_tripdata_2022-01.parquet")df_val = read_dataframe("./data/green_tripdata_2022-02.parquet")

dv = DictVectorizer()categorical = ["PULocationID", "DOLocationID"]numerical = ["trip_distance"]train_dicts= df_train[categorical + numerical].to_dict(orient='records')X_train = dv.fit_transform(train_dicts)val_dicts= df_val[categorical + numerical].to_dict(orient='records')X_val = dv.transform(val_dicts)

target = 'duration'y_train = df_train[target].valuesy_val = df_val[target].values

lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_val)mean_squared_error(y_val, y_pred, squared=False)

8.364575685718151

Try other models like lasso and Ridge

Save the model

with open('models/lin_reg.bin', 'wb') as f_out:    pickle.dump((dv, lr), f_out)

Cover Photo byAlina Grubnyak onUnsplash

Top comments(1)

Ben Read

I don’t know enough. Software Engineer, JavaScript mostly. Interested in DevOps

Email
endymion1818@gmail.com
Location
Flintshire
Education
School of Life
Work
Web developer
Joined
Feb 22, 2017

• Jun 19 '23

Copy link

👏 Nice article, welcome to the community!

For further actions, you may consider blocking this person and/orreporting abuse

Movatterモバイル変換

DEV Community