Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Demystifying MLOps: Week 1
totalSophie
totalSophie

Posted on • Edited on

     

Demystifying MLOps: Week 1

Notes fromMLOps ZoomCamp

1.1 What is MLOps

MLOps (Machine Learning Operations) refers to the practices, processes, and tools used to manage the entire lifecycle of machine learning models. It bridges the gap between data scientists, software engineers, and operations teams to ensure successful deployment and maintenance of ML models.

Key Components

  • Data Management and Versioning
  • Model Training and Evaluation
  • Deployment and Infrastructure
  • Continuous Integration and Delivery
  • Monitoring and Governance

1.2 Environment Preparation

You can use an EC2 instance or your local environment

Step 1

Download and install the Anaconda distribution of Python:

wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.shbash Anaconda3-2022.05-Linux-x86_64.sh
Enter fullscreen modeExit fullscreen mode

Step 2

Update existing packages:

sudo apt update
Enter fullscreen modeExit fullscreen mode

Step 3

Install Docker:

sudo apt install docker.io
Enter fullscreen modeExit fullscreen mode

Step 4

Create a separate directory for the installation and get the latest release of Docker Compose:

mkdir softcd softwget https://github.com/docker/compose/releases/download/v2.18.0/docker-compose-linux-x86_64 -O docker-composechmod +x docker-composenano ~/.bashrc
Enter fullscreen modeExit fullscreen mode

Add the following line to the .bashrc file:

export PATH="${HOME}/soft:${PATH}"
Enter fullscreen modeExit fullscreen mode

Save and exit the .bashrc file, then apply the changes:

source ~/.bashrc
Enter fullscreen modeExit fullscreen mode

Step 5

Run Docker to check if it's working:

docker run hello-world
Enter fullscreen modeExit fullscreen mode

1.3 Training a ride duration prediction model

Dataset

Dataset used is2022 NYC green taxi trip records
Photo of a green cab
More information on the data is found athttps://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

Download the dataset

!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet
Enter fullscreen modeExit fullscreen mode

Imports

Import required packages

import pandas as pd import pickle import seaborn as sns import matplotlib.pyplot as plt from sklearn.feature_extraction import DictVectorizer from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lassofrom sklearn.linear_model import Ridgefrom sklearn.metrics import mean_squared_error
Enter fullscreen modeExit fullscreen mode

Reading the file:

jan_data = pd.read_parquet("./data/green_tripdata_2022-01.parquet")jan_data.head()
Enter fullscreen modeExit fullscreen mode
VendorIDlpep_pickup_datetimelpep_dropoff_datetimestore_and_fwd_flagRatecodeIDPULocationIDDOLocationIDpassenger_counttrip_distancefare_amountextramta_taxtip_amounttolls_amountehail_feeimprovement_surchargetotal_amountpayment_typetrip_typecongestion_surcharge
022022-01-01 00:14:212022-01-01 00:15:33N1424210.443.50.50.5000.34.8210
112022-01-01 00:20:552022-01-01 00:29:38N11164112.19.50.50.5000.310.8210
212022-01-01 00:57:022022-01-01 01:13:14N14114013.714.53.250.54.600.323.15112.75
322022-01-01 00:07:422022-01-01 00:15:57N118118111.6980.50.5000.39.3210
422022-01-01 00:07:502022-01-01 00:28:52N13317016.26220.50.55.2100.331.26112.75

jan_data.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 62495 entries, 0 to 62494Data columns (total 20 columns): #   Column                 Non-Null Count  Dtype         ---  ------                 --------------  -----          0   VendorID               62495 non-null  int64          1   lpep_pickup_datetime   62495 non-null  datetime64[ns] 2   lpep_dropoff_datetime  62495 non-null  datetime64[ns] 3   store_and_fwd_flag     56200 non-null  object         4   RatecodeID             56200 non-null  float64        5   PULocationID           62495 non-null  int64          6   DOLocationID           62495 non-null  int64          7   passenger_count        56200 non-null  float64        8   trip_distance          62495 non-null  float64        9   fare_amount            62495 non-null  float64        10  extra                  62495 non-null  float64        11  mta_tax                62495 non-null  float64        12  tip_amount             62495 non-null  float64        13  tolls_amount           62495 non-null  float64        14  ehail_fee              0 non-null      object         15  improvement_surcharge  62495 non-null  float64        16  total_amount           62495 non-null  float64        17  payment_type           56200 non-null  float64        18  trip_type              56200 non-null  float64        19  congestion_surcharge   56200 non-null  float64       dtypes: datetime64[ns](2), float64(13), int64(3), object(2)memory usage: 9.5+ MB
Enter fullscreen modeExit fullscreen mode

Calculate duration of trip from dropoff and pickup times

jan_dropoff = pd.to_datetime(jan_data["lpep_dropoff_datetime"])jan_pickup = pd.to_datetime(jan_data["lpep_pickup_datetime"])jan_data["duration"] = jan_dropoff - jan_pickup# Convert the values to minutesjan_data["duration"] = jan_data.duration.apply(lambda td: td.total_seconds()/60)
Enter fullscreen modeExit fullscreen mode

Check the distribution of the duration

jan_data.duration.describe(percentiles=[0.95, 0.98, 0.99])

count    62495.000000mean        19.019387std         78.215732min          0.00000050%         11.58333395%         35.43833398%         49.72266799%         68.453000max       1439.466667Name: duration, dtype: float64
Enter fullscreen modeExit fullscreen mode

sns.distplot(jan_data.duration)

Picture of the distplot

We can see the data is skewed due to the presence of outliers
Keeping only the records with the duration between 1 and 70 minutes

jan_data = jan_data[(jan_data.duration >= 1) & (jan_data.duration <= 60)]

One Hot Encoding

Using Dictionary Vectorizer for One Hot Encoding
Our categorical values that I will consider are the pickup and dropoff locations

categorical = ["PULocationID", "DOLocationID"]numerical = ["trip_distance"]
Enter fullscreen modeExit fullscreen mode

Convert the column type to string from integers


jan_data.loc[:, categorical] = jan_data[categorical].astype(str)

# Change our values to dictionariestrain_jan_data = jan_data[categorical + numerical].to_dict(orient='records')
Enter fullscreen modeExit fullscreen mode
dv = DictVectorizer()X_train_jan = dv.fit_transform(train_jan_data)# Convert the feature matrix to an arrayfm_array = X_train_jan.toarray()
Enter fullscreen modeExit fullscreen mode
# Get the dimensionality of the feature matrixfm_array.shape
Enter fullscreen modeExit fullscreen mode

(59837, 471)

Python function that would do the above steps

Custom function to read and preprocess the data

def read_dataframe(filename):    # Read the parquet file    df = pd.read_parquet(filename)    # Calculate the duration    df_dropoff = pd.to_datetime(df["lpep_dropoff_datetime"])    df_pickup = pd.to_datetime(df["lpep_pickup_datetime"])    df["duration"] = df_dropoff - df_pickup    # Remove outliers    df["duration"] = df.duration.apply(lambda td: td.total_seconds()/60)    df = df[(jan_data.duration >= 1) & (df.duration <= 60)]    # Preparation for OneHotEncoding using DictVectorizer    categorical = ["PULocationID", "DOLocationID"]    df[categorical] = df[categorical].astype(str)    return df
Enter fullscreen modeExit fullscreen mode

Fitting Linear Regression Model

# Using January data as train and Feb as Validationdf_train = read_dataframe("./data/green_tripdata_2022-01.parquet")df_val = read_dataframe("./data/green_tripdata_2022-02.parquet")
Enter fullscreen modeExit fullscreen mode
dv = DictVectorizer()categorical = ["PULocationID", "DOLocationID"]numerical = ["trip_distance"]train_dicts= df_train[categorical + numerical].to_dict(orient='records')X_train = dv.fit_transform(train_dicts)val_dicts= df_val[categorical + numerical].to_dict(orient='records')X_val = dv.transform(val_dicts)
Enter fullscreen modeExit fullscreen mode
target = 'duration'y_train = df_train[target].valuesy_val = df_val[target].values
Enter fullscreen modeExit fullscreen mode
lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_val)mean_squared_error(y_val, y_pred, squared=False)
Enter fullscreen modeExit fullscreen mode

8.364575685718151

Try other models like lasso and Ridge

Save the model

with open('models/lin_reg.bin', 'wb') as f_out:    pickle.dump((dv, lr), f_out)
Enter fullscreen modeExit fullscreen mode

Cover Photo byAlina Grubnyak onUnsplash

Top comments(1)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
CollapseExpand
 
endymion1818 profile image
Ben Read
I don’t know enough. Software Engineer, JavaScript mostly. Interested in DevOps
  • Email
  • Location
    Flintshire
  • Education
    School of Life
  • Work
    Web developer
  • Joined

👏 Nice article, welcome to the community!

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

MLOps || Software Engineering
  • Pronouns
    she/her
  • Joined

Trending onDEV CommunityHot

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp