Posted onFeb 1

Predicting House Prices as Your First ML Project

#machinelearning #ai #datascience #python

Machine Learning (ML) can seem intimidating at first, but the best way to learn is by doing. In this article, we’ll walk through a beginner-friendly ML project:predicting house prices using the Boston Housing dataset. By the end of this guide, you’ll have built your first ML model using Python and Scikit-learn. Let’s get started!

What is the Boston Housing Dataset?

The Boston Housing dataset is a classic dataset used for regression problems. It contains information about housing prices in the Boston area, along with features that might influence those prices, such as:

CRIM: Per capita crime rate by town.
RM: Average number of rooms per dwelling.
AGE: Proportion of owner-occupied units built before 1940.
DIS: Weighted distances to five Boston employment centers.
LSTAT: Percentage of lower status of the population.
MEDV: Median value of owner-occupied homes in $1000s (the target variable we want to predict).

Our goal is to build a model that predicts themedian house price (MEDV) based on these features.

Step 1: Set Up Your Environment

Before we start, make sure you have the necessary libraries installed. You can install them using pip:

pipinstallnumpy pandas scikit-learn matplotlib

Step 2: Load the Dataset

Scikit-learn provides the Boston Housing dataset as part of its built-in datasets. Let’s load it and explore the data.

# Import librariesimportnumpyasnpimportpandasaspdfromsklearn.datasetsimportload_bostonimportmatplotlib.pyplotasplt# Load the datasetboston=load_boston()# Convert it to a Pandas DataFrame for easier manipulationdata=pd.DataFrame(boston.data,columns=boston.feature_names)data['MEDV']=boston.target# Add the target variable to the DataFrame# Display the first few rowsprint(data.head())

Step 3: Explore the Data

Before building a model, it’s important to understand the data. Let’s perform some basic exploratory data analysis (EDA).

# Check for missing valuesprint(data.isnull().sum())# Get basic statisticsprint(data.describe())# Visualize the relationship between features and the target variableplt.scatter(data['RM'],data['MEDV'])plt.xlabel('Average Number of Rooms (RM)')plt.ylabel('Median House Price (MEDV)')plt.title('Rooms vs. Price')plt.show()

From the scatter plot, you can see that houses with more rooms tend to have higher prices. This is a good sign that our features are relevant to the target variable.

Step 4: Prepare the Data

Next, we’ll split the data intofeatures (X) andlabels (y), and then split it into training and testing sets.

fromsklearn.model_selectionimporttrain_test_split# Features (X) and labels (y)X=data.drop('MEDV',axis=1)# All columns except 'MEDV'y=data['MEDV']# Only the 'MEDV' column# Split the data into training and testing setsX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

Step 5: Build and Train the Model

We’ll use aLinear Regression model, which is a simple and effective algorithm for regression problems.

fromsklearn.linear_modelimportLinearRegressionfromsklearn.metricsimportmean_squared_error,r2_score# Initialize the modelmodel=LinearRegression()# Train the modelmodel.fit(X_train,y_train)# Make predictions on the test sety_pred=model.predict(X_test)

Step 6: Evaluate the Model

To see how well our model performs, we’ll calculate two common metrics:Mean Squared Error (MSE) andR-squared (R²).

# Calculate Mean Squared Error (MSE)mse=mean_squared_error(y_test,y_pred)print(f"Mean Squared Error:{mse:.2f}")# Calculate R-squared (R²)r2=r2_score(y_test,y_pred)print(f"R-squared:{r2:.2f}")

MSE: Measures the average squared difference between the predicted and actual values. Lower is better.
R²: Represents the proportion of variance in the target variable that’s explained by the model. Closer to 1 is better.

Step 7: Interpret the Results

Let’s interpret the results:

A low MSE indicates that the model’s predictions are close to the actual values.
An R² value close to 1 suggests that the model explains a large portion of the variance in house prices.

For example, if your R² is 0.75, it means that 75% of the variability in house prices can be explained by the features in the dataset.

Step 8: Make Predictions

Now that the model is trained, you can use it to make predictions on new data. For example, let’s predict the price of a house with the following features:

# Example input (replace with your own values)new_house=np.array([[0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14]])# Predict the pricepredicted_price=model.predict(new_house)print(f"Predicted Price: ${predicted_price[0]*1000:.2f}")

Real-World Applications

Predicting house prices is just one example of how ML can be applied in the real world. Here are some other applications of regression models:

Stock Price Prediction: Predicting the future price of stocks based on historical data.
Sales Forecasting: Estimating future sales based on past trends and external factors.
Healthcare: Predicting patient outcomes based on medical data.

Conclusion

We just built our first ML project using the Boston Housing dataset. Here’s a quick recap of what we covered:

Loaded and explored the dataset.
Prepared the data for training.
Built and trained a Linear Regression model.
Evaluated the model’s performance.
Made predictions on new data.

This is just the beginning of your ML journey. As you continue learning, you can explore more advanced algorithms, work with larger datasets, and tackle real-world problems.

If you have any questions or want to share your results, feel free to leave a comment below.