Regression

Harini Mallawaarachchi
Aug 4, 2023
8 min read

Supervised machine learning techniques involve training a model to operate on a set of features and predict a label using a dataset that includes some already-known label values. The training process fits the features to the known labels to define a general function that can be applied to new features for which the labels are unknown and predict them. You can think of this function like this, in which y represents the label we want to predict, and x represents the features the model uses to predict it.

y=f(x)

In most cases, x is actually a vector that consists of multiple feature values, so to be a little more precise, the function could be expressed like this:

y=f([x1,x2,x3,...])

The goal of training the model is to find a function that performs some kind of calculation to the x values that produces the result y. We do this by applying a machine learning algorithm that tries to fit the x values to a calculation that produces y reasonably accurately for all of the cases in the training dataset.

There are lots of machine learning algorithms for supervised learning, and they can be broadly divided into two types:

Regression algorithms: Algorithms that predict a y value that is a numeric value, such as the price of a house or the number of sales transactions.
Classification algorithms: Algorithms that predict to which category, or class, an observation belongs. The y value in a classification model is a vector of probability values between 0 and 1, one for each class, indicating the probability of the observation belonging to each class.

Classification and Regression are two fundamental types of supervised learning tasks in machine learning. They differ in their objectives, output types, and the nature of the dependent variable they aim to predict.

Classification involves assigning input data to predefined categories, while regression aim to predict a continuous numerical value.

Classification:

Objective: The primary goal of a classification task is to categorize input data into predefined classes or categories. The model learns to assign each input sample to one of the specified classes based on its features.
Output Type: The output of a classification model is discrete and represents the class label or category to which the input belongs. The classes can be binary (two classes) or multi-class (more than two classes).
Dependent Variable: In classification, the dependent variable is categorical, meaning it has distinct, non-numeric values representing the different classes.
Example Use Cases:
- Email spam detection: Classifying emails as either spam or non-spam.
- Image classification: Identifying objects or animals in images from a predefined set of classes.
- Disease diagnosis: Classifying patient health status as healthy or having a particular disease.

Regression:

Objective: The main objective of a regression task is to predict a continuous numerical value based on input features. The model learns the relationship between the input variables and the continuous target variable.
Output Type: The output of a regression model is a continuous numerical value. It can be any real number within a specific range based on the problem domain.
Dependent Variable: In regression, the dependent variable is continuous, meaning it can take on any numeric value.
Example Use Cases:
- House price prediction: Predicting the price of a house based on its features such as size, location, and number of bedrooms.
- Stock market forecasting: Predicting the future price of a stock based on historical market data.
- Temperature prediction: Forecasting the temperature for the next day based on historical weather data.

Here, we'll focus on regression, using an example based on a real study in which data for a bicycle sharing scheme was collected and used to predict the number of rentals based on seasonality and weather conditions. We'll use a simplified version of the dataset from that study.

Exercise:

The first step in any machine learning project involves exploring the data to understand relationships between attributes, detecting and fixing data issues, performing feature engineering, normalizing numeric features, and encoding categorical features.

Let's start by loading the bicycle sharing data as a Pandas DataFrame and viewing the first few rows.

import pandas as pd

# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/daily-bike-share.csv
bike_data = pd.read_csv('daily-bike-share.csv')
bike_data.head()

In this dataset, rentals represents the label (the y value) our model must be trained to predict. The other columns are potential features (x values).

Let's add a new column named day to the dataframe by extracting the day component from the existing dteday column. The new column represents the day of the month from 1 to 31.

bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day
bike_data.head(32)

let's start our analysis of the data by examining a few key descriptive statistics. We can use the dataframe's describe method to generate these for the numeric features as well as the rentals label column.

numeric_features = ['temp', 'atemp', 'hum', 'windspeed']
bike_data[numeric_features + ['rentals']].describe()

The statistics of the dataset show information about the distribution of data in each numeric field, including 731 observations, mean, standard deviation, minimum and maximum values, and quartile values. The mean number of daily rentals is approximately 848, but the relatively large standard deviation indicates significant variance in the number of rentals per day.

To gain a clearer understanding of the rentals distribution, visualizations like histograms and box plots are helpful. Python's matplotlib library can be used to create these visualizations for the rentals column.

import pandas as pd
import matplotlib.pyplot as plt

# This ensures plots are displayed inline in the Jupyter notebook
%matplotlib inline

# Get the label column
label = bike_data['rentals']


# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (9,12))

# Plot the histogram   
ax[0].hist(label, bins=100)
ax[0].set_ylabel('Frequency')

# Add lines for the mean, median, and mode
ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)

# Plot the boxplot   
ax[1].boxplot(label, vert=False)
ax[1].set_xlabel('Rentals')

# Add a title to the Figure
fig.suptitle('Rental Distribution')

# Show the figure
fig.show()

The plots show that the number of daily rentals ranges from 0 to just over 3,400. However, the mean (and median) number of daily rentals is closer to the low end of that range, with most of the data between 0 and around 2,200 rentals. The few values above this are shown in the box plot as small circles, indicating that they are outliers - in other words, unusually high or low values beyond the typical range of most of the data.

We can do the same kind of visual exploration of the numeric features. Let's create a histogram for each of these.

# Plot a histogram for each numeric feature
for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = bike_data[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

The numeric features show a distribution that is approximately normal, with the mean and median closer to the middle of the value range, coinciding with the most commonly occurring values.

Note: The distributions are not truly normal in the statistical sense, which would result in a smooth, symmetric "bell-curve" histogram with the mean and mode (the most common value) in the center; but they do generally indicate that most of the observations have a value somewhere near the middle.

Next, we'll explore the distribution of categorical features. Since these are discrete values, we can't use histograms. Instead, we can use bar charts to show the count of each discrete value within each category.

import numpy as np

# plot a bar plot for each categorical feature count
categorical_features = ['season','mnth','holiday','weekday','workingday','weathersit', 'day']

for col in categorical_features:
    counts = bike_data[col].value_counts().sort_index()
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    counts.plot.bar(ax = ax, color='steelblue')
    ax.set_title(col + ' counts')
    ax.set_xlabel(col) 
    ax.set_ylabel("Frequency")
plt.show()

For the numeric features, we can create scatter plots that show the intersection of feature and label values. We can also calculate the correlation statistic to quantify the apparent relationship.

for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = bike_data[col]
    label = bike_data['rentals']
    correlation = feature.corr(label)
    plt.scatter(x=feature, y=label)
    plt.xlabel(col)
    plt.ylabel('Bike Rentals')
    ax.set_title('rentals vs ' + col + '- correlation: ' + str(correlation))
plt.show()

Now let's compare the categorical features to the label. We'll do this by creating box plots that show the distribution of rental counts for each category.

# plot a boxplot for the label by each categorical feature
for col in categorical_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    bike_data.boxplot(column = 'rentals', by = col, ax = ax)
    ax.set_title('Label by ' + col)
    ax.set_ylabel("Bike Rentals")
plt.show()

Train a Regression Model

Now that we've explored the data, it's time to use it to train a regression model that uses the features we've identified as potentially predictive to predict the rentals label. The first thing we need to do is to separate the features we want to use to train the model from the label we want it to predict.

# Separate features and labels
X, y = bike_data[['season','mnth', 'holiday','weekday','workingday','weathersit','temp', 'atemp', 'hum', 'windspeed']].values, bike_data['rentals'].values
print('Features:',X[:10], '\nLabels:', y[:10], sep='\n')

After separating the dataset, we could train a model using all of the data; but it's common practice in supervised learning to split the data into two subsets.

A (typically larger) set with which to train the model
A smaller "hold-back" set with which to validate the trained model

To randomly split the data, we'll use the train_test_split function in the scikit-learn library. This library is one of the most widely used machine learning packages for Python.

from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape[0]))

Now we're ready to train a model by fitting a suitable regression algorithm to the training data. We'll use a linear regression algorithm, a common starting point for regression that works by trying to find a linear relationship between the X values and the y label. The resulting model is a function that conceptually defines a line where every possible X and y value combination intersect.

In Scikit-Learn, training algorithms are encapsulated in estimators, and in this case we'll use the LinearRegression estimator to train a linear regression model.

# Train the model
from sklearn.linear_model import LinearRegression

# Fit a linear regression model on the training set
model = LinearRegression().fit(X_train, y_train)
print (model)

Evaluate the Trained Model

Now that we've trained the model, we can use it to predict rental counts for the features we held back in our validation dataset. Then we can compare these predictions to the actual label values to evaluate how well (or not!) the model is working.

import numpy as np

predictions = model.predict(X_test)
np.set_printoptions(suppress=True)
print('Predicted labels: ', np.round(predictions)[:10])
print('Actual labels   : ' ,y_test[:10])

Let's see if we can get a better indication by visualizing a scatter plot that compares the predictions to the actual labels. We'll also overlay a trend line to get a general sense for how well the predicted labels align with the true labels.

import matplotlib.pyplot as plt

%matplotlib inline

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

There's a definite diagonal trend, and the intersections of the predicted and actual values are generally following the path of the trend line, but there's a fair amount of difference between the ideal function represented by the line and the results. This variance represents the residuals of the model - in other words, the difference between the label predicted when the model applies the coefficients it learned during training to the validation data, and the actual value of the validation label. These residuals when evaluated from the validation data indicate the expected level of error when the model is used with new data for which the label is unknown.

You can quantify the residuals by calculating a number of commonly used evaluation metrics. We'll focus on the following three:

Mean Square Error (MSE): The mean of the squared differences between predicted and actual values. This yields a relative metric in which the smaller the value, the better the fit of the model
Root Mean Square Error (RMSE): The square root of the MSE. This yields an absolute metric in the same unit as the label (in this case, numbers of rentals). The smaller the value, the better the model (in a simplistic sense, it represents the average number of rentals by which the predictions are wrong!)
Coefficient of Determination (usually known as R-squared or R2): A relative metric in which the higher the value, the better the fit of the model. In essence, this metric represents how much of the variance between predicted and actual label values the model is able to explain.

from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

Other regression algorithms

Other regression algorithms are used to improve performance.

Linear algorithms: The simplest. Not just the Linear Regression algorithm we used above (which is technically an Ordinary Least Squares algorithm), but other variants such as Lasso and Ridge.

Tree-based algorithms: Algorithms that build a decision tree to reach a prediction. It makes predictions step by step, considering factors like seasons and days.
Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to improve generalizability. Like Random Forest creates numerous trees for improved predictions on intricate data.

Data scientists commonly test various models.

Here we've explored our data and fit a basic regression model. Regression models are popular for their suitability with small data, robustness, interpretability, and variety.