TP5 - Linear Regression

Course: EDA & Unsupervised Learning
M1-DAS
Lecturer: HAS Sothea, PhD

Objective: In this TP, you will learn how to implement SLR, MLR and enhance your practical skills in pushing the performance of your models on new unseen data using techniques introduced in the course.

The Jupyter Notebook for this Lab can be downloaded here: TP5_Linear_Regression.ipynb.

1. Importing `Abalone dataset`

You need internet to load the data by running the following codes. For more information about this data, read Abalone dataset.

# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

# Import data
import pandas as pd
data = pd.read_csv(path + "/abalone.csv")
data.head()

Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.6), please consider upgrading to the latest version (0.3.7).

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

A. What’s the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?

print(f'dimension of the data: {data.shape}')
data.dtypes

dimension of the data: (4177, 9)

Sex                object
Length            float64
Diameter          float64
Height            float64
Whole weight      float64
Shucked weight    float64
Viscera weight    float64
Shell weight      float64
Rings               int64
dtype: object

print(f'Number of qualitative columns: {len(data.select_dtypes(exclude="number").columns)}')
print(f'Number of quantitative columns: {len(data.select_dtypes(include="number").columns)}')

Number of qualitative columns: 1
Number of quantitative columns: 8

B. Perform univariate analysis (compute statistical values and plot the distribution) on each variables of your dataset. The goal is to understand your data (the scale and how each column is distributed), detect missing values and outliers, etc.

Qualitative columns

data.select_dtypes(exclude='number').value_counts(normalize=True) * 100

Sex
M      36.581278
I      32.128322
F      31.290400
Name: proportion, dtype: float64

import seaborn as sns                 # For plotting
import matplotlib.pyplot as plt       # control panel for graphic
sns.set(style="whitegrid")
_, ax  = plt.subplots(1,1, figsize=(5,3))
sns.countplot(data, x="Sex", ax = ax)
ax.bar_label(ax.containers[0])
ax.set_title("Variable Sex")
plt.show()

Quantitative columns

data.select_dtypes(include='number').describe().drop(index=['count'])

	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
mean	0.523992	0.407881	0.139516	0.828742	0.359367	0.180594	0.238831	9.933684
std	0.120093	0.099240	0.041827	0.490389	0.221963	0.109614	0.139203	3.224169
min	0.075000	0.055000	0.000000	0.002000	0.001000	0.000500	0.001500	1.000000
25%	0.450000	0.350000	0.115000	0.441500	0.186000	0.093500	0.130000	8.000000
50%	0.545000	0.425000	0.140000	0.799500	0.336000	0.171000	0.234000	9.000000
75%	0.615000	0.480000	0.165000	1.153000	0.502000	0.253000	0.329000	11.000000
max	0.815000	0.650000	1.130000	2.825500	1.488000	0.760000	1.005000	29.000000

We will remove abalone with 0 heights.

data.query('Height > 0', inplace=True)
data.describe()

	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
count	4175.000000	4175.00000	4175.000000	4175.000000	4175.000000	4175.000000	4175.000000	4175.000000
mean	0.524065	0.40794	0.139583	0.829005	0.359476	0.180653	0.238834	9.935090
std	0.120069	0.09922	0.041725	0.490349	0.221954	0.109605	0.139212	3.224227
min	0.075000	0.05500	0.010000	0.002000	0.001000	0.000500	0.001500	1.000000
25%	0.450000	0.35000	0.115000	0.442250	0.186250	0.093500	0.130000	8.000000
50%	0.545000	0.42500	0.140000	0.800000	0.336000	0.171000	0.234000	9.000000
75%	0.615000	0.48000	0.165000	1.153500	0.502000	0.253000	0.328750	11.000000
max	0.815000	0.65000	1.130000	2.825500	1.488000	0.760000	1.005000	29.000000

fig, axs = plt.subplots(2, 4, figsize=(15,5))
for i, va in enumerate(data.select_dtypes(include="number").columns):
    sns.histplot(data, x=va, ax=axs[i//4, i%4], kde=True)
    axs[i//4, i%4].set_title(va)

plt.tight_layout()
plt.show()

Note: Most abalones are around 0.001 to 0.25 in height except for a few abalones that are much taller in height than others. THose can be considered outliers.

2. Correlation matrix, SLR and MLR

A. Compute the correlation matrix of this data using pd.corr() function. Describe what you observed from this correlation matrix.

data.select_dtypes(include='number').corr()

	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
Length	1.000000	0.986802	0.828108	0.925217	0.897859	0.902960	0.898419	0.556464
Diameter	0.986802	1.000000	0.834298	0.925414	0.893108	0.899672	0.906084	0.574418
Height	0.828108	0.834298	1.000000	0.819886	0.775621	0.798908	0.819596	0.557625
Whole weight	0.925217	0.925414	0.819886	1.000000	0.969389	0.966354	0.955924	0.540151
Shucked weight	0.897859	0.893108	0.775621	0.969389	1.000000	0.931924	0.883129	0.420597
Viscera weight	0.902960	0.899672	0.798908	0.966354	0.931924	1.000000	0.908186	0.503562
Shell weight	0.898419	0.906084	0.819596	0.955924	0.883129	0.908186	1.000000	0.627928
Rings	0.556464	0.574418	0.557625	0.540151	0.420597	0.503562	0.627928	1.000000

data.select_dtypes(include='number').corr(method="spearman")

	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
Length	1.000000	0.983299	0.888124	0.972600	0.956790	0.952600	0.948662	0.603954
Diameter	0.983299	1.000000	0.895651	0.971290	0.950422	0.948328	0.954896	0.622487
Height	0.888124	0.895651	1.000000	0.915963	0.874163	0.900535	0.922130	0.657403
Whole weight	0.972600	0.971290	0.915963	1.000000	0.977038	0.975222	0.970197	0.630440
Shucked weight	0.956790	0.950422	0.874163	0.977038	1.000000	0.947581	0.918465	0.538956
Viscera weight	0.952600	0.948328	0.900535	0.975222	0.947581	1.000000	0.938893	0.613930
Shell weight	0.948662	0.954896	0.922130	0.970197	0.918465	0.938893	1.000000	0.692991
Rings	0.603954	0.622487	0.657403	0.630440	0.538956	0.613930	0.692991	1.000000

B. Draw the pairs of scatterplots of the target Rings vs the all quantitative inputs.

Comment your findings and handle the problems if there is any in the graph.

sns.pairplot(data=data, hue='Sex')

There seems to be outliers in Height, therefore we should remove them.

data.query("Height < 0.5", inplace=True)
sns.pairplot(data, hue="Sex")

data.select_dtypes(include="number").corr()

	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
Length	1.000000	0.986794	0.900868	0.925328	0.898129	0.903033	0.898363	0.556572
Diameter	0.986794	1.000000	0.907187	0.925499	0.893330	0.899716	0.906026	0.574551
Height	0.900868	0.907187	1.000000	0.888850	0.837485	0.866757	0.891857	0.610107
Whole weight	0.925328	0.925499	0.888850	1.000000	0.969370	0.966290	0.955954	0.540621
Shucked weight	0.898129	0.893330	0.837485	0.969370	1.000000	0.931831	0.883194	0.421156
Viscera weight	0.903033	0.899716	0.866757	0.966290	0.931831	1.000000	0.908133	0.503977
Shell weight	0.898363	0.906026	0.891857	0.955954	0.883194	0.908133	1.000000	0.628169
Rings	0.556572	0.574551	0.610107	0.540621	0.421156	0.503977	0.628169	1.000000

Here, I split the data into two parts. You are allowed to use only the training part for building models. The testing part will be used to evaluate the models’ performance.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                        data.iloc[:,:8], 
                                        data['Rings'], 
                                        test_size=0.2, 
                                        random_state=42)
X_train = pd.concat([pd.get_dummies(X_train[['Sex']], drop_first=True), X_train.iloc[:,1:]], axis=1)
X_test = pd.concat([pd.get_dummies(X_test[['Sex']], drop_first=True), X_test.iloc[:,1:]], axis=1)

C. Fit a SLR model using the most correlated input to predict the Rings of abalone.

Draw the fitted line on the previous scatterplot.
Compute \(R^2\) and explain the observed value.
Compute Mean Suared Error on the test data (Test MSE).
Analyze the residuals and conclude.

from sklearn.linear_model import LinearRegression as LR
slr = LR()
slr = slr.fit(X_train[['Shell weight']], y_train)
y_pred = slr.predict(X_test[['Shell weight']])

ax = sns.scatterplot(data, x="Shell weight", y = "Rings")
plt.plot(X_test['Shell weight'], y_pred, c='r')
ax.set_title("Fitted line")
plt.show()

from sklearn.metrics import mean_squared_error, r2_score
df_mse = pd.DataFrame(
    {'SLR': [
        mean_squared_error(y_test, y_pred), 
        r2_score(y_test, y_pred), 
        'NA']},
     index=['Test MSE', 'R-square', 'Adj-R-Squared'])
df_mse

	SLR
Test MSE	6.960663
R-square	0.419115
Adj-R-Squared	NA

y_pred_tr = slr.predict(X_train[['Shell weight']])
res0 = y_train - y_pred_tr

df_slr = pd.DataFrame(
    {
        'Fitted': y_pred_tr,
        'Residual': res0,
        'Actual': y_train
    }
)

_, axs = plt.subplots(1, 3, figsize=(12, 4))
sns.scatterplot(df_slr, x="Actual", y="Fitted", ax=axs[0])
axs[0].plot(df_slr['Actual'], df_slr['Actual'], 'r')
axs[0].set_title("Actual vs Prediction")

sns.scatterplot(df_slr, x="Fitted", y="Residual", ax=axs[1])
axs[1].set_title("Residual vs Prediction")
axs[1].plot(df_slr['Actual'], [0]*len(df_slr['Actual']), 'r')

sns.histplot(df_slr, x="Residual", ax=axs[2], kde=True)
axs[2].set_title("Residual distribution")
plt.show()

D. Fit MLR using all inputs. Compute:

Compute \(R^2\), adjusted \(R_{\text{adj}}^2\) and explain the observed value.
Compute Mean Suared Error on the test data (Test MSE).
Analyze the residuals and conclude.

mlr = LR().fit(X_train, y_train)
y_pred_mlr = mlr.predict(X_test)

def adjusted_r2(y1, y2, d):
    return 1-(len(y1)-1)/(len(y1)-d-1)*(1-r2_score(y1,y2))

print(f'R-squared: {r2_score(y_test, y_pred_mlr)}')
print(f'R-squared: {adjusted_r2(y_test, y_pred_mlr, X_test.shape[1])}')
print(f'Test MSE: {mean_squared_error(y_test, y_pred_mlr)}')

df_mse = pd.concat([df_mse, pd.DataFrame(
    {'MLR': [
        mean_squared_error(y_test, y_pred_mlr), 
        r2_score(y_test, y_pred_mlr), 
        adjusted_r2(y_test, y_pred_mlr, X_test.shape[1])]},
     index=['Test MSE', 'R-square', 'Adj-R-Squared'])], axis=1)
df_mse

R-squared: 0.5852978270364873
R-squared: 0.5807738033314307
Test MSE: 4.969320719101582

	SLR	MLR
Test MSE	6.960663	4.969321
R-square	0.419115	0.585298
Adj-R-Squared	NA	0.580774

y_pred_tr = mlr.predict(X_train)
res1 = y_train - y_pred_tr

df_mlr = pd.DataFrame(
    {
        'Fitted': y_pred_tr,
        'Residual': res1,
        'Actual': y_train
    }
)

_, axs = plt.subplots(1, 3, figsize=(12, 4))
sns.scatterplot(df_mlr, x="Actual", y="Fitted", ax=axs[0])
axs[0].plot(df_mlr['Actual'], df_mlr['Actual'], 'r')
axs[0].set_title("Actual vs Prediction")

sns.scatterplot(df_mlr, x="Fitted", y="Residual", ax=axs[1])
axs[1].set_title("Residual vs Prediction")
axs[1].plot(df_mlr['Actual'], [0]*len(df_mlr['Actual']), 'r')

sns.histplot(df_mlr, x="Residual", ax=axs[2], kde=True)
axs[2].set_title("Residual distribution")
plt.show()

3. Polynomial Regression and Regularized Linear Models

A. Build polynomial regression with different degree \(n\in\{2,3,...,10\}\) of the best correlated input to predict Rings. Compute Test MSE for each case.

Perform \(10\)-fold Cross-validation to select the best degree \(n\). Hint: see slide 47.

from sklearn.preprocessing import PolynomialFeatures

degrees = list(range(2, 11))
preds = []
preds_test = []
poly_dim = []
mse = []
for deg in degrees:
    poly = PolynomialFeatures(degree=deg)
    X_poly = poly.fit_transform(X_train[['Shell weight']])
    poly_dim.append(X_poly.shape[1])
    model = LR().fit(X_poly, y_train)
    preds.append(model.predict(X_poly))
    X_poly_test = poly.transform(X_test[['Shell weight']])
    y_test_pred = model.predict(X_poly_test)
    preds_test.append(y_test_pred)
    mse.append(mean_squared_error(y_test, y_test_pred))

df_poly = pd.DataFrame(
    {
        'Degree' : degrees,
        'Test MSE': mse
    }
)

ax = sns.lineplot(df_poly, x="Degree", y="Test MSE")
ax.set_title("Test MSE vs Degree")
plt.show()

import numpy as np
ax = sns.scatterplot(x=X_train['Shell weight'], y=y_train)
ax.set_title('Fitted polynomial regression')
cols = sns.color_palette()
id_sort = X_train['Shell weight'].argsort()
for i, pred in enumerate(preds):
    ax.plot(X_train['Shell weight'].iloc[id_sort], pred[id_sort], c=cols[i], label=f"Poly {i+2}")
    df_mse = pd.concat(
        [df_mse, pd.DataFrame(
            {'Poly' + str(i+2): [mean_squared_error(y_test, preds_test[i]), 
                     r2_score(y_test, preds_test[i]), 
                     adjusted_r2(y_test, preds_test[i], poly_dim[i])]},
     index=['Test MSE', 'R-square', 'Adj-R-Squared'])], axis=1)
    ax.legend()

df_mse

	SLR	MLR	Poly2	Poly3	Poly4	Poly5	Poly6	Poly7	Poly8	Poly9	Poly10
Test MSE	6.960663	4.969321	6.753474	6.595747	6.558894	6.581101	6.599169	6.599699	6.579606	6.573815	6.595559
R-square	0.419115	0.585298	0.436406	0.449569	0.452644	0.450791	0.449283	0.449239	0.450915	0.451399	0.449584
Adj-R-Squared	NA	0.580774	0.434371	0.446916	0.449343	0.446811	0.444621	0.443904	0.444925	0.444741	0.442228

Cross-validation

from sklearn.model_selection import cross_val_score
# List to store all losses
loss = [] 
for deg in degrees:
    pf = PolynomialFeatures(degree=deg)
    X_poly = pf.fit_transform(X_train[['Shell weight']])
    model = LR()
    score = -cross_val_score(model, X_poly, y_train, cv=10, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)

df_loss_cv = pd.DataFrame(
    {
        'CV-MSE': loss,
        'Degree': degrees
    },
    index=['Poly' +str(i) for i in degrees]
)

best_deg = degrees[np.argmin(loss)]
ax = sns.lineplot(df_loss_cv, x="Degree", y="CV-MSE", markers=True)
ax.vlines(x=[best_deg], ymin=np.min(loss), ymax=np.max(loss), color='r')
ax.set_yscale("log")
ax.set_title("CV-MSE vs Degree")
plt.show()

B. Perform \(10\)-fold Cross-validation to tune the best penalization strength \(\alpha\) of Ridge regression model for predicting Rings. Hint: see slide 51.

Compute Test MSE and compare it to the previous models.

from sklearn.linear_model import Ridge
# List of all degrees to search over
alphas = list(np.linspace(0.00001, 3, 50)) + list(np.linspace(3.1, 100, 50))
# List to store all losses
loss = []
test_mse = []
test_r2 = []
test_r2_squared = []
coefficients = {f'alpha={alpha}': [] for alpha in alphas}
for alp in alphas:
    model = Ridge(alpha=alp)
    mse = -cross_val_score(model, X_train, y_train, cv=10, 
                scoring='neg_mean_squared_error').mean()
    loss.append(mse)
    
    # Fit
    model.fit(X_train, y_train)
    coefficients[f'alpha={alp}'] = model.coef_
    y_hat_ridge = model.predict(X_test)
    test_mse.append(mean_squared_error(y_test, y_hat_ridge))
    test_r2.append(r2_score(y_test, y_hat_ridge))
    test_r2_squared.append(adjusted_r2(y_test, y_hat_ridge, X_train.shape[1]))

fig, axes = plt.subplots(1, 3, figsize=(12,4))
pd_coef = pd.DataFrame(coefficients, index=X_train.columns)
# Test error
axes[0].plot(alphas, test_mse, label="Test MSE")
axes[0].set_title("Test MSE at each penalty strength alpha")
axes[0].legend()
axes[0].set_xlabel("alpha")
axes[0].set_xscale('log')
axes[0].set_ylabel("Test MSE")

id_min = np.argmin(loss)
print(f'Optimal regularization strength alpha for Ridge regression: {alphas[id_min]}')
axes[1].plot(alphas, loss, label="CV MSE")
axes[1].vlines(x=alphas[id_min], ymin=np.min(loss), ymax=np.max(loss), label="optimal alpha", color="red", linestyle="--")
axes[1].set_title("CV MSE at each penalty strength alpha")
axes[1].legend()
axes[1].set_xscale('log')
axes[1].set_xlabel("alpha")
axes[1].set_ylabel("CV MSE")

axes[2].plot(alphas, pd_coef.transpose(), label=pd_coef.index)
axes[2].vlines(x=alphas[id_min], ymin=np.min(pd_coef), ymax=np.max(pd_coef), label="optimal alpha", color="red", linestyle="--")
axes[2].set_title("Coefs at each penalty strength alpha")
axes[2].legend()
axes[2].set_xscale('log')
axes[2].set_xlabel("alpha")
axes[2].set_ylabel("Coef")
plt.show()

Optimal regularization strength alpha for Ridge regression: 0.06123428571428571

The first graph is the performance of Ridge regression at each value of on the testing data. On the other hand, the second graph illustrates the performance of Ridge regression on cross-validation subsets. The last graph shows the values of the coefficients of the model at each value of penalty parameter \(\alpha\).

We can see that optimal strength \(\alpha\approx 0.0612\) by cross-validation method. The average CV MSE at this level of penalty strength is given below.

df_mse['Ridge'] = [test_mse[id_min], test_r2[id_min], test_r2_squared[id_min]]
df_mse

	SLR	MLR	Poly2	Poly3	Poly4	Poly5	Poly6	Poly7	Poly8	Poly9	Poly10	Ridge
Test MSE	6.960663	4.969321	6.753474	6.595747	6.558894	6.581101	6.599169	6.599699	6.579606	6.573815	6.595559	4.962656
R-square	0.419115	0.585298	0.436406	0.449569	0.452644	0.450791	0.449283	0.449239	0.450915	0.451399	0.449584	0.585854
Adj-R-Squared	NA	0.580774	0.434371	0.446916	0.449343	0.446811	0.444621	0.443904	0.444925	0.444741	0.442228	0.581336

C. Perform \(10\)-fold Cross-validation to tune the best penalization strength \(\alpha\) of Lasso regression model for predicting Rings. Hint: see slide 51.

Compute Test MSE and compare it to the previous models.
How many inputs are retained by Lasso?

from sklearn.linear_model import Lasso
# List of all degrees to search over
alphas = list(np.linspace(0.001, 3, 100))
# List to store all losses
loss = []
test_mse = []
test_r2 = []
test_r2_squared = []
coefficients = {f'alpha={alpha}': [] for alpha in alphas}
for alp in alphas:
    model = Lasso(alpha=alp)
    mse = -cross_val_score(model, X_train, y_train, cv=10, 
                scoring='neg_mean_squared_error').mean()
    loss.append(mse)

    # Fit
    model.fit(X_train, y_train)
    coefficients[f'alpha={alp}'] = model.coef_
    y_hat_lasso = model.predict(X_test)
    test_mse.append(mean_squared_error(y_test, y_hat_lasso))
    test_r2.append(r2_score(y_test, y_hat_lasso))
    test_r2_squared.append(adjusted_r2(y_test, y_hat_lasso, X_train.shape[1]))

fig, axes = plt.subplots(1, 3, figsize=(12,4))
pd_coef = pd.DataFrame(coefficients, index=X_train.columns)
# Test error
axes[0].plot(alphas, test_mse, label="Test MSE")
axes[0].set_title("Test MSE at each penalty strength alpha")
axes[0].legend()
axes[0].set_xlabel("alpha")
axes[0].set_xscale('log')
axes[0].set_ylabel("Test MSE")

id_min = np.argmin(loss)
print(f'Optimal regularization strength alpha for LASSO regression: {alphas[id_min]}')
axes[1].plot(alphas, loss, label="CV MSE")
axes[1].vlines(x=alphas[id_min], ymin=np.min(loss), ymax=np.max(loss), label="optimal alpha", color="red", linestyle="--")
axes[1].set_title("CV MSE at each penalty strength alpha")
axes[1].legend()
axes[1].set_xscale('log')
axes[1].set_xlabel("alpha")
axes[1].set_ylabel("CV MSE")

axes[2].plot(alphas, pd_coef.transpose(), label=pd_coef.index)
axes[2].vlines(x=alphas[id_min], ymin=np.min(pd_coef), ymax=np.max(pd_coef), label="optimal alpha", color="red", linestyle="--")
axes[2].set_title("Coefs at each penalty strength alpha")
axes[2].legend(loc='upper right')
axes[2].set_xscale('log')
axes[2].set_xlabel("alpha")
axes[2].set_ylabel("Coef")
plt.show()

Optimal regularization strength alpha for LASSO regression: 0.001

df_mse['LASSO'] = [test_mse[id_min], test_r2[id_min], test_r2_squared[id_min]]
df_mse

	SLR	MLR	Poly2	Poly3	Poly4	Poly5	Poly6	Poly7	Poly8	Poly9	Poly10	Ridge	LASSO
Test MSE	6.960663	4.969321	6.753474	6.595747	6.558894	6.581101	6.599169	6.599699	6.579606	6.573815	6.595559	4.962656	4.955233
R-square	0.419115	0.585298	0.436406	0.449569	0.452644	0.450791	0.449283	0.449239	0.450915	0.451399	0.449584	0.585854	0.586473
Adj-R-Squared	NA	0.580774	0.434371	0.446916	0.449343	0.446811	0.444621	0.443904	0.444925	0.444741	0.442228	0.581336	0.581962

4. Auto-MPG Dataset

Apply what you have learn to predict fuel efficiency (Miles Per Gallon, MPG) in vehicles of the kaggle AUTO-MPG Dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("uciml/autompg-dataset")

data = pd.read_csv(path + '/auto-mpg.csv')
data.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

You should try to work on this problem by yourself.

1. Importing Abalone dataset

2. Correlation matrix, SLR and MLR

3. Polynomial Regression and Regularized Linear Models

4. Auto-MPG Dataset

Further readings

1. Importing `Abalone dataset`