Untitled Page

# Machine Learning Lab – All 8 Experiments (Exam Ready)

---

## EXPERIMENT 1(A): Data Preprocessing

### Aim

To perform data preprocessing on regression and classification datasets by handling missing values, scaling features, and splitting the dataset.

### Objectives

* Understand dataset structure
* Handle missing values using imputation
* Perform feature scaling
* Split dataset into training and testing sets

### Theory

Data preprocessing is a crucial step in machine learning where raw data is cleaned and transformed into a usable format. It includes handling missing values, feature scaling, and dataset splitting. Proper preprocessing improves model accuracy and prevents bias. It also ensures that models perform well on unseen data.

### Code

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.datasets import fetch_openml

np.random.seed(42)

# Boston Dataset
housing = fetch_openml(name='boston', version=1, as_frame=True)
df_boston = housing.frame

X_b = df_boston.drop('MEDV', axis=1)
y_b = df_boston['MEDV']

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_b, y_b, test_size=0.25, random_state=42)

scaler = StandardScaler()
X_train_b = scaler.fit_transform(X_train_b)
X_test_b = scaler.transform(X_test_b)

# Pima Dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
df = pd.read_csv(url, names=names)

cols = ['plas','pres','skin','test','mass']
df[cols] = df[cols].replace(0, np.nan)

imputer = SimpleImputer(strategy='mean')
df = pd.DataFrame(imputer.fit_transform(df), columns=names)

X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

---

## EXPERIMENT 1(B): Feature Encoding

### Aim

To apply categorical feature encoding techniques for machine learning.

### Objectives

* Identify categorical features
* Apply Label Encoding
* Apply One-Hot Encoding

### Theory

Machine learning models require numerical input, so categorical variables must be converted into numerical form. Label encoding assigns integers to categories, while one-hot encoding creates binary columns. Proper encoding avoids incorrect relationships between categories.

### Code

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from category_encoders import TargetEncoder
from ucimlrepo import fetch_ucirepo

adult = fetch_ucirepo(id=2)

X = adult.data.features.copy()
y = adult.data.targets.squeeze()
y = y.str.replace('.', '', regex=False)

le = LabelEncoder()
y = le.fit_transform(y)

categorical_features = X.select_dtypes(include=['object']).columns.tolist()
X[categorical_features] = X[categorical_features].fillna('Missing')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

ohe = ColumnTransformer([
    ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_features)
], remainder='passthrough')

X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)
```

---

## EXPERIMENT 2(A): Dataset Splitting & Cross Validation

### Aim

To evaluate model performance using hold-out and K-Fold cross validation.

### Objectives

* Split dataset into training and testing sets
* Train classification model
* Evaluate performance using cross-validation

### Theory

Dataset splitting ensures that models are tested on unseen data. The hold-out method divides data into training and testing sets. K-Fold cross validation divides data into multiple folds and evaluates model performance multiple times, giving a more reliable result.

### Code

```python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)
print(scores)
```

---

## EXPERIMENT 2(B): Feature Scaling

### Aim

To apply normalization and standardization techniques.

### Objectives

* Apply Min-Max scaling
* Apply Standardization
* Compare feature ranges

### Theory

Feature scaling ensures that all features contribute equally to the model. Normalization scales values between 0 and 1, while standardization transforms data to have mean 0 and standard deviation 1. It improves convergence and model performance.

### Code

```python
from sklearn.datasets import load_wine
from sklearn.preprocessing import MinMaxScaler, StandardScaler

wine = load_wine()
X = wine.data

minmax = MinMaxScaler()
X_norm = minmax.fit_transform(X)

std = StandardScaler()
X_std = std.fit_transform(X)
```

---

## EXPERIMENT 3: Linear Regression (Gradient Descent)

### Aim

To implement linear regression using gradient descent.

### Objectives

* Understand gradient descent
* Train regression model
* Predict values

### Theory

Linear regression models the relationship between input and output variables. Gradient descent is an optimization algorithm used to minimize error by updating model parameters iteratively. It improves prediction accuracy.

### Code

```python
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

california = fetch_california_housing()
X = california.data[:, 0:1]
y = california.target

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

class GD:
    def __init__(self, lr=0.01, iters=1000):
        self.lr = lr
        self.iters = iters

    def fit(self, X, y):
        n = X.shape[0]
        X_b = np.c_[np.ones((n,1)), X]
        self.theta = np.random.randn(2,1)
        y = y.reshape(-1,1)

        for _ in range(self.iters):
            grad = (1/n) * X_b.T @ (X_b @ self.theta - y)
            self.theta -= self.lr * grad

    def predict(self, X):
        X_b = np.c_[np.ones((X.shape[0],1)), X]
        return X_b @ self.theta

model = GD()
model.fit(X_train, y_train)
```

---

## EXPERIMENT 4: Regression Evaluation

### Aim

To evaluate regression model using MSE, MAE, and R2 score.

### Objectives

* Train regression model
* Calculate evaluation metrics

### Theory

Evaluation metrics measure model performance. MSE calculates average squared error, MAE calculates absolute error, and R2 score indicates how well the model fits the data.

### Code

```python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(mean_squared_error(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
```

---

## EXPERIMENT 5: KNN Classification

### Aim

To implement K-Nearest Neighbors classification.

### Objectives

* Train KNN model
* Predict class labels
* Evaluate accuracy

### Theory

KNN is a supervised learning algorithm that classifies data based on nearest neighbors. It calculates distance between points and assigns class based on majority voting.

### Code

```python
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = pd.DataFrame(data.data)
y = pd.Series(data.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
```

---

## EXPERIMENT 6(A): SVM

### Aim

To implement Support Vector Machine classification.

### Objectives

* Train SVM model
* Perform classification

### Theory

SVM is a supervised learning algorithm that separates data using a hyperplane. It maximizes the margin between classes and is effective for high-dimensional data.

### Code

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = SVC(kernel='linear')
model.fit(X_train, y_train)
```

---

## EXPERIMENT 6(B): Naive Bayes

### Aim

To implement Naive Bayes classifier for text classification.

### Objectives

* Convert text to numerical form
* Train classifier
* Predict output

### Theory

Naive Bayes is a probabilistic classifier based on Bayes theorem. It assumes independence between features and is commonly used in text classification problems like spam detection.

### Code

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', names=['label','text'])

df['label'] = df['label'].map({'ham':0, 'spam':1})

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2)

vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

model = MultinomialNB()
model.fit(X_train, y_train)
```

---

## EXPERIMENT 7: K-Means Clustering

### Aim

To implement K-Means clustering algorithm.

### Objectives

* Group data into clusters
* Analyze cluster labels

### Theory

K-Means is an unsupervised learning algorithm that groups data into K clusters based on similarity. It minimizes the distance between data points and cluster centroids.

### Code

```python
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris

data = load_iris()
X = data.data

model = KMeans(n_clusters=3)
model.fit(X)

print(model.labels_)
```

---

## EXPERIMENT 8: Apriori Algorithm

### Aim

To implement association rule mining using Apriori algorithm.

### Objectives

* Find frequent itemsets
* Generate association rules

### Theory

Apriori is used to find frequent itemsets in datasets and generate association rules. It is widely used in market basket analysis to discover relationships between items.

### Code

```python
!pip install mlxtend

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

data = {
    'Milk': [1,0,1,1],
    'Bread': [1,1,1,0],
    'Butter': [0,1,1,1]
}

df = pd.DataFrame(data)

freq = apriori(df, min_support=0.5, use_colnames=True)
rules = association_rules(freq, metric="confidence", min_threshold=0.5)

print(freq)
print(rules)
```

---

🔥 **You can directly copy this into your practical file / exam.**