# Machine Learning Lab – All 8 Experiments (Exam Ready)
---
## EXPERIMENT 1(A): Data Preprocessing
### Aim
To perform data preprocessing on regression and classification datasets by handling missing values, scaling features, and splitting the dataset.
### Objectives
* Understand dataset structure
* Handle missing values using imputation
* Perform feature scaling
* Split dataset into training and testing sets
### Theory
Data preprocessing is a crucial step in machine learning where raw data is cleaned and transformed into a usable format. It includes handling missing values, feature scaling, and dataset splitting. Proper preprocessing improves model accuracy and prevents bias. It also ensures that models perform well on unseen data.
### Code
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.datasets import fetch_openml
np.random.seed(42)
# Boston Dataset
housing = fetch_openml(name='boston', version=1, as_frame=True)
df_boston = housing.frame
X_b = df_boston.drop('MEDV', axis=1)
y_b = df_boston['MEDV']
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_b, y_b, test_size=0.25, random_state=42)
scaler = StandardScaler()
X_train_b = scaler.fit_transform(X_train_b)
X_test_b = scaler.transform(X_test_b)
# Pima Dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
df = pd.read_csv(url, names=names)
cols = ['plas','pres','skin','test','mass']
df[cols] = df[cols].replace(0, np.nan)
imputer = SimpleImputer(strategy='mean')
df = pd.DataFrame(imputer.fit_transform(df), columns=names)
X = df.drop('class', axis=1)
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```
---
## EXPERIMENT 1(B): Feature Encoding
### Aim
To apply categorical feature encoding techniques for machine learning.
### Objectives
* Identify categorical features
* Apply Label Encoding
* Apply One-Hot Encoding
### Theory
Machine learning models require numerical input, so categorical variables must be converted into numerical form. Label encoding assigns integers to categories, while one-hot encoding creates binary columns. Proper encoding avoids incorrect relationships between categories.
### Code
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from category_encoders import TargetEncoder
from ucimlrepo import fetch_ucirepo
adult = fetch_ucirepo(id=2)
X = adult.data.features.copy()
y = adult.data.targets.squeeze()
y = y.str.replace('.', '', regex=False)
le = LabelEncoder()
y = le.fit_transform(y)
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
X[categorical_features] = X[categorical_features].fillna('Missing')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
ohe = ColumnTransformer([
('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_features)
], remainder='passthrough')
X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)
```
---
## EXPERIMENT 2(A): Dataset Splitting & Cross Validation
### Aim
To evaluate model performance using hold-out and K-Fold cross validation.
### Objectives
* Split dataset into training and testing sets
* Train classification model
* Evaluate performance using cross-validation
### Theory
Dataset splitting ensures that models are tested on unseen data. The hold-out method divides data into training and testing sets. K-Fold cross validation divides data into multiple folds and evaluates model performance multiple times, giving a more reliable result.
### Code
```python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)
print(scores)
```
---
## EXPERIMENT 2(B): Feature Scaling
### Aim
To apply normalization and standardization techniques.
### Objectives
* Apply Min-Max scaling
* Apply Standardization
* Compare feature ranges
### Theory
Feature scaling ensures that all features contribute equally to the model. Normalization scales values between 0 and 1, while standardization transforms data to have mean 0 and standard deviation 1. It improves convergence and model performance.
### Code
```python
from sklearn.datasets import load_wine
from sklearn.preprocessing import MinMaxScaler, StandardScaler
wine = load_wine()
X = wine.data
minmax = MinMaxScaler()
X_norm = minmax.fit_transform(X)
std = StandardScaler()
X_std = std.fit_transform(X)
```
---
## EXPERIMENT 3: Linear Regression (Gradient Descent)
### Aim
To implement linear regression using gradient descent.
### Objectives
* Understand gradient descent
* Train regression model
* Predict values
### Theory
Linear regression models the relationship between input and output variables. Gradient descent is an optimization algorithm used to minimize error by updating model parameters iteratively. It improves prediction accuracy.
### Code
```python
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
california = fetch_california_housing()
X = california.data[:, 0:1]
y = california.target
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
class GD:
def __init__(self, lr=0.01, iters=1000):
self.lr = lr
self.iters = iters
def fit(self, X, y):
n = X.shape[0]
X_b = np.c_[np.ones((n,1)), X]
self.theta = np.random.randn(2,1)
y = y.reshape(-1,1)
for _ in range(self.iters):
grad = (1/n) * X_b.T @ (X_b @ self.theta - y)
self.theta -= self.lr * grad
def predict(self, X):
X_b = np.c_[np.ones((X.shape[0],1)), X]
return X_b @ self.theta
model = GD()
model.fit(X_train, y_train)
```
---
## EXPERIMENT 4: Regression Evaluation
### Aim
To evaluate regression model using MSE, MAE, and R2 score.
### Objectives
* Train regression model
* Calculate evaluation metrics
### Theory
Evaluation metrics measure model performance. MSE calculates average squared error, MAE calculates absolute error, and R2 score indicates how well the model fits the data.
### Code
```python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(mean_squared_error(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
```
---
## EXPERIMENT 5: KNN Classification
### Aim
To implement K-Nearest Neighbors classification.
### Objectives
* Train KNN model
* Predict class labels
* Evaluate accuracy
### Theory
KNN is a supervised learning algorithm that classifies data based on nearest neighbors. It calculates distance between points and assigns class based on majority voting.
### Code
```python
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
data = load_iris()
X = pd.DataFrame(data.data)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
```
---
## EXPERIMENT 6(A): SVM
### Aim
To implement Support Vector Machine classification.
### Objectives
* Train SVM model
* Perform classification
### Theory
SVM is a supervised learning algorithm that separates data using a hyperplane. It maximizes the margin between classes and is effective for high-dimensional data.
### Code
```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = SVC(kernel='linear')
model.fit(X_train, y_train)
```
---
## EXPERIMENT 6(B): Naive Bayes
### Aim
To implement Naive Bayes classifier for text classification.
### Objectives
* Convert text to numerical form
* Train classifier
* Predict output
### Theory
Naive Bayes is a probabilistic classifier based on Bayes theorem. It assumes independence between features and is commonly used in text classification problems like spam detection.
### Code
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', names=['label','text'])
df['label'] = df['label'].map({'ham':0, 'spam':1})
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2)
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
model = MultinomialNB()
model.fit(X_train, y_train)
```
---
## EXPERIMENT 7: K-Means Clustering
### Aim
To implement K-Means clustering algorithm.
### Objectives
* Group data into clusters
* Analyze cluster labels
### Theory
K-Means is an unsupervised learning algorithm that groups data into K clusters based on similarity. It minimizes the distance between data points and cluster centroids.
### Code
```python
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
model = KMeans(n_clusters=3)
model.fit(X)
print(model.labels_)
```
---
## EXPERIMENT 8: Apriori Algorithm
### Aim
To implement association rule mining using Apriori algorithm.
### Objectives
* Find frequent itemsets
* Generate association rules
### Theory
Apriori is used to find frequent itemsets in datasets and generate association rules. It is widely used in market basket analysis to discover relationships between items.
### Code
```python
!pip install mlxtend
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
data = {
'Milk': [1,0,1,1],
'Bread': [1,1,1,0],
'Butter': [0,1,1,1]
}
df = pd.DataFrame(data)
freq = apriori(df, min_support=0.5, use_colnames=True)
rules = association_rules(freq, metric="confidence", min_threshold=0.5)
print(freq)
print(rules)
```
---
🔥 **You can directly copy this into your practical file / exam.**0 views