Must-know Data Science Algorithms (Part 4)

Estimated reading time: 7 minutes

Another Top 5 Data Science Algorithms (Part 4)

Hierarchical Clustering

Hierarchical clustering is a cluster analysis method that seeks to build a hierarchy of clusters. It can be either agglomerative (bottom-up) or divisive (top-down).

Use Cases:

Biological taxonomy.
Document clustering.
Market segmentation.

Sample Data:


import numpy as np
# Features (Feature 1, Feature 2)
cluster_data = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

cluster_data represents data points in a 2D space to be clustered.

Code Implementation:


from sklearn.cluster import AgglomerativeClustering
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

# Agglomerative clustering
agg_clustering = AgglomerativeClustering(n_clusters=3, linkage='ward')
agg_labels = agg_clustering.fit_predict(X)
print("Agglomerative Clustering Labels:", agg_labels)

# Dendrogram (for visualization)
linked = linkage(X, 'ward')
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top')
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.show()

Code Explanation:

from sklearn.cluster import AgglomerativeClustering: Imports the AgglomerativeClustering class for bottom-up hierarchical clustering.
from scipy.cluster.hierarchy import linkage, dendrogram: Imports functions for creating the dendrogram, which visualizes the hierarchical clustering process.
linkage(X, 'ward'): Performs the linkage clustering using the ‘ward’ method, which minimizes the variance within each cluster. Other methods include ‘average’, ‘complete’, ‘single’.
dendrogram(linked, orientation='top'): Creates and displays the dendrogram.
AgglomerativeClustering(n_clusters=3, linkage='ward'): Initializes the agglomerative clustering model, specifying the number of clusters to find and the linkage method.
fit_predict(X): Performs the clustering and returns the cluster labels for each data point.

Principal Component Regression (PCR)

PCR is a regression technique that combines Principal Component Analysis (PCA) with linear regression. It first reduces the dimensionality of the predictor variables using PCA and then uses the principal components as predictors in a linear regression model.

Use Cases:

Regression with multicollinear predictors.
Dimensionality reduction before regression.

Sample Data:


import numpy as np
# Features with potential multicollinearity
pcr_features = np.array([[1, 2, 2.1], [3, 4, 3.9], [5, 6, 6.2], [7, 8, 8.1]])
# Target variable
pcr_target = np.array([3, 7, 11, 15])

pcr_features are the predictor variables, and pcr_target is the target variable for regression.

Code Implementation:


from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np

X = np.array([[1, 2, 2.1], [3, 4, 3.9], [5, 6, 6.2], [7, 8, 8.1]])
y = np.array([3, 7, 11, 15])

# Create a pipeline for PCA followed by Linear Regression
pcr_pipeline = Pipeline([
    ('pca', PCA(n_components=2)),
    ('linear_regression', LinearRegression())
])

# Train the model
pcr_pipeline.fit(X, y)

# Predict on new data
new_X = np.array([[9, 10, 9.9]])
predicted_y = pcr_pipeline.predict(new_X)
print(f"Predicted value for [9, 10, 9.9]: {predicted_y}")

Code Explanation:

from sklearn.decomposition import PCA: Imports the PCA class.
from sklearn.linear_model import LinearRegression: Imports the LinearRegression class.
from sklearn.pipeline import Pipeline: Imports the Pipeline class to chain the PCA and Linear Regression steps.
Pipeline([('pca', PCA(n_components=2)), ('linear_regression', LinearRegression())]): Creates a pipeline where PCA is performed first, reducing the data to 2 principal components, and then a linear regression model is trained on these components.
pcr_pipeline.fit(X, y): Trains the entire pipeline.
pcr_pipeline.predict(new_X): Predicts the target variable for new data after transforming it through the PCA step in the pipeline.

Partial Least Squares Regression (PLSR)

PLSR is another regression technique particularly useful when the predictor variables have multicollinearity or when the number of predictors is large compared to the number of observations. It aims to find components that have the maximum covariance with the response variable.

Use Cases:

Chemometrics.
Quantitative structure-activity relationship (QSAR) modeling.
Sensory analysis.

Sample Data:


import numpy as np
# Predictors with potential collinearity
plsr_predictors = np.array([[1, 0.1], [2, 0.2], [3, 0.3], [4, 0.4]])
# Response variable
plsr_response = np.array([2.5, 4.8, 7.1, 9.4])

plsr_predictors are the independent variables, and plsr_response is the dependent variable.

Code Implementation:


from sklearn.cross_decomposition import PLSRegression
import numpy as np

X = np.array([[1, 0.1], [2, 0.2], [3, 0.3], [4, 0.4]])
y = np.array([2.5, 4.8, 7.1, 9.4])

# Initialize and train PLSR model
plsr = PLSRegression(n_components=1)
plsr.fit(X, y)

# Predict on new data
new_X = np.array([[5, 0.5]])
predicted_y = plsr.predict(new_X)
print(f"Predicted value for [5, 0.5]: {predicted_y}")

Code Explanation:

from sklearn.cross_decomposition import PLSRegression: Imports the PLSRegression class.
PLSRegression(n_components=1): Initializes the PLSR model, specifying the number of components to use.
plsr.fit(X, y): Trains the PLSR model.
plsr.predict(new_X): Predicts the response variable for new predictor data.

Time Series Analysis (ARIMA)

ARIMA (Autoregressive Integrated Moving Average) is a class of statistical models for analyzing and forecasting time series data.

Use Cases:

Stock price forecasting.
Sales forecasting.
Weather forecasting.

Sample Data:


import numpy as np
import pandas as pd
# Sample time series data
time_series_data = pd.Series([10, 12, 15, 13, 16, 18, 15, 17, 20])

time_series_data is a pandas Series representing a sequence of values over time.

Code Implementation:


from statsmodels.tsa.arima.model import ARIMA
import pandas as pd

# Sample time series data
data = pd.Series([10, 12, 15, 13, 16, 18, 15, 17, 20])

# Fit ARIMA model (p, d, q) - order of AR, differencing, MA
model = ARIMA(data, order=(1, 1, 1))
model_fit = model.fit()

# Forecast the next value
forecast = model_fit.predict(start=len(data), end=len(data))
print(f"Forecast for the next time step: {forecast}")

Code Explanation:

from statsmodels.tsa.arima.model import ARIMA: Imports the ARIMA model from the statsmodels library (install if needed: `pip install statsmodels`).
The sample time series data is a pandas Series.
ARIMA(data, order=(1, 1, 1)): Initializes the ARIMA model with the time series data and the order of the model (p, d, q). These parameters need to be determined based on the characteristics of the time series data (e.g., using ACF and PACF plots).
model_fit = model.fit(): Fits the ARIMA model to the data.
model_fit.predict(start=len(data), end=len(data)): Predicts the value for the next time step.

Recommendation Systems (Collaborative Filtering)

Collaborative filtering is a technique used for building recommendation systems where recommendations are based on the past behavior of users and items, without explicit knowledge of item content.

Use Cases:

Recommending movies, books, or products to users.
Suggesting friends on social media.

Sample Data:


import numpy as np
# User-item rating matrix (users x items)
user_item_matrix = np.array([
    [5, 3, 0, 1],
    [4, 0, 0, 1],
    [1, 1, 0, 5],
    [1, 0, 0, 4],
    [0, 1, 5, 4],
])

user_item_matrix represents user ratings for different items (0 indicates no rating).

Code Implementation (Basic User-Based Collaborative Filtering):


import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# User-item rating matrix (users x items)
R = np.array([
    [5, 3, 0, 1],
    [4, 0, 0, 1],
    [1, 1, 0, 5],
    [1, 0, 0, 4],
    [0, 1, 5, 4],
])

# Calculate user similarity using cosine similarity
user_similarity = cosine_similarity(R)
print("User Similarity Matrix:\n", user_similarity)

# Function to predict ratings for a user on an item
def predict(ratings, similarity, user_index, item_index):
    similar_users_ratings = ratings[:, item_index][similarity[user_index] > 0]
    similar_users_weights = similarity[user_index][similarity[user_index] > 0]
    weighted_sum = similar_users_ratings @ similar_users_weights
    sum_of_weights = np.sum(similar_users_weights)
    if sum_of_weights == 0:
        return 0
    return weighted_sum / sum_of_weights

# Predict rating for user 0 on item 2 (index 2)
prediction = predict(R, user_similarity, 0, 2)
print(f"\nPredicted rating for user 0 on item 2: {prediction}")

Code Explanation:

import numpy as np and from sklearn.metrics.pairwise import cosine_similarity: Import necessary libraries.
R: The user-item rating matrix.
cosine_similarity(R): Calculates the cosine similarity between each pair of users based on their ratings.
The predict function estimates the rating of a specific user for a specific item by considering the ratings of similar users for that item, weighted by their similarity to the target user.

Latest Posts

Must-know Data Science Algorithms (Part 4)

Hierarchical Clustering

Use Cases:

Sample Data:

Code Implementation:

Code Explanation:

Principal Component Regression (PCR)

Use Cases:

Sample Data:

Code Implementation:

Code Explanation:

Partial Least Squares Regression (PLSR)

Use Cases:

Sample Data:

Code Implementation:

Code Explanation:

Time Series Analysis (ARIMA)

Use Cases:

Sample Data:

Code Implementation:

Code Explanation:

Recommendation Systems (Collaborative Filtering)

Use Cases:

Sample Data:

Code Implementation (Basic User-Based Collaborative Filtering):

Code Explanation:

Like this:

Related Posts

Leave a ReplyCancel reply

Must-know Data Science Algorithms (Part 4)

Hierarchical Clustering

Use Cases:

Sample Data:

Code Implementation:

Code Explanation:

Principal Component Regression (PCR)

Use Cases:

Sample Data:

Code Implementation:

Code Explanation:

Partial Least Squares Regression (PLSR)

Use Cases:

Sample Data:

Code Implementation:

Code Explanation:

Time Series Analysis (ARIMA)

Use Cases:

Sample Data:

Code Implementation:

Code Explanation:

Recommendation Systems (Collaborative Filtering)

Use Cases:

Sample Data:

Code Implementation (Basic User-Based Collaborative Filtering):

Code Explanation:

Share this:

Like this:

Related Posts

Leave a ReplyCancel reply