Must-know Data Science Algorithms (Part 3)

Estimated reading time: 5 minutes

Another Top 5 Data Science Algorithms (Part 3)

K-Nearest Neighbors (KNN)

KNN is a simple yet effective for classification and regression. It classifies a new data point based on the majority class among its K nearest neighbors in the feature space.

:

  • recognition.
  • Recommendation systems.
  • Pattern recognition.

Sample Data:


import numpy as np
# Features (Feature 1, Feature 2)
neighbor_features = np.array([[1, 1], [1, 2], [2, 0], [5, 5], [5, 6], [6, 4]])
# Target (Class)
neighbor_labels = np.array([0, 0, 0, 1, 1, 1])
        

neighbor_features are the coordinates of data points, and neighbor_labels are their corresponding classes.

Code Implementation (Classification Example):


from sklearn.neighbors import KNeighborsClassifier
import numpy as np

X = np.array([[1, 1], [1, 2], [2, 0], [5, 5], [5, 6], [6, 4]])
y = np.array([0, 0, 0, 1, 1, 1])

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)

new_X = np.array([[3, 3]])
predicted_class = model.predict(new_X)
print(f"Predicted class for [3, 3]: {predicted_class}")
        

Code Explanation:

  1. from sklearn.neighbors import KNeighborsClassifier: Imports the KNeighborsClassifier class for classification. For regression, use KNeighborsRegressor.
  2. n_neighbors=3: Specifies the number of nearest neighbors to consider.
  3. The rest of the steps are similar to other classification .

Support Regression (SVR)

SVR is the regression counterpart of Support Vector Machines. It aims to find a function that has at most a specified deviation from the actually obtained targets for all training data.

Use Cases:

  • prediction.
  • Financial forecasting.
  • Demand forecasting.

Sample Data:


import numpy as np
# Features (Input feature)
svr_features = np.array([[1], [2], [3], [4], [5]])
# Target (Continuous value)
svr_target = np.array([2.1, 3.9, 6.2, 8.1, 9.8])
        

svr_features are the input values, and svr_target are the corresponding continuous target values.

Code Implementation:


from sklearn.svm import SVR
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2.1, 3.9, 6.2, 8.1, 9.8])

model = SVR(kernel='rbf')
model.fit(X, y)

new_X = np.array([[6]])
predicted_value = model.predict(new_X)
print(f"Predicted value for [6]: {predicted_value}")
        

Code Explanation:

  1. from sklearn.svm import SVR: Imports the SVR class.
  2. kernel='rbf': Specifies the Radial Basis Function kernel. Other kernels like ‘linear’, ‘poly’, ‘sigmoid’ can also be used.
  3. The rest of the steps are similar to other regression algorithms.

Independent Component Analysis (ICA)

ICA is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals.

Use Cases:

  • Signal processing (e.g., separating audio sources).
  • Biomedical signal analysis (e.g., EEG analysis).
  • Financial data analysis.

Sample Data:


import numpy as np
# Mixed signals (each row is a mixed signal)
mixed_signals = np.array([[1.0, 2.0], [3.0, 1.5], [1.5, 2.5]])
        

mixed_signals represents data where independent sources have been mixed together.

Code Implementation:


from sklearn.decomposition import FastICA
import numpy as np

X = np.array([[1.0, 2.0], [3.0, 1.5], [1.5, 2.5]])

model = FastICA(n_components=2, random_state=0)
S = model.fit_transform(X)  # Estimated source signals
A = model.mixing_matrix_  # Estimated mixing matrix

print("Estimated source signals:\n", S)
print("\nEstimated mixing matrix:\n", A)
        

Code Explanation:

  1. from sklearn.decomposition import FastICA: Imports the FastICA class, a common implementation of ICA.
  2. n_components=2: Specifies the number of independent components to extract.
  3. model.fit_transform(X): Fits the ICA model to the mixed data and transforms it to estimate the independent source signals (S).
  4. model.mixing_matrix_: Provides the estimated mixing matrix (A) that combined the original sources.

Singular Value Decomposition (SVD)

SVD is a matrix factorization technique that decomposes a matrix into the product of three other matrices. It has applications in dimensionality reduction and recommendation systems.

Use Cases:

  • Recommendation systems (e.g., collaborative filtering).
  • Dimensionality reduction.
  • Information retrieval (e.g., Latent Semantic Analysis).

Sample Data:


import numpy as np
# A user-item rating matrix (rows are users, columns are items, values are ratings)
ratings_matrix = np.array([[5, 1, 0, 0], [1, 0, 0, 4], [0, 0, 5, 0], [0, 3, 0, 0]])
        

ratings_matrix represents user ratings for different items, with 0 indicating no rating.

Code Implementation:


import numpy as np
from numpy.linalg import svd

A = np.array([[5, 1, 0, 0], [1, 0, 0, 4], [0, 0, 5, 0], [0, 3, 0, 0]])

U, s, Vh = svd(A)

print("U (User features):\n", U)
print("\ns (Singular values):\n", s)
print("\nVh (Item features):\n", Vh)

# You can use the decomposed matrices for various tasks like recommendation
# by reconstructing a lower-rank approximation of the original matrix.
k = 2  # Number of components to keep
Ak_approx = U[:, :k] @ np.diag(s[:k]) @ Vh[:k, :]
print("\nLower-rank approximation (k=2):\n", Ak_approx)
        

Code Explanation:

  1. from numpy.linalg import svd: Imports the Singular Value Decomposition function from NumPy’s linear algebra module.
  2. A: The matrix to be decomposed (e.g., the user-item rating matrix).
  3. U, s, Vh = svd(A): Performs the SVD, resulting in three matrices:
    • U: The left singular vectors (representing user features).
    • s: The singular values (indicating the importance of each component).
    • Vh: The right singular vectors (representing item features, transposed).
  4. The code then demonstrates how to reconstruct a lower-rank approximation of the original matrix by keeping only the top `k` singular values and corresponding singular vectors. This lower-rank matrix can be used for recommendations or dimensionality reduction.

Must-know Data Science Algorithms (Part 4)

Agentic AI (13) AI Agent (14) airflow (4) Algorithm (21) Algorithms (46) apache (28) apex (2) API (89) Automation (44) Autonomous (24) auto scaling (5) AWS (49) Azure (35) BigQuery (14) bigtable (8) blockchain (1) Career (4) Chatbot (14) cloud (94) cosmosdb (3) cpu (38) cuda (16) Cybersecurity (6) database (77) Databricks (4) Data structure (13) Design (66) dynamodb (23) ELK (2) embeddings (35) emr (7) flink (9) gcp (23) Generative AI (11) gpu (7) graph (36) graph database (13) graphql (3) image (39) indexing (26) interview (7) java (39) json (31) Kafka (21) LLM (13) LLMs (28) Mcp (1) monitoring (85) Monolith (3) mulesoft (1) N8n (3) Networking (12) NLU (4) node.js (20) Nodejs (2) nosql (22) Optimization (62) performance (174) Platform (78) Platforms (57) postgres (3) productivity (15) programming (47) pseudo code (1) python (53) pytorch (31) RAG (34) rasa (4) rdbms (5) ReactJS (4) redis (13) Restful (8) rust (2) salesforce (10) Spark (14) spring boot (5) sql (53) tensor (17) time series (12) tips (7) tricks (4) use cases (33) vector (48) vector db (1) Vertex AI (16) Workflow (35) xpu (1)

Leave a Reply