Must-Know Data Science Algorithms and Their Use Cases: Part 1

Estimated reading time: 7 minutes

Top 10 Data Scientist Algorithms

1. Linear Regression

Linear regression is used for predicting a continuous target variable based on one or more independent variables by fitting a linear relationship.

:

  • Predicting house prices based on features like size and location.
  • Forecasting sales based on advertising spend.
  • Estimating the yield of a crop based on rainfall and temperature.

Sample Data:


import numpy as np
# Features (Square Footage)
house_size = np.array([[1500], [2000], [1800], [2200], [1600]])
# Target (Selling Price in $1000s)
selling_price = np.array([250, 320, 280, 350, 260])
        

Here, house_size is the independent variable (feature), and selling_price is the dependent variable (target).

Code Implementation:


from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

model = LinearRegression()
model.fit(X, y)

new_X = np.array([[6]])
predicted_y = model.predict(new_X)
print(f"Predicted value for X=6: {predicted_y}")
        

Code Explanation:

  1. from sklearn.linear_model import LinearRegression: Imports the LinearRegression class from scikit-learn.
  2. import numpy as np: Imports the NumPy library for numerical operations.
  3. X = np.array([[1], [2], [3], [4], [5]]): Creates a NumPy array representing the independent variable.
  4. y = np.array([2, 4, 5, 4, 5]): Creates a NumPy array representing the dependent variable.
  5. model = LinearRegression(): Initializes a Linear Regression model.
  6. model.fit(X, y): Trains the model using the sample data to learn the linear relationship.
  7. new_X = np.array([[6]]): Creates new data for prediction.
  8. predicted_y = model.predict(new_X): Uses the trained model to predict the target variable for the new data.

2. Logistic Regression

Logistic regression is a classification used for predicting binary outcomes (0 or 1) by modeling the probability of a certain class.

Use Cases:

  • Spam email detection (spam or not spam).
  • Disease prediction (positive or negative).
  • Customer churn prediction (will churn or will not churn).

Sample Data:


import numpy as np
# Features (Number of keywords, Has suspicious link)
email_features = np.array([[5, 0], [2, 0], [8, 1], [1, 0], [6, 1]])
# Target (0: Not Spam, 1: Spam)
is_spam = np.array([0, 0, 1, 0, 1])
        

email_features contains the number of spam-related keywords and whether a suspicious link is present. is_spam indicates the class.

Code Implementation:


from sklearn.linear_model import LogisticRegression
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 1], [4, 5], [5, 4]])
y = np.array([0, 0, 1, 1, 0])

model = LogisticRegression()
model.fit(X, y)

new_X = np.array([[3, 2]])
predicted_class = model.predict(new_X)
predicted_probability = model.predict_proba(new_X)
print(f"Predicted class for [3, 2]: {predicted_class}")
print(f"Predicted probabilities for [3, 2]: {predicted_probability}")
        

Code Explanation:

  1. from sklearn.linear_model import LogisticRegression: Imports the LogisticRegression class.
  2. X: Independent features.
  3. y: Binary target variable.
  4. model = LogisticRegression(): Initializes a Logistic Regression model.
  5. model.fit(X, y): Trains the model.
  6. model.predict(new_X): Predicts the class label for new data.
  7. model.predict_proba(new_X): Predicts the probability of each class for new data.

3. Decision Trees

Decision trees are tree-like structures that make decisions based on feature values, used for both classification and regression.

Use Cases:

  • Customer segmentation.
  • Credit risk assessment.
  • Medical diagnosis.

Sample Data:


import numpy as np
# Features (Age, Income)
customer_features = np.array([[25, 50000], [40, 100000], [30, 75000], [50, 120000]])
# Target (0: Low Risk, 1: High Risk)
risk_level = np.array([0, 1, 0, 1])
        

customer_features contains age and income, used to predict the risk_level.

Code Implementation:


from sklearn.tree import DecisionTreeClassifier
import numpy as np

X = np.array([[2, 2], [2, 3], [3, 1], [1, 1]])
y = np.array([0, 0, 1, 1])

model = DecisionTreeClassifier()
model.fit(X, y)

new_X = np.array([[2, 1]])
predicted_class = model.predict(new_X)
print(f"Predicted class for [2, 1]: {predicted_class}")
        

Code Explanation:

  1. from sklearn.tree import DecisionTreeClassifier: Imports the DecisionTreeClassifier class.
  2. X: Features.
  3. y: Target variable.
  4. model = DecisionTreeClassifier(): Initializes a Decision Tree Classifier.
  5. model.fit(X, y): Trains the decision tree.
  6. model.predict(new_X): Predicts the class for new data.

4. Random Forest

Random forest is an ensemble learning method that builds multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting.

Use Cases:

  • classification.
  • Fraud detection.
  • Stock price prediction.

Sample Data:


import numpy as np
# Features (Feature 1, Feature 2)
data_features = np.array([[1, 2], [2, 3], [3, 1], [4, 5], [5, 4]])
# Target (Class)
target_class = np.array([0, 0, 1, 1, 0])
        

Simplified features data_features used for target_class prediction.

Code Implementation:


from sklearn.ensemble import RandomForestClassifier
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 1], [4, 5], [5, 4]])
y = np.array([0, 0, 1, 1, 0])

model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)

new_X = np.array([[3, 3]])
predicted_class = model.predict(new_X)
print(f"Predicted class for [3, 3]: {predicted_class}")
        

Code Explanation:

  1. from sklearn.ensemble import RandomForestClassifier: Imports the RandomForestClassifier class.
  2. n_estimators=100: Specifies the number of trees in the forest.
  3. The rest of the steps are similar to the Decision Tree Classifier.

5. Support Machines (SVM)

SVM is a powerful algorithm for classification and regression that aims to find the optimal hyperplane to separate data points.

Use Cases:

  • Image recognition.
  • Text categorization.
  • Bioinformatics.

Sample Data:


import numpy as np
# Features (Feature A, Feature B)
data_points = np.array([[1, 1], [2, 1], [1, 2], [2, 2]])
# Target (Class)
labels = np.array([0, 1, 1, 0])
        

data_points are the features, and labels are the class assignments.

Code Implementation:


from sklearn.svm import SVC
import numpy as np

X = np.array([[1, 1], [2, 1], [1, 2], [2, 2]])
y = np.array([0, 1, 1, 0])

model = SVC(kernel='linear')
model.fit(X, y)

new_X = np.array([[1.5, 1.5]])
predicted_class = model.predict(new_X)
print(f"Predicted class for [1.5, 1.5]: {predicted_class}")
        

Code Explanation:

  1. from sklearn.svm import SVC: Imports the SVC class (Support Vector Classifier).
  2. kernel='linear': Specifies the linear kernel for the SVM. Other kernels like ‘rbf’ can be used for non-linear data.
  3. The rest of the steps are similar to the classification above.

6. K-Means Clustering

K-Means is an unsupervised learning algorithm used to partition a dataset into K distinct, non-overlapping clusters.

Use Cases:

  • Customer segmentation.
  • Anomaly detection.
  • Image compression.

Sample Data:


import numpy as np
# Features (Spending, Age)
customer_data = np.array([[50, 25], [200, 45], [60, 30], [180, 55], [70, 35], [220, 60]])
        

customer_data contains spending and age features for customer segmentation.

Code Implementation:


from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 1], [1.5, 1.5], [5, 5], [5, 5.5]])

model = KMeans(n_clusters=2)
model.fit(X)

new_X = np.array([[1.2, 1.2], [4.8, 4.8]])
predicted_clusters = model.predict(new_X)
cluster_centers = model.cluster_centers_
print(f"Predicted clusters for new data: {predicted_clusters}")
print(f"Cluster centers: {cluster_centers}")
        

Code Explanation:

  1. from sklearn.cluster import KMeans: Imports the KMeans class.
  2. n_clusters=2: Specifies the number of clusters to form.
  3. model.fit(X): Trains the K-Means model to find the clusters in the data.
  4. model.predict(new_X): Assigns new data points to the nearest cluster.
  5. model.cluster_centers_: Provides the coordinates of the cluster centers.

7. Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation while retaining the most important information.

Use Cases:

  • Data visualization.
  • Feature extraction.
  • Noise reduction.

Sample Data:


import numpy as np
# Features (Feature 1, Feature 2, Feature 3)
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
        

data is a high-dimensional dataset that PCA will reduce.

Code Implementation:


from sklearn.decomposition import PCA
import numpy as np

X = np.array([[1, 2], [3, 4







Agentic AI (26) AI Agent (22) airflow (4) Algorithm (37) Algorithms (31) apache (40) apex (11) API (106) Automation (25) Autonomous (26) auto scaling (3) AWS (40) aws bedrock (1) Azure (29) BigQuery (18) bigtable (3) blockchain (3) Career (5) Chatbot (17) cloud (79) cosmosdb (1) cpu (27) Cybersecurity (5) database (89) Databricks (14) Data structure (12) Design (74) dynamodb (4) ELK (1) embeddings (10) emr (4) flink (10) gcp (18) Generative AI (10) gpu (10) graph (19) graph database (1) graphql (1) image (21) indexing (11) interview (7) java (36) json (58) Kafka (26) LLM (29) LLMs (9) monitoring (69) Monolith (8) mulesoft (8) N8n (9) Networking (11) NLU (2) node.js (11) Nodejs (6) nosql (14) Optimization (43) performance (79) Platform (72) Platforms (46) postgres (19) productivity (9) programming (23) pseudo code (1) python (59) RAG (126) rasa (3) rdbms (2) ReactJS (1) realtime (1) redis (13) Restful (4) rust (10) salesforce (22) Spark (29) sql (49) time series (11) tips (2) tricks (13) use cases (65) vector (18) Vertex AI (15) Workflow (49)

Leave a Reply

Your email address will not be published. Required fields are marked *