Must-Know Data Science Algorithms and Their Use Cases: Part 2

Estimated reading time: 6 minutes

The current image has no alternative text. The file name is: image-1.png?wsr

Another Top 5 Data Science Algorithms

8. Naive Bayes

Naive Bayes is a probabilistic classification based on Bayes’ theorem, assuming independence between features.

:

  • Text classification.
  • Spam filtering.
  • Sentiment analysis.

Sample Data:


import numpy as np
# Features (Word count 1, Word count 2)
text_counts = np.array([[2, 1], [0, 3], [1, 0], [3, 2]])
# Target (Class)
text_labels = np.array([0, 1, 0, 1])
        

text_counts represents the frequency of two words in different documents, and text_labels are the corresponding document classes.

Code Implementation:


from sklearn.naive_bayes import GaussianNB
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 1, 0, 1])

model = GaussianNB()
model.fit(X, y)

new_X = np.array([[2, 3]])
predicted_class = model.predict(new_X)
predicted_probability = model.predict_proba(new_X)
print(f"Predicted class for [2, 3]: {predicted_class}")
print(f"Predicted probabilities for [2, 3]: {predicted_probability}")
        

Code Explanation:

  1. from sklearn.naive_bayes import GaussianNB: Imports the GaussianNB class, suitable for continuous features. For discrete features like word counts, MultinomialNB or BernoulliNB might be more appropriate.
  2. X: Independent features.
  3. y: Target variable.
  4. model = GaussianNB(): Initializes a Gaussian Naive Bayes model.
  5. model.fit(X, y): Trains the model.
  6. model.predict(new_X): Predicts the class label for new data.
  7. model.predict_proba(new_X): Predicts the probability of each class for new data.

9. Gradient Boosting Machines (GBM)

Gradient Boosting Machines are an ensemble learning method that builds multiple decision trees sequentially, with each new tree trying to correct the errors made by the previous ones.

Use Cases:

  • Predictive modeling on structured data.
  • Ranking problems.
  • Risk assessment.

Sample Data:


import numpy as np
# Features (Feature 1, Feature 2, Feature 3)
feature_data = np.array([[1, 0.1, 5], [2, 0.5, 3], [3, 0.2, 7], [4, 0.8, 2]])
# Target (Continuous value for regression, class for classification)
target_values = np.array([10, 25, 15, 30])
        

feature_data contains multiple features, and target_values are the corresponding target variable (for regression in this example).

Code Implementation (Regression Example):


from sklearn.ensemble import GradientBoostingRegressor
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([10, 15, 20, 25])

model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X, y)

new_X = np.array([[8, 9]])
predicted_value = model.predict(new_X)
print(f"Predicted value for [8, 9]: {predicted_value}")
        

Code Explanation:

  1. from sklearn.ensemble import GradientBoostingRegressor: Imports the GradientBoostingRegressor class for regression tasks. For classification, use GradientBoostingClassifier.
  2. n_estimators=100: The number of boosting stages to perform.
  3. learning_rate=0.1: Shrinks the contribution of each tree to prevent overfitting.
  4. max_depth=3: The maximum depth of the individual regression estimators.
  5. The rest of the steps are similar to other regression .

10. Artificial Neural Networks (ANN) / Multi-layer Perceptron (MLP)

Artificial Neural Networks are a set of algorithms modeled loosely after the human brain, designed to recognize patterns. Multi-layer Perceptron is a type of feedforward ANN.

Use Cases:

Sample Data:


import numpy as np
# Features (Feature 1, Feature 2)
input_features = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
# Target (Binary output for classification)
output_labels = np.array([0, 1, 1, 0]) # Example for XOR
        

input_features are the input to the neural network, and output_labels are the target classes.

Code Implementation (Classification Example):


from sklearn.neural_network import MLPClassifier
import numpy as np

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

model = MLPClassifier(hidden_layer_sizes=(2,), activation='relu', solver='adam', max_iter=200)
model.fit(X, y)

new_X = np.array([[1, 0]])
predicted_class = model.predict(new_X)
predicted_probability = model.predict_proba(new_X)
print(f"Predicted class for [1, 0]: {predicted_class}")
print(f"Predicted probabilities for [1, 0]: {predicted_probability}")
        

Code Explanation:

  1. from sklearn.neural_network import MLPClassifier: Imports the MLPClassifier class for classification. For regression, use MLPRegressor.
  2. hidden_layer_sizes=(2,): Defines a single hidden layer with 2 neurons. You can add more layers and neurons.
  3. activation='relu': The activation function used by the neurons. Other options include ‘tanh’, ‘logistic’.
  4. solver='adam': The algorithm used for training the network. Other options include ‘lbfgs’, ‘sgd’.
  5. max_iter=200: Maximum number of iterations for training.
  6. The rest of the steps are similar to other classification algorithms. Note that neural networks often require more data and careful tuning of hyperparameters.

11. Association Rule Mining (Apriori Algorithm)

Apriori is an algorithm for frequent itemset mining and association rule learning over transactional databases.

Use Cases:

  • Market basket analysis (e.g., what items are frequently bought together?).
  • Recommendation systems.
  • Web usage mining.

Sample Data:


# Sample transactional data
transactions = [
    ['milk', 'bread', 'butter'],
    ['bread', 'butter'],
    ['milk', 'butter'],
    ['milk', 'bread', 'eggs'],
    ['milk', 'bread', 'butter', 'eggs'],
]
        

transactions is a list of lists, where each inner list represents a transaction containing items.

Code Implementation:


from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import pandas as pd

# Sample transactional data
transactions = [
    ['milk', 'bread', 'butter'],
    ['bread', 'butter'],
    ['milk', 'butter'],
    ['milk', 'bread', 'eggs'],
    ['milk', 'bread', 'butter', 'eggs'],
]

# Convert transaction data into a one-hot encoded DataFrame
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_array, columns=te.columns_)

# Find frequent itemsets
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
print("Frequent Itemsets:\n", frequent_itemsets)

# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
print("\nAssociation Rules:\n", rules)
        

Code Explanation:

  1. from mlxtend.frequent_patterns import apriori and association_rules: Imports the necessary functions from the mlxtend library. Make sure you have it installed (`pip install mlxtend pandas`).
  2. import pandas as pd: Imports the pandas library for data manipulation.
  3. The sample transactions data is defined.
  4. TransactionEncoder is used to convert the list of transactions into a one-hot encoded pandas DataFrame, which is required by the apriori function.
  5. apriori(df, min_support=0.4, use_colnames=True): Runs the Apriori algorithm to find frequent itemsets with a minimum support of 40%. use_colnames=True ensures that item names are used instead of column indices.
  6. association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6): Generates association rules from the frequent itemsets with a minimum confidence of 60%. Other metrics like ‘lift’ can also be used.

Agentic AI (26) AI Agent (22) airflow (4) Algorithm (36) Algorithms (29) apache (40) apex (11) API (106) Automation (25) Autonomous (26) auto scaling (3) AWS (40) aws bedrock (1) Azure (29) BigQuery (18) bigtable (3) blockchain (3) Career (5) Chatbot (17) cloud (79) cosmosdb (1) cpu (27) Cybersecurity (5) database (89) Databricks (14) Data structure (12) Design (74) dynamodb (4) ELK (1) embeddings (10) emr (4) flink (10) gcp (18) Generative AI (10) gpu (10) graph (19) graph database (1) graphql (1) image (20) indexing (11) interview (7) java (36) json (58) Kafka (26) LLM (29) LLMs (9) monitoring (69) Monolith (8) mulesoft (8) N8n (9) Networking (11) NLU (2) node.js (11) Nodejs (6) nosql (14) Optimization (43) performance (79) Platform (72) Platforms (46) postgres (19) productivity (9) programming (23) pseudo code (1) python (59) RAG (126) rasa (3) rdbms (2) ReactJS (1) realtime (1) redis (13) Restful (4) rust (10) salesforce (22) Spark (29) sql (49) time series (9) tips (2) tricks (13) use cases (63) vector (17) Vertex AI (15) Workflow (49)

Leave a Reply

Your email address will not be published. Required fields are marked *