📅  最后修改于: 2023-12-03 14:55:57.320000             🧑  作者: Mango
本文介绍如何使用Python从头开始实现朴素贝叶斯算法。主要分为以下几个步骤:
我们使用UCI的鸢尾花数据集来演示朴素贝叶斯算法。首先需要将数据集分成训练集和测试集,代码如下:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
接下来,我们需要计算每个特征在每个类别下出现的概率。具体来说,对于每个特征,需要计算它在每个类别下出现的次数,然后除以该类别下所有特征出现的总次数,即:
$$P(feature|class) = \frac{count(feature, class)}{count(class)}$$
代码如下:
import numpy as np
from collections import defaultdict
class NaiveBayes:
def fit(self, X, y):
n_samples, n_features = X.shape
self._classes = np.unique(y)
n_classes = len(self._classes)
# 计算每个特征在每个类别下出现的次数
self._mean = np.zeros((n_classes, n_features))
self._variance = np.zeros((n_classes, n_features))
self._class_count = np.zeros(n_classes)
for c in self._classes:
X_class = X[c == y]
self._mean[c, :] = X_class.mean(axis=0)
self._variance[c, :] = X_class.var(axis=0)
self._class_count[c] = X_class.shape[0]
# 计算每个特征在每个类别下出现的概率
self._class_probabilities = self._class_count / n_samples
def predict(self, X):
y_pred = [self._predict(x) for x in X]
return y_pred
def _predict(self, x):
posteriors = []
for idx, c in enumerate(self._classes):
# 计算每个类别的先验概率
prior = np.log(self._class_probabilities[idx])
# 计算每个特征的条件概率
posterior = np.sum(np.log(self._pdf(idx, x)))
posterior = prior + posterior
posteriors.append(posterior)
# 返回概率最大的类别
return self._classes[np.argmax(posteriors)]
def _pdf(self, class_idx, x):
mean = self._mean[class_idx]
variance = self._variance[class_idx]
numerator = np.exp(-((x - mean)**2) / (2 * variance))
denominator = np.sqrt(2 * np.pi * variance)
return numerator / denominator
最后,我们可以使用训练好的模型对测试集进行预测了。代码如下:
nb = NaiveBayes()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
本文介绍了如何使用Python实现朴素贝叶斯算法。具体来说,我们需要对数据进行预处理,计算每个特征在每个类别下出现的概率,然后根据贝叶斯定理对新样本进行分类。