使用 LIME 的可解释 AI（XAI）简介

激励可解释的人工智能

近年来，人工智能（AI）的广阔领域经历了巨大的增长。每年都有更新、更复杂的模型出现，人工智能模型已经开始以无人能预测的速度超越人类的智力。但是随着我们获得更准确和精确的结果，解释这些模型所采取的复杂数学决策背后的推理变得越来越困难。这种数学抽象也无助于用户保持对特定模型决策的信任。

e.g., Say a Deep Learning model takes in an image and predicts with 70% accuracy that a patient has lung cancer. Though the model might have given the correct diagnosis, a doctor can’t really advise a patient confidently as he/she doesn’t know the reasoning behind the said model’s diagnosis.

编程需要懂一点英语

这就是可解释 AI（或更普遍地称为 XAI）的用武之地！可解释 AI 统称为技术或方法，有助于解释给定 AI 模型的决策过程。这个新发现的人工智能分支已经显示出巨大的潜力，每年都会出现更新、更复杂的技术。一些最著名的 XAI 技术包括SHAP（Shapley Additive exPlanations）、DeepSHAP、DeepLIFT、CXplain 和 LIME。本文详细介绍了 LIME。

介绍 LIME（或本地可解释模型不可知的解释）

LIME 的魅力在于它的可访问性和简单性。 LIME 背后的核心思想虽然详尽无遗，但却非常直观和简单！让我们深入了解一下名称本身代表什么：

模型不可知论是指 LIME 的属性，使用它可以通过将任何给定的监督学习模型单独视为“黑匣子”来解释任何给定的监督学习模型。这意味着 LIME 几乎可以处理存在于野外的任何模型！
局部解释意味着 LIME 给出的解释在被解释的观察/样本的周围环境或附近是局部忠实的。

尽管 LIME 在当前状态下仅限于监督机器学习和深度学习模型，但它是最流行和最常用的 XAI 方法之一。 LIME 拥有丰富的开源 API，可在 R 和Python中使用，拥有庞大的用户群，在其 Github 存储库中拥有近 8000 颗星和 2000 个分叉。

石灰如何运作？

广义上讲，当给定预测模型和测试样本时，LIME 会执行以下步骤：

采样和获取代理数据集： LIME 在被解释的实例附近提供本地忠实的解释。默认情况下，它会生成 5000 个服从正态分布的特征向量样本（参见num_samples变量）。然后它使用预测模型获得这 5000 个样本的目标变量，它试图解释其决策。
来自代理数据集的特征选择：在获得代理数据集后，它根据它们与原始样本/观察的接近程度对每一行进行加权。然后它使用像套索这样的特征选择技术来获取最重要的特征。

LIME 还仅使用获得的特征对样本采用岭回归模型。输出的预测理论上应该与原始预测模型输出的量级相似。这样做是为了强调这些获得的特征的相关性和重要性。

在本文中，我们不会真正深入研究 LIME 内部背后的技术和数学细节。不过，如果你对它感兴趣，你可以阅读基础研究论文。现在，进入更有趣的部分，代码！

安装石灰

来到安装部分，我们可以使用pip或conda在Python中安装 LIME。

pip install lime

或者

conda install -c conda-forge lime

在继续之前，这里有一些关键点，它们将有助于更好地了解围绕 LIME 的整个工作流程。

数据集描述：

LIME 在其当前状态下只能对以下类型的数据集进行解释：

表格数据集（lime.lime_tabular.LimeTabularExplainer）：例如：回归、分类数据集
图像相关数据集（lime.lime_image.LimeImageExplainer）
文本相关数据集（lime.lime_text.LimeTextExplainer）

由于这是一篇介绍性文章，我们将保持简单并继续使用表格数据集。更具体地说，我们将使用波士顿房屋定价数据集进行分析。我们将使用 Scikit-Learn 实用程序加载数据集。

使用的预测模型：

由于 LIME 本质上是模型不可知的，因此它几乎可以处理抛出的任何模型。为了强调这一事实，我们将通过 Scitkit-learn 实用程序使用 Extra-trees 回归器作为我们试图调查其决策的预测模型。

LimeTabularExplainer 简介

如上所述，我们将使用表格数据集进行分析。为了处理此类数据集，LIME 的 API 提供了 LimeTabularExplainer。

Syntax: lime.lime_tabular.LimeTabularExplainer(training_data, mode, feature_names, verbose)

Parameters:

training_data – 2d array consisting of the training dataset
mode – Depends on the problem; “classification” or “regression”
feature_names – list of titles corresponding to the columns in the training dataset. If not mentioned, it uses the column indices.
verbose – if true, print local prediction values from the regression model trained on the samples using only the obtained features

编程需要懂一点英语

实例化后，我们将使用定义的解释器对象中的方法来解释给定的测试样本。

Syntax: explain_instance(data_row, predict_fn, num_features=10, num_samples=5000)

Parameters:

data_row – 1d array containing values corresponding to the test sample being explained
predict_fn – Prediction function used by the prediction model
num_features – maximum number of features present in explanation
num_samples – size of the neighborhood to learn the linear model

编程需要懂一点英语

为简洁起见，上述两种语法中仅提及了部分参数。其余的参数，其中大部分默认为一些巧妙优化的值，感兴趣的读者可以在官方 LIME 文档中查看。

工作流程

数据预处理
在数据集上训练 Extra-trees 回归器
获取给定测试样本的解释

分析

1. 从 Scikit-learn 实用程序中提取数据

Python

# Importing the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
 
# Loading the dataset using sklearn
from sklearn.datasets import load_boston
data = load_boston()
 
# Displaying relevant information about the data
print(data['DESCR'][200:1420])

Python

# Separating data into feature variable X and target variable y respectively
from sklearn.model_selection import train_test_split
X = data['data']
y = data['target']
 
# Extracting the names of the features from data
features = data['feature_names']
 
# Splitting X & y into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.90, random_state=50)
 
# Creating a dataframe of the data, for a visual check
df = pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
df.columns = np.concatenate((features, np.array(['label'])))
print("Shape of data =", df.shape)
 
# Printing the top 5 rows of the dataframe
df.head()

Python

# Instantiating the prediction model - an extra-trees regressor
from sklearn.ensemble import ExtraTreesRegressor
reg = ExtraTreesRegressor(random_state=50)
 
# Fitting the predictino model onto the training set
reg.fit(X_train, y_train)
 
# Checking the model's performance on the test set
print('R2 score for the model on test set =', reg.score(X_test, y_test))

Python

# Importing the module for LimeTabularExplainer
import lime.lime_tabular
 
# Instantiating the explainer object by passing in the training set, and the extracted features
explainer_lime = lime.lime_tabular.LimeTabularExplainer(X_train,
                                                        feature_names=features,
                                                        verbose=True, mode='regression')

Python

# Index corresponding to the test vector
i = 10
 
# Number denoting the top features
k = 5
 
# Calling the explain_instance method by passing in the:
#    1) ith test vector
#    2) prediction function used by our prediction model('reg' in this case)
#    3) the top features which we want to see, denoted by k
exp_lime = explainer_lime.explain_instance(
    X_test[i], reg.predict, num_features=k)
 
# Finally visualizing the explanations
exp_lime.show_in_notebook()

Python

# Index corresponding to the test vector
i = 47
 
# Number denoting the top features
k = 5
 
# Calling the explain_instance method by passing in the:
#    1) ith test vector
#    2) prediction function used by our prediction model('reg' in this case)
#    3) the top features which we want to see, denoted by k
exp_lime = explainer_lime.explain_instance(
    X_test[i], reg.predict, num_features=k)
 
# Finally visualizing the explanations
exp_lime.show_in_notebook()

输出：

上面代码的 Jupyter notebook 输出

2.提取特征矩阵X和目标变量y，做train-test split

Python

# Separating data into feature variable X and target variable y respectively
from sklearn.model_selection import train_test_split
X = data['data']
y = data['target']
 
# Extracting the names of the features from data
features = data['feature_names']
 
# Splitting X & y into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.90, random_state=50)
 
# Creating a dataframe of the data, for a visual check
df = pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
df.columns = np.concatenate((features, np.array(['label'])))
print("Shape of data =", df.shape)
 
# Printing the top 5 rows of the dataframe
df.head()

输出：

上面代码的 Jupyter notebook 输出

3. 实例化预测模型并在 (X_train, y_train) 上对其进行训练

Python

# Instantiating the prediction model - an extra-trees regressor
from sklearn.ensemble import ExtraTreesRegressor
reg = ExtraTreesRegressor(random_state=50)
 
# Fitting the predictino model onto the training set
reg.fit(X_train, y_train)
 
# Checking the model's performance on the test set
print('R2 score for the model on test set =', reg.score(X_test, y_test))

输出：

上面代码的 Jupyter notebook 输出

4. 实例化解释器对象

Python

# Importing the module for LimeTabularExplainer
import lime.lime_tabular
 
# Instantiating the explainer object by passing in the training set, and the extracted features
explainer_lime = lime.lime_tabular.LimeTabularExplainer(X_train,
                                                        feature_names=features,
                                                        verbose=True, mode='regression')

5.通过调用explain_instance()方法获取解释

假设我们想探索预测模型对第 i 个测试向量的预测背后的推理。
此外，假设我们想要可视化导致这种推理的前 k 个特征。

在本文中，我们对 i & k 的两种组合进行了解释：

5.1 解释 i=10, k=5 的决定

We’re basically asking LIME to explain the decisions behind the predictions for the 10th test vector by displaying the top 5 features which contributed towards the said model’s prediction.

编程需要懂一点英语

Python

# Index corresponding to the test vector
i = 10
 
# Number denoting the top features
k = 5
 
# Calling the explain_instance method by passing in the:
#    1) ith test vector
#    2) prediction function used by our prediction model('reg' in this case)
#    3) the top features which we want to see, denoted by k
exp_lime = explainer_lime.explain_instance(
    X_test[i], reg.predict, num_features=k)
 
# Finally visualizing the explanations
exp_lime.show_in_notebook()

输出：

上面代码的 Jupyter notebook 输出

解释输出：

LIME 输出了大量信息！让我们一步一步来解释它试图传达的意思

首先，我们在可视化上方看到三个值：
1. 右图：这表示我们的预测模型（在这种情况下是额外的树回归器）对给定测试向量给出的预测。
2. Prediction_local：这表示由在扰动样本上训练的线性模型输出的值（通过遵循正态分布在测试向量周围采样获得）并且仅使用 LIME 输出的前 k 个特征。
3. 截距：截距是上述线性模型对给定测试向量的预测给出的预测的常数部分。

$prediction\_local = w_1*x_1+w_2*x_2+...+w_k*x_k+Intercept$

来到可视化，我们可以看到蓝色和橙色，分别描绘了消极和积极的关联。
- 为了解释上述结果，我们可以得出结论，给定向量所描绘的房屋相对较低的价格（由左侧的条形表示）可归因于以下社会经济原因：
  - LSTAT的高值表明社会在教育和失业方面的地位较低
  - PTRATIO的高值表示每位教师的学生人数的高值
  - DIS的高值表明离就业中心的距离高。
  - RM的低值表明每个住宅的房间数量较少
- 我们也可以看到， NOX值偏低，说明空气中一氧化氮浓度偏低，对房子的价值有小幅提升。

We can see how easy it has become to correlate the decisions taken by a relatively complex prediction model(an extra-trees regressor) in an interpreatable and meaningful way. Let’s try this exercise on one more test vector!

编程需要懂一点英语

5.2 解释 i=47, k=5 的决定

Here again we’re asking LIME to explain the decisions behind the predictions for the 47th test vector by displaying the top 5 features which contributed towards the said model’s prediction

编程需要懂一点英语

Python

# Index corresponding to the test vector
i = 47
 
# Number denoting the top features
k = 5
 
# Calling the explain_instance method by passing in the:
#    1) ith test vector
#    2) prediction function used by our prediction model('reg' in this case)
#    3) the top features which we want to see, denoted by k
exp_lime = explainer_lime.explain_instance(
    X_test[i], reg.predict, num_features=k)
 
# Finally visualizing the explanations
exp_lime.show_in_notebook()

输出：

上面代码的 Jupyter notebook 输出

解释输出：

从可视化中，我们可以得出结论，给定向量所描绘的房屋相对较高的价格（由左侧的条形表示）可归因于以下社会经济原因：
- LSTAT的低值表明一个社会在教育和就业能力方面的重要地位
- RM的高值表明每个住宅的房间数量多
- TAX的低价值表明财产的税率低
- AGE的低值，描述了机构的新鲜度
我们还可以看到， INDUS的平均值表明社会附近的非零售数量较少，这对房屋的价值有所降低。

概括：

本文简要介绍了在Python中使用 LIME 的可解释 AI（XAI）。很明显，LIME 可以为我们提供给定黑盒模型决策过程背后的深刻直觉，同时提供对固有数据集的可靠见解。这使得 LIME 成为 AI 研究人员和数据科学家的有用资源！

参考：

https://lime-ml.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html
https://scikit-learn.org/0.16/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston
Marco Tulio Ribeiro、Sameer Singh 和 Carlos Guestrin。 “我为什么要相信你？”：解释任何分类器的预测。在第 22 届 ACM SIGKDD 知识发现和数据挖掘国际会议论文集，第 1135-1144 页。计算机协会，2016 年。