关联挖掘在数据集中搜索频繁项。在频繁的挖掘中,通常会在事务和关系数据库中找到项目集之间有趣的关联和相关性。简而言之,“频繁采矿”显示哪些项目在交易或关系中一起出现。
需要关联挖掘:
频繁挖掘是根据交易数据集生成关联规则。如果有X和Y经常购买的2件物品,那么最好将它们放在商店中,或者在购买另一件物品时提供某些物品的折扣优惠。这确实可以增加销售量。例如,很可能会发现,如果客户购买牛奶和面包,那么他/她也会购买黄油。
因此,关联规则为[‘milk] ^ [‘bread’] => [‘butter’] 。因此,卖方可以建议客户在购买牛奶和面包时购买黄油。
重要定义:
- 支持:这是衡量趣味性的一种方法。这说明了规则的有用性和确定性。 5%支持意味着数据库中总计5%的交易遵循该规则。
Support(A -> B) = Support_count(A ∪ B)
- 信心: 60%的信心意味着60%的购买牛奶和面包的顾客也购买了黄油。
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
如果一个规则同时满足最小支持和最小置信度,则这是一个强有力的规则。
Support_count(X) : Number of transactions in which X appears. If X is A union B then it is the number of transactions in which A and B both are present.
- 最大项目集:如果一个项目集的超集都不频繁,则它的频率最高。
- 封闭的项目集:如果某个项目集的直接超集都不具有与项目集相同的支持计数,则该项目集将被关闭。
- K-项目集:包含K个项目的项目集是K个项目集。因此可以说,如果相应的支持数量大于最小支持数量,则项目集很频繁。
查找频繁项集的示例–
考虑具有给定交易的给定数据集。
- 可以说最低支持数是3
- 关系保持最大频繁=>关闭=>频繁
1-frequent:
{A} = 3; // not closed due to {A, C} and not maximal
{B} = 4; // not closed due to {B, D} and no maximal
{C} = 4; // not closed due to {C, D} not maximal
{D} = 5; // closed item-set since not immediate super-set has same count. Not maximal
2-frequent:
{A, B} = 2 // not frequent because support count < minimum support count so ignore
{A, C} = 3 // not closed due to {A, C, D}
{A, D} = 3 // not closed due to {A, C, D}
{B, C} = 3 // not closed due to {B, C, D}
{B, D} = 4 // closed but not maximal due to {B, C, D}
{C, D} = 4 // closed but not maximal due to {B, C, D}
3-frequent:
{A, B, C} = 2 // ignore not frequent because support count < minimum support count
{A, B, D} = 2 // ignore not frequent because support count < minimum support count
{A, C, D} = 3 // maximal frequent
{B, C, D} = 3 // maximal frequent
4-frequent:
{A, B, C, D} = 2 //ignore not frequent