연관규칙분석(Association Rule Analysis)

연관분석

대량의 트랜잭션 정보로부터 개별 데이터 사이에서 연관규칙(x면 y가 발생)을 찾는 것이다. 가령 슈퍼마켓의 구매내역에서 특정 물건의 판매 발생 빈도를 기반으로 ‘A 물건을 구매하는 사람들을 B 물건을 구매하는 경향이 있다’ 라는 규칙을 찾을 수 있다. 다른 말로는 장바구니 분석(Market Basket Analysis)라고 한다.

연관규칙

조건 결과의 빈도수를 기반으로 표현되기 때문에 비교적 결과를 쉽게 이해할 수 있다. 넷플릭스도 연관규칙을 추천 알고리즘에 적용했다. A 영화에 대한 시청 결과가 B나 C 영화를 선택할 가능성에 얼마나 영향을 미치는지 계산하는 조건부 확률로 콘텐츠 추천 모델을 만들었다.

1. 지지도(Support)

전체 거래 중 항목 A와 B를 동시에 포함하는 거래의 비율이다.
ex) 장을 본 목록을 확인했을 때 우유와 식빵이 꼭 함께 있을 확률

Support=A와 B가 동시에 포함된 거래 수/전체 거래 수

2. 신뢰도(Confidence)

항목 A를 포함한 거래 중에서 항목 A와 항목 B가 동시에 포함될 확률을 구한다.
ex) 우유를 구매했을 때 식빵이 장바구니로 함께 들어갈 확률

Confidence=A와 B가 동시에 포함된 거래 수/A를 포함하는 거래 수
지지도/P(A)

3. 향상도(Lift)

A가 주어지지 않은 상태에서 B의 확률에 대하여 A사 주어졌을 때 B의 확률의 증가비율이다.

지지도/신뢰도

Library Import

mlxtend : 통계분석 기능을 지원해주는 파이썬 라이브러리

연관규칙을 적용하기 위해 각 항목들이 어떤 빈도로 나타났는지 또는 어떤 항목과 함께 나왔는지를 파악하는 것이 필수다. 하지만 데이터 셋이 큰 경우 모든 항목들에 대해 검사하는 것은 비효율 적이므로 이를 해결하기 위해 연관규칙분석의 대표적인 알고리즘인 Apriori을 사용한다.

  
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

Dataset

  
dataset = [
    ['Milk', 'Onion', 'Nutmeg', 'Eggs', 'Yogurt'],
    ['Onion', 'Nutmeg', 'Eggs', 'Yogurt'],
    ['Milk', 'Apple', 'Eggs'],
    ['Milk', 'Unicorn', 'Corn', 'Yogurt'],
    ['Corn', 'Onion', 'Onion', 'Ice cream', 'Eggs']
]

Asocitaion Rule Analysis

  
# TransactionEncoder() : 기계학습에 적합한 배열 형식으로 변환
te = TransactionEncoder()

# One-hot encoding
te_arr = te.fit_transform(dataset)

df = pd.DataFrame(te_arr, columns=te.columns_)

df

	Apple	Corn	Eggs	Ice cream	Milk	Nutmeg	Onion	Unicorn	Yogurt
0	False	False	True	False	True	True	True	False	True
1	False	False	True	False	False	True	True	False	True
2	True	False	True	False	True	False	False	False	False
3	False	True	False	False	True	False	False	True	True
4	False	True	True	True	False	False	True	False	False

Eggs를 구매할 확률은 0.8이다.
Apple, Eggs를 함께 구매할 확률은 0.2이다.

  
# min_support : 최소 지지도가 0.05 이상인 규칙 집합
freq_items = apriori(df, min_support=0.05, use_colnames=True)
freq_items

	support	itemsets
0	0.2	(Apple)
1	0.4	(Corn)
2	0.8	(Eggs)
3	0.2	(Ice cream)
4	0.6	(Milk)
5	0.4	(Nutmeg)
6	0.6	(Onion)
7	0.2	(Unicorn)
8	0.6	(Yogurt)
9	0.2	(Eggs, Apple)
10	0.2	(Milk, Apple)
11	0.2	(Corn, Eggs)
12	0.2	(Ice cream, Corn)
13	0.2	(Corn, Milk)
14	0.2	(Onion, Corn)
15	0.2	(Corn, Unicorn)
16	0.2	(Corn, Yogurt)
17	0.2	(Ice cream, Eggs)
18	0.4	(Milk, Eggs)
19	0.4	(Nutmeg, Eggs)
20	0.6	(Onion, Eggs)
21	0.4	(Eggs, Yogurt)
22	0.2	(Ice cream, Onion)
23	0.2	(Milk, Nutmeg)
24	0.2	(Onion, Milk)
25	0.2	(Unicorn, Milk)
26	0.4	(Milk, Yogurt)
27	0.4	(Onion, Nutmeg)
28	0.4	(Nutmeg, Yogurt)
29	0.4	(Onion, Yogurt)
30	0.2	(Unicorn, Yogurt)
31	0.2	(Milk, Eggs, Apple)
32	0.2	(Ice cream, Corn, Eggs)
33	0.2	(Onion, Corn, Eggs)
34	0.2	(Ice cream, Onion, Corn)
35	0.2	(Corn, Unicorn, Milk)
36	0.2	(Corn, Milk, Yogurt)
37	0.2	(Corn, Unicorn, Yogurt)
38	0.2	(Ice cream, Onion, Eggs)
39	0.2	(Nutmeg, Milk, Eggs)
40	0.2	(Onion, Milk, Eggs)
41	0.2	(Milk, Eggs, Yogurt)
42	0.4	(Nutmeg, Onion, Eggs)
43	0.4	(Nutmeg, Eggs, Yogurt)
44	0.4	(Onion, Eggs, Yogurt)
45	0.2	(Onion, Milk, Nutmeg)
46	0.2	(Nutmeg, Milk, Yogurt)
47	0.2	(Onion, Milk, Yogurt)
48	0.2	(Unicorn, Milk, Yogurt)
49	0.4	(Nutmeg, Onion, Yogurt)
50	0.2	(Ice cream, Onion, Corn, Eggs)
51	0.2	(Corn, Unicorn, Milk, Yogurt)
52	0.2	(Nutmeg, Onion, Milk, Eggs)
53	0.2	(Nutmeg, Milk, Eggs, Yogurt)
54	0.2	(Onion, Milk, Eggs, Yogurt)
55	0.4	(Nutmeg, Onion, Eggs, Yogurt)
56	0.2	(Nutmeg, Onion, Milk, Yogurt)
57	0.2	(Nutmeg, Yogurt, Milk, Eggs, Onion)

lift(향상도)가 1보다 클수록 우연히 일어나지 않았다는 의미다. 아무런 관계가 없을 경우 1로 표시된다.

Apple과 Eggs를 모두 구매할 확률은 0.2이다.
Apple을 구매했을 때 Eggs도 함께 구매할 가능성은 1(100%)이다.
Eggs를 구매했을 때 Apple도 함께 구매할 가능성은 0.25(25%)이다.

  
association_rules(freq_items, metric='lift', min_threshold=1)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(Eggs)	(Apple)	0.8	0.2	0.2	0.250000	1.250000	0.04	1.066667
1	(Apple)	(Eggs)	0.2	0.8	0.2	1.000000	1.250000	0.04	inf
2	(Milk)	(Apple)	0.6	0.2	0.2	0.333333	1.666667	0.08	1.200000
3	(Apple)	(Milk)	0.2	0.6	0.2	1.000000	1.666667	0.08	inf
4	(Ice cream)	(Corn)	0.2	0.4	0.2	1.000000	2.500000	0.12	inf
...	...	...	...	...	...	...	...	...	...
231	(Onion, Eggs)	(Yogurt, Milk, Nutmeg)	0.6	0.2	0.2	0.333333	1.666667	0.08	1.200000
232	(Nutmeg)	(Onion, Eggs, Milk, Yogurt)	0.4	0.2	0.2	0.500000	2.500000	0.12	1.600000
233	(Yogurt)	(Onion, Eggs, Milk, Nutmeg)	0.6	0.2	0.2	0.333333	1.666667	0.08	1.200000
234	(Eggs)	(Onion, Yogurt, Milk, Nutmeg)	0.8	0.2	0.2	0.250000	1.250000	0.04	1.066667
235	(Onion)	(Yogurt, Eggs, Milk, Nutmeg)	0.6	0.2	0.2	0.333333	1.666667	0.08	1.200000

236 rows × 9 columns

연관규칙분석(Association Rule Analysis)

연관분석

연관규칙

1. 지지도(Support)

2. 신뢰도(Confidence)

3. 향상도(Lift)

Library Import

Dataset

Asocitaion Rule Analysis

Further Reading

의사결정나무(Decision Tree)

K-NN(K-Nearest Neighbors)

주성분분석(Principal Component Analysis)