[NLP] BoW, DTM, TF-IDF

BoW(Bag of Words)

단어들의 순서와 상관 없이 출현 빈도(frequency)에 따라 표현하는 방법

  
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punckt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Error loading punckt: Package 'punckt' not found in index
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!





True

  
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['Think like a man of action and act like man of thought.']
print(corpus)

['Think like a man of action and act like man of thought.']

  
vector = CountVectorizer()
bow = vector.fit_transform(corpus)

print(bow.toarray())
print(vector.vocabulary_)

[[1 1 1 2 2 2 1 1]]
{'think': 6, 'like': 3, 'man': 4, 'of': 5, 'action': 1, 'and': 2, 'act': 0, 'thought': 7}

  
# 영어에 대한 불용어 자동으로 제거
vector = CountVectorizer(stop_words='english')
bow = vector.fit_transform(corpus)

print(bow.toarray())
print(vector.vocabulary_)

[[1 1 2 2 1 1]]
{'think': 4, 'like': 2, 'man': 3, 'action': 1, 'act': 0, 'thought': 5}

  
corpus = ['또또는 고기 반찬을 좋아합니다. 그러나 또또는 사료를 싫어합니다.']

vector = CountVectorizer()
bow = vector.fit_transform(corpus)

print(bow.toarray())
print(vector.vocabulary_)

[[1 1 2 1 1 1 1]]
{'또또는': 2, '고기': 0, '반찬을': 3, '좋아합니다': 6, '그러나': 1, '사료를': 4, '싫어합니다': 5}

문서 단어 행렬(DTM)

문서 단어 행렬은 문서에 등장하는 여러 단어들의 빈도를 행렬로 표현
각 문서에 대한 BoW를 하나의 행렬로 표현한 것

  
corpus = ['Think like a man of action and act like man of thought.',
          'Try not to become a man of success but rather try to becom a man of value.',
          'Give me liberty, of give me death']

vector = CountVectorizer(stop_words='english')
bow = vector.fit_transform(corpus)

print(bow.toarray())
print(vector.vocabulary_)

[[1 1 0 0 0 2 2 0 1 1 0 0]
 [0 0 1 0 0 0 2 1 0 0 2 1]
 [0 0 0 1 1 0 0 0 0 0 0 0]]
{'think': 8, 'like': 5, 'man': 6, 'action': 1, 'act': 0, 'thought': 9, 'try': 10, 'success': 7, 'becom': 2, 'value': 11, 'liberty': 4, 'death': 3}

  
import pandas as pd

columns = []
for k, v in sorted(vector.vocabulary_.items(), key=lambda item:item[1]):
    columns.append(k)

df = pd.DataFrame(bow.toarray(), columns=columns)
df

	act	action	becom	death	liberty	like	man	success	think	thought	try	value
0	1	1	0	0	0	2	2	0	1	1	0	0
1	0	0	1	0	0	0	2	1	0	0	2	1
2	0	0	0	1	1	0	0	0	0	0	0	0

어휘 빈도-문서 역빈도(TF-IDF)

단순히 빈도수가 높은 단어가 핵심이 아닌, 특정 문서에서만 집중적으로 등장할 때 해당 단어가 문서의 주제를 잘 담고 있는 핵심이라고 가정
특정 문서에서 특정 단어가 많이 등장하고, 그 단어가 다른 문서에서 적게 등장할 때, 그 단어를 특정 문서의 핵심어로 간주
어휘 빈도-문서 역빈도는 어휘 빈도와 역문서 빈도를 곱해 계산
어휘 빈도 : 특정 문서에서 특정 단어가 많이 등장하는 것을 의미
역문서 빈도 : 다른 문서에서 등장하지 않는 단어의 빈도를 의미
TF-IDF를 계산하기 위해 scikit-learn의 Tfidfvectorizer 이용
앞서 계산한 단어 빈도수를 입력하여 TF-IDF로 변환

  
from sklearn.feature_extraction.text import TfidfVectorizer

  
tfidf = TfidfVectorizer(stop_words='english').fit(corpus)

print(tfidf.transform(corpus).toarray())
print(tfidf.vocabulary_)

[[0.311383   0.311383   0.         0.         0.         0.62276601
  0.4736296  0.         0.311383   0.311383   0.         0.        ]
 [0.         0.         0.32767345 0.         0.         0.
  0.49840822 0.32767345 0.         0.         0.65534691 0.32767345]
 [0.         0.         0.         0.70710678 0.70710678 0.
  0.         0.         0.         0.         0.         0.        ]]
{'think': 8, 'like': 5, 'man': 6, 'action': 1, 'act': 0, 'thought': 9, 'try': 10, 'success': 7, 'becom': 2, 'value': 11, 'liberty': 4, 'death': 3}

  
import pandas as pd

columns = []
for k, v in sorted(tfidf.vocabulary_.items(), key=lambda item:item[1]):
    columns.append(k)

df = pd.DataFrame(tfidf.transform(corpus).toarray(), columns=columns)
df

	act	action	becom	death	liberty	like	man	success	think	thought	try	value
0	0.311383	0.311383	0.000000	0.000000	0.000000	0.622766	0.473630	0.000000	0.311383	0.311383	0.000000	0.000000
1	0.000000	0.000000	0.327673	0.000000	0.000000	0.000000	0.498408	0.327673	0.000000	0.000000	0.655347	0.327673
2	0.000000	0.000000	0.000000	0.707107	0.707107	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

[NLP] BoW, DTM, TF-IDF

BoW(Bag of Words)

문서 단어 행렬(DTM)

어휘 빈도-문서 역빈도(TF-IDF)

Further Reading

[NLP] Tokenization, Stopwoard, Lemmatization, Stemming

[NLP] 정규 표현식(Regular Expression)

[NLP] 텍스트 전처리(Text Preprocessing)1