[NLP] Tokenization, Stopwoard, Lemmatization, Stemming

자연어 처리

NLP(Natural Language Processing)
자연어의 의미를 분석 처리하는 일
텍스트 분류, 감성 분석, 문서 요약, 번역, 질의 응답, 음성 인식, 챗봇

자연어 처리 단계

어휘 분석

단어의 구조를 식별하고 분석하여 어휘의 의미와 품사에 대한 단어 수준 연구
형태소 분석 : 더 이상 분해될 수 없는 최소한의 의미를 갖는 단위인 형태소를 사용해 단어가 어떻게 형성되는지에 대해 분석
품사 태깅 : 단어의 기능, 형태, 의미에 따라 나눈 것이 품사이고, 같은 단어에 대해 의미가 다를 경우를 해결하기 위해 부가적인 언어의 정보를 부착하는 태깅

구문 분석

구구조 구문 분석 : 구구조 문법에 기반한 구문 분석 기술
의존 구문 분석 : 자연어 문장에서 단어 간의 의존 관계를 분석함으로써 문장 전체의 구조를 분석

의미 분석

중의성 해소 : 문장 내 중의성을 가지는 어휘를 사전에 정의된 의미와 매칭하여 해결
의미역 분석 : 의미를 해석하기 위해 서술어가 수식하는 대상의 관계를 파악하고 역할을 분류

형태소 분석기 설치

한국어 자연어 처리 : konlpy
형태소 분석기 : Okt, Kkma, Hannanum, Komoran, MeCab

  
pip install --upgrade pip
pip install JPype1
pip install konlpy --upgrade

Library Call

  
from konlpy.tag import Kkma, Hannanum, Komoran, Okt, Mecab

  
kkma = Kkma()
okt = Okt()
komoran = Komoran()
hannanum = Hannanum()

토큰화(Tokenization)

특수 문자에 대한 처리
특정 단어에 대한 토큰 분리

단어 토큰화(Word Tokenization)

파이썬 내장 함수인 split을 활용해 단어 토큰화
공백을 기준으로 단어 분리

  
sentence = 'Time is gold'

token = [x for x in sentence.split(' ')]
token

['Time', 'is', 'gold']

토큰화는 nltk 패키지의 tokenize 모듈을 사용해 구현 가능
단어 토큰화는 word_tokenize() 함수를 사용

  
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.

True

  
from nltk.tokenize import word_tokenize

tokens = word_tokenize(sentence)
tokens

['Time', 'is', 'gold']

한국어는 공백으로 단어를 분리해도 조사, 접속사 등이 남아 분석에 어려움이 있음
이를 해결해주는 한국어 토큰화는 조사, 접속사를 분리해주거나 제거
morphs()라는 함수 사용

  
sentence = '언제나 현재에 집중할 수 있다면 행복할 것이다.'

print('Kkma 형태소 분석 : ', kkma.pos(sentence))
print('Okt 형태소 분석 : ', okt.pos(sentence))
print('Hannanum 형태소 분석 : ', hannanum.pos(sentence))
print('Komoran 형태소 분석 : ', komoran.pos(sentence))

Kkma 형태소 분석 :  [('언제나', 'MAG'), ('현재', 'NNG'), ('에', 'JKM'), ('집중', 'NNG'), ('하', 'XSV'), ('ㄹ', 'ETD'), ('수', 'NNB'), ('있', 'VA'), ('다면', 'ECE'), ('행복', 'NNG'), ('하', 'XSV'), ('ㄹ', 'ETD'), ('것', 'NNB'), ('이', 'VCP'), ('다', 'EFN'), ('.', 'SF')]
Okt 형태소 분석 :  [('언제나', 'Adverb'), ('현재', 'Noun'), ('에', 'Josa'), ('집중', 'Noun'), ('할', 'Verb'), ('수', 'Noun'), ('있다면', 'Adjective'), ('행복할', 'Adjective'), ('것', 'Noun'), ('이다', 'Josa'), ('.', 'Punctuation')]
Hannanum 형태소 분석 :  [('언제나', 'M'), ('현재', 'N'), ('에', 'J'), ('집중', 'N'), ('하', 'X'), ('ㄹ', 'E'), ('수', 'N'), ('있', 'P'), ('다면', 'E'), ('행복', 'N'), ('하', 'X'), ('ㄹ', 'E'), ('것', 'N'), ('이', 'J'), ('다', 'E'), ('.', 'S')]
Komoran 형태소 분석 :  [('언제나', 'MAG'), ('현재', 'NNG'), ('에', 'JKB'), ('집중', 'NNG'), ('하', 'XSV'), ('ㄹ', 'ETM'), ('수', 'NNB'), ('있', 'VV'), ('다면', 'EC'), ('행복', 'NNG'), ('하', 'XSV'), ('ㄹ', 'ETM'), ('것', 'NNB'), ('이', 'VCP'), ('다', 'EF'), ('.', 'SF')]

토큰화만 실행할 때는 morphs() 함수 사용

  
sentence = '언제나 현재에 집중할 수 있다면 행복할 것이다.'

print('Kkma 형태소 분석 : ', kkma.morphs(sentence))
print('Okt 형태소 분석 : ', okt.morphs(sentence))
print('Hannanum 형태소 분석 : ', hannanum.morphs(sentence))
print('Komoran 형태소 분석 : ', komoran.morphs(sentence))

Kkma 형태소 분석 :  ['언제나', '현재', '에', '집중', '하', 'ㄹ', '수', '있', '다면', '행복', '하', 'ㄹ', '것', '이', '다', '.']
Okt 형태소 분석 :  ['언제나', '현재', '에', '집중', '할', '수', '있다면', '행복할', '것', '이다', '.']
Hannanum 형태소 분석 :  ['언제나', '현재', '에', '집중', '하', 'ㄹ', '수', '있', '다면', '행복', '하', 'ㄹ', '것', '이', '다', '.']
Komoran 형태소 분석 :  ['언제나', '현재', '에', '집중', '하', 'ㄹ', '수', '있', '다면', '행복', '하', 'ㄹ', '것', '이', '다', '.']

형태소만 사용하고 싶을 때는 nouns() 함수 사용하여 조사, 접속사를 제거 가능

  
sentence = '언제나 현재에 집중할 수 있다면 행복할 것이다.'

print('Kkma 형태소 분석 : ', kkma.nouns(sentence))
print('Okt 형태소 분석 : ', okt.nouns(sentence))
print('Hannanum 형태소 분석 : ', hannanum.nouns(sentence))
print('Komoran 형태소 분석 : ', komoran.nouns(sentence))

Kkma 형태소 분석 :  ['현재', '집중', '수', '행복']
Okt 형태소 분석 :  ['현재', '집중', '수', '것']
Hannanum 형태소 분석 :  ['현재', '집중', '수', '행복', '것']
Komoran 형태소 분석 :  ['현재', '집중', '수', '행복', '것']

문장 토큰화(Sentence Tokenization)

줄바꿈 문자(\n)를 기준으로 문장을 분리

  
sentences = 'The world is a beautiful book.\nBut of little use to him who cannot read it'
print(sentences)

tokens = [x for x in sentences.split('\n')]
print(tokens)

The world is a beautiful book.
But of little use to him who cannot read it
['The world is a beautiful book.', 'But of little use to him who cannot read it']

문장 토큰화는 sent_tokenize()` 함수를 사용

  
from nltk.tokenize import sent_tokenize

sentences = 'The world is a beautiful book.\nBut of little use to him who cannot read it'
tokens = sent_tokenize(sentences)
print(tokens)

['The world is a beautiful book.', 'But of little use to him who cannot read it']

문장 토큰화에서는 온점의 처리를 위해 이진 분류기를 사용할 수도 있음
온점은 문장과 문장을 구분해줄 수도, 문장에 포함된 단어를 구성할 수도 있기 때문에 이를 이진 분류기로 분류해 좋은 토큰화를 구현할 수도 있음
konlpy 라이브러리의 형태소 분석기 중에서는 꼬꼬마만 문장 분리 가능

  
text = '진짜? 내일 뭐하지. 이렇게 애매모호한 문장도? 밥은 먹었어. 나는'

print(kkma.sentences(text))

['진짜? 내일 뭐하지. 이렇게 애매모호한 문장도? 밥은 먹었어.', '나는']

한국어 문장을 토큰화할 때는 kss 라이브러리 이용

# pip install kss

라이브러리를 이용해도 한국어에는 전치 표현이 존재해 제대로 토큰화가 안 됨
좀 더 나은 학습을 위해 사용자는 해당 부분을 따로 처리해주어야만 함

  
import kss

print(kss.split_sentences(text))

['진짜? 내일 뭐하지.', '이렇게 애매모호한 문장도?', '밥은 먹었어.', '나는']

정규표현식을 이용한 토큰화

토큰화 기능을 직접 구현할 수도 있지만 정규 표현식을 이용해 간단히 구현
nltk 패키지는 정규표현식을 사용하는 도구인 RegexpTokenizer를 제공

  
from nltk.tokenize import RegexpTokenizer

sentence = 'Where there\'s a will. ther\'s a way'
tokenizer = RegexpTokenizer('[\w]+') # 특수 문자 제거
tokens = tokenizer.tokenize(sentence)
print(tokens)

['Where', 'there', 's', 'a', 'will', 'ther', 's', 'a', 'way']

  
# 공백 기준으로
tokenizer = RegexpTokenizer('[\s]+', gaps=True)
tokens = tokenizer.tokenize(sentence)
print(tokens)

['Where', "there's", 'a', 'will.', "ther's", 'a', 'way']

  
sentence = '안녕하세요 ㅋㅋ 저는 자연어 처리(Natural Language Processing)를 배우고 있습니다.'

# 한국어만 남기고 제거
tokenizer = RegexpTokenizer('[가-힣]+')
tokens = tokenizer.tokenize(sentence)
print(tokens)

['안녕하세요', '저는', '자연어', '처리', '를', '배우고', '있습니다']

  
# 자음을 기준으로, 공백으로 분리
tokenizer = RegexpTokenizer('[ㄱ-ㅎ]+', gaps=True)
tokens = tokenizer.tokenize(sentence)
print(tokens)

['안녕하세요 ', ' 저는 자연어 처리(Natural Language Processing)를 배우고 있습니다.']

TexBlob을 이용한 토큰화

  
from textblob import TextBlob

eng = 'Where there\'s a wiil. there\'s a way'
blob = TextBlob(eng)
blob.words

WordList(['Where', 'there', "'s", 'a', 'wiil', 'there', "'s", 'a', 'way'])

  
kor = '성공의 비결은 단 한 가지. 잘 할 수 있는 일에 광적으로 집중하는 것이다.'
blob = TextBlob(kor)
blob.words

WordList(['성공의', '비결은', '단', '한', '가지', '잘', '할', '수', '있는', '일에', '광적으로', '집중하는', '것이다'])

Keras를 이용한 토큰화

  
from keras.preprocessing.text import text_to_word_sequence

print(text_to_word_sequence(eng))
print(text_to_word_sequence(kor))

['where', "there's", 'a', 'wiil', "there's", 'a', 'way']
['성공의', '비결은', '단', '한', '가지', '잘', '할', '수', '있는', '일에', '광적으로', '집중하는', '것이다']

기타

WhiteSpaceTokenizer : 공백을 기준으로 토큰화
WordPunktTokenizer : 알파멧 문자, 숫자, 알파벳 이외 문자 리스트로 토큰화
MWETokenizer : 여러 단어로 이루어진 특정 그룹을 한 개체로 취급
TweetTokenizer : 문장 속 감정을 다룸

n-gram

n-gram은 n개의 어절이나 음절을 연쇄적으로 분류해 그 빈도를 분석
n=1일 때는 unigram, n=2일 때는 bigram, n=3일 때는 trigram

  
from nltk import ngrams

sentence = 'There is no royal road to learning'
bigram = list(ngrams(sentence.split(), 2)) # n=2 (bigram)
print(bigram)

[('There', 'is'), ('is', 'no'), ('no', 'royal'), ('royal', 'road'), ('road', 'to'), ('to', 'learning')]

  
sentence = 'There is no royal road to learning'
trigram = list(ngrams(sentence.split(), 3)) # n=3 (trigram)
print(trigram)

[('There', 'is', 'no'), ('is', 'no', 'royal'), ('no', 'royal', 'road'), ('royal', 'road', 'to'), ('road', 'to', 'learning')]

  
from textblob import TextBlob

blob = TextBlob(sentence)
blob.ngrams(n=2) # bigram

[WordList(['There', 'is']),
 WordList(['is', 'no']),
 WordList(['no', 'royal']),
 WordList(['royal', 'road']),
 WordList(['road', 'to']),
 WordList(['to', 'learning'])]

  
from textblob import TextBlob

blob = TextBlob(sentence)
blob.ngrams(n=3) # trigram

[WordList(['There', 'is', 'no']),
 WordList(['is', 'no', 'royal']),
 WordList(['no', 'royal', 'road']),
 WordList(['royal', 'road', 'to']),
 WordList(['road', 'to', 'learning'])]

Pos(Parts-of-Speech) 태킹

Pos는 품사를 의미하며 품사 태킹은 문장 내에서 단어에 해당하는 각 품사를 태깅

  
sentence = 'Think like man of action and act line man of thought'
words = word_tokenize(sentence)
print(words)

['Think', 'like', 'man', 'of', 'action', 'and', 'act', 'line', 'man', 'of', 'thought']

  
nltk.download('averaged_perceptron_tagger')

nltk.pos_tag(words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.





[('Think', 'VBP'),
 ('like', 'IN'),
 ('man', 'NN'),
 ('of', 'IN'),
 ('action', 'NN'),
 ('and', 'CC'),
 ('act', 'NN'),
 ('line', 'NN'),
 ('man', 'NN'),
 ('of', 'IN'),
 ('thought', 'NN')]

불용어(Stopwoard) 제거

불용어는 조사, 접미사 같은 단어들이며 문장에서 자주 등장하지만 실제 의미 분석을 하는데는 도움이 안 되는 단어를 의미
데이터에서 유의미한 단어 토큰만을 선별하기 위해 큰 의미가 없는 단어 토큰을 제거하는 작업이 필요 (예) -나, 너, 은, 는, 이, 가, 하다, 합니다 등
사용자 정의 방법

  
# 불용어 정의
stop_words = 'on in the'
stop_words = stop_words.split(' ')
stop_words

['on', 'in', 'the']

  
sentence = 'singer on the stage'
sentence = sentence.split(' ')

nouns = [] # 명사 목록
for noun in sentence:
    if noun not in stop_words: # 불용어가 아닌 것만
        nouns.append(noun) # 리스트에 추가
print(nouns)

['singer', 'stage']

nltk 패키지에 불용어 리스트 사용

  
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.

True

  
# 영어에 대한 불용어
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

  
s = 'If you do not walk today, you wiil have to run tomorrow.'

# 불용어가 제거되지 않은 상태
words = word_tokenize(s)
print(words)

['If', 'you', 'do', 'not', 'walk', 'today', ',', 'you', 'wiil', 'have', 'to', 'run', 'tomorrow', '.']

  
# 불용어 제거
no_stopwoards = []
for w in words:
    if w not in stop_words:
        no_stopwoards.append(w)
print(no_stopwoards)

['If', 'walk', 'today', ',', 'wiil', 'run', 'tomorrow', '.']

철자 교종(Spelling Correction)

텍스트에 오탈자가 존재하는 경우가 있음

# pip install autocorrect

  
from autocorrect import Speller

spell = Speller('en')

# 오타
print(spell('ppoplle'))
print(spell('peope'))
print(spell('peopae'))

people
people
people

  
s = word_tokenize('Early biird catchess the womm.')

print(s)

# 공백을 기준으로 join
ss = ' '.join([spell(s) for s in s])
print(ss)

['Early', 'biird', 'catchess', 'the', 'womm', '.']
Early bird catches the worm .

언어의 단수화와 복수화

  
from textblob import TextBlob

words = 'apples bananas oranges' # 복수형 단어
textblob = TextBlob(words)

print(textblob.words) # 단어
print(textblob.words.singularize()) # 복수형 단어

['apples', 'bananas', 'oranges']
['apple', 'banana', 'orange']

  
words = 'car train airplane' # 단수형 단어
textblob = TextBlob(words)

print(textblob.words) # 단어
print(textblob.words.pluralize()) # 복수형 단어

['car', 'train', 'airplane']
['cars', 'trains', 'airplanes']

어간(Stemming) 추출

  
stemmer = nltk.stem.PorterStemmer()

  
stemmer.stem('application')

'applic'

  
stemmer.stem('beginning')

'begin'

  
stemmer.stem('education')

'educ'

표제어(Lemmatization) 추출

  
from nltk.stem.wordnet import WordNetLemmatizer

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...

  
lemmatizer.lemmatize('application')

'application'

  
lemmatizer.lemmatize('biginning')

'biginning'

단어 중의성

  
from nltk.wsd import lesk

s = 'I saw bats.'

print(word_tokenize(s)) # 단어 토큰화
print(lesk(word_tokenize(s), 'saw'))
print(lesk(word_tokenize(s), 'bats')) # 결과 : bats를 라켓으로 인식

['I', 'saw', 'bats', '.']
Synset('saw.v.01')
Synset('squash_racket.n.01')

[NLP] Tokenization, Stopwoard, Lemmatization, Stemming

자연어 처리

자연어 처리 단계

어휘 분석

구문 분석

의미 분석

형태소 분석기 설치

Library Call

토큰화(Tokenization)

단어 토큰화(Word Tokenization)

문장 토큰화(Sentence Tokenization)

정규표현식을 이용한 토큰화

TexBlob을 이용한 토큰화

Keras를 이용한 토큰화

기타

n-gram

Pos(Parts-of-Speech) 태킹

불용어(Stopwoard) 제거

철자 교종(Spelling Correction)

언어의 단수화와 복수화

어간(Stemming) 추출

표제어(Lemmatization) 추출

단어 중의성

Further Reading

[NLP] 정규 표현식(Regular Expression)

[NLP] BoW, DTM, TF-IDF

[NLP] 텍스트 전처리(Text Preprocessing)1