11. Text Data

카테고리 없음

11. Text Data

nananakh 2023. 7. 6. 13:51

이제, darkknight.txt 파일에서 자주 등장하는 명사 단어를 추출하는 작업을 진행하겠습니다.

가장 많이 사용하는 단어는 Counter 함수를 사용하면 편리합니다.

자주 등장하는 명사 단어를 추출하기 전까지 코드를 짜겠습니다.

import nltk
from nltk.corpus import stopwords
from collections import Countera

f= open('darkknight.txt', 'r', encoding = 'utf-8')
file_line = f.readline()
token = nltk.word_tokenize(file_line)
token_pos = nltk.pos_tag(token)

stopWords = stopwords.words('english')
stopWords.append('.')
stopWords.append(',')

lemmatizer = nltk.wordnet.WordNetLemmatizer()

result = []
for a, b in token_pos:
    if b.startswith('N'):
        if a.lower() not in stopWords:
            result.append(lemmatizer.lemmatize(a))
result

이런식으로 코드를 짜면 여러개의 리뷰를 못 읽는다. 단 하나의 리뷰(readline)만 읽을 수 있다.

import nltk
from nltk.corpus import stopwords
from collections import Counter

f= open('darkknight.txt', 'r', encoding = 'utf-8')
file_line = f.readlines()
print(file_line)

stopWords = stopwords.words('english')
stopWords.append('.')
stopWords.append(',')

lemmatizer = nltk.wordnet.WordNetLemmatizer()
result = []

for line in file_line:
    token = nltk.word_tokenize(line)
    token_pos = nltk.pos_tag(token)
    
    for a, b in token_pos:
        if b.startswith('N'):
            if a.lower() not in stopWords:
                result.append(lemmatizer.lemmatize(a))

print(result)

이렇게 각 라인마다 끊어서 token화를 시켜줘야 읽을 수 있다. 엄청난 양의 명사 품사가 나올 것이다.

이제, 이렇게 많은 명사 중 가장 많이 쓰이는 명사를 Counter 함수를 써서 10개만 뽑아주겠습니다.

counter = Counter(result)
print(counter.most_common(10))

[('movie', 400), ('film', 283), ('Batman', 280), ('Joker', 180), ('Ledger', 123), ('Dark', 118), ('Knight', 114), ('time', 110), ('Heath', 106), ('performance', 87)]

위와 같은 방법으로 자주 쓰이는 명사, 형용사, 동사 등등 출력이 가능합니다.

이제 문장의 모든 특수기호를 삭제하고, 토큰만 남긴 뒤 토큰 갯수를 확인하는 작업을 진행해 보겠습니다.

import nltk
from nltk.corpus import stopwords
from collections import Counter
import re

f= open('darkknight.txt', 'r', encoding = 'utf-8')
file_line = f.readlines()
print(file_line)

stpoWords = stopwords.words('english')
lemmatizer = nltk.wordnet.WordNetLemmatizer()
result = []

for line in file_line:
    token = nltk.word_tokenize(line)
    token_pos = nltk.pos_tag(token)
    
    for a, b in token_pos:
        if b.startswith('N'):
            if a.lower() not in stopWords:
                if re.match('^[a-zA-Z]+', a):
                    result.append(lemmatizer.lemmatize(a))
print(len(set(result)))

set 함수로 묶어야 동일한 token을 모두 제거 한 뒤 숫자를 세줍니다.

이제 추가적으로 알아두면 좋은 기능입니다.

첫번째는 문맥상 유사한 단어를 출력하는 함수입니다.

result = nltk.Text(tokens)

result.similar('batman')

두번째는 어떤 단어들이 같이 등장했는지 표현해줍니다.

result = nltk.Text(tokens)

result.collocations( )

Dark Knight; Heath Ledger; Christian Bale; comic book; Harvey Dent;
Christopher Nolan; Bruce Wayne; Aaron Eckhart; Morgan Freeman; Gary
Oldman; Batman Begins; Two Face; Gotham City; Maggie Gyllenhaal;
Rachel Dawes; Michael Caine; special effect; Tim Burton; Jack
Nicholson; dark knight

현재글11. Text Data

코딩 공부 기록

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

코딩 공부 기록

11. Text Data

'카테고리 없음'의 다른글

티스토리툴바