티스토리 뷰
1. 클렌징
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
import re
def data_processing(text):
text_re = re.sub('[-=+,#/\?:^$.@*\"※~&%ㆍ!』\\‘|\(\)\[\]\<\>`\'…》]', ' ', text)
#Single character removal
text_re = re.sub(r"\s+[a-zA-Z]\s+", ' ', text_re)
# Removing multiple spaces
text_re = re.sub(r'\s+', ' ', text_re)
return text_re
text = "DF+#$%^ $^&@$%} a "
data_processing(text)
"""
'DF } '
"""
|
cs |
2. 토큰화
1
2
3
4
5
6
7
8
9
10
11
12
13
|
from nltk import sent_tokenize
text_sample = 'The Beluga XL is the successor to the Beluga, or Airbus A300-600ST, which has been in operation since 1995. \
Its design was adapted from an A330 airliner, with Airbus engineers lowering the flight deck and grafting a huge cargo bay onto the fuselage to create its distinctive shape.\
Through an upward-opening forward hatch on the "bubble," completed aircraft wings, fuselage sections and other components easily slide in and out.'
sentences = sent_tokenize(text = text_sample)
print(type(sentences),len(sentences))
print(sentences)
"""
<class 'list'> 2
['The Beluga XL is the successor to the Beluga, or Airbus A300-600ST,
which has been in operation since 1995.', 'Its design was adapted from an A330 airliner,
with Airbus engineers lowering the flight deck and grafting a huge cargo bay onto the
fuselage to create its distinctive shape.Through an upward-opening forward hatch on t
he "bubble," completed aircraft wings, fuselage sections and other components easily
slide in and out.']
"""
|
cs |
1
2
3
4
5
6
7
8
9
10
11
12
|
from nltk import word_tokenize
sentence = 'The Beluga XL is the successor to the Beluga, or Airbus A300-600ST, which has been in operation since 1995.'
words = word_tokenize(sentence)
print(type(words), len(words))
print(words)
"""
<class 'list'> 22
['The', 'Beluga', 'XL', 'is', 'the', 'successor', 'to', 'the', 'Beluga', ',', 'or', 'Airbus', 'A300-600ST', ',', 'which', 'has', 'been', 'in', 'operation', 'since', '1995', '.']
"""
|
cs |
3. stopwords
1
2
3
4
5
6
7
8
9
10
11
12
13
|
from nltk import word_tokenize
import nltk
nltk.download('stopwords')
print("영어 stopword 개수", len(nltk.corpus.stopwords.words('english')))
print(nltk.corpus.stopwords.words('english')[0:30])
"""
영어 stopword 개수 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd",
'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself'
, 'it', "it's", 'its', 'itself']
"""
|
cs |
4. Stemming
1
2
3
4
5
6
7
8
9
10
11
|
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
word_list = ['loved', 'loving', 'lovely']
for word in word_list:
print(stemmer.stem(word))
"""
lov
lov
lov
"""
|
cs |
5. Lemmatization
1
2
3
4
5
6
7
8
9
10
11
12
|
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
print(lemma.lemmatize('amusing','v'))
print(lemma.lemmatize('beautiful','a'))
print(lemma.lemmatize('fanciest','a'))
"""
amuse
beautiful
fancy
"""
|
cs |
'인공지능 > 자연어처리' 카테고리의 다른 글
자연어처리 - 기계번역 (0) | 2019.12.30 |
---|---|
자연어처리 - Language Representation (1) (0) | 2019.12.30 |
자연어처리 - Language Representation (2) (0) | 2019.12.30 |
자연어처리 - 기초 (0) | 2019.12.30 |
자연어처리 - WSL 환경에서 시작하기 (1) | 2019.12.25 |
댓글