자연어처리 - 전처리(영어)

티스토리 뷰

인공지능/자연어처리

자연어처리 - 전처리(영어)

RosyPark 2020. 1. 15. 20:31

1. 클렌징

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

import re
 
def data_processing(text):
    text_re = re.sub('[-=+,#/\?:^$.@*\"※~&%ㆍ!』\\‘|\(\)\[\]\<\>`\'…》]', ' ', text)
      
    
 
    #Single character removal
    text_re = re.sub(r"\s+[a-zA-Z]\s+", ' ', text_re)
 
    # Removing multiple spaces
    text_re = re.sub(r'\s+', ' ', text_re)
    
    
    return text_re
 
 
text = "DF+#$%^ $^&@$%}      a       "
data_processing(text)
 
"""
'DF } '
"""
Colored by Color Scripter

cs

2. 토큰화

1
2
3
4
5
6
7
8
9
10
11
12
13

from nltk import sent_tokenize
text_sample = 'The Beluga XL is the successor to the Beluga, or Airbus A300-600ST, which has been in operation since 1995. \
Its design was adapted from an A330 airliner, with Airbus engineers lowering the flight deck and grafting a huge cargo bay onto the fuselage to create its distinctive shape.\
Through an upward-opening forward hatch on the "bubble," completed aircraft wings, fuselage sections and other components easily slide in and out.'
sentences = sent_tokenize(text = text_sample)
print(type(sentences),len(sentences))
print(sentences)
 
"""
<class 'list'> 2
['The Beluga XL is the successor to the Beluga, or Airbus A300-600ST, 
which has been in operation since 1995.', 'Its design was adapted from an A330 airliner,
 with Airbus engineers lowering the flight deck and grafting a huge cargo bay onto the
 fuselage to create its distinctive shape.Through an upward-opening forward hatch on t
he "bubble," completed aircraft wings, fuselage sections and other components easily 
slide in and out.']
 
"""
Colored by Color Scripter

cs

1
2
3
4
5
6
7
8
9
10
11
12

from nltk import word_tokenize
sentence = 'The Beluga XL is the successor to the Beluga, or Airbus A300-600ST, which has been in operation since 1995.'
words = word_tokenize(sentence)
print(type(words), len(words))
print(words)
 
"""
<class 'list'> 22
['The', 'Beluga', 'XL', 'is', 'the', 'successor', 'to', 'the', 'Beluga', ',', 'or', 'Airbus', 'A300-600ST', ',', 'which', 'has', 'been', 'in', 'operation', 'since', '1995', '.']
 
"""
 
Colored by Color Scripter

cs

3. stopwords

1
2
3
4
5
6
7
8
9
10
11
12
13

from nltk import word_tokenize
import nltk
nltk.download('stopwords')
print("영어 stopword 개수", len(nltk.corpus.stopwords.words('english')))
print(nltk.corpus.stopwords.words('english')[0:30])
 
"""
영어 stopword 개수 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 
'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself'
, 'it', "it's", 'its', 'itself']
 
"""
Colored by Color Scripter

cs

4. Stemming

1
2
3
4
5
6
7
8
9
10
11

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
word_list = ['loved', 'loving', 'lovely']
for word in word_list:
    print(stemmer.stem(word))
    
"""
lov
lov
lov
"""
Colored by Color Scripter

cs

5. Lemmatization

1
2
3
4
5
6
7
8
9
10
11
12

from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
print(lemma.lemmatize('amusing','v'))
print(lemma.lemmatize('beautiful','a'))
print(lemma.lemmatize('fanciest','a'))
 
"""
amuse
beautiful
fancy
 
"""
Colored by Color Scripter

cs

'인공지능 > 자연어처리' 카테고리의 다른 글

자연어처리 - 기계번역 (1)	2019.12.30
자연어처리 - Language Representation (1) (0)	2019.12.30
자연어처리 - Language Representation (2) (0)	2019.12.30
자연어처리 - 기초 (0)	2019.12.30
자연어처리 - WSL 환경에서 시작하기 (1)	2019.12.25

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

Rosy's Artificial Intelligence Blog

티스토리 뷰

자연어처리 - 전처리(영어)

'인공지능 > 자연어처리' 카테고리의 다른 글

티스토리툴바