NLP: Tokenize : TypeError: expected string or bytes-like object
NLP: Tokenize : TypeError: expected string or bytes-like object
I also tried .apply(str) and .astype(str) before tokenization, yet I get TypeError: expected string or bytes-like object.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tag 8 non-null object
1 clean_patterns 8 non-null object
2 clean_responses 8 non-null object
dtypes: object(3)
memory usage: 320.0+ bytes
I am trying to word_tokenize the data for the NLP chatbot.
print(word_tokenize(data))
TypeError Traceback (most recent call
last) in
----> 1 print(word_tokenize(data))
D:\anaconda\lib\site-packages\nltk\tokenize_init_.py in
word_tokenize(text, language, preserve_line)
128 :type preserve_line: bool
129 """
--> 130 sentences = [text] if preserve_line else sent_tokenize(text, language)
131 return [
132 token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
D:\anaconda\lib\site-packages\nltk\tokenize_init_.py in
sent_tokenize(text, language)
106 """
107 tokenizer = load("tokenizers/punkt/{0}.pickle".format(language))
--> 108 return tokenizer.tokenize(text)
109
110
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self,
text, realign_boundaries) 1272 Given a text, returns a list
of the sentences in that text. 1273 """
-> 1274 return list(self.sentences_from_text(text, realign_boundaries)) 1275 1276 def debug_decisions(self,
text):
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in
sentences_from_text(self, text, realign_boundaries) 1326
follows the period. 1327 """
-> 1328 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 1329 1330 def _slices_from_text(self,
text):
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in (.0)
1326 follows the period. 1327 """
-> 1328 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 1329 1330 def _slices_from_text(self,
text):
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in
span_tokenize(self, text, realign_boundaries) 1316 if
realign_boundaries: 1317 slices =
self._realign_boundaries(text, slices)
-> 1318 for sl in slices: 1319 yield (sl.start, sl.stop) 1320
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in
_realign_boundaries(self, text, slices) 1357 """ 1358 realign = 0
-> 1359 for sl1, sl2 in _pair_iter(slices): 1360 sl1 = slice(sl1.start + realign, sl1.stop) 1361 if not
sl2:
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it)
314 it = iter(it)
315 try:
--> 316 prev = next(it)
317 except StopIteration:
318 return
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in
_slices_from_text(self, text) 1330 def _slices_from_text(self, text): 1331 last_break = 0
-> 1332 for match in self._lang_vars.period_context_re().finditer(text): 1333
context = match.group() + match.group("after_tok") 1334
if self.text_contains_sentbreak(context):
TypeError: expected string or bytes-like object
欢迎来到 ;)
给定以下数据框 data
和函数 word_tokenize
你必须做
import pandas as pd
def word_tokenize(sentence):
return sentence.split()
data = pd.DataFrame(data={'col1': ['bar bar bar foo',
'foo foo foo bar', 124],
'col2': [12, 13, 14]})
以最简单的方式在 col1
上应用函数
df['col1'].astype(str).apply(word_tokenize)
#ouput
0 [bar, bar, bar, foo]
1 [foo, foo, foo, bar]
2 [124]
Name: col1, dtype: objec
首先将类型更改为 str
,然后将函数应用于每个元素。输出将是 pandas.core.series.Series
I also tried .apply(str) and .astype(str) before tokenization, yet I get TypeError: expected string or bytes-like object.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tag 8 non-null object
1 clean_patterns 8 non-null object
2 clean_responses 8 non-null object
dtypes: object(3)
memory usage: 320.0+ bytes
I am trying to word_tokenize the data for the NLP chatbot.
print(word_tokenize(data))
TypeError Traceback (most recent call last) in ----> 1 print(word_tokenize(data))
D:\anaconda\lib\site-packages\nltk\tokenize_init_.py in word_tokenize(text, language, preserve_line) 128 :type preserve_line: bool 129 """ --> 130 sentences = [text] if preserve_line else sent_tokenize(text, language) 131 return [ 132 token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
D:\anaconda\lib\site-packages\nltk\tokenize_init_.py in sent_tokenize(text, language) 106 """ 107 tokenizer = load("tokenizers/punkt/{0}.pickle".format(language)) --> 108 return tokenizer.tokenize(text) 109 110
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries) 1272 Given a text, returns a list of the sentences in that text. 1273 """ -> 1274 return list(self.sentences_from_text(text, realign_boundaries)) 1275 1276 def debug_decisions(self, text):
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries) 1326
follows the period. 1327 """ -> 1328 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 1329 1330 def _slices_from_text(self, text):D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in (.0) 1326 follows the period. 1327 """ -> 1328 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 1329 1330 def _slices_from_text(self, text):
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries) 1316 if realign_boundaries: 1317 slices = self._realign_boundaries(text, slices) -> 1318 for sl in slices: 1319 yield (sl.start, sl.stop) 1320
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices) 1357 """ 1358 realign = 0 -> 1359 for sl1, sl2 in _pair_iter(slices): 1360 sl1 = slice(sl1.start + realign, sl1.stop) 1361 if not sl2:
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it) 314 it = iter(it) 315 try: --> 316 prev = next(it) 317 except StopIteration: 318 return
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) 1330 def _slices_from_text(self, text): 1331 last_break = 0 -> 1332 for match in self._lang_vars.period_context_re().finditer(text): 1333
context = match.group() + match.group("after_tok") 1334
if self.text_contains_sentbreak(context):TypeError: expected string or bytes-like object
欢迎来到 ;)
给定以下数据框 data
和函数 word_tokenize
你必须做
import pandas as pd
def word_tokenize(sentence):
return sentence.split()
data = pd.DataFrame(data={'col1': ['bar bar bar foo',
'foo foo foo bar', 124],
'col2': [12, 13, 14]})
以最简单的方式在 col1
上应用函数
df['col1'].astype(str).apply(word_tokenize)
#ouput
0 [bar, bar, bar, foo]
1 [foo, foo, foo, bar]
2 [124]
Name: col1, dtype: objec
首先将类型更改为 str
,然后将函数应用于每个元素。输出将是 pandas.core.series.Series