re.sub : How to solve TypeError: expected string or bytes-like object
re.sub : How to solve TypeError: expected string or bytes-like object
我有一个名为 tweet
的 dataframe
,类型如下:
Id Text
0 1281015183687720961 @AngelaRuchTruck has @BubbaWallace beat, by fa...
1 1281015160803667968 I’m an old, white male. I marched in the 60s a...
2 1281014374744891392 This is me and I am saying #EnoughIsEnoughNS L...
3 1281014363193819139 The Ultimate Fighter Finale! Join in on the fu...
4 1281014339433095169 This #blm $hit is about done
... ... ...
12529 1279207822207725569 First thing I see, getting here #BLM #BLMDC #B...
12530 1279206857253543936 So here’s a thought for all of you #BLM people...
12531 1279206802035539969 #campingworld #Hamilton #BreakTheSilenceForSus...
12532 1279205845474127872 #Day 3.168 . . #artmenow #drawmenow #nodapl #n...
12533 1279205399535792128 Oh but wait ....... Breonna Taylor! #BreonnaTa...
我正在尝试使用以下代码
清理文本 tweet['Text']
tweet['cleaned_text'] = re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", tweet['Text'])
tweet['cleaned_text']= re.sub(r'^RT[\s]+', '', tweet['cleaned_text']))
但是我得到这个错误:
~\AppData\Local\Continuum\anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
建议的答案是使用以下代码:
cleaned = []
txt = list(tweet['Text'])
for i in txt:
cleaned.append(re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", i))
tweet['cleaned_text'] = cleaned
代码运行良好。但是,tweet['cleaned_text']
仍然不是字符串。例如,当我使用以下代码时:
Blobtweet = TextBlob(tweet["cleaned_text"])
我收到这个错误
~\AppData\Local\Continuum\anaconda3\lib\site-packages\textblob\blob.py in __init__(self, text, tokenizer, pos_tagger, np_extractor, analyzer, parser, classifier, clean_html)
368 if not isinstance(text, basestring):
369 raise TypeError('The `text` argument passed to `__init__(text)` '
--> 370 'must be a string, not {0}'.format(type(text)))
371 if clean_html:
372 raise NotImplementedError("clean_html has been deprecated. "
TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'pandas.core.series.Series'>
###########
或
text=tweet['cleaned_text']
text = text.lower()
tokens = tokenizer.tokenize(text)
我收到以下错误:
AttributeError: 'Series' object has no attribute 'lower'
当我有一个字符串时,所有这些例子都工作正常
tweet['cleaned_text']
returns 列,不是字符串,你必须遍历列的每个元素。
cleaned = []
txt = list(tweet['Text'])
for i in txt:
t = re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", i)
cleaned.append(re.sub(r'^RT[\s]+', '', t))
tweet['cleaned_text'] = cleaned
我有一个名为 tweet
的 dataframe
,类型如下:
Id Text
0 1281015183687720961 @AngelaRuchTruck has @BubbaWallace beat, by fa...
1 1281015160803667968 I’m an old, white male. I marched in the 60s a...
2 1281014374744891392 This is me and I am saying #EnoughIsEnoughNS L...
3 1281014363193819139 The Ultimate Fighter Finale! Join in on the fu...
4 1281014339433095169 This #blm $hit is about done
... ... ...
12529 1279207822207725569 First thing I see, getting here #BLM #BLMDC #B...
12530 1279206857253543936 So here’s a thought for all of you #BLM people...
12531 1279206802035539969 #campingworld #Hamilton #BreakTheSilenceForSus...
12532 1279205845474127872 #Day 3.168 . . #artmenow #drawmenow #nodapl #n...
12533 1279205399535792128 Oh but wait ....... Breonna Taylor! #BreonnaTa...
我正在尝试使用以下代码
清理文本tweet['Text']
tweet['cleaned_text'] = re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", tweet['Text'])
tweet['cleaned_text']= re.sub(r'^RT[\s]+', '', tweet['cleaned_text']))
但是我得到这个错误:
~\AppData\Local\Continuum\anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
建议的答案是使用以下代码:
cleaned = []
txt = list(tweet['Text'])
for i in txt:
cleaned.append(re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", i))
tweet['cleaned_text'] = cleaned
代码运行良好。但是,tweet['cleaned_text']
仍然不是字符串。例如,当我使用以下代码时:
Blobtweet = TextBlob(tweet["cleaned_text"])
我收到这个错误
~\AppData\Local\Continuum\anaconda3\lib\site-packages\textblob\blob.py in __init__(self, text, tokenizer, pos_tagger, np_extractor, analyzer, parser, classifier, clean_html)
368 if not isinstance(text, basestring):
369 raise TypeError('The `text` argument passed to `__init__(text)` '
--> 370 'must be a string, not {0}'.format(type(text)))
371 if clean_html:
372 raise NotImplementedError("clean_html has been deprecated. "
TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'pandas.core.series.Series'>
########### 或
text=tweet['cleaned_text']
text = text.lower()
tokens = tokenizer.tokenize(text)
我收到以下错误:
AttributeError: 'Series' object has no attribute 'lower'
当我有一个字符串时,所有这些例子都工作正常
tweet['cleaned_text']
returns 列,不是字符串,你必须遍历列的每个元素。
cleaned = []
txt = list(tweet['Text'])
for i in txt:
t = re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", i)
cleaned.append(re.sub(r'^RT[\s]+', '', t))
tweet['cleaned_text'] = cleaned