re.sub : How to solve TypeError: expected string or bytes-like object

Question

我有一个名为 tweet 的 dataframe，类型如下：

                        Id                                               Text
0      1281015183687720961  @AngelaRuchTruck has @BubbaWallace beat, by fa...
1      1281015160803667968  I’m an old, white male. I marched in the 60s a...
2      1281014374744891392  This is me and I am saying #EnoughIsEnoughNS L...
3      1281014363193819139  The Ultimate Fighter Finale! Join in on the fu...
4      1281014339433095169                       This #blm $hit is about done
...                    ...                                                ...
12529  1279207822207725569  First thing I see, getting here #BLM #BLMDC #B...
12530  1279206857253543936  So here’s a thought for all of you #BLM people...
12531  1279206802035539969  #campingworld #Hamilton #BreakTheSilenceForSus...
12532  1279205845474127872  #Day 3.168 . . #artmenow #drawmenow #nodapl #n...
12533  1279205399535792128  Oh but wait ....... Breonna Taylor! #BreonnaTa...

我正在尝试使用以下代码

清理文本 tweet['Text']

tweet['cleaned_text'] = re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", tweet['Text'])

tweet['cleaned_text']= re.sub(r'^RT[\s]+', '', tweet['cleaned_text']))

但是我得到这个错误：

~\AppData\Local\Continuum\anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
    190     a callable, it's passed the Match object and must return
    191     a replacement string to be used."""
--> 192     return _compile(pattern, flags).sub(repl, string, count)
    193 
    194 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

建议的答案是使用以下代码：

cleaned = []
txt = list(tweet['Text'])
for i  in txt:
    cleaned.append(re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", i))
tweet['cleaned_text'] = cleaned

代码运行良好。但是，tweet['cleaned_text'] 仍然不是字符串。例如，当我使用以下代码时：

Blobtweet = TextBlob(tweet["cleaned_text"])

我收到这个错误

~\AppData\Local\Continuum\anaconda3\lib\site-packages\textblob\blob.py in __init__(self, text, tokenizer, pos_tagger, np_extractor, analyzer, parser, classifier, clean_html)
    368         if not isinstance(text, basestring):
    369             raise TypeError('The `text` argument passed to `__init__(text)` '
--> 370                             'must be a string, not {0}'.format(type(text)))
    371         if clean_html:
    372             raise NotImplementedError("clean_html has been deprecated. "

TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'pandas.core.series.Series'>

########### 或

text=tweet['cleaned_text']
text = text.lower()  
tokens = tokenizer.tokenize(text)

我收到以下错误：

AttributeError: 'Series' object has no attribute 'lower'

当我有一个字符串时，所有这些例子都工作正常

Answer 1

tweet['cleaned_text'] returns 列，不是字符串，你必须遍历列的每个元素。

cleaned = []
txt = list(tweet['Text'])
for i  in txt:
    t = re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", i)
    cleaned.append(re.sub(r'^RT[\s]+', '', t))
tweet['cleaned_text'] = cleaned

re.sub : How to solve TypeError: expected string or bytes-like object

re.sub : How to solve TypeError: expected string or bytes-like object

string

dataframe

python-3.x

python-re