清理推文的问题(表情符号,笑脸......)
Problems with cleaning tweet (emoticons, smileys ...)
我在清理推文时遇到问题。我有一个将推文保存在 csv 中的过程,然后我对数据进行 pandas 数据框。
x 是来自我的数据框的推文:
'b\'RT @LBC: James O\\'Brien on Geoffrey Cox\\'s awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not fore\xe2\x80\xa6\''
更多推文:
"b'RT @suzannelynch1: Meanwhile in #Washington... Almost two dozen members of #Congress write to #TheresaMay on eve of #StPatricksDay visit wa\xe2\x80\xa6'
b"RT @KMTV_Kent: #KentTonight Poll:\nKent\'s MPs will be having their say on Theresa May\'s #Brexit deal today. @SirRogerGaleMP said he\'ll back\xe2\x80\xa6"
结果应该是这样的:
James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for'
(保留主题标签,只删除任何 utf8 字符)
我想清理这条推文。我尝试将正则表达式与 re.sub(my_regex), re.compile ...
一起使用
我尝试了不同的正则表达式:([\U00010000-\U0010ffff],r'@[A-Za-z0-9]+',https?://[A-Za-z0-9./] +)
我也这样试过:
x.encode('ascii','ignore').decode('utf-8')
由于双反斜杠,它不起作用,但当我这样做时它起作用了:
'to tell us whether or not fore\xe2\x80\xa6'.encode('ascii','ignore').decode('utf-8')
它returns我:
'to tell us whether or not fore'
有人知道怎么清洗吗?
非常感谢 !
看看这是否有帮助
a = 'b\'RT @LBC: James O\\'Brien on Geoffrey Cox\\'s awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not fore\xe2\x80\xa6\''
chars = re.findall("""[\s"'#]+\w+""",a)
''.join([c for c in chars if c])
输出
James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for'
我在清理推文时遇到问题。我有一个将推文保存在 csv 中的过程,然后我对数据进行 pandas 数据框。
x 是来自我的数据框的推文:
'b\'RT @LBC: James O\\'Brien on Geoffrey Cox\\'s awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not fore\xe2\x80\xa6\''
更多推文:
"b'RT @suzannelynch1: Meanwhile in #Washington... Almost two dozen members of #Congress write to #TheresaMay on eve of #StPatricksDay visit wa\xe2\x80\xa6'
b"RT @KMTV_Kent: #KentTonight Poll:\nKent\'s MPs will be having their say on Theresa May\'s #Brexit deal today. @SirRogerGaleMP said he\'ll back\xe2\x80\xa6"
结果应该是这样的:
James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for'
(保留主题标签,只删除任何 utf8 字符)
我想清理这条推文。我尝试将正则表达式与 re.sub(my_regex), re.compile ...
一起使用我尝试了不同的正则表达式:([\U00010000-\U0010ffff],r'@[A-Za-z0-9]+',https?://[A-Za-z0-9./] +)
我也这样试过:
x.encode('ascii','ignore').decode('utf-8')
由于双反斜杠,它不起作用,但当我这样做时它起作用了:
'to tell us whether or not fore\xe2\x80\xa6'.encode('ascii','ignore').decode('utf-8')
它returns我:
'to tell us whether or not fore'
有人知道怎么清洗吗? 非常感谢 !
看看这是否有帮助
a = 'b\'RT @LBC: James O\\'Brien on Geoffrey Cox\\'s awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not fore\xe2\x80\xa6\''
chars = re.findall("""[\s"'#]+\w+""",a)
''.join([c for c in chars if c])
输出
James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for'