难以通过 Spacy 删除字符和白色 space 来标记文本

Difficulties in removing characters and white space to tokenize text via Spacy

我正在测试 Spacy 库,但在使用图书馆。

我在某种程度上删除了这些元素,但是,当我执行标记化时,我注意到除了“it”和“s”等术语的分隔之外,还有额外的空格(它是).

这是我的代码和一些文本示例:

text1 = "[Intro] Well, alright [Chorus] Well, it's 1969, okay? All across the USA It's another year for me and you"
text2 = "[Verse 1] For fifty years they've been married And they can't wait for their fifty-first to roll around"
text3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"
df = pd.DataFrame({'text':[text1, text2, text3]})
replacer ={'\n':' ',"[\[].*?[\]]": " ",'[!"#%\'()*+,-./:;<=>?@\[\]^_`{|}~1234567890’”“′‘\\]':" "}
df['cleanText'] = df['text'].replace(replacer, regex=True)
df.head()
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
df
#Output:
result1 = "  Well  alright   Well  it s       okay  All across the USA It s another year for me and you"
result2 = "  For fifty years they ve been married And they can t wait for their fifty first to roll around"
result3 = "Passion that shouts And red with anger I lost myself Through alleys of mysteries I went up and down Like a demented train"

当我尝试标记化时,我得到,例如:( , 好吧, , 好吧, it, s, ...)

我使用相同的逻辑删除要通过 nltk 进行标记化的字符,并且成功了。有谁知道我可能错了什么?

此正则表达式模式几乎删除了所有额外的 空格 ,因为我将句子 " " 更改为 "",最后像这样添加 ' +':' '

replacer = {'\n':'',"[\[].*?[\]]": "",'[!"#%\'()*+,-./:;<=>?@\[\]^_`{|}~1234567890’""′‘\\]':"", ' +': ' '}

然后在应用正则表达式模式后,调用 strip() 方法删除开头和结尾的空格。

df['cleanText'] = df['cleanText'].apply(lambda x: x.strip())

并且当您使用 npl():

定义 列时 new_col
df['new_col'] = df['cleanText'].apply(lambda x: nlp(x))
[3 rows x 3 columns]
>>> df
                                                text                                          cleanText                                            new_col
0  [Intro] Well, alright [Chorus] Well, it's 1969...  Well alright Well its okay All across the USA ...  (Well, alright, Well, its, okay, All, across, ...
1  [Verse 1] For fifty years they've been married...  For fifty years theyve been married And they c...  (For, fifty, years, they, ve, been, married, A...
2  Passion that shouts And red with anger I lost ...  Passion that shouts And red with anger I lost ...  (Passion, that, shouts, And, red, with, anger,...