Python Pandas - 如何格式化和拆分列中的文本?
Python Pandas - How to format and split a text in column ?
我在数据框中有一组字符串,如下所示
ID TextColumn
1 This is line number one
2 I love pandas, they are so puffy
3 [This $tring is with specia| characters, yes it is!]
一个。我想格式化这个字符串以消除所有特殊字符
B. 格式化后,我想得到一个唯一单词列表(space 是唯一的拆分)
这是我写的代码:
get_df_by_id dataframe 有一个选定的帧,比如说 ID 3。
#replace all special characters
formatted_title = get_df_by_id['title'].str.replace(r'[\-\!\@\#$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')
# then split the words
results = set()
get_df_by_id['title'].str.lower().str.split().apply(results.update)
print results
但是当我检查输出时,我可以看到特殊字符仍在列表中。
Output
set([u'[this', u'is', u'it', u'specia|', u'$tring', u'is!]', u'characters,', u'yes', u'with'])
预期输出应如下所示:
set([u'this', u'is', u'it', u'specia', u'tring', u'is', u'characters,', u'yes', u'with'])
为什么格式化后的数据框仍然保留特殊字符?
您必须将格式化值分配给相同的数据框
get_df_by_id['title'] = get_df_by_id['title'].str.replace(r'[\-\!\@\#$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')
我觉得你可以先replace
special characters (I add \|
to the end), then lower
text, split
by \s+
(arbitrary wtitespaces). Output is DataFrame. So you can stack
it to Series
, drop_duplicates
and last tolist
:
print (df['title'].str
.replace(r'[\-\!\@\#$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')
.str
.lower()
.str
.split('\s+', expand=True)
.stack()
.drop_duplicates()
.tolist())
['this', 'is', 'line', 'number', 'one', 'i', 'love', 'pandas', 'they', 'are',
'so', 'puffy', 'tring', 'with', 'specia', 'characters', 'yes', 'it']
如果您想要每行的唯一单词列表:
>>> get_df_by_id['title'].str.replace(r'[^a-zA-Z\s]', '').str.lower().str.split('\s+').apply(lambda x: list(set(x)))
0 [this, is, one, line, number]
1 [love, i, puffy, so, are, they, pandas]
2 [specia, this, is, it, characters, tring, yes, with]
Name: title, dtype: object
我在数据框中有一组字符串,如下所示
ID TextColumn
1 This is line number one
2 I love pandas, they are so puffy
3 [This $tring is with specia| characters, yes it is!]
一个。我想格式化这个字符串以消除所有特殊字符 B. 格式化后,我想得到一个唯一单词列表(space 是唯一的拆分)
这是我写的代码:
get_df_by_id dataframe 有一个选定的帧,比如说 ID 3。
#replace all special characters
formatted_title = get_df_by_id['title'].str.replace(r'[\-\!\@\#$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')
# then split the words
results = set()
get_df_by_id['title'].str.lower().str.split().apply(results.update)
print results
但是当我检查输出时,我可以看到特殊字符仍在列表中。
Output
set([u'[this', u'is', u'it', u'specia|', u'$tring', u'is!]', u'characters,', u'yes', u'with'])
预期输出应如下所示:
set([u'this', u'is', u'it', u'specia', u'tring', u'is', u'characters,', u'yes', u'with'])
为什么格式化后的数据框仍然保留特殊字符?
您必须将格式化值分配给相同的数据框
get_df_by_id['title'] = get_df_by_id['title'].str.replace(r'[\-\!\@\#$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')
我觉得你可以先replace
special characters (I add \|
to the end), then lower
text, split
by \s+
(arbitrary wtitespaces). Output is DataFrame. So you can stack
it to Series
, drop_duplicates
and last tolist
:
print (df['title'].str
.replace(r'[\-\!\@\#$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')
.str
.lower()
.str
.split('\s+', expand=True)
.stack()
.drop_duplicates()
.tolist())
['this', 'is', 'line', 'number', 'one', 'i', 'love', 'pandas', 'they', 'are',
'so', 'puffy', 'tring', 'with', 'specia', 'characters', 'yes', 'it']
如果您想要每行的唯一单词列表:
>>> get_df_by_id['title'].str.replace(r'[^a-zA-Z\s]', '').str.lower().str.split('\s+').apply(lambda x: list(set(x)))
0 [this, is, one, line, number]
1 [love, i, puffy, so, are, they, pandas]
2 [specia, this, is, it, characters, tring, yes, with]
Name: title, dtype: object