修剪数据框中的特定单词
Trimming specifc words in a dataframe
我有一个带有一些 trigrams(和更多 ngrams)的 df,我想检查句子是否以特定单词列表开头或结尾,并将它们从我的 df 中删除。例如:
import pandas as pd
df = pd.DataFrame({'Trigrams+': ['because of tuna', 'to your family', 'pay to you', 'give you in','happy birthday to you'], 'Count': [10,9,8,7,5]})
list_remove = ['of','in','to', 'a']
print(df)
Trigrams+ Count
0 because of tuna 10
1 to your family 9
2 pay to you 8
3 give you in 7
4 happy birthday to you 5
我尝试使用 strip
但在上面的示例中第一行会 return 因为 tun
输出应该是这样的:
list_remove = ['of','in','to', 'a']
Trigrams+ Count
0 because of tuna 10
1 pay to you 8
2 happy birthday to you 5
有人可以帮我吗?提前致谢!
尝试:
list_remove = ["of", "in", "to", "a"]
tmp = df["Trigrams+"].str.split()
df = df[~(tmp.str[0].isin(list_remove) | tmp.str[-1].isin(list_remove))]
print(df)
打印:
Trigrams+ Count
0 because of tuna 10
2 pay to you 8
4 happy birthday to you 5
您可以尝试这样的操作:
import numpy as np
def func(x):
y = x.split()[0]
z = x.split()[-1]
if (y in list_remove) or (z in list_remove):
return np.nan
return x
df['Trigrams+'] = df['Trigrams+'].apply(lambda x:func(x))
df = df.dropna().reset_index(drop=True)
我有一个带有一些 trigrams(和更多 ngrams)的 df,我想检查句子是否以特定单词列表开头或结尾,并将它们从我的 df 中删除。例如:
import pandas as pd
df = pd.DataFrame({'Trigrams+': ['because of tuna', 'to your family', 'pay to you', 'give you in','happy birthday to you'], 'Count': [10,9,8,7,5]})
list_remove = ['of','in','to', 'a']
print(df)
Trigrams+ Count
0 because of tuna 10
1 to your family 9
2 pay to you 8
3 give you in 7
4 happy birthday to you 5
我尝试使用 strip
但在上面的示例中第一行会 return 因为 tun
输出应该是这样的:
list_remove = ['of','in','to', 'a']
Trigrams+ Count
0 because of tuna 10
1 pay to you 8
2 happy birthday to you 5
有人可以帮我吗?提前致谢!
尝试:
list_remove = ["of", "in", "to", "a"]
tmp = df["Trigrams+"].str.split()
df = df[~(tmp.str[0].isin(list_remove) | tmp.str[-1].isin(list_remove))]
print(df)
打印:
Trigrams+ Count
0 because of tuna 10
2 pay to you 8
4 happy birthday to you 5
您可以尝试这样的操作:
import numpy as np
def func(x):
y = x.split()[0]
z = x.split()[-1]
if (y in list_remove) or (z in list_remove):
return np.nan
return x
df['Trigrams+'] = df['Trigrams+'].apply(lambda x:func(x))
df = df.dropna().reset_index(drop=True)