如何在 python 中进行模式匹配时从文本中获取单词大小写
how to get the word case from the text while pattern matching in python
我有一个包含两列 Stg 和 Txt 的数据框。任务是检查每个 Txt 行中 Stg 列中的所有单词,并将匹配的单词输出到新列中,同时保持单词大小写与 Txt 中的大小写相同。
Example Code:
from pandas import DataFrame
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = DataFrame(new,columns= ['Stg','Txt'])
my_list = df["Stg"].tolist()
import re
def words_in_string(word_list, a_string):
word_set = set(word_list)
pattern = r'\b({0})\b'.format('|'.join(word_list))
for found_word in re.finditer(pattern, a_string):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set:
raise StopIteration
df['new'] = ''
for i,values in enumerate(df['Txt']):
a=[]
b = []
for word in words_in_string(my_list, values):
a=word
b.append(a)
df['new'][i] = b
exit
以上代码returns来自Stg列的案例。有没有办法从 Txt.case 中获取案例?此外,我想检查整个字符串,而不是像文本 'two-way'、当前代码 returns 单词方式那样的子字符串。
Current Output:
Stg Txt new
0 way An early term []
1 Early two-way allowed [way, allowed]
2 phone New Phone feature that allowed [allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
Expected Output:
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
你应该使用 Series.str.findall
负后视:
import pandas as pd
import re
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = pd.DataFrame(new,columns= ['Stg','Txt'])
pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\b" for i in new["Stg"])
df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)
print (df)
#
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
我有一个包含两列 Stg 和 Txt 的数据框。任务是检查每个 Txt 行中 Stg 列中的所有单词,并将匹配的单词输出到新列中,同时保持单词大小写与 Txt 中的大小写相同。
Example Code:
from pandas import DataFrame
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = DataFrame(new,columns= ['Stg','Txt'])
my_list = df["Stg"].tolist()
import re
def words_in_string(word_list, a_string):
word_set = set(word_list)
pattern = r'\b({0})\b'.format('|'.join(word_list))
for found_word in re.finditer(pattern, a_string):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set:
raise StopIteration
df['new'] = ''
for i,values in enumerate(df['Txt']):
a=[]
b = []
for word in words_in_string(my_list, values):
a=word
b.append(a)
df['new'][i] = b
exit
以上代码returns来自Stg列的案例。有没有办法从 Txt.case 中获取案例?此外,我想检查整个字符串,而不是像文本 'two-way'、当前代码 returns 单词方式那样的子字符串。
Current Output:
Stg Txt new
0 way An early term []
1 Early two-way allowed [way, allowed]
2 phone New Phone feature that allowed [allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
Expected Output:
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
你应该使用 Series.str.findall
负后视:
import pandas as pd
import re
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = pd.DataFrame(new,columns= ['Stg','Txt'])
pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\b" for i in new["Stg"])
df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)
print (df)
#
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]