Python 使用字典键值从字符串中获取第一个和最后一个值
Python get first and last value from string using dictionary key values
我得到了一个很奇怪的数据。我有一个包含键和值的字典,我想使用这本字典来搜索这些关键字是否仅从文本的 and/or 结尾而不是句子的中间开始。我尝试在下面创建简单的数据框来显示问题案例和我到目前为止尝试过的 python 代码。我如何让它只搜索句子的开头或结尾?这个搜索整个文本子字符串。
代码:
d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
"The word Apple is commonly confused with Apple Corp which is a business",
"Apple Corp is a business they make computers",
"Apple Corp also writes App",
"The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df
原始数据框:
id text
1 The word Apple is commonly confused with Apple Corp which is a business
2 Apple Corp is a business they make computers
3 Apple Corp also writes App
4 The Apple Corp also writes App
试用代码:
def matcher(k):
x = (i for i in d if i in k)
# i.startswith(k) getting error
return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df
错误:
TypeError: 'in <string>' requires string as left operand, not bool
当我使用这个 x = (i for i in d if i.startswith(k) in k)
如果我尝试这个,则为空值x = (i for i in d if i.startswith(k) == True in k)
TypeError: sequence item 0: expected str instance, NoneType found
当我使用这个 x = (i.startswith(k) for i in d if i in k)
以上代码的结果...创建新字段'text_value':
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business Company;Application
2 Apple Corp is a business they make computers Company;Application
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Company;Application
正在尝试获得这样的最终输出:
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business NaN
2 Apple Corp is a business they make computers Company
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Application
您需要一个 matcher
函数,它可以接受 flag
,然后调用它两次以获得 startswith
和 endswith
的结果。
def matcher(s, flag="start"):
if flag=="start":
for i in d:
if s.startswith(i):
return d[i]
else:
for i in d:
if s.endswith(i):
return d[i]
return None
df['st'] = df['text'].apply(matcher)
df['ed'] = df['text'].apply(matcher, flag="end")
df['text_value'] = df[['st', 'ed']].apply(lambda x: ';'.join(x.dropna()),1)
df = df[['id','text', 'text_value']]
text_value
列如下所示:
0
1 Company
2 Company;Application
3 Application
Name: text_value, dtype: object
joined = "|".join(d.keys())
pat = '(?i)^(?:the\s*)?(' + joined + ')\b.*?|.*\b(' + joined + ')$'+'|.*'
get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '')
df.text.str.replace(pat,get)
0
1 Company
2 Company;Application
3 Company;Application
Name: text, dtype: object
我得到了一个很奇怪的数据。我有一个包含键和值的字典,我想使用这本字典来搜索这些关键字是否仅从文本的 and/or 结尾而不是句子的中间开始。我尝试在下面创建简单的数据框来显示问题案例和我到目前为止尝试过的 python 代码。我如何让它只搜索句子的开头或结尾?这个搜索整个文本子字符串。
代码:
d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
"The word Apple is commonly confused with Apple Corp which is a business",
"Apple Corp is a business they make computers",
"Apple Corp also writes App",
"The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df
原始数据框:
id text
1 The word Apple is commonly confused with Apple Corp which is a business
2 Apple Corp is a business they make computers
3 Apple Corp also writes App
4 The Apple Corp also writes App
试用代码:
def matcher(k):
x = (i for i in d if i in k)
# i.startswith(k) getting error
return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df
错误:
TypeError: 'in <string>' requires string as left operand, not bool
当我使用这个 x = (i for i in d if i.startswith(k) in k)
如果我尝试这个,则为空值x = (i for i in d if i.startswith(k) == True in k)
TypeError: sequence item 0: expected str instance, NoneType found
当我使用这个 x = (i.startswith(k) for i in d if i in k)
以上代码的结果...创建新字段'text_value':
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business Company;Application
2 Apple Corp is a business they make computers Company;Application
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Company;Application
正在尝试获得这样的最终输出:
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business NaN
2 Apple Corp is a business they make computers Company
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Application
您需要一个 matcher
函数,它可以接受 flag
,然后调用它两次以获得 startswith
和 endswith
的结果。
def matcher(s, flag="start"):
if flag=="start":
for i in d:
if s.startswith(i):
return d[i]
else:
for i in d:
if s.endswith(i):
return d[i]
return None
df['st'] = df['text'].apply(matcher)
df['ed'] = df['text'].apply(matcher, flag="end")
df['text_value'] = df[['st', 'ed']].apply(lambda x: ';'.join(x.dropna()),1)
df = df[['id','text', 'text_value']]
text_value
列如下所示:
0
1 Company
2 Company;Application
3 Application
Name: text_value, dtype: object
joined = "|".join(d.keys())
pat = '(?i)^(?:the\s*)?(' + joined + ')\b.*?|.*\b(' + joined + ')$'+'|.*'
get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '')
df.text.str.replace(pat,get)
0
1 Company
2 Company;Application
3 Company;Application
Name: text, dtype: object