如果该词在另一个词的特定数量的词内,则替换字符串中的一个词
Replace one word of a string if that word is within a specific number of words of another word
我在名为 'DESCRIPTION' 的数据框中有一个文本列。我需要找到单词 "tile" 或 "tiles" 在单词 "roof" 的 6 个单词以内的所有实例,然后仅将单词 "tile/s" 更改为 "rooftiles"。我需要对 "floor" 和 "tiles" 执行相同的操作(将 "tiles" 更改为 "floortiles")。当某些词与其他词结合使用时,这将有助于区分我们正在查看的建筑行业。
为了说明我的意思,数据示例和我最近的错误尝试是:
s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])
df
我寻求的解决方案应该是这样的(数据帧格式):
1.After the storm the roof was damaged and some of the rooftiles are missing
2.I dropped the saw and it fell on the floor and damaged some of the floortiles
3.the roof was leaking and when I checked I saw that some of the tiles were cracked
我在这里尝试使用 REGEX 模式来匹配 "tiles" 一词,但这是完全错误的...有没有办法做我想做的事情?我是 Python...
的新手
regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])
更新:解决方案
感谢大家的帮助!我设法使用 Jan 的代码和几个 additions/tweaks 让它工作。最终工作代码如下(使用真实而非示例文件和数据):
claims_file = pd.read_csv(project_path + claims_filename) # Read input file
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN
#create the REGEX
rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)
#create the reverse REGEX
rx2 = re.compile(r'''
( # outer group
\b(tiles?) # tile or tiles
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(floor|roof)\b # roof or floor
''', re.VERBOSE)
#apply it to every row of Loss Description:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'', x))
#apply the reverse regex:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'', x))
# Write results into CSV file and check results
claims_file.to_csv(project_path + output_filename, index = False
, encoding = 'utf-8')
我将向您展示一个快速而肮脏的不完整实现。您当然可以使它更加健壮和有用。假设 s
是您的描述之一:
s = "I dropped the saw and it fell on the roof and damaged roof " +\
"and some of the tiles"
让我们先把它分解成单词(分词;如果你愿意,你可以去掉标点符号):
tokens = nltk.word_tokenize(s)
现在,select 感兴趣的标记并按字母顺序对它们进行排序,但请记住它们在 s
中的原始位置:
my_tokens = sorted((w.lower(), i) for i,w in enumerate(tokens)
if w.lower() in ("roof", "tiles"))
#[('roof', 6), ('roof', 12), ('tiles', 17)]
合并相同的标记并创建一个字典,其中标记是键,它们的位置列表是值。使用字典理解:
token_dict = {name: [p0 for _, p0 in pos]
for name,pos
in itertools.groupby(my_tokens, key=lambda a:a[0])}
#{'roof': [9, 12], 'tiles': [17]}
遍历 tiles
个位置列表,如果有的话,看看附近是否有 roof
,如果有,则更改单词:
for i in token_dict['tiles']:
for j in token_dict['roof']:
if abs(i-j) <= 6:
tokens[i] = 'rooftiles'
最后再把单词拼起来:
' '.join(tokens)
#'I dropped the saw and it fell on the roof and damaged roof '+\
#' and some of the rooftiles'
您遇到的主要问题是您的正则表达式中磁贴前面的 .*。这使得任意数量的任意字符都可以到达那里并且仍然匹配。 \b 是不必要的,因为它们位于空白和非空白之间。而且分组 () 也没有被使用,所以我删除了它们。
r"(roof\s+[^\s]+\s+){0,6}tiles" 将仅匹配 6 "words" 内的屋顶(非空白字符组由空白)的瓷砖。要替换它,请从正则表达式中取出匹配字符串的最后 5 个字符以外的所有字符,添加 "rooftiles",然后用更新后的字符串替换匹配的字符串。或者,您可以在正则表达式中用 () 对除图块以外的所有内容进行分组,然后将该组替换为自身加上 "roof"。你不能对这么复杂的东西使用 re.sub,因为它会替换从屋顶到瓦片的整个匹配项,而不仅仅是瓦片这个词。
我可以将其概括为比 "roof" 和 "floor" 更多的子字符串,但这似乎是一个更简单的代码:
for idx,r in enumerate(df.loc[:,'DESCRIPTION']):
if "roof" in r and "tile" in r:
fill=r[r.find("roof")+4:]
fill = fill[0:fill.replace(' ','_',7).find(' ')]
sixWords = fill if fill.find('.') == -1 else ''
df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "rooftile"))
elif "floor" in r and "tile" in r:
fill=r[r.find("floor")+5:]
fill = fill[0:fill.replace(' ','_',7).find(' ')]
sixWords = fill if fill.find('.') == -1 else ''
df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "floortile"))
请注意,这还包括对句号 (".") 的检查。您可以通过删除 sixWords
变量并将其替换为 fill
来删除它
您可以在此处使用带正则表达式的解决方案:
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
参见 a demo for the regex on regex101.com。
之后,只需将捕获的部分组合起来,然后用 rx.sub()
将它们再次组合在一起,并将其应用于 DESCRIPTION
列的所有项目,这样你最终就会得到以下代码:
import pandas as pd, re
s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df = pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])
rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)
# apply it to every row of "DESCRIPTION"
df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'', x))
print(df["DESCRIPTION"])
请注意,虽然您最初的问题不是很清楚:此解决方案只会找到 tile
或 tiles
after roof
,意思是像 Can you give me the tile for the roof, please?
不会被匹配(虽然单词 tile
在 roof
的六个单词范围内,即)。
我在名为 'DESCRIPTION' 的数据框中有一个文本列。我需要找到单词 "tile" 或 "tiles" 在单词 "roof" 的 6 个单词以内的所有实例,然后仅将单词 "tile/s" 更改为 "rooftiles"。我需要对 "floor" 和 "tiles" 执行相同的操作(将 "tiles" 更改为 "floortiles")。当某些词与其他词结合使用时,这将有助于区分我们正在查看的建筑行业。
为了说明我的意思,数据示例和我最近的错误尝试是:
s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])
df
我寻求的解决方案应该是这样的(数据帧格式):
1.After the storm the roof was damaged and some of the rooftiles are missing
2.I dropped the saw and it fell on the floor and damaged some of the floortiles
3.the roof was leaking and when I checked I saw that some of the tiles were cracked
我在这里尝试使用 REGEX 模式来匹配 "tiles" 一词,但这是完全错误的...有没有办法做我想做的事情?我是 Python...
的新手regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])
更新:解决方案
感谢大家的帮助!我设法使用 Jan 的代码和几个 additions/tweaks 让它工作。最终工作代码如下(使用真实而非示例文件和数据):
claims_file = pd.read_csv(project_path + claims_filename) # Read input file
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN
#create the REGEX
rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)
#create the reverse REGEX
rx2 = re.compile(r'''
( # outer group
\b(tiles?) # tile or tiles
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(floor|roof)\b # roof or floor
''', re.VERBOSE)
#apply it to every row of Loss Description:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'', x))
#apply the reverse regex:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'', x))
# Write results into CSV file and check results
claims_file.to_csv(project_path + output_filename, index = False
, encoding = 'utf-8')
我将向您展示一个快速而肮脏的不完整实现。您当然可以使它更加健壮和有用。假设 s
是您的描述之一:
s = "I dropped the saw and it fell on the roof and damaged roof " +\
"and some of the tiles"
让我们先把它分解成单词(分词;如果你愿意,你可以去掉标点符号):
tokens = nltk.word_tokenize(s)
现在,select 感兴趣的标记并按字母顺序对它们进行排序,但请记住它们在 s
中的原始位置:
my_tokens = sorted((w.lower(), i) for i,w in enumerate(tokens)
if w.lower() in ("roof", "tiles"))
#[('roof', 6), ('roof', 12), ('tiles', 17)]
合并相同的标记并创建一个字典,其中标记是键,它们的位置列表是值。使用字典理解:
token_dict = {name: [p0 for _, p0 in pos]
for name,pos
in itertools.groupby(my_tokens, key=lambda a:a[0])}
#{'roof': [9, 12], 'tiles': [17]}
遍历 tiles
个位置列表,如果有的话,看看附近是否有 roof
,如果有,则更改单词:
for i in token_dict['tiles']:
for j in token_dict['roof']:
if abs(i-j) <= 6:
tokens[i] = 'rooftiles'
最后再把单词拼起来:
' '.join(tokens)
#'I dropped the saw and it fell on the roof and damaged roof '+\
#' and some of the rooftiles'
您遇到的主要问题是您的正则表达式中磁贴前面的 .*。这使得任意数量的任意字符都可以到达那里并且仍然匹配。 \b 是不必要的,因为它们位于空白和非空白之间。而且分组 () 也没有被使用,所以我删除了它们。
r"(roof\s+[^\s]+\s+){0,6}tiles" 将仅匹配 6 "words" 内的屋顶(非空白字符组由空白)的瓷砖。要替换它,请从正则表达式中取出匹配字符串的最后 5 个字符以外的所有字符,添加 "rooftiles",然后用更新后的字符串替换匹配的字符串。或者,您可以在正则表达式中用 () 对除图块以外的所有内容进行分组,然后将该组替换为自身加上 "roof"。你不能对这么复杂的东西使用 re.sub,因为它会替换从屋顶到瓦片的整个匹配项,而不仅仅是瓦片这个词。
我可以将其概括为比 "roof" 和 "floor" 更多的子字符串,但这似乎是一个更简单的代码:
for idx,r in enumerate(df.loc[:,'DESCRIPTION']):
if "roof" in r and "tile" in r:
fill=r[r.find("roof")+4:]
fill = fill[0:fill.replace(' ','_',7).find(' ')]
sixWords = fill if fill.find('.') == -1 else ''
df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "rooftile"))
elif "floor" in r and "tile" in r:
fill=r[r.find("floor")+5:]
fill = fill[0:fill.replace(' ','_',7).find(' ')]
sixWords = fill if fill.find('.') == -1 else ''
df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "floortile"))
请注意,这还包括对句号 (".") 的检查。您可以通过删除 sixWords
变量并将其替换为 fill
您可以在此处使用带正则表达式的解决方案:
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
参见 a demo for the regex on regex101.com。
之后,只需将捕获的部分组合起来,然后用
rx.sub()
将它们再次组合在一起,并将其应用于 DESCRIPTION
列的所有项目,这样你最终就会得到以下代码:
import pandas as pd, re
s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df = pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])
rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)
# apply it to every row of "DESCRIPTION"
df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'', x))
print(df["DESCRIPTION"])
请注意,虽然您最初的问题不是很清楚:此解决方案只会找到
tile
或 tiles
after roof
,意思是像 Can you give me the tile for the roof, please?
不会被匹配(虽然单词 tile
在 roof
的六个单词范围内,即)。