在 python 中搜索并用浮点表示替换特定字符串
Search and replace specific strings with floating point representations in python
问题: 我正在尝试用 Python.
中的另一个浮点表示的多个特定序列替换字符串中的多个特定序列
我在 JSON 文件中有一个字符串数组,我通过 json 模块将其加载到 python 脚本。
字符串数组:
{
"LinesToReplace": [
"_ __ ___ ____ _____ ______ _______ ",
"_._ __._ ___._ ____._ _____._ ______._ ",
"_._ _.__ _.___ _.____ _._____ _.______ ",
"_._ __.__ ___.___ ____.____ _____._____ ",
"_. __. ___. ____. _____. ______. "
]
}
我通过json-模块加载JSON-文件:
with open("myFile.json") as jsonFile:
data = json.load(jsonFile)
我正在尝试用浮点表示的特定子字符串替换 _
的序列。
规格:
- 要在字符串中查找的字符必须是单个
_
或多个 _
的序列。
_
的序列长度未知。
- 如果单个
_
或多个 _
的序列后跟一个 .
,然后再次跟一个 _
或多个的序列_
,.
是 _
序列的一部分 。
.
用于指定小数
- 如果
.
后面没有单个 _
或多个 _
的序列,则 .
不属于 _
-序列。
_
和 .
的序列将被替换为浮点表示,即 %f1.0
.
- 表示取决于
_
- 和 .
- 序列。
例子:
__
将替换为 %f2.0
。
_.___
将替换为 %f1.3
。
____.__
将替换为 %f4.2
。
___.
将替换为 %3.0
。
对于上面的JSON-文件,结果应该是:
{
"ReplacedLines": [
"%f1.0 %f2.0 %f3.0 %f4.0 %f5.0 %f6.0 %f7.0 ",
"%f1.1 %f2.1 %f3.1 %f4.1 %f5.1 %f6.1 ",
"%f1.1 %f1.2 %f1.3 %f1.4 %f1.5 %f1.6 ",
"%f1.1 %f2.2 %f3.3 %f4.4 %f5.5 ",
"%f1.0. %f.0. %f3.0. %f4.0. %f5.0. %f6.0. "
]
}
一些代码,试图用 %f1.0
替换单个 _
(这不起作用...):
with open("myFile.json") as jsonFile:
data = json.load(jsonFile)
strToFind = "_"
for line in data["LinesToReplace"]:
for idl, l in enumerate(line):
if (line[idl] == strToFind and line[idl+1] != ".") and (line[idl+1] != strToFind and line[idl-1] != strToFind):
l = l[:idl] + "%f1.0" + l[idl+1:] # replace string
关于如何做到这一点有什么想法吗?我也考虑过使用正则表达式。
编辑
该算法应该能够检查字符是否为“_”,即能够格式化为:
{
"LinesToReplace": [
"Ex1:_ Ex2:_. Ex3:._ Ex4:_._ Ex5:_._. ",
"Ex6:._._ Ex7:._._. Ex8:__._ Ex9: _.__ ",
"Ex10: _ Ex11: _. Ex12: ._ Ex13: _._ ",
"Ex5:._._..Ex6:.._._.Ex7:.._._._._._._._."
]
}
解决方案:
{
"LinesToReplace": [
"Ex1:%f1.0 Ex2:%f1.0. Ex3:.%f1.0 Ex4:%f1.1 Ex5:%f1.1. ",
"Ex6:.%f1.1 Ex7:.%f1.1. Ex8:%f2.1 Ex9: %f1.2 ",
"Ex10: %f1.0 Ex11: %f1.0. Ex12: .%f1.0 Ex13: %f1.1 ",
"Ex5:.%f1.1..Ex6:..%f1.1.Ex7:..%f1.1.%f1.1.%f1.1.%f1.0."
]
}
我已经根据上述标准尝试了以下算法,但我不知道如何实现它:
def replaceFunc3(lines: list[str]) -> list[str]:
result = []
charToFind = '_'
charMatrix = []
# Find indicies of all "_" in lines
for line in lines:
charIndices = [idx for idx, c in enumerate(line) if c == charToFind]
charMatrix.append(charIndices)
for (line, char) in zip(lines, charMatrix):
if not char: # No "_" in current line, append the whole line
result.append(line)
else:
pass
# result.append(Something)
# TODO: Insert "%fx.x on all the placeholders"
return result
整洁的问题。就个人而言,我会这样做:
from pprint import pprint
d = {
"LinesToReplace": [
"_ __ ___ ____ _____ ______ _______ ",
"_._ __._ ___._ ____._ _____._ ______._ ",
"_._ _.__ _.___ _.____ _._____ _.______ ",
"_._ __.__ ___.___ ____.____ _____._____ ",
"_. __. ___. ____. _____. ______. "
]
}
def get_replaced_lines(lines: list[str]) -> list[str]:
result = []
for line in lines:
trimmed_line = line.rstrip()
trailing_spaces = len(line) - len(trimmed_line)
underscores = trimmed_line.split()
repl_line = []
for s in underscores:
n = len(s)
if '.' in s:
if s.endswith('.'):
repl_line.append(f'%f{n - 1}.0.')
else:
idx = s.index('.')
repl_line.append(f'%f{idx}.{n - idx - 1}')
else:
repl_line.append(f'%f{n}.0')
result.append(' '.join(repl_line) + ' ' * trailing_spaces)
return result
if __name__ == '__main__':
pprint(get_replaced_lines(d['LinesToReplace']))
输出:
['%f1.0 %f2.0 %f3.0 %f4.0 %f5.0 %f6.0 %f7.0 ',
'%f1.1 %f2.1 %f3.1 %f4.1 %f5.1 %f6.1 ',
'%f1.1 %f1.2 %f1.3 %f1.4 %f1.5 %f1.6 ',
'%f1.1 %f2.2 %f3.3 %f4.4 %f5.5 ',
'%f1.0. %f2.0. %f3.0. %f4.0. %f5.0. %f6.0. ']
如果好奇的话,我还用备用正则表达式方法对它进行了计时,发现它总体上快了 40%。我只喜欢这个测试,因为它证明一般来说,正则表达式比手工做要慢一点。虽然正则表达式方法很好,因为它肯定更短:-)
这是我的测试代码:
import re
from timeit import timeit
d = {
"LinesToReplace": [
"_ __ ___ ____ _____ ______ _______ ",
"_._ __._ ___._ ____._ _____._ ______._ ",
"_._ _.__ _.___ _.____ _._____ _.______ ",
"_._ __.__ ___.___ ____.____ _____._____ ",
"_. __. ___. ____. _____. ______. "
]
}
def get_replaced_lines(lines: list[str]) -> list[str]:
result = []
dot = '.'
space = ' '
for line in lines:
trimmed_line = line.rstrip()
trailing_spaces = len(line) - len(trimmed_line)
underscores = trimmed_line.split()
repl_line = []
for s in underscores:
n = len(s)
if dot in s:
if s[n - 1] == dot: # if last character is a '.'
repl_line.append(f'%f{n - 1}.0.')
else:
idx = s.index(dot)
repl_line.append(f'%f{idx}.{n - idx - 1}')
else:
repl_line.append(f'%f{n}.0')
result.append(space.join(repl_line) + space * trailing_spaces)
return result
def get_replaced_lines_regex(lines_to_replace):
return [re.sub(
'(_+)([.]_+)?',
lambda m: f'%f{len(m.group(1))}.{len(m.group(2) or ".")-1}',
line,
) for line in lines_to_replace]
if __name__ == '__main__':
n = 100_000
time_1 = timeit("get_replaced_lines(d['LinesToReplace'])", number=n, globals=globals())
time_2 = timeit("get_replaced_lines_regex(d['LinesToReplace'])", number=n, globals=globals())
print(f'get_replaced_lines: {time_1:.3f}')
print(f'get_replaced_lines_regex: {time_2:.3f}')
print(f'The first (non-regex) approach is faster by {(1 - time_1 / time_2) * 100:.2f}%')
assert get_replaced_lines(d['LinesToReplace']) == get_replaced_lines_regex(d['LinesToReplace'])
我的 M1 结果 Mac:
get_replaced_lines: 0.813
get_replaced_lines_regex: 1.359
The first (non-regex) approach is faster by 40.14%
您可以将正则表达式的 re.sub
与在捕获组上执行逻辑的替换函数一起使用:
import re
def replace(line):
return re.sub(
'(_+)([.]_+)?',
lambda m: f'%f{len(m.group(1))}.{len(m.group(2) or ".")-1}',
line,
)
lines = [replace(line) for line in lines_to_replace]
正则表达式的解释:
(_+)
匹配 one or more underscores; the ()
part makes them available as a capture group(第一个这样的组,即 m.group(1)
)。
([.]_+)?
可选地 匹配一个点后跟一个或多个尾随下划线(optional by the trailing ?
); the dot is part of a character class ([]
) because otherwise it would have the special meaning "any character"。()
使这部分可用作为第二个捕获组 (m.group(2)
).
问题: 我正在尝试用 Python.
中的另一个浮点表示的多个特定序列替换字符串中的多个特定序列我在 JSON 文件中有一个字符串数组,我通过 json 模块将其加载到 python 脚本。 字符串数组:
{
"LinesToReplace": [
"_ __ ___ ____ _____ ______ _______ ",
"_._ __._ ___._ ____._ _____._ ______._ ",
"_._ _.__ _.___ _.____ _._____ _.______ ",
"_._ __.__ ___.___ ____.____ _____._____ ",
"_. __. ___. ____. _____. ______. "
]
}
我通过json-模块加载JSON-文件:
with open("myFile.json") as jsonFile:
data = json.load(jsonFile)
我正在尝试用浮点表示的特定子字符串替换 _
的序列。
规格:
- 要在字符串中查找的字符必须是单个
_
或多个_
的序列。 _
的序列长度未知。- 如果单个
_
或多个_
的序列后跟一个.
,然后再次跟一个_
或多个的序列_
,.
是_
序列的一部分 。 .
用于指定小数- 如果
.
后面没有单个_
或多个_
的序列,则.
不属于_
-序列。 _
和.
的序列将被替换为浮点表示,即%f1.0
.- 表示取决于
_
- 和.
- 序列。
例子:
__
将替换为%f2.0
。_.___
将替换为%f1.3
。____.__
将替换为%f4.2
。___.
将替换为%3.0
。
对于上面的JSON-文件,结果应该是:
{
"ReplacedLines": [
"%f1.0 %f2.0 %f3.0 %f4.0 %f5.0 %f6.0 %f7.0 ",
"%f1.1 %f2.1 %f3.1 %f4.1 %f5.1 %f6.1 ",
"%f1.1 %f1.2 %f1.3 %f1.4 %f1.5 %f1.6 ",
"%f1.1 %f2.2 %f3.3 %f4.4 %f5.5 ",
"%f1.0. %f.0. %f3.0. %f4.0. %f5.0. %f6.0. "
]
}
一些代码,试图用 %f1.0
替换单个 _
(这不起作用...):
with open("myFile.json") as jsonFile:
data = json.load(jsonFile)
strToFind = "_"
for line in data["LinesToReplace"]:
for idl, l in enumerate(line):
if (line[idl] == strToFind and line[idl+1] != ".") and (line[idl+1] != strToFind and line[idl-1] != strToFind):
l = l[:idl] + "%f1.0" + l[idl+1:] # replace string
关于如何做到这一点有什么想法吗?我也考虑过使用正则表达式。
编辑
该算法应该能够检查字符是否为“_”,即能够格式化为:
{
"LinesToReplace": [
"Ex1:_ Ex2:_. Ex3:._ Ex4:_._ Ex5:_._. ",
"Ex6:._._ Ex7:._._. Ex8:__._ Ex9: _.__ ",
"Ex10: _ Ex11: _. Ex12: ._ Ex13: _._ ",
"Ex5:._._..Ex6:.._._.Ex7:.._._._._._._._."
]
}
解决方案:
{
"LinesToReplace": [
"Ex1:%f1.0 Ex2:%f1.0. Ex3:.%f1.0 Ex4:%f1.1 Ex5:%f1.1. ",
"Ex6:.%f1.1 Ex7:.%f1.1. Ex8:%f2.1 Ex9: %f1.2 ",
"Ex10: %f1.0 Ex11: %f1.0. Ex12: .%f1.0 Ex13: %f1.1 ",
"Ex5:.%f1.1..Ex6:..%f1.1.Ex7:..%f1.1.%f1.1.%f1.1.%f1.0."
]
}
我已经根据上述标准尝试了以下算法,但我不知道如何实现它:
def replaceFunc3(lines: list[str]) -> list[str]:
result = []
charToFind = '_'
charMatrix = []
# Find indicies of all "_" in lines
for line in lines:
charIndices = [idx for idx, c in enumerate(line) if c == charToFind]
charMatrix.append(charIndices)
for (line, char) in zip(lines, charMatrix):
if not char: # No "_" in current line, append the whole line
result.append(line)
else:
pass
# result.append(Something)
# TODO: Insert "%fx.x on all the placeholders"
return result
整洁的问题。就个人而言,我会这样做:
from pprint import pprint
d = {
"LinesToReplace": [
"_ __ ___ ____ _____ ______ _______ ",
"_._ __._ ___._ ____._ _____._ ______._ ",
"_._ _.__ _.___ _.____ _._____ _.______ ",
"_._ __.__ ___.___ ____.____ _____._____ ",
"_. __. ___. ____. _____. ______. "
]
}
def get_replaced_lines(lines: list[str]) -> list[str]:
result = []
for line in lines:
trimmed_line = line.rstrip()
trailing_spaces = len(line) - len(trimmed_line)
underscores = trimmed_line.split()
repl_line = []
for s in underscores:
n = len(s)
if '.' in s:
if s.endswith('.'):
repl_line.append(f'%f{n - 1}.0.')
else:
idx = s.index('.')
repl_line.append(f'%f{idx}.{n - idx - 1}')
else:
repl_line.append(f'%f{n}.0')
result.append(' '.join(repl_line) + ' ' * trailing_spaces)
return result
if __name__ == '__main__':
pprint(get_replaced_lines(d['LinesToReplace']))
输出:
['%f1.0 %f2.0 %f3.0 %f4.0 %f5.0 %f6.0 %f7.0 ',
'%f1.1 %f2.1 %f3.1 %f4.1 %f5.1 %f6.1 ',
'%f1.1 %f1.2 %f1.3 %f1.4 %f1.5 %f1.6 ',
'%f1.1 %f2.2 %f3.3 %f4.4 %f5.5 ',
'%f1.0. %f2.0. %f3.0. %f4.0. %f5.0. %f6.0. ']
如果好奇的话,我还用备用正则表达式方法对它进行了计时,发现它总体上快了 40%。我只喜欢这个测试,因为它证明一般来说,正则表达式比手工做要慢一点。虽然正则表达式方法很好,因为它肯定更短:-)
这是我的测试代码:
import re
from timeit import timeit
d = {
"LinesToReplace": [
"_ __ ___ ____ _____ ______ _______ ",
"_._ __._ ___._ ____._ _____._ ______._ ",
"_._ _.__ _.___ _.____ _._____ _.______ ",
"_._ __.__ ___.___ ____.____ _____._____ ",
"_. __. ___. ____. _____. ______. "
]
}
def get_replaced_lines(lines: list[str]) -> list[str]:
result = []
dot = '.'
space = ' '
for line in lines:
trimmed_line = line.rstrip()
trailing_spaces = len(line) - len(trimmed_line)
underscores = trimmed_line.split()
repl_line = []
for s in underscores:
n = len(s)
if dot in s:
if s[n - 1] == dot: # if last character is a '.'
repl_line.append(f'%f{n - 1}.0.')
else:
idx = s.index(dot)
repl_line.append(f'%f{idx}.{n - idx - 1}')
else:
repl_line.append(f'%f{n}.0')
result.append(space.join(repl_line) + space * trailing_spaces)
return result
def get_replaced_lines_regex(lines_to_replace):
return [re.sub(
'(_+)([.]_+)?',
lambda m: f'%f{len(m.group(1))}.{len(m.group(2) or ".")-1}',
line,
) for line in lines_to_replace]
if __name__ == '__main__':
n = 100_000
time_1 = timeit("get_replaced_lines(d['LinesToReplace'])", number=n, globals=globals())
time_2 = timeit("get_replaced_lines_regex(d['LinesToReplace'])", number=n, globals=globals())
print(f'get_replaced_lines: {time_1:.3f}')
print(f'get_replaced_lines_regex: {time_2:.3f}')
print(f'The first (non-regex) approach is faster by {(1 - time_1 / time_2) * 100:.2f}%')
assert get_replaced_lines(d['LinesToReplace']) == get_replaced_lines_regex(d['LinesToReplace'])
我的 M1 结果 Mac:
get_replaced_lines: 0.813
get_replaced_lines_regex: 1.359
The first (non-regex) approach is faster by 40.14%
您可以将正则表达式的 re.sub
与在捕获组上执行逻辑的替换函数一起使用:
import re
def replace(line):
return re.sub(
'(_+)([.]_+)?',
lambda m: f'%f{len(m.group(1))}.{len(m.group(2) or ".")-1}',
line,
)
lines = [replace(line) for line in lines_to_replace]
正则表达式的解释:
(_+)
匹配 one or more underscores; the()
part makes them available as a capture group(第一个这样的组,即m.group(1)
)。([.]_+)?
可选地 匹配一个点后跟一个或多个尾随下划线(optional by the trailing?
); the dot is part of a character class ([]
) because otherwise it would have the special meaning "any character"。()
使这部分可用作为第二个捕获组 (m.group(2)
).