从字符串输出中剥离制表符、换行符和 spaces,但保留一个 space 以便单词不连接
stripping tabs, newlines, and spaces from string output, but leave one space so that words are not connected
我有一个list_3,有一个元素,一个字符串:
[['\n\n\n Headquarters or Regional Office\n\n\n\n\n\t\t\t\t\t\t\t\t\tMain Headquarters\t\t\t\t\t\t\t\n\n', '\n\n\n Founders\n\n\n\n\n\t\t\t\t\t\t\t\t\tThomas Lon Van\t\t\t\t\t\t\t\n\n', '\n\n\n Founder Diversity\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n', '\n\n\n Year Founded\n\n\n\n\n\t\t\t\t\t\t\t\t\t2016\t\t\t\t\t\t\t\n\n', '\n\n\n # of Employees\n\n\n\n\n\t\t\t\t\t\t\t\t\t1-10\t\t\t\t\t\t\t\n\n', '\n\n\n Seeking Funding?\n\n\n\n\n\t\t\t\t\t\t\t\t\tNo \t\t\t\t\t\t\t\n\n', '\n\n\n Funding Phase\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n'], ['\n\n\n Headquarters or Regional Office\n\n\n\n\n\t\t\t\t\t\t\t\t\tMain Headquarters\t\t\t\t\t\t\t\n\n', '\n\n\n Founders\n\n\n\n\n\t\t\t\t\t\t\t\t\tMacKenzie T Stout,\t\t\t\t\t\t\t\n\n', '\n\n\n Founder Diversity\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n', '\n\n\n Year Founded\n\n\n\n\n\t\t\t\t\t\t\t\t\t2020\t\t\t\t\t\t\t\n\n', '\n\n\n # of Employees\n\n\n\n\n\t\t\t\t\t\t\t\t\t1-10\t\t\t\t\t\t\t\n\n', '\n\n\n Seeking Funding?\n\n\n\n\n\t\t\t\t\t\t\t\t\tYes\t\t\t\t\t\t\t\n\n', '\n\n\n Funding Phase\n\n\n\n\n\t\t\t\t\t\t\t\t\tPre-Seed\t\t\t\t\t\t\t\n\n']]
我想使用正则表达式从输出中删除 \n\t\r,并以易于阅读的格式 return 文本
这是我试过的:
list_33 = []
for i in list_3:
string = ''.join(list_3)
list_33.append(re.sub('\s+','', string))
print(list_33)
输出:
['HeadquartersorRegionalOfficeMainHeadquarters', 'FoundersThomasLonVan', 'FounderDiversityN/A', 'YearFounded2016', '#ofEmployees1-10', 'SeekingFunding?No', 'FundingPhaseN/A']
这几乎是我所需要的,但我希望在 list_3 的第一个文本块之后的每个单词和冒号之间有一个 space,即:
['Headquarters or Regional Office: Main Headquarters', 'Founders: Thomas Lon Van', 'Founder Diversity: N/A', 'Year Founded: 2015', '# of Employees 1-10', 'Seeking Funding?: No', 'Funding Phase: N/A']
关于如何将两个正则表达式函数合并为一个的任何想法?
谢谢
ps。我知道我不需要对只有一个元素的列表使用 for 循环,但将来列表将包含更多元素,我现在正尝试仅使用一个输入来概括代码结构。
您可以浏览列表中的每个字符串,并使用 re.sub
将每次出现的超过 2 个白色 space 替换为 :
>>> import re
>>> lst = ['\n\n\n Headquarters or Regional Office\n\n\n\n\n\t\t\t\t\t\t\t\t\tMain Headquarters\t\t\t\t\t\t\t\n\n', '\n\n\n Founders\n\n\n\n\n\t\t\t\t\t\t\t\t\tThomas Lon Van\t\t\t\t\t\t\t\n\n', '\n\n\n Founder Diversity\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n', '\n\n\n Year Founded\n\n\n\n\n\t\t\t\t\t\t\t\t\t2016\t\t\t\t\t\t\t\n\n', '\n\n\n # of Employees\n\n\n\n\n\t\t\t\t\t\t\t\t\t1-10\t\t\t\t\t\t\t\n\n', '\n\n\n Seeking Funding?\n\n\n\n\n\t\t\t\t\t\t\t\t\tNo \t\t\t\t\t\t\t\n\n', '\n\n\n Funding Phase\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n']
>>> [re.sub(r'\s\s+', ': ', word).strip(': ') for word in lst]
['Headquarters or Regional Office: Main Headquarters', 'Founders: Thomas Lon Van', 'Founder Diversity: N/A', 'Year Founded: 2016', '# of Employees: 1-10', 'Seeking Funding?: No', 'Funding Phase: N/A']
我有一个list_3,有一个元素,一个字符串:
[['\n\n\n Headquarters or Regional Office\n\n\n\n\n\t\t\t\t\t\t\t\t\tMain Headquarters\t\t\t\t\t\t\t\n\n', '\n\n\n Founders\n\n\n\n\n\t\t\t\t\t\t\t\t\tThomas Lon Van\t\t\t\t\t\t\t\n\n', '\n\n\n Founder Diversity\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n', '\n\n\n Year Founded\n\n\n\n\n\t\t\t\t\t\t\t\t\t2016\t\t\t\t\t\t\t\n\n', '\n\n\n # of Employees\n\n\n\n\n\t\t\t\t\t\t\t\t\t1-10\t\t\t\t\t\t\t\n\n', '\n\n\n Seeking Funding?\n\n\n\n\n\t\t\t\t\t\t\t\t\tNo \t\t\t\t\t\t\t\n\n', '\n\n\n Funding Phase\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n'], ['\n\n\n Headquarters or Regional Office\n\n\n\n\n\t\t\t\t\t\t\t\t\tMain Headquarters\t\t\t\t\t\t\t\n\n', '\n\n\n Founders\n\n\n\n\n\t\t\t\t\t\t\t\t\tMacKenzie T Stout,\t\t\t\t\t\t\t\n\n', '\n\n\n Founder Diversity\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n', '\n\n\n Year Founded\n\n\n\n\n\t\t\t\t\t\t\t\t\t2020\t\t\t\t\t\t\t\n\n', '\n\n\n # of Employees\n\n\n\n\n\t\t\t\t\t\t\t\t\t1-10\t\t\t\t\t\t\t\n\n', '\n\n\n Seeking Funding?\n\n\n\n\n\t\t\t\t\t\t\t\t\tYes\t\t\t\t\t\t\t\n\n', '\n\n\n Funding Phase\n\n\n\n\n\t\t\t\t\t\t\t\t\tPre-Seed\t\t\t\t\t\t\t\n\n']]
我想使用正则表达式从输出中删除 \n\t\r,并以易于阅读的格式 return 文本
这是我试过的:
list_33 = []
for i in list_3:
string = ''.join(list_3)
list_33.append(re.sub('\s+','', string))
print(list_33)
输出:
['HeadquartersorRegionalOfficeMainHeadquarters', 'FoundersThomasLonVan', 'FounderDiversityN/A', 'YearFounded2016', '#ofEmployees1-10', 'SeekingFunding?No', 'FundingPhaseN/A']
这几乎是我所需要的,但我希望在 list_3 的第一个文本块之后的每个单词和冒号之间有一个 space,即:
['Headquarters or Regional Office: Main Headquarters', 'Founders: Thomas Lon Van', 'Founder Diversity: N/A', 'Year Founded: 2015', '# of Employees 1-10', 'Seeking Funding?: No', 'Funding Phase: N/A']
关于如何将两个正则表达式函数合并为一个的任何想法?
谢谢
ps。我知道我不需要对只有一个元素的列表使用 for 循环,但将来列表将包含更多元素,我现在正尝试仅使用一个输入来概括代码结构。
您可以浏览列表中的每个字符串,并使用 re.sub
将每次出现的超过 2 个白色 space 替换为 :
>>> import re
>>> lst = ['\n\n\n Headquarters or Regional Office\n\n\n\n\n\t\t\t\t\t\t\t\t\tMain Headquarters\t\t\t\t\t\t\t\n\n', '\n\n\n Founders\n\n\n\n\n\t\t\t\t\t\t\t\t\tThomas Lon Van\t\t\t\t\t\t\t\n\n', '\n\n\n Founder Diversity\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n', '\n\n\n Year Founded\n\n\n\n\n\t\t\t\t\t\t\t\t\t2016\t\t\t\t\t\t\t\n\n', '\n\n\n # of Employees\n\n\n\n\n\t\t\t\t\t\t\t\t\t1-10\t\t\t\t\t\t\t\n\n', '\n\n\n Seeking Funding?\n\n\n\n\n\t\t\t\t\t\t\t\t\tNo \t\t\t\t\t\t\t\n\n', '\n\n\n Funding Phase\n\n\n\n\n\t\t\t\t\t\t\t\t\tN/A\t\t\t\t\t\t\t\n\n']
>>> [re.sub(r'\s\s+', ': ', word).strip(': ') for word in lst]
['Headquarters or Regional Office: Main Headquarters', 'Founders: Thomas Lon Van', 'Founder Diversity: N/A', 'Year Founded: 2016', '# of Employees: 1-10', 'Seeking Funding?: No', 'Funding Phase: N/A']