按字符拆分 unicode 到列表中
Split unicode by character into list
我制作了一个读取一系列名称的程序,然后将其转换为 Unicode 示例
StevensJohn:-:
WasouskiMike:-:
TimebombTime:-:
etc
有什么方法可以创建一个列表来拆分索引
example_list = ["StevensJohn", "WasouskiMike", "TimebombTim"]
这将是动态的,因此将从网络抓取中返回名称的数量和不同的名称。
如有任何意见,我们将不胜感激。
代码
results = unicode("""
Hospitality
Customer Care
Wick , John 12:00-20:00
Wick , John 10:00-17:00
Obama , Barack 06:00-14:00
Musk , Elon 07:00-15:00
Wasouski , Mike 06:30-14:30
Production
Fries
Piper , Billie 12:00-20:00
Tennent , David 06:30-14:30
Telsa, Nikola 11:45-17:00
Beverages & Desserts in a Dual Lane Drive-thru with a split beverage cell
Timebomb , Tim 06:30-14:30
Freeman , Matt 08:00-16:00
Cool , Tre 11:45-17:00
Sausage
Prestly , Elvis 06:30-14:30
Fat , Mike 06:30-14:30
Knoxville , Johnny 06:00-14:00
Man , Wee 05:00-12:00
Heartness , Jack 09:00-16:00
Breakfast BOP
Schofield , Phillip 06:30-14:15
Burns , George 06:30-14:15
Johnson , Boris 06:30-14:30
Milliband, Edd 06:30-14:30
Trump , Donald 10:00-17:00
Biden , Joe 08:00-16:00
Tempering & Prep
Clinton , Hillary 11:00-19:00
""")
for span in results:
results = results.replace(',', '')
results = results.replace(" ", "")
results = results.replace("/r","")
results = results.replace(":-:", "\r")
results = ''.join([i for i in results if not i.isdigit()])
print(results)
import re
input = 'StevensJohn:-:\nWasouskiMike:-:\nTimebombTime:-:\n'
class Names:
def __init__(self, input, delimiter=':-:\n'):
self.names = [ x for x in re.split(delimiter, input) if x ]
self.diffrent_names = set(self.names)
def number_of_names(self):
return len(self.names)
def number_of_diffrent_names(self):
return len(self.diffrent_names)
def __str__(self):
return str(self.names)
names = Names(input)
print(names)
print(names.number_of_names())
print(names.number_of_diffrent_names())
unicode_ex = 'StevensJohn:-:\nWasouskiMike:-:\nTimebombTime:-:\n'
splitted = [name.replace(" ", "") for name in unicode_ex.split(":-:\n") if name]
print(splitted)
输出
['StevensJohn', 'WasouskiMike', 'TimebombTime']
您的编辑表明这确实是一个 XY problem。您尝试连续 trim 关闭小子串将不可避免地遇到某些时候不应删除某些子串的极端情况。一种常见的替代方法是使用正则表达式。
import re
matches=[''.join([m.group(1), m.group(2)]) for m in re.iterfind(r"([A-Za-z']+)\s*,\s*([A-Za-z'.]+)\s+\d+:\d+-\d+:\d+", results)]
一个更好的解决方案仍然是使用周围的结构 HTML 来仅提取特定的跨度;大多数现代网站使用 CSS 选择器进行格式化,这对于抓取也非常有用。但由于我们看不到您提取此字符串的原始页面,因此这完全是推测。
我制作了一个读取一系列名称的程序,然后将其转换为 Unicode 示例
StevensJohn:-:
WasouskiMike:-:
TimebombTime:-:
etc
有什么方法可以创建一个列表来拆分索引
example_list = ["StevensJohn", "WasouskiMike", "TimebombTim"]
这将是动态的,因此将从网络抓取中返回名称的数量和不同的名称。
如有任何意见,我们将不胜感激。
代码
results = unicode("""
Hospitality
Customer Care
Wick , John 12:00-20:00
Wick , John 10:00-17:00
Obama , Barack 06:00-14:00
Musk , Elon 07:00-15:00
Wasouski , Mike 06:30-14:30
Production
Fries
Piper , Billie 12:00-20:00
Tennent , David 06:30-14:30
Telsa, Nikola 11:45-17:00
Beverages & Desserts in a Dual Lane Drive-thru with a split beverage cell
Timebomb , Tim 06:30-14:30
Freeman , Matt 08:00-16:00
Cool , Tre 11:45-17:00
Sausage
Prestly , Elvis 06:30-14:30
Fat , Mike 06:30-14:30
Knoxville , Johnny 06:00-14:00
Man , Wee 05:00-12:00
Heartness , Jack 09:00-16:00
Breakfast BOP
Schofield , Phillip 06:30-14:15
Burns , George 06:30-14:15
Johnson , Boris 06:30-14:30
Milliband, Edd 06:30-14:30
Trump , Donald 10:00-17:00
Biden , Joe 08:00-16:00
Tempering & Prep
Clinton , Hillary 11:00-19:00
""")
for span in results:
results = results.replace(',', '')
results = results.replace(" ", "")
results = results.replace("/r","")
results = results.replace(":-:", "\r")
results = ''.join([i for i in results if not i.isdigit()])
print(results)
import re
input = 'StevensJohn:-:\nWasouskiMike:-:\nTimebombTime:-:\n'
class Names:
def __init__(self, input, delimiter=':-:\n'):
self.names = [ x for x in re.split(delimiter, input) if x ]
self.diffrent_names = set(self.names)
def number_of_names(self):
return len(self.names)
def number_of_diffrent_names(self):
return len(self.diffrent_names)
def __str__(self):
return str(self.names)
names = Names(input)
print(names)
print(names.number_of_names())
print(names.number_of_diffrent_names())
unicode_ex = 'StevensJohn:-:\nWasouskiMike:-:\nTimebombTime:-:\n'
splitted = [name.replace(" ", "") for name in unicode_ex.split(":-:\n") if name]
print(splitted)
输出
['StevensJohn', 'WasouskiMike', 'TimebombTime']
您的编辑表明这确实是一个 XY problem。您尝试连续 trim 关闭小子串将不可避免地遇到某些时候不应删除某些子串的极端情况。一种常见的替代方法是使用正则表达式。
import re
matches=[''.join([m.group(1), m.group(2)]) for m in re.iterfind(r"([A-Za-z']+)\s*,\s*([A-Za-z'.]+)\s+\d+:\d+-\d+:\d+", results)]
一个更好的解决方案仍然是使用周围的结构 HTML 来仅提取特定的跨度;大多数现代网站使用 CSS 选择器进行格式化,这对于抓取也非常有用。但由于我们看不到您提取此字符串的原始页面,因此这完全是推测。