将字符串恢复为初始大小写和 punctuation_pandas
Revert string to initial casing and punctuation_pandas
有没有一种方法可以修改此代码以保持其逻辑但将字符串恢复为初始大小写和标点符号?
data = {'duplicate_column':["Adidas Women's Womens A004 Snow Boot", 'Amul Milk, 100ml, 100ML', 'L-OCCITANE L´Occitane CREMA MANI', 'Corneto Ice Cream Ice, 300 ml -300ml', 'Béaba BÉABA, Set di 6 Contenitori,set']}
df = pd.DataFrame(data)
punct = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{}~´'
transtab = str.maketrans(dict.fromkeys(punct, ''))
df['new_column'] = [
' '.join(dict.fromkeys(s.translate(transtab).lower().split()))
for s in df['duplicate']
此代码正在删除列 'duplicate' 中的重复项,并使用结果创建一个新列。需要将字符串恢复为初始大小写和标点符号。
重复列(初始数据):
Adidas Women's Womens A004 Snow Boot
Amul Milk, 100ml, 100ML
L-OCCITANE L´Occitane CREMA MANI
Corneto Ice Cream Ice, 300 ml -300ml
Béaba BÉABA, Set di 6 Contenitori,set
使用上述代码创建的新列:
adidas womens a004 snow boot
amul milk 100ml
loccitane crema mani
corneto ice cream 300 ml 300ml
béaba set di 6 contenitori set
期望的输出:
Adidas Women's A004 Snow Boot
Amul Milk, 100ml
L-OCCITANE CREMA MANI
Corneto Ice Cream, 300 ml
Béaba, Set di 6 Contenitori
而不是固定输出来恢复。 punctuation/case 首先不要放弃它。您可以使用基于集合的自定义函数:
import re
regex = re.compile('[%s]' % re.escape(punct))
def remove_dup(s):
seen = set()
keep = []
for w in s.split():
w2 = regex.sub('', w.lower())
if w2 in seen:
continue
seen.add(w2)
keep.append(w)
return ' '.join(keep).strip(punct)
df['new_column'] = list(map(remove_dup, df['duplicate_column']))
输出:
duplicate_column new_column
0 Adidas Women's Womens A004 Snow Boot Adidas Women's A004 Snow Boot
1 Amul Milk, 100ml, 100ML Amul Milk, 100ml
2 L-OCCITANE L´Occitane CREMA MANI L-OCCITANE CREMA MANI
3 Corneto Ice Cream Ice, 300 ml -300ml Corneto Ice Cream 300 ml -300ml
4 Béaba BÉABA, Set di 6 Contenitori,set Béaba Set di 6 Contenitori,set
备选
import re
pat = '[\s%s]' % re.escape(punct)
regex = re.compile(pat)
regex2 = re.compile(fr'({pat}+)(?!s\b|\s*ml\b)')
def remove_dup(s):
seen = set()
keep = []
for w in regex2.split(s):
if len(w)>1:
w2 = regex.sub('', w.lower())
if w2 in seen:
continue
seen.add(w2)
keep.append(w.strip())
else:
keep.append(w)
return ''.join(keep).strip(punct)
df['new_column'] = list(map(remove_dup, df['duplicate_column']))
print(df)
输出:
duplicate_column new_column
0 Adidas Women's Womens A004 Snow Boot Adidas Women's A004 Snow Boot
1 Amul Milk, 100ml, 100ML Amul Milk,100ml
2 L-OCCITANE L´Occitane CREMA MANI L-OCCITANE L´ CREMA MANI
3 Corneto Ice Cream Ice, 300 ml -300ml Corneto Ice Cream ,300 ml
4 Béaba BÉABA, Set di 6 Contenitori,set Béaba ,Set di 6 Contenitori
另一个版本:
import re
remove_punct = re.compile("""[!"#$%&'()*+-./:;<=>?@[\]^_`{}~´]""")
millilitres = re.compile(r"(\d+)\s+(ml)", flags=re.I)
def remove_duplicates(x):
# do some basic preprocess
x = x.replace(",", " ")
x = millilitres.sub(r"", x)
words = x.split()
words_without_punct = remove_punct.sub("", x).lower().split()
dupl, out = set(), []
for w, wwp in zip(words, words_without_punct):
if wwp not in dupl:
out.append(w)
dupl.add(wwp)
return " ".join(out)
df["new_column"] = df["duplicate_column"].apply(remove_duplicates)
print(df)
打印:
duplicate_column new_column
0 Adidas Women's Womens A004 Snow Boot Adidas Women's A004 Snow Boot
1 Amul Milk, 100ml, 100ML Amul Milk 100ml
2 L-OCCITANE L´Occitane CREMA MANI L-OCCITANE CREMA MANI
3 Corneto Ice Cream Ice, 300 ml -300ml Corneto Ice Cream 300ml
4 Béaba BÉABA, Set di 6 Contenitori,set Béaba Set di 6 Contenitori
有没有一种方法可以修改此代码以保持其逻辑但将字符串恢复为初始大小写和标点符号?
data = {'duplicate_column':["Adidas Women's Womens A004 Snow Boot", 'Amul Milk, 100ml, 100ML', 'L-OCCITANE L´Occitane CREMA MANI', 'Corneto Ice Cream Ice, 300 ml -300ml', 'Béaba BÉABA, Set di 6 Contenitori,set']}
df = pd.DataFrame(data)
punct = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{}~´'
transtab = str.maketrans(dict.fromkeys(punct, ''))
df['new_column'] = [
' '.join(dict.fromkeys(s.translate(transtab).lower().split()))
for s in df['duplicate']
此代码正在删除列 'duplicate' 中的重复项,并使用结果创建一个新列。需要将字符串恢复为初始大小写和标点符号。
重复列(初始数据):
Adidas Women's Womens A004 Snow Boot
Amul Milk, 100ml, 100ML
L-OCCITANE L´Occitane CREMA MANI
Corneto Ice Cream Ice, 300 ml -300ml
Béaba BÉABA, Set di 6 Contenitori,set
使用上述代码创建的新列:
adidas womens a004 snow boot
amul milk 100ml
loccitane crema mani
corneto ice cream 300 ml 300ml
béaba set di 6 contenitori set
期望的输出:
Adidas Women's A004 Snow Boot
Amul Milk, 100ml
L-OCCITANE CREMA MANI
Corneto Ice Cream, 300 ml
Béaba, Set di 6 Contenitori
而不是固定输出来恢复。 punctuation/case 首先不要放弃它。您可以使用基于集合的自定义函数:
import re
regex = re.compile('[%s]' % re.escape(punct))
def remove_dup(s):
seen = set()
keep = []
for w in s.split():
w2 = regex.sub('', w.lower())
if w2 in seen:
continue
seen.add(w2)
keep.append(w)
return ' '.join(keep).strip(punct)
df['new_column'] = list(map(remove_dup, df['duplicate_column']))
输出:
duplicate_column new_column
0 Adidas Women's Womens A004 Snow Boot Adidas Women's A004 Snow Boot
1 Amul Milk, 100ml, 100ML Amul Milk, 100ml
2 L-OCCITANE L´Occitane CREMA MANI L-OCCITANE CREMA MANI
3 Corneto Ice Cream Ice, 300 ml -300ml Corneto Ice Cream 300 ml -300ml
4 Béaba BÉABA, Set di 6 Contenitori,set Béaba Set di 6 Contenitori,set
备选
import re
pat = '[\s%s]' % re.escape(punct)
regex = re.compile(pat)
regex2 = re.compile(fr'({pat}+)(?!s\b|\s*ml\b)')
def remove_dup(s):
seen = set()
keep = []
for w in regex2.split(s):
if len(w)>1:
w2 = regex.sub('', w.lower())
if w2 in seen:
continue
seen.add(w2)
keep.append(w.strip())
else:
keep.append(w)
return ''.join(keep).strip(punct)
df['new_column'] = list(map(remove_dup, df['duplicate_column']))
print(df)
输出:
duplicate_column new_column
0 Adidas Women's Womens A004 Snow Boot Adidas Women's A004 Snow Boot
1 Amul Milk, 100ml, 100ML Amul Milk,100ml
2 L-OCCITANE L´Occitane CREMA MANI L-OCCITANE L´ CREMA MANI
3 Corneto Ice Cream Ice, 300 ml -300ml Corneto Ice Cream ,300 ml
4 Béaba BÉABA, Set di 6 Contenitori,set Béaba ,Set di 6 Contenitori
另一个版本:
import re
remove_punct = re.compile("""[!"#$%&'()*+-./:;<=>?@[\]^_`{}~´]""")
millilitres = re.compile(r"(\d+)\s+(ml)", flags=re.I)
def remove_duplicates(x):
# do some basic preprocess
x = x.replace(",", " ")
x = millilitres.sub(r"", x)
words = x.split()
words_without_punct = remove_punct.sub("", x).lower().split()
dupl, out = set(), []
for w, wwp in zip(words, words_without_punct):
if wwp not in dupl:
out.append(w)
dupl.add(wwp)
return " ".join(out)
df["new_column"] = df["duplicate_column"].apply(remove_duplicates)
print(df)
打印:
duplicate_column new_column
0 Adidas Women's Womens A004 Snow Boot Adidas Women's A004 Snow Boot
1 Amul Milk, 100ml, 100ML Amul Milk 100ml
2 L-OCCITANE L´Occitane CREMA MANI L-OCCITANE CREMA MANI
3 Corneto Ice Cream Ice, 300 ml -300ml Corneto Ice Cream 300ml
4 Béaba BÉABA, Set di 6 Contenitori,set Béaba Set di 6 Contenitori