将字符串恢复为初始大小写和 punctuation_pandas

Revert string to initial casing and punctuation_pandas

有没有一种方法可以修改此代码以保持其逻辑但将字符串恢复为初始大小写和标点符号?

data = {'duplicate_column':["Adidas Women's Womens A004 Snow Boot", 'Amul Milk, 100ml, 100ML', 'L-OCCITANE L´Occitane CREMA MANI', 'Corneto Ice Cream Ice, 300 ml -300ml', 'Béaba BÉABA, Set di 6 Contenitori,set']}
df = pd.DataFrame(data)
punct = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{}~´'
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['new_column'] = [
' '.join(dict.fromkeys(s.translate(transtab).lower().split()))
for s in df['duplicate']

此代码正在删除列 'duplicate' 中的重复项,并使用结果创建一个新列。需要将字符串恢复为初始大小写和标点符号。

重复列(初始数据):

Adidas Women's Womens A004 Snow Boot
Amul Milk, 100ml, 100ML
L-OCCITANE L´Occitane CREMA MANI
Corneto Ice Cream Ice, 300 ml -300ml
Béaba BÉABA, Set di 6 Contenitori,set

使用上述代码创建的新列:

adidas womens a004 snow boot
amul milk 100ml
loccitane crema mani
corneto ice cream 300 ml 300ml
béaba set di 6 contenitori set

期望的输出:

Adidas Women's A004 Snow Boot
Amul Milk, 100ml
L-OCCITANE CREMA MANI
Corneto Ice Cream, 300 ml
Béaba, Set di 6 Contenitori

而不是固定输出来恢复。 punctuation/case 首先不要放弃它。您可以使用基于集合的自定义函数:

import re
regex = re.compile('[%s]' % re.escape(punct))
def remove_dup(s):
    seen = set()
    keep = []
    for w in s.split():
        w2 = regex.sub('', w.lower())
        if w2 in seen:
            continue 
        seen.add(w2)
        keep.append(w)
    return ' '.join(keep).strip(punct)
        
df['new_column'] = list(map(remove_dup, df['duplicate_column']))

输出:

                        duplicate_column                       new_column
0   Adidas Women's Womens A004 Snow Boot    Adidas Women's A004 Snow Boot
1                Amul Milk, 100ml, 100ML                 Amul Milk, 100ml
2       L-OCCITANE L´Occitane CREMA MANI            L-OCCITANE CREMA MANI
3   Corneto Ice Cream Ice, 300 ml -300ml  Corneto Ice Cream 300 ml -300ml
4  Béaba BÉABA, Set di 6 Contenitori,set   Béaba Set di 6 Contenitori,set
备选
import re
pat = '[\s%s]' % re.escape(punct)
regex = re.compile(pat)
regex2 = re.compile(fr'({pat}+)(?!s\b|\s*ml\b)')

def remove_dup(s):
    seen = set()
    keep = []
    for w in regex2.split(s):
        if len(w)>1:
            w2 = regex.sub('', w.lower())
            if w2 in seen:
                continue 
            seen.add(w2)
            keep.append(w.strip())
        else:
            keep.append(w)
    return ''.join(keep).strip(punct)
        
df['new_column'] = list(map(remove_dup, df['duplicate_column']))

print(df)

输出:

                        duplicate_column                      new_column
0   Adidas Women's Womens A004 Snow Boot  Adidas Women's  A004 Snow Boot
1                Amul Milk, 100ml, 100ML                 Amul Milk,100ml
2       L-OCCITANE L´Occitane CREMA MANI        L-OCCITANE L´ CREMA MANI
3   Corneto Ice Cream Ice, 300 ml -300ml       Corneto Ice Cream ,300 ml
4  Béaba BÉABA, Set di 6 Contenitori,set     Béaba ,Set di 6 Contenitori

另一个版本:

import re

remove_punct = re.compile("""[!"#$%&'()*+-./:;<=>?@[\]^_`{}~´]""")
millilitres = re.compile(r"(\d+)\s+(ml)", flags=re.I)


def remove_duplicates(x):
    # do some basic preprocess
    x = x.replace(",", " ")
    x = millilitres.sub(r"", x)

    words = x.split()
    words_without_punct = remove_punct.sub("", x).lower().split()
    dupl, out = set(), []
    for w, wwp in zip(words, words_without_punct):
        if wwp not in dupl:
            out.append(w)
            dupl.add(wwp)
    return " ".join(out)


df["new_column"] = df["duplicate_column"].apply(remove_duplicates)
print(df)

打印:

                        duplicate_column                     new_column
0   Adidas Women's Womens A004 Snow Boot  Adidas Women's A004 Snow Boot
1                Amul Milk, 100ml, 100ML                Amul Milk 100ml
2       L-OCCITANE L´Occitane CREMA MANI          L-OCCITANE CREMA MANI
3   Corneto Ice Cream Ice, 300 ml -300ml        Corneto Ice Cream 300ml
4  Béaba BÉABA, Set di 6 Contenitori,set     Béaba Set di 6 Contenitori