在删除 pandas 中的重复项时保留第一次出现

Keep first occurrence while removing duplicates in pandas

我需要删除区分大小写的重复项,以保持第一次出现并保持句子的顺序。 这需要在列的每一行上完成。

Initial format:                                        How the output should look:
col_sentence                                                 col_sentence
paper Plastic aluminum paper                                 paper Plastic aluminum 
paper Plastic aluminum Paper                                 paper Plastic aluminum 
Paper tin glass tin PAPER                                    Paper tin glass 
Paper tin glass Paper-tin                                    Paper tin glass

这可以用 python 完成吗?我已经创建了一个可以工作并删除重复项的函数,但只能通过转换为较低的顺序并更改顺序 ,在我的情况下是不可行的。

string = "paper Plastic aluminum Paper"
set_string = list()
for s in string.split(' '):
    if s not in set_string:
        set_string.append(s)
    
string = ' '.join(set_string)
print(string)
#output paper Plastic aluminum Paper

示例 python 程序以保留 1 次并删除其他。您可以从中创建一个函数并将其应用于每个 row/column.

注意:需要 python 3.7+ 才能确保排序。

import re

def unique_only(sentence):
    words = re.split('[\W]+', sentence)
    unique_words = {}
    for word in words:
        key = word.lower()
        if key not in unique_words:
            unique_words[key] = word
    words = unique_words.values()
    return ' '.join(words)

df.applymap(unique_only)

示例输入:

                   col_sentence
0  paper Plastic aluminum paper
1  paper Plastic aluminum Paper
2     Paper tin glass tin PAPER
3     Paper tin glass Paper-tin

输出:

             col_sentence
0  paper Plastic aluminum
1  paper Plastic aluminum
2         Paper tin glass
3         Paper tin glass

假设只有“-”和“”是您列中的单词分隔符,试试这个:

def uniqueList(row):
    words = row.split(" ")
    unique = words[0]
    for w in words:
        if w.lower() not in unique.lower():
            unique = unique + " " + w
    return unique

data["col_sentence"].str.replace("-", " ").apply(uniqueList)

编辑(结合@im0j的建议):为避免字符串的部分匹配(例如:匹配pappaper),将函数更改为以下内容:

def uniqueList_full(row):
    words = row.split(" ")
    unique = [words[0]]
    for w in words:
        if w.lower() not in [u.lower() for u in unique]:
            unique = unique + [w]
    return " ".join(unique)

另一种方法是使用 OrderedDict

import pandas as pd
from collections import OrderedDict

df = pd.DataFrame(data = {'col_sentence':['paper Plastic aluminum paper','paper Plastic aluminum Paper','Paper tin glass tin PAPER','Paper tin glass Paper-tin']})
df.col_sentence.apply(lambda x: ' '.join(list(OrderedDict.fromkeys(x.replace('-', ' ').split()))))
0          paper Plastic aluminum
1    paper Plastic aluminum Paper
2           Paper tin glass PAPER
3                 Paper tin glass
Name: col_sentence, dtype: object