在删除 pandas 中的重复项时保留第一次出现
Keep first occurrence while removing duplicates in pandas
我需要删除区分大小写的重复项,以保持第一次出现并保持句子的顺序。
这需要在列的每一行上完成。
Initial format: How the output should look:
col_sentence col_sentence
paper Plastic aluminum paper paper Plastic aluminum
paper Plastic aluminum Paper paper Plastic aluminum
Paper tin glass tin PAPER Paper tin glass
Paper tin glass Paper-tin Paper tin glass
这可以用 python 完成吗?我已经创建了一个可以工作并删除重复项的函数,但只能通过转换为较低的顺序并更改顺序 ,在我的情况下是不可行的。
string = "paper Plastic aluminum Paper"
set_string = list()
for s in string.split(' '):
if s not in set_string:
set_string.append(s)
string = ' '.join(set_string)
print(string)
#output paper Plastic aluminum Paper
示例 python 程序以保留 1 次并删除其他。您可以从中创建一个函数并将其应用于每个 row/column.
注意:需要 python 3.7+ 才能确保排序。
import re
def unique_only(sentence):
words = re.split('[\W]+', sentence)
unique_words = {}
for word in words:
key = word.lower()
if key not in unique_words:
unique_words[key] = word
words = unique_words.values()
return ' '.join(words)
df.applymap(unique_only)
示例输入:
col_sentence
0 paper Plastic aluminum paper
1 paper Plastic aluminum Paper
2 Paper tin glass tin PAPER
3 Paper tin glass Paper-tin
输出:
col_sentence
0 paper Plastic aluminum
1 paper Plastic aluminum
2 Paper tin glass
3 Paper tin glass
假设只有“-”和“”是您列中的单词分隔符,试试这个:
def uniqueList(row):
words = row.split(" ")
unique = words[0]
for w in words:
if w.lower() not in unique.lower():
unique = unique + " " + w
return unique
data["col_sentence"].str.replace("-", " ").apply(uniqueList)
编辑(结合@im0j的建议):为避免字符串的部分匹配(例如:匹配pap
与paper
),将函数更改为以下内容:
def uniqueList_full(row):
words = row.split(" ")
unique = [words[0]]
for w in words:
if w.lower() not in [u.lower() for u in unique]:
unique = unique + [w]
return " ".join(unique)
另一种方法是使用 OrderedDict
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame(data = {'col_sentence':['paper Plastic aluminum paper','paper Plastic aluminum Paper','Paper tin glass tin PAPER','Paper tin glass Paper-tin']})
df.col_sentence.apply(lambda x: ' '.join(list(OrderedDict.fromkeys(x.replace('-', ' ').split()))))
0 paper Plastic aluminum
1 paper Plastic aluminum Paper
2 Paper tin glass PAPER
3 Paper tin glass
Name: col_sentence, dtype: object
我需要删除区分大小写的重复项,以保持第一次出现并保持句子的顺序。 这需要在列的每一行上完成。
Initial format: How the output should look:
col_sentence col_sentence
paper Plastic aluminum paper paper Plastic aluminum
paper Plastic aluminum Paper paper Plastic aluminum
Paper tin glass tin PAPER Paper tin glass
Paper tin glass Paper-tin Paper tin glass
这可以用 python 完成吗?我已经创建了一个可以工作并删除重复项的函数,但只能通过转换为较低的顺序并更改顺序 ,在我的情况下是不可行的。
string = "paper Plastic aluminum Paper"
set_string = list()
for s in string.split(' '):
if s not in set_string:
set_string.append(s)
string = ' '.join(set_string)
print(string)
#output paper Plastic aluminum Paper
示例 python 程序以保留 1 次并删除其他。您可以从中创建一个函数并将其应用于每个 row/column.
注意:需要 python 3.7+ 才能确保排序。
import re
def unique_only(sentence):
words = re.split('[\W]+', sentence)
unique_words = {}
for word in words:
key = word.lower()
if key not in unique_words:
unique_words[key] = word
words = unique_words.values()
return ' '.join(words)
df.applymap(unique_only)
示例输入:
col_sentence
0 paper Plastic aluminum paper
1 paper Plastic aluminum Paper
2 Paper tin glass tin PAPER
3 Paper tin glass Paper-tin
输出:
col_sentence
0 paper Plastic aluminum
1 paper Plastic aluminum
2 Paper tin glass
3 Paper tin glass
假设只有“-”和“”是您列中的单词分隔符,试试这个:
def uniqueList(row):
words = row.split(" ")
unique = words[0]
for w in words:
if w.lower() not in unique.lower():
unique = unique + " " + w
return unique
data["col_sentence"].str.replace("-", " ").apply(uniqueList)
编辑(结合@im0j的建议):为避免字符串的部分匹配(例如:匹配pap
与paper
),将函数更改为以下内容:
def uniqueList_full(row):
words = row.split(" ")
unique = [words[0]]
for w in words:
if w.lower() not in [u.lower() for u in unique]:
unique = unique + [w]
return " ".join(unique)
另一种方法是使用 OrderedDict
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame(data = {'col_sentence':['paper Plastic aluminum paper','paper Plastic aluminum Paper','Paper tin glass tin PAPER','Paper tin glass Paper-tin']})
df.col_sentence.apply(lambda x: ' '.join(list(OrderedDict.fromkeys(x.replace('-', ' ').split()))))
0 paper Plastic aluminum
1 paper Plastic aluminum Paper
2 Paper tin glass PAPER
3 Paper tin glass
Name: col_sentence, dtype: object