根据 pandas Python 中另一个数据帧的条件从一个数据帧中删除行

Question

我有两个 pandas 数据框，在 python 中包含数百万行。我想根据三个条件从包含以秒为单位的单词的第一个数据框中删除行：

如果单词连续出现在句子的开头
如果单词连续出现在句子的末尾
如果单词连续出现在句子中间（确切单词，不是子集）

示例：

第一个数据框：

This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence

第二个数据框：

Second
forth
fifth

预期输出：

This is the first sentence
This is fifth_sentence

请注意，我在两个数据框中都有数百万条记录，我该如何处理并以最有效的方式导出？

我试过了，但是很费时间

import pandas as pd
import re

bad_words_file_data = pd.read_csv("words.txt", sep = ",", header = None)
sentences_file_data = pd.read_csv("setences.txt", sep = ".", header = None)

bad_words_index = []
for i in sentences_file_data.index:
    print("Processing Sentence:- ", i, "\n")
    single_sentence = sentences_file_data[0][i]
    for j in bad_words_file_data.index:
        word = bad_words_file_data[0][j]
        if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
            bad_words_index.append(i)
            break
            
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None, index = False)

谢谢

Answer 1

您可以使用 numpy.where 函数并创建一个名为 'remove' 的变量，如果满足您列出的条件，该变量将标记为 1。首先，创建一个包含 df2

值的列表

条件 1： 将检查单元格值是否以列表中的任何值开头

条件 2： 与上面相同，但它将检查单元格值是否以列表中的任何值结尾

条件 3：拆分每个单元格并检查拆分字符串中的任何值是否在您的列表中

之后，您可以通过过滤掉 1:

创建新的数据框

# Imports
import pandas as pd
import numpy as np

# Get the values from df2 in a list
l = list(set(df2['col']))

# Set conditions
c = df['col']

cond = (c.str.startswith(tuple(l)) \
        |(c.str.endswith(tuple(l))) \
        |pd.DataFrame(c.str.split(' ').tolist()).isin(l).any(1))

# Assign 1 or 0
df['remove'] = np.where(cond,1,0)

# Create 
out = (df[df['remove']!=1]).drop(['remove'],axis=1)

out 打印：

                          col
0  This is the first sentence
4      This is fifth_sentence

参考文献：

check if a columns contains any str from list

使用的数据帧：

>>> df.to_dict()

{'col': {0: 'This is the first sentence',
  1: 'Second this is another sentence',
  2: 'This is the third sentence forth',
  3: 'This is fifth sentence',
  4: 'This is fifth_sentence'}}

>>> df2.to_dict()

Out[80]: {'col': {0: 'Second', 1: 'forth', 2: 'fifth'}}

根据 pandas Python 中另一个数据帧的条件从一个数据帧中删除行

remove rows from one dataframe based on conditions from another dataframe in pandas Python

python

dataframe

python-3.x

pandas

modin