提取 CSV 文件中没有列表元素的行

Question

我有一个包含子字符串的列表，如果列表中存在的任何子字符串存在于 CSV 文件的该列中，我需要将其与 CSV 文件中的列进行比较。我想写那些在该字符串列中没有那些子字符串的行。此文件中有很多列，我只查看一列。

示例 my_string 列具有值

{ "This is just comparison of likely tokens","what a tough thing?"}

de = ["just","not","really ", "hat"]

我只想写包含 "What a tough thing?"

的行

如果列中的列表中只有单词，这就可以正常工作。例如，如果 my_string 列有 "really"，它不会写入新文件。但是，如果list中的item带有其他字符串，则不能通过。

with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')

for row[1] in reader:

    if any(d in row[1] for d in de):
        pass
    else:
        writer.writerow(row[1])

Answer 1

要检查一个字符串是否存在于子字符串列表中，我通常使用集合。

list1 = ['a','b','c']
list2 = ['c','d','e']

现在，找出不同之处，

list3 = list(set(a) - set(b))

它给你 ['a','b']（list1 中的内容不在 list2 中）并且你有你感兴趣的字符串。做

list(set(b) - set(a))

会给你 "what is in list2 that is not in list1?" 的字符串，即 ['e','d']

Answer 2

听起来您想搜索单词而不只是子字符串，例如，"hat" 不会匹配 "What"。当想要匹配复数、不同大小写、带连字符的字符串等时，单词搜索会变得复杂。但是，如果您不介意忽略这些并发症，您可以使用正则表达式将列分解为单词列表，将它们小写，然后使用集合操作进行检查。

import re
import csv

# TEST: write a sample csv file. using col0 to indicate what should be
# in the outfile
open('infile.csv', 'w').write(
"""exclude,This is just a comparison of likely tokens,col02,col03
include,what a tough thing?,col12,col13""")

# the words to find
de = ["just","not","really", "hat"]

# the files
infile = 'infile.csv'
outfile = 'outfile.csv'

# a "normalized set" of words to search
de = set(word.lower() for word in de)

def normalize_text(text):
    """Return a set of all the words in lowercased text"""
    return set(re.findall('\w+', text.lower()))

with open(infile, 'r') as inFile, open(outfile, 'w') as outFile:
    reader = csv.reader(inFile, delimiter=',')
    writer = csv.writer(outFile, delimiter=',')
    for row in reader:
        mycol = normalize_text(row[1])
        if not mycol & de:
            writer.writerow(row)

print("---- output file ----")
print(open(outfile).read())

Answer 3

您可以将单词编译成单个正则表达式，甚至可以进行不区分大小写的匹配，如下所示：

r = re.compile('\b('+"|".join(de)+')\b', re.IGNORECASE)

那么您的代码可以简单地是：

with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')

for row in reader:
    if not r.search(row[1]):
        writer.writerow(row[1])

提取 CSV 文件中没有列表元素的行

Extract lines in CSV file which don't have elements in a list

python

regex

csv

list

tokenize