提取 CSV 文件中没有列表元素的行

Extract lines in CSV file which don't have elements in a list

我有一个包含子字符串的列表,如果列表中存在的任何子字符串存在于 CSV 文件的该列中,我需要将其与 CSV 文件中的列进行比较。我想写那些在该字符串列中没有那些子字符串的行。此文件中有很多列,我只查看一列。

示例 my_string 列具有值

{ "This is just comparison of likely tokens","what a tough thing?"}

de = ["just","not","really ", "hat"]

我只想写包含 "What a tough thing?"

的行

如果列中的列表中只有单词,这就可以正常工作。例如,如果 my_string 列有 "really",它不会写入新文件。但是,如果list中的item带有其他字符串,则不能通过。

with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')

for row[1] in reader:

    if any(d in row[1] for d in de):
        pass
    else:
        writer.writerow(row[1])

要检查一个字符串是否存在于子字符串列表中,我通常使用集合。

list1 = ['a','b','c']
list2 = ['c','d','e']

现在,找出不同之处,

list3 = list(set(a) - set(b))

它给你 ['a','b'](list1 中的内容不在 list2 中)并且你有你感兴趣的字符串。做

list(set(b) - set(a)) 

会给你 "what is in list2 that is not in list1?" 的字符串,即 ['e','d']

听起来您想搜索单词而不只是子字符串,例如,"hat" 不会匹配 "What"。当想要匹配复数、不同大小写、带连字符的字符串等时,单词搜索会变得复杂。但是,如果您不介意忽略这些并发症,您可以使用正则表达式将列分解为单词列表,将它们小写,然后使用集合操作进行检查。

import re
import csv

# TEST: write a sample csv file. using col0 to indicate what should be
# in the outfile
open('infile.csv', 'w').write(
"""exclude,This is just a comparison of likely tokens,col02,col03
include,what a tough thing?,col12,col13""")

# the words to find
de = ["just","not","really", "hat"]

# the files
infile = 'infile.csv'
outfile = 'outfile.csv'

# a "normalized set" of words to search
de = set(word.lower() for word in de)

def normalize_text(text):
    """Return a set of all the words in lowercased text"""
    return set(re.findall('\w+', text.lower()))

with open(infile, 'r') as inFile, open(outfile, 'w') as outFile:
    reader = csv.reader(inFile, delimiter=',')
    writer = csv.writer(outFile, delimiter=',')
    for row in reader:
        mycol = normalize_text(row[1])
        if not mycol & de:
            writer.writerow(row)

print("---- output file ----")
print(open(outfile).read())

您可以将单词编译成单个正则表达式,甚至可以进行不区分大小写的匹配,如下所示:

r = re.compile('\b('+"|".join(de)+')\b', re.IGNORECASE)

那么您的代码可以简单地是:

with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')

for row in reader:
    if not r.search(row[1]):
        writer.writerow(row[1])