提取 CSV 文件中没有列表元素的行
Extract lines in CSV file which don't have elements in a list
我有一个包含子字符串的列表,如果列表中存在的任何子字符串存在于 CSV 文件的该列中,我需要将其与 CSV 文件中的列进行比较。我想写那些在该字符串列中没有那些子字符串的行。此文件中有很多列,我只查看一列。
示例 my_string 列具有值
{ "This is just comparison of likely tokens","what a tough thing?"}
de = ["just","not","really ", "hat"]
我只想写包含 "What a tough thing?"
的行
如果列中的列表中只有单词,这就可以正常工作。例如,如果 my_string 列有 "really",它不会写入新文件。但是,如果list中的item带有其他字符串,则不能通过。
with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')
for row[1] in reader:
if any(d in row[1] for d in de):
pass
else:
writer.writerow(row[1])
要检查一个字符串是否存在于子字符串列表中,我通常使用集合。
list1 = ['a','b','c']
list2 = ['c','d','e']
现在,找出不同之处,
list3 = list(set(a) - set(b))
它给你 ['a','b'](list1 中的内容不在 list2 中)并且你有你感兴趣的字符串。做
list(set(b) - set(a))
会给你 "what is in list2 that is not in list1?" 的字符串,即 ['e','d']
听起来您想搜索单词而不只是子字符串,例如,"hat" 不会匹配 "What"。当想要匹配复数、不同大小写、带连字符的字符串等时,单词搜索会变得复杂。但是,如果您不介意忽略这些并发症,您可以使用正则表达式将列分解为单词列表,将它们小写,然后使用集合操作进行检查。
import re
import csv
# TEST: write a sample csv file. using col0 to indicate what should be
# in the outfile
open('infile.csv', 'w').write(
"""exclude,This is just a comparison of likely tokens,col02,col03
include,what a tough thing?,col12,col13""")
# the words to find
de = ["just","not","really", "hat"]
# the files
infile = 'infile.csv'
outfile = 'outfile.csv'
# a "normalized set" of words to search
de = set(word.lower() for word in de)
def normalize_text(text):
"""Return a set of all the words in lowercased text"""
return set(re.findall('\w+', text.lower()))
with open(infile, 'r') as inFile, open(outfile, 'w') as outFile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outFile, delimiter=',')
for row in reader:
mycol = normalize_text(row[1])
if not mycol & de:
writer.writerow(row)
print("---- output file ----")
print(open(outfile).read())
您可以将单词编译成单个正则表达式,甚至可以进行不区分大小写的匹配,如下所示:
r = re.compile('\b('+"|".join(de)+')\b', re.IGNORECASE)
那么您的代码可以简单地是:
with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')
for row in reader:
if not r.search(row[1]):
writer.writerow(row[1])
我有一个包含子字符串的列表,如果列表中存在的任何子字符串存在于 CSV 文件的该列中,我需要将其与 CSV 文件中的列进行比较。我想写那些在该字符串列中没有那些子字符串的行。此文件中有很多列,我只查看一列。
示例 my_string 列具有值
{ "This is just comparison of likely tokens","what a tough thing?"}
de = ["just","not","really ", "hat"]
我只想写包含 "What a tough thing?"
的行如果列中的列表中只有单词,这就可以正常工作。例如,如果 my_string 列有 "really",它不会写入新文件。但是,如果list中的item带有其他字符串,则不能通过。
with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')
for row[1] in reader:
if any(d in row[1] for d in de):
pass
else:
writer.writerow(row[1])
要检查一个字符串是否存在于子字符串列表中,我通常使用集合。
list1 = ['a','b','c']
list2 = ['c','d','e']
现在,找出不同之处,
list3 = list(set(a) - set(b))
它给你 ['a','b'](list1 中的内容不在 list2 中)并且你有你感兴趣的字符串。做
list(set(b) - set(a))
会给你 "what is in list2 that is not in list1?" 的字符串,即 ['e','d']
听起来您想搜索单词而不只是子字符串,例如,"hat" 不会匹配 "What"。当想要匹配复数、不同大小写、带连字符的字符串等时,单词搜索会变得复杂。但是,如果您不介意忽略这些并发症,您可以使用正则表达式将列分解为单词列表,将它们小写,然后使用集合操作进行检查。
import re
import csv
# TEST: write a sample csv file. using col0 to indicate what should be
# in the outfile
open('infile.csv', 'w').write(
"""exclude,This is just a comparison of likely tokens,col02,col03
include,what a tough thing?,col12,col13""")
# the words to find
de = ["just","not","really", "hat"]
# the files
infile = 'infile.csv'
outfile = 'outfile.csv'
# a "normalized set" of words to search
de = set(word.lower() for word in de)
def normalize_text(text):
"""Return a set of all the words in lowercased text"""
return set(re.findall('\w+', text.lower()))
with open(infile, 'r') as inFile, open(outfile, 'w') as outFile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outFile, delimiter=',')
for row in reader:
mycol = normalize_text(row[1])
if not mycol & de:
writer.writerow(row)
print("---- output file ----")
print(open(outfile).read())
您可以将单词编译成单个正则表达式,甚至可以进行不区分大小写的匹配,如下所示:
r = re.compile('\b('+"|".join(de)+')\b', re.IGNORECASE)
那么您的代码可以简单地是:
with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')
for row in reader:
if not r.search(row[1]):
writer.writerow(row[1])