在每隔一行的开头和结尾添加引号,忽略空行
Add quotation at start and end of every other line ignoring empty line
我需要组织文本方面的帮助。我在 csv 中有数千个词汇表。每个单词都有术语、定义和例句。术语和定义由制表符分隔,例句由空行分隔。
例如:
exacerbate worsen
This attack will exacerbate the already tense relations between the two communities
exasperate irritate, vex
he often exasperates his mother with pranks
execrable very bad, abominable, utterly detestable
an execrable performance
我想这样整理一下,例句用双引号括起来,前后没有空行,句子中的词用连字符代替。所有这些都发生了变化,同时保留术语后的制表符、每个术语开头的新行以及定义和例句之间唯一的 space。我需要这种格式来将其导入抽认卡网络应用程序。
使用上述示例的预期结果:
exacerbate worsen "This attack will – the already tense relations between the two communities"
exasperate irritate, vex "he often – his mother with pranks"
execrable very bad, abominable, utterly detestable "an – performance"
我正在使用 Mac。我知道基本的命令行(包括正则表达式)和 python,但还不足以自己解决这个问题。如果您能帮助我,我将不胜感激。
不一定是防弹的,但这个脚本将根据您的示例完成工作:
import sys
import re
input_file = sys.argv[1]
is_definition = True
current_entry = ""
current_definition = ""
for line in open(input_file, 'r'):
line = line.strip()
if line != "":
if is_definition == True:
is_definition = False
[current_entry, current_definition] = line.split("\t")
else:
is_definition = True
example = line
print (current_entry + "\t" + current_definition + ' "' + re.sub(current_entry + r'\w*', "-", line) + '"')
输出:
exacerbate worsen "This attack will - the already tense relations between the two communities"
exasperate irritate, vex "he often - his mother with pranks"
execrable very bad, abominable, utterly detestable "an - performance"
我们当前方法的问题是它不适用于不规则动词,例如:"go - went" 或 "bring - brought" 或 "seek - sought"。
尝试:
suffixList = ["s", "ed", "es", "ing"] #et cetera
file = vocab.read()
file.split("\n")
vocab_words = [file[i] for i in range(0, len(file)-2, 4)]
vocab_defs = [file[i] for i in range(2, len(file), 4)]
for defCount in range(len(vocab_defs)):
vocab_defs[defCount] = "\"" + vocab_defs[defCount] + "\""
newFileText = ""
for count in range(len(vocab_words)):
vocab_defs[count] = vocab_defs[count].replace(vocab_words[count].split(" ")[0], "-")
for i in suffixList:
vocab_defs[count] = vocab_defs[count].replace("-%s" % i, "-")
newFileText += vocab_words[count]
newFileText += " "
newFileText += vocab_defs[count]
newFileText += "\n"
new_vocab_file.write(newFileText)
输出:
============== RESTART: /Users/chervjay/Documents/thingy.py ==============
exacerbate worsen "This attack will - the already tense relations between the two communities"
exasperate irritate, vex "he often - his mother with pranks"
execrable very bad, abominable, utterly detestable "an - performance"
>>>
打开终端到你有输入文件的目录。
将以下代码保存在 .py
文件中:
import sys
import string
import difflib
import itertools
with open(sys.argv[1]) as fobj:
lines = fobj.read().split('\n\n')
with open(sys.argv[2], 'w') as out:
for i in range(0, len(lines), 2):
line1, example = lines[i:i + 2]
words = [w.strip(string.punctuation).lower()
for w in example.split()]
# if the target word is not in the example sentence,
# we will find the most similar one
target = line1.split('\t')[0]
if target in words:
most_similar = target
else:
most_similar = difflib.get_close_matches(target, words, 1)[0]
new_example = example.replace(most_similar, '-')
out.write('{} "{}"\n'.format(line1.strip(), new_example.strip()))
程序需要输入文件名和输出文件名作为命令行参数。也就是说,从终端执行以下命令:
$ python program.py input.txt output.txt
其中 program.py
是上面的程序,input.txt
是您的输入文件,output.txt
是将以您需要的格式创建的文件。
我运行 该程序针对您提供的示例。我手动添加了标签,因为在问题中只有空格。这是程序产生的输出:
exacerbate worsen "This attack will - the already tense relations between the two communities"
exasperate irritate, vex "he often - his mother with pranks"
execrable very bad, abominable, utterly detestable "an - performance"
程序在第二个示例中正确地将 exacerbates
替换为破折号,即使单词是 exacerbate
。如果没有文件,我无法保证运行此技术对文件中的每个单词都有效。
#!/usr/local/bin/python3
import re
with open('yourFile.csv', 'r') as myfile:
data = myfile.read()
print(re.sub(r'(^[A-Za-z]+)\t(.+)\n\n(.+)[s|ed|es|ing]*(.+)$',r'\t "-"', data, flags = re.MULTILINE))
输出:
exacerbate worsen "This attack will - the already tense relations between the two communities"
exasperate irritate, vex "he often - his mother with pranks"
execrable very bad, abominable, utterly detestable "an - performance"
我需要组织文本方面的帮助。我在 csv 中有数千个词汇表。每个单词都有术语、定义和例句。术语和定义由制表符分隔,例句由空行分隔。
例如:
exacerbate worsen
This attack will exacerbate the already tense relations between the two communities
exasperate irritate, vex
he often exasperates his mother with pranks
execrable very bad, abominable, utterly detestable
an execrable performance
我想这样整理一下,例句用双引号括起来,前后没有空行,句子中的词用连字符代替。所有这些都发生了变化,同时保留术语后的制表符、每个术语开头的新行以及定义和例句之间唯一的 space。我需要这种格式来将其导入抽认卡网络应用程序。
使用上述示例的预期结果:
exacerbate worsen "This attack will – the already tense relations between the two communities"
exasperate irritate, vex "he often – his mother with pranks"
execrable very bad, abominable, utterly detestable "an – performance"
我正在使用 Mac。我知道基本的命令行(包括正则表达式)和 python,但还不足以自己解决这个问题。如果您能帮助我,我将不胜感激。
不一定是防弹的,但这个脚本将根据您的示例完成工作:
import sys
import re
input_file = sys.argv[1]
is_definition = True
current_entry = ""
current_definition = ""
for line in open(input_file, 'r'):
line = line.strip()
if line != "":
if is_definition == True:
is_definition = False
[current_entry, current_definition] = line.split("\t")
else:
is_definition = True
example = line
print (current_entry + "\t" + current_definition + ' "' + re.sub(current_entry + r'\w*', "-", line) + '"')
输出:
exacerbate worsen "This attack will - the already tense relations between the two communities"
exasperate irritate, vex "he often - his mother with pranks"
execrable very bad, abominable, utterly detestable "an - performance"
我们当前方法的问题是它不适用于不规则动词,例如:"go - went" 或 "bring - brought" 或 "seek - sought"。
尝试:
suffixList = ["s", "ed", "es", "ing"] #et cetera
file = vocab.read()
file.split("\n")
vocab_words = [file[i] for i in range(0, len(file)-2, 4)]
vocab_defs = [file[i] for i in range(2, len(file), 4)]
for defCount in range(len(vocab_defs)):
vocab_defs[defCount] = "\"" + vocab_defs[defCount] + "\""
newFileText = ""
for count in range(len(vocab_words)):
vocab_defs[count] = vocab_defs[count].replace(vocab_words[count].split(" ")[0], "-")
for i in suffixList:
vocab_defs[count] = vocab_defs[count].replace("-%s" % i, "-")
newFileText += vocab_words[count]
newFileText += " "
newFileText += vocab_defs[count]
newFileText += "\n"
new_vocab_file.write(newFileText)
输出:
============== RESTART: /Users/chervjay/Documents/thingy.py ==============
exacerbate worsen "This attack will - the already tense relations between the two communities"
exasperate irritate, vex "he often - his mother with pranks"
execrable very bad, abominable, utterly detestable "an - performance"
>>>
打开终端到你有输入文件的目录。
将以下代码保存在 .py
文件中:
import sys
import string
import difflib
import itertools
with open(sys.argv[1]) as fobj:
lines = fobj.read().split('\n\n')
with open(sys.argv[2], 'w') as out:
for i in range(0, len(lines), 2):
line1, example = lines[i:i + 2]
words = [w.strip(string.punctuation).lower()
for w in example.split()]
# if the target word is not in the example sentence,
# we will find the most similar one
target = line1.split('\t')[0]
if target in words:
most_similar = target
else:
most_similar = difflib.get_close_matches(target, words, 1)[0]
new_example = example.replace(most_similar, '-')
out.write('{} "{}"\n'.format(line1.strip(), new_example.strip()))
程序需要输入文件名和输出文件名作为命令行参数。也就是说,从终端执行以下命令:
$ python program.py input.txt output.txt
其中 program.py
是上面的程序,input.txt
是您的输入文件,output.txt
是将以您需要的格式创建的文件。
我运行 该程序针对您提供的示例。我手动添加了标签,因为在问题中只有空格。这是程序产生的输出:
exacerbate worsen "This attack will - the already tense relations between the two communities"
exasperate irritate, vex "he often - his mother with pranks"
execrable very bad, abominable, utterly detestable "an - performance"
程序在第二个示例中正确地将 exacerbates
替换为破折号,即使单词是 exacerbate
。如果没有文件,我无法保证运行此技术对文件中的每个单词都有效。
#!/usr/local/bin/python3
import re
with open('yourFile.csv', 'r') as myfile:
data = myfile.read()
print(re.sub(r'(^[A-Za-z]+)\t(.+)\n\n(.+)[s|ed|es|ing]*(.+)$',r'\t "-"', data, flags = re.MULTILINE))
输出:
exacerbate worsen "This attack will - the already tense relations between the two communities"
exasperate irritate, vex "he often - his mother with pranks"
execrable very bad, abominable, utterly detestable "an - performance"