在每隔一行的开头和结尾添加引号，忽略空行

Question

我需要组织文本方面的帮助。我在 csv 中有数千个词汇表。每个单词都有术语、定义和例句。术语和定义由制表符分隔，例句由空行分隔。

例如：

exacerbate  worsen

This attack will exacerbate the already tense relations between the two communities

exasperate  irritate, vex

he often exasperates his mother with pranks

execrable   very bad, abominable, utterly detestable

an execrable performance

我想这样整理一下，例句用双引号括起来，前后没有空行，句子中的词用连字符代替。所有这些都发生了变化，同时保留术语后的制表符、每个术语开头的新行以及定义和例句之间唯一的 space。我需要这种格式来将其导入抽认卡网络应用程序。

使用上述示例的预期结果：

exacerbate  worsen "This attack will – the already tense relations between the two communities"
exasperate  irritate, vex "he often – his mother with pranks"
execrable   very bad, abominable, utterly detestable "an – performance"

我正在使用 Mac。我知道基本的命令行（包括正则表达式）和 python，但还不足以自己解决这个问题。如果您能帮助我，我将不胜感激。

Answer 1

不一定是防弹的，但这个脚本将根据您的示例完成工作：

import sys
import re
input_file = sys.argv[1]


is_definition = True

current_entry = ""
current_definition = ""

for line in open(input_file, 'r'):
    line = line.strip()

    if line != "":
        if is_definition == True:
            is_definition = False

            [current_entry, current_definition] = line.split("\t")

        else:
            is_definition = True

            example = line

            print (current_entry + "\t" + current_definition + ' "' + re.sub(current_entry + r'\w*', "-", line) + '"')

输出：

exacerbate  worsen "This attack will - the already tense relations between the two communities"
exasperate  irritate, vex "he often - his mother with pranks"
execrable   very bad, abominable, utterly detestable "an - performance"

我们当前方法的问题是它不适用于不规则动词，例如："go - went" 或 "bring - brought" 或 "seek - sought"。

Answer 2

尝试：

suffixList = ["s", "ed", "es", "ing"] #et cetera
file = vocab.read()
file.split("\n")

vocab_words = [file[i] for i in range(0, len(file)-2, 4)]
vocab_defs = [file[i] for i in range(2, len(file), 4)]

for defCount in range(len(vocab_defs)):
    vocab_defs[defCount] = "\"" + vocab_defs[defCount] + "\""

newFileText = ""
for count in range(len(vocab_words)):
    vocab_defs[count] = vocab_defs[count].replace(vocab_words[count].split(" ")[0], "-")
    for i in suffixList:
        vocab_defs[count] = vocab_defs[count].replace("-%s" % i, "-")
    newFileText += vocab_words[count]
    newFileText += "  "
    newFileText += vocab_defs[count]
    newFileText += "\n"

new_vocab_file.write(newFileText)

输出：

============== RESTART: /Users/chervjay/Documents/thingy.py ==============
exacerbate  worsen  "This attack will - the already tense relations between the two communities"
exasperate  irritate, vex  "he often - his mother with pranks"
execrable   very bad, abominable, utterly detestable  "an - performance"

>>>

Answer 3

打开终端到你有输入文件的目录。将以下代码保存在 .py 文件中：

import sys
import string
import difflib
import itertools


with open(sys.argv[1]) as fobj:
    lines = fobj.read().split('\n\n')

with open(sys.argv[2], 'w') as out:
    for i in range(0, len(lines), 2):
        line1, example = lines[i:i + 2]
        words = [w.strip(string.punctuation).lower()
                 for w in example.split()]

        # if the target word is not in the example sentence,
        # we will find the most similar one
        target = line1.split('\t')[0]
        if target in words:
            most_similar = target
        else:
            most_similar = difflib.get_close_matches(target, words, 1)[0]
        new_example = example.replace(most_similar, '-')
        out.write('{} "{}"\n'.format(line1.strip(), new_example.strip()))

程序需要输入文件名和输出文件名作为命令行参数。也就是说，从终端执行以下命令：

$ python program.py input.txt output.txt

其中 program.py 是上面的程序，input.txt 是您的输入文件，output.txt 是将以您需要的格式创建的文件。

我运行该程序针对您提供的示例。我手动添加了标签，因为在问题中只有空格。这是程序产生的输出：

exacerbate  worsen "This attack will - the already tense relations between the two communities"
exasperate  irritate, vex "he often - his mother with pranks"
execrable   very bad, abominable, utterly detestable "an - performance"

程序在第二个示例中正确地将 exacerbates 替换为破折号，即使单词是 exacerbate。如果没有文件，我无法保证运行此技术对文件中的每个单词都有效。

Answer 4

#!/usr/local/bin/python3

import re

with open('yourFile.csv', 'r') as myfile:
    data = myfile.read()    

print(re.sub(r'(^[A-Za-z]+)\t(.+)\n\n(.+)[s|ed|es|ing]*(.+)$',r'\t "-"', data, flags = re.MULTILINE))

输出：

exacerbate worsen "This attack will - the already tense relations between the two communities"

exasperate irritate, vex "he often - his mother with pranks"

execrable very bad, abominable, utterly detestable "an - performance"

在每隔一行的开头和结尾添加引号，忽略空行

Add quotation at start and end of every other line ignoring empty line

python

regex

macos

textedit