打印文件中所有标有形态标签的标记

Question

我想打印文件中所有带有形态学标签的标记。到目前为止，我编写了如下所示的代码。

def index(filepath, string):

    import re
    pattern = re.compile(r'(\w+)+')
    StringList = []
    StringList.append(string)

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            words = set(m.group(1) for m in pattern.finditer(line))
            matches = [keyword for keyword in StringList if keyword in words]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)

    StringList.clear()



index('deneme.txt', '+Noun')

输出结果是这样的，我可以找到token中的Noun和行号，但是打印不出我想要的部分。我只想要+号之前的单词部分。

Noun            1
Noun            2
Noun            3
Noun            4
Noun            5
Noun            6
Noun            7

我的文件中的行是这样的：

Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc oluş+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karşı+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylaş+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num aşama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc 
Türkiye+Noun+Gen ekonomik+Adj ve+Conj insani+Adj potansiyel+Noun+P3sg ,+Punc güç+Noun+With savun+Verb+Inf2 kapasite+Noun+P3sg ,+Punc ulus+Noun+A3pl+InBetween çatış+Verb+Inf2+A3pl+Gen önle+Verb+Pass+Inf2+P3sg ve+Conj barış+Noun+P3sg inşa+Noun çaba+Noun+A3pl+P3sg+Dat aktif+Adj katılım+Noun+P3sg+Gen yanısıra+PostpPCGen ,+Punc fark+Noun+With kültür+Noun ve+Conj gelenek+Noun+A3pl+Dat ait+PostpPCDat seçkin+Adj özellik+Noun+A3pl+Acc birleş+Verb+Caus+PresPart bir+Num bünye+Noun+Dat sahip+Noun ol+Verb+Inf2+P3sg ,+Punc kendi+Pron+P3sg bölge+Noun+P3sg+Loc ve+Conj öte+Noun+P3sg+Loc önem+Noun+With rol+Noun oyna+Verb+Inf2+P3sg+Acc sağla+Verb+Fut değer+Noun+With özellik+Noun+A3pl+Cop .+Punc 
Türkiye+Noun ,+Punc bu+Det önem+Noun+With katkı+Noun+Acc yap+Verb+Able+Inf1 için+PostpPCGen yeterli+Adj donanım+Noun+P3sg haiz+Adj bir+Num ülke+Noun+Cop ve+Conj gelecek+Noun nesil+Noun+A3pl için+PostpPCGen daha+Noun i+Noun+Acc bir+Num dünya+Noun oluş+Verb+Caus+Inf1 amaç+Noun+P3sg+Ins ,+Punc dost+Noun+A3pl+P3pl ve+Conj müttefik+Adj+A3pl+P3sg+Ins yakın+Noun bir+Num biçim+Noun+Loc çalış+Verb+Inf2+Dat devam+Noun et+Verb+Fut+Cop .+Punc 
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj ilişki+Noun+A3pl 
club+Noun toplantı+Noun+A3pl+P3sg 
Türkiye+Noun -+Punc At+Noun gümrük+Noun işbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlaşma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklık+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj geliş+Verb+Inf2+P3sg+Acc sağla+Verb+Inf1 üzere+PostpPCNom ortaklık+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayılı+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc 
club+Noun toplantı+Noun+A3pl+P3sg 
nispi+Adj 
nisbi+Adj 
görece+Adj+With 
izafi+Adj 
obur+Adj

我想在编写标签时获取令牌。例如，当我写 +Adj 时，我想获得所有包含 +Adj 的标记（nispi，izafi ....（例如））。

Answer 1

拆分 \w+ 从您要查找的内容中删除了 + 部分，因此我改为拆分两者之间的空格。然后就是将 for 和 in 转换为列表理解的正确顺序。

def index(filepath, string):
    StringList = [string]

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            words = line.split(' ')
            matches = [word for keyword in StringList for word in words if keyword in word]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)


index('deneme.txt', '+Adj')

导致结果：

küresel+Adj,karşı+Adj+P3sg+Loc,samimi+Adj 1
ekonomik+Adj,insani+Adj,aktif+Adj,seçkin+Adj 2
yeterli+Adj,haiz+Adj,müttefik+Adj+A3pl+P3sg+Ins 3
kurumsal+Adj    4
sayılı+Adj      6
nispi+Adj       8
nisbi+Adj       9
görece+Adj+With 10
izafi+Adj       11
obur+Adj        12

我删除了行 StringList.clear()，因为它不知何故出错了。

适用于 Python 2.7 和 3.6+，尽管文本中的扩展 Unicode 字符会在使用 2.7 时失去对齐效果。

Answer 2

我认为，您关于如何使用正则表达式的概念需要改进。

请注意，每个输入行包含多个 "tokens"，例如terörizm+Noun+Gen。如您所见，它包含：

第一个词 - 文本中的实际词，
一些分类符号，每个符号前面都有一个+字符。

所以：

每行应在一系列空白字符上拆分为标记，
每个标记应拆分为单词，在 + 个字符上，
这些词的第一个是"actual"词，
剩下的词（没有+）是分类符号。

去除终止空白字符的好习惯（至少 \n）。

另请注意，您的代码包含 StringList，因此您知道如果此函数可能会寻找多个中的一个或多个分类词。

我的编程方式略有不同：

第二个参数(lookFor)是一个列表的单词，即转换成集合(lookForSet).
词集（分词的结果，减去第一个词）也转换成集合。

决定是否打印一个词（token中的第一个词）是基于是否至少可以在 lookForSet 中找到其分类符号之一。换句话说 - lookForSet 和 wordSet 是否有一些公共元素（设置交集）。

所以整个脚本如下所示：

import re

def index(fileName, lookFor):
    lookForSet = set(lookFor)  # Set of classification symbols to look for
    pat1 = re.compile(r'\s+')  # Regex to split line into tokens
    pat2 = re.compile(r'\+')   # Regex to split a token into words
    with open(fileName) as f:
        for lineNo, line in enumerate(f, start=1):
            line = line.rstrip()
            tokens = pat1.split(line)
            for token in tokens:
                words = pat2.split(token)
                word1 = words.pop(0)  # Initial word
                wordSet = set(words)  # Classification words
                commonWords = lookForSet.intersection(wordSet)
                if commonWords:
                    print("{:3}: {:<15} {}".format(lineNo, word1, ', '.join(commonWords)))

index('lines.txt', ['Noun', 'Gen'])

它的一段输出，用于我的输入数据（你的稍微缩短的版本）如下所示：

1: Türkiye         Noun
1: terörizm        Noun, Gen
1: kitle           Noun
1: imha            Noun
2: Türkiye         Noun, Gen
2: potansiyel      Noun

它包含：

源码行数，
令牌的第一个字，
lookFor 中的哪些分类词已在此令牌中找到。

打印文件中所有标有形态标签的标记

Print all the tokens in the file that are labelled with the morphological tag

python

regex

file

morphological-analysis

python-3.x