打印文件中所有标有形态标签的标记
Print all the tokens in the file that are labelled with the morphological tag
我想打印文件中所有带有形态学标签的标记。到目前为止,我编写了如下所示的代码。
def index(filepath, string):
import re
pattern = re.compile(r'(\w+)+')
StringList = []
StringList.append(string)
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
words = set(m.group(1) for m in pattern.finditer(line))
matches = [keyword for keyword in StringList if keyword in words]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
StringList.clear()
index('deneme.txt', '+Noun')
输出结果是这样的,我可以找到token中的Noun和行号,但是打印不出我想要的部分。我只想要+号之前的单词部分。
Noun 1
Noun 2
Noun 3
Noun 4
Noun 5
Noun 6
Noun 7
我的文件中的行是这样的:
Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc oluş+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karşı+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylaş+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num aşama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Türkiye+Noun+Gen ekonomik+Adj ve+Conj insani+Adj potansiyel+Noun+P3sg ,+Punc güç+Noun+With savun+Verb+Inf2 kapasite+Noun+P3sg ,+Punc ulus+Noun+A3pl+InBetween çatış+Verb+Inf2+A3pl+Gen önle+Verb+Pass+Inf2+P3sg ve+Conj barış+Noun+P3sg inşa+Noun çaba+Noun+A3pl+P3sg+Dat aktif+Adj katılım+Noun+P3sg+Gen yanısıra+PostpPCGen ,+Punc fark+Noun+With kültür+Noun ve+Conj gelenek+Noun+A3pl+Dat ait+PostpPCDat seçkin+Adj özellik+Noun+A3pl+Acc birleş+Verb+Caus+PresPart bir+Num bünye+Noun+Dat sahip+Noun ol+Verb+Inf2+P3sg ,+Punc kendi+Pron+P3sg bölge+Noun+P3sg+Loc ve+Conj öte+Noun+P3sg+Loc önem+Noun+With rol+Noun oyna+Verb+Inf2+P3sg+Acc sağla+Verb+Fut değer+Noun+With özellik+Noun+A3pl+Cop .+Punc
Türkiye+Noun ,+Punc bu+Det önem+Noun+With katkı+Noun+Acc yap+Verb+Able+Inf1 için+PostpPCGen yeterli+Adj donanım+Noun+P3sg haiz+Adj bir+Num ülke+Noun+Cop ve+Conj gelecek+Noun nesil+Noun+A3pl için+PostpPCGen daha+Noun i+Noun+Acc bir+Num dünya+Noun oluş+Verb+Caus+Inf1 amaç+Noun+P3sg+Ins ,+Punc dost+Noun+A3pl+P3pl ve+Conj müttefik+Adj+A3pl+P3sg+Ins yakın+Noun bir+Num biçim+Noun+Loc çalış+Verb+Inf2+Dat devam+Noun et+Verb+Fut+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj ilişki+Noun+A3pl
club+Noun toplantı+Noun+A3pl+P3sg
Türkiye+Noun -+Punc At+Noun gümrük+Noun işbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlaşma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklık+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj geliş+Verb+Inf2+P3sg+Acc sağla+Verb+Inf1 üzere+PostpPCNom ortaklık+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayılı+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
club+Noun toplantı+Noun+A3pl+P3sg
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj
我想在编写标签时获取令牌。
例如,当我写 +Adj 时,我想获得所有包含 +Adj 的标记(nispi,izafi ....(例如))。
拆分 \w+
从您要查找的内容中删除了 +
部分,因此我改为拆分两者之间的空格。然后就是将 for
和 in
转换为列表理解的正确顺序。
def index(filepath, string):
StringList = [string]
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
words = line.split(' ')
matches = [word for keyword in StringList for word in words if keyword in word]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
index('deneme.txt', '+Adj')
导致结果:
küresel+Adj,karşı+Adj+P3sg+Loc,samimi+Adj 1
ekonomik+Adj,insani+Adj,aktif+Adj,seçkin+Adj 2
yeterli+Adj,haiz+Adj,müttefik+Adj+A3pl+P3sg+Ins 3
kurumsal+Adj 4
sayılı+Adj 6
nispi+Adj 8
nisbi+Adj 9
görece+Adj+With 10
izafi+Adj 11
obur+Adj 12
我删除了行 StringList.clear()
,因为它不知何故出错了。
适用于 Python 2.7 和 3.6+,尽管文本中的扩展 Unicode 字符会在使用 2.7 时失去对齐效果。
我认为,您关于如何使用正则表达式的概念需要改进。
请注意,每个输入行包含多个 "tokens",例如terörizm+Noun+Gen
。
如您所见,它包含:
- 第一个词 - 文本中的实际词,
- 一些分类符号,每个符号前面都有一个
+
字符。
所以:
- 每行应在一系列空白字符上拆分为标记,
- 每个标记应拆分为单词,在
+
个字符上,
- 这些词的第一个是"actual"词,
- 剩下的词(没有
+
)是分类符号。
去除终止空白字符的好习惯(至少 \n
)。
另请注意,您的代码包含 StringList
,因此您知道
如果此函数可能会寻找 多个 中的一个或多个
分类词。
我的编程方式略有不同:
- 第二个参数(
lookFor
)是一个列表的单词,即
转换成集合(lookForSet
).
- 词集(分词的结果,减去第一个词)
也转换成集合。
决定是否打印一个词(token中的第一个词)是基于
是否至少可以在 lookForSet
中找到其分类符号之一。
换句话说 - lookForSet
和 wordSet
是否有一些
公共元素(设置交集)。
所以整个脚本如下所示:
import re
def index(fileName, lookFor):
lookForSet = set(lookFor) # Set of classification symbols to look for
pat1 = re.compile(r'\s+') # Regex to split line into tokens
pat2 = re.compile(r'\+') # Regex to split a token into words
with open(fileName) as f:
for lineNo, line in enumerate(f, start=1):
line = line.rstrip()
tokens = pat1.split(line)
for token in tokens:
words = pat2.split(token)
word1 = words.pop(0) # Initial word
wordSet = set(words) # Classification words
commonWords = lookForSet.intersection(wordSet)
if commonWords:
print("{:3}: {:<15} {}".format(lineNo, word1, ', '.join(commonWords)))
index('lines.txt', ['Noun', 'Gen'])
它的一段输出,用于我的输入数据(你的稍微缩短的版本)
如下所示:
1: Türkiye Noun
1: terörizm Noun, Gen
1: kitle Noun
1: imha Noun
2: Türkiye Noun, Gen
2: potansiyel Noun
它包含:
- 源码行数,
- 令牌的第一个字,
lookFor
中的哪些分类词已在此令牌中找到。
我想打印文件中所有带有形态学标签的标记。到目前为止,我编写了如下所示的代码。
def index(filepath, string):
import re
pattern = re.compile(r'(\w+)+')
StringList = []
StringList.append(string)
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
words = set(m.group(1) for m in pattern.finditer(line))
matches = [keyword for keyword in StringList if keyword in words]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
StringList.clear()
index('deneme.txt', '+Noun')
输出结果是这样的,我可以找到token中的Noun和行号,但是打印不出我想要的部分。我只想要+号之前的单词部分。
Noun 1
Noun 2
Noun 3
Noun 4
Noun 5
Noun 6
Noun 7
我的文件中的行是这样的:
Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc oluş+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karşı+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylaş+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num aşama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Türkiye+Noun+Gen ekonomik+Adj ve+Conj insani+Adj potansiyel+Noun+P3sg ,+Punc güç+Noun+With savun+Verb+Inf2 kapasite+Noun+P3sg ,+Punc ulus+Noun+A3pl+InBetween çatış+Verb+Inf2+A3pl+Gen önle+Verb+Pass+Inf2+P3sg ve+Conj barış+Noun+P3sg inşa+Noun çaba+Noun+A3pl+P3sg+Dat aktif+Adj katılım+Noun+P3sg+Gen yanısıra+PostpPCGen ,+Punc fark+Noun+With kültür+Noun ve+Conj gelenek+Noun+A3pl+Dat ait+PostpPCDat seçkin+Adj özellik+Noun+A3pl+Acc birleş+Verb+Caus+PresPart bir+Num bünye+Noun+Dat sahip+Noun ol+Verb+Inf2+P3sg ,+Punc kendi+Pron+P3sg bölge+Noun+P3sg+Loc ve+Conj öte+Noun+P3sg+Loc önem+Noun+With rol+Noun oyna+Verb+Inf2+P3sg+Acc sağla+Verb+Fut değer+Noun+With özellik+Noun+A3pl+Cop .+Punc
Türkiye+Noun ,+Punc bu+Det önem+Noun+With katkı+Noun+Acc yap+Verb+Able+Inf1 için+PostpPCGen yeterli+Adj donanım+Noun+P3sg haiz+Adj bir+Num ülke+Noun+Cop ve+Conj gelecek+Noun nesil+Noun+A3pl için+PostpPCGen daha+Noun i+Noun+Acc bir+Num dünya+Noun oluş+Verb+Caus+Inf1 amaç+Noun+P3sg+Ins ,+Punc dost+Noun+A3pl+P3pl ve+Conj müttefik+Adj+A3pl+P3sg+Ins yakın+Noun bir+Num biçim+Noun+Loc çalış+Verb+Inf2+Dat devam+Noun et+Verb+Fut+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj ilişki+Noun+A3pl
club+Noun toplantı+Noun+A3pl+P3sg
Türkiye+Noun -+Punc At+Noun gümrük+Noun işbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlaşma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklık+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj geliş+Verb+Inf2+P3sg+Acc sağla+Verb+Inf1 üzere+PostpPCNom ortaklık+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayılı+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
club+Noun toplantı+Noun+A3pl+P3sg
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj
我想在编写标签时获取令牌。 例如,当我写 +Adj 时,我想获得所有包含 +Adj 的标记(nispi,izafi ....(例如))。
拆分 \w+
从您要查找的内容中删除了 +
部分,因此我改为拆分两者之间的空格。然后就是将 for
和 in
转换为列表理解的正确顺序。
def index(filepath, string):
StringList = [string]
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
words = line.split(' ')
matches = [word for keyword in StringList for word in words if keyword in word]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
index('deneme.txt', '+Adj')
导致结果:
küresel+Adj,karşı+Adj+P3sg+Loc,samimi+Adj 1
ekonomik+Adj,insani+Adj,aktif+Adj,seçkin+Adj 2
yeterli+Adj,haiz+Adj,müttefik+Adj+A3pl+P3sg+Ins 3
kurumsal+Adj 4
sayılı+Adj 6
nispi+Adj 8
nisbi+Adj 9
görece+Adj+With 10
izafi+Adj 11
obur+Adj 12
我删除了行 StringList.clear()
,因为它不知何故出错了。
适用于 Python 2.7 和 3.6+,尽管文本中的扩展 Unicode 字符会在使用 2.7 时失去对齐效果。
我认为,您关于如何使用正则表达式的概念需要改进。
请注意,每个输入行包含多个 "tokens",例如terörizm+Noun+Gen
。
如您所见,它包含:
- 第一个词 - 文本中的实际词,
- 一些分类符号,每个符号前面都有一个
+
字符。
所以:
- 每行应在一系列空白字符上拆分为标记,
- 每个标记应拆分为单词,在
+
个字符上, - 这些词的第一个是"actual"词,
- 剩下的词(没有
+
)是分类符号。
去除终止空白字符的好习惯(至少 \n
)。
另请注意,您的代码包含 StringList
,因此您知道
如果此函数可能会寻找 多个 中的一个或多个
分类词。
我的编程方式略有不同:
- 第二个参数(
lookFor
)是一个列表的单词,即 转换成集合(lookForSet
). - 词集(分词的结果,减去第一个词) 也转换成集合。
决定是否打印一个词(token中的第一个词)是基于
是否至少可以在 lookForSet
中找到其分类符号之一。
换句话说 - lookForSet
和 wordSet
是否有一些
公共元素(设置交集)。
所以整个脚本如下所示:
import re
def index(fileName, lookFor):
lookForSet = set(lookFor) # Set of classification symbols to look for
pat1 = re.compile(r'\s+') # Regex to split line into tokens
pat2 = re.compile(r'\+') # Regex to split a token into words
with open(fileName) as f:
for lineNo, line in enumerate(f, start=1):
line = line.rstrip()
tokens = pat1.split(line)
for token in tokens:
words = pat2.split(token)
word1 = words.pop(0) # Initial word
wordSet = set(words) # Classification words
commonWords = lookForSet.intersection(wordSet)
if commonWords:
print("{:3}: {:<15} {}".format(lineNo, word1, ', '.join(commonWords)))
index('lines.txt', ['Noun', 'Gen'])
它的一段输出,用于我的输入数据(你的稍微缩短的版本) 如下所示:
1: Türkiye Noun
1: terörizm Noun, Gen
1: kitle Noun
1: imha Noun
2: Türkiye Noun, Gen
2: potansiyel Noun
它包含:
- 源码行数,
- 令牌的第一个字,
lookFor
中的哪些分类词已在此令牌中找到。