Python 中具有特征结构的上下文无关文法

Context free grammar with feature structure in Python

我正在尝试使用 python 从定义的语法生成句子,为避免一致性问题,我使用了特征结构,

这是我到目前为止完成的代码:

>>> from __future__ import print_function
   >>> import nltk
   >>> from nltk.featstruct import FeatStruct
   >>> from nltk import grammar, parse
   >>> from nltk.parse.generate import generate
   >>> from nltk import CFG
   >>> g = """
    % start DP
    DP-> D[AGR=[NUM='sg', PERS=3, GND='m']] N[AGR=[NUM='sg', GND='m']]
    D[AGR=[NUM='sg', PERS=3, GND='f']] -> 'une' | 'la'
    D[AGR=[NUM='sg', PERS=3, GND='m']] -> 'un' | 'le'
    D[AGR=[NUM='pl', PERS=3]] -> 'des' | 'les'
    N[AGR=[NUM='sg', GND='m']] -> 'garçon'
    N[AGR=[NUM='pl', GND='m']] -> 'garçons'
    N[AGR=[NUM='sg', GND='f']] -> 'fille'
    N[AGR=[NUM='pl', GND='f']] -> 'filles'
    """
        >>> for sentence in generate(grammar, n=30):
            print(''.join(sentence))

这是我得到的输出:

unegarçon
unegarçons
unefille
unefilles
lagarçon
lagarçons
lafille
lafilles
ungarçon
ungarçons
unfille
unfilles
legarçon
legarçons
lefille
lefilles
desgarçon
desgarçons
desfille
desfilles
lesgarçon
lesgarçons
lesfille
lesfilles

虽然我应该有这样的输出:

un garçon
le garçon

我遇到的问题是:

  1. 协议未生效,我有不遵守协议的句子

  2. 句子中两个字之间没有space

什么是我看不到的?

让我们先解决问题的简单部分。

Q2。句子中两个字之间没有space

关于打印,你已经很接近了 =)

问题在于您如何使用 str.join 函数。

>>> list_of_str = ['a', 'b', 'c']
>>> ''.join(list_of_str)
'abc'
>>> ' '.join(list_of_str)
'a b c'
>>> '|'.join(list_of_str)
'a|b|c'

Q1。协议未生效,我有不遵守协议的句子

第一个警告标志

为了生成一致的特征结构语法,应该有一条规则在右侧 (RHS) 包含类似 D[AGR=?a] N[AGR=?a] 的内容,例如

NP -> D[AGR=?a] N[AGR=?a] 

由于缺少该语法,因此语法中没有真正的一致规则,请参阅 http://www.nltk.org/howto/featgram.html

问题来了!

如果我们仔细查看 nltk.parse.generate 代码,它只是产生所有可能的终端组合,似乎并不关心特征结​​构:https://github.com/nltk/nltk/blob/develop/nltk/parse/generate.py

(我认为这是一个错误而不是一个功能,所以向 NLTK 存储库提出问题会很好)

所以如果我们这样做,它会打印所有可能的终端组合(不关心协议):

from nltk import grammar, parse
from nltk.parse.generate import generate

# If person is always 3rd, we can skip the PERSON feature.
g = """
DP -> D[AGR=?a] N[AGR=?a] 
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

"""

grammar =  grammar.FeatureGrammar.fromstring(g)

print(list(generate(grammar, n=30)))

[输出]:

[['un', 'garcon'], ['un', 'fille'], ['une', 'garcon'], ['une', 'fille']]

但是如果我们尝试解析有效​​和无效的句子,协议规则就会生效:

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """
DP -> D[AGR=?a] N[AGR=?a] 
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

"""

grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(grammar)

trees = parser.parse('une garcon'.split()) # Invalid sentence.
print ("Parses for 'une garcon':", list(trees)) 

trees = parser.parse('un garcon'.split()) # Valid sentence.
print ("Parses for 'un garcon':", list(trees)) 

[输出]:

Parses for 'une garcon': []
Parses for 'un garcon': [Tree(DP[], [Tree(D[AGR=[GND='m', NUM='sg']], ['un']), Tree(N[AGR=[GND='m', NUM='sg']], ['garcon'])])]

要在生成时实现一致规则,一种可能的解决方案是解析每个生成的产生式并保留可解析的产生式,例如

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """
DP -> D[AGR=?a] N[AGR=?a] 
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

"""

grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(grammar)

for tokens in list(generate(grammar, n=30)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        print(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

[输出]:

un garcon
une fille

我想目标是生成以下的最后第二列:

没有介词:

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """
DP -> D[AGR=?a] N[AGR=?a] 

N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'

N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'

D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

D[AGR=[NUM='sg', GND='m']] -> 'le'
D[AGR=[NUM='sg', GND='f']] -> 'la'

D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'


"""

grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(grammar)

valid_productions = set()

for tokens in list(generate(grammar, n=30)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        valid_productions.add(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

for np in sorted(valid_productions):
    print(np)

[输出]:

la fille
le garcon
les filles
les garcons
un garcon
une fille

现在包括介词

语法的 TOP(又名 START)必须有多个分支,目前 DP -> D[AGR=?a] N[AGR=?a] 规则位于 TOP,为了允许 PP 构造,我们必须类似于 PHRASE -> DP | PP 并使 PHRASE 成为新的 TOP 非终结符,例如

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """

PHRASE -> DP | PP 

DP -> D[AGR=?a] N[AGR=?a] 
PP -> P[AGR=?a] N[AGR=?a] 

P[AGR=[NUM='sg', GND='m']] -> 'du' | 'au'

N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'

N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'

D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

D[AGR=[NUM='sg', GND='m']] -> 'le'
D[AGR=[NUM='sg', GND='f']] -> 'la'

D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'

"""

french_grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(french_grammar)

valid_productions = set()

for tokens in list(generate(french_grammar, n=100)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        valid_productions.add(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

for np in sorted(valid_productions):
    print(np)

[输出]:

au garcon
du garcon
la fille
le garcon
les filles
les garcons
un garcon
une fille

获取table中的所有内容:

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """

PHRASE -> DP | PP 

DP -> D[AGR=?a] N[AGR=?a] 
PP -> P[AGR=[GND='m', NUM='sg']] N[AGR=[GND='m', NUM='sg']]
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg', DEF='d']] N[AGR=[GND='f', NUM='sg']]
PP -> P[AGR=[GND=?a, NUM='pl']] N[AGR=[GND=?a, NUM='pl']]


P[AGR=[NUM='sg', GND='m']] -> 'du' | 'au'
P[AGR=[NUM='sg', GND='f']] -> 'de' | 'à'
P[AGR=[NUM='pl']] -> 'des' | 'aux'


N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'

N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'

D[AGR=[NUM='sg', GND='m', DEF='i']] -> 'un'
D[AGR=[NUM='sg', GND='f', DEF='i']] -> 'une'

D[AGR=[NUM='sg', GND='m', DEF='d']] -> 'le'
D[AGR=[NUM='sg', GND='f', DEF='d']] -> 'la'

D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'



"""

french_grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(french_grammar)

valid_productions = set()

for tokens in list(generate(french_grammar, n=100000)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        valid_productions.add(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

for np in sorted(valid_productions):
    print(np)

[输出]:

au garcon
aux filles
aux garcons
de la fille
des filles
des garcons
du garcon
la fille
le garcon
les filles
les garcons
un garcon
une fille
à la fille

超越table

也可以生成de|a un(e) garcon|fille,即

  • 去加康
  • 一个女孩
  • 一个 un garcon
  • 一个姑娘

但我不确定它们是否是有效的法语短语,但如果是,您可以低估女性单数 PP 规则并删除 DEF 特征:

PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg', DEF='d']] N[AGR=[GND='f', NUM='sg']]

至:

PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg']] N[AGR=[GND='f', NUM='sg']]

然后添加一条额外的规则来产生男性单数不定 PP

PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='m', NUM='sg', DEF='i']] N[AGR=[GND='m', NUM='sg']]

TL;DR

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """

PHRASE -> DP | PP 

DP -> D[AGR=?a] N[AGR=?a] 
PP -> P[AGR=[GND='m', NUM='sg']] N[AGR=[GND='m', NUM='sg']]
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg']] N[AGR=[GND='f', NUM='sg']]
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='m', NUM='sg', DEF='i']] N[AGR=[GND='m', NUM='sg']]
PP -> P[AGR=[GND=?a, NUM='pl']] N[AGR=[GND=?a, NUM='pl']]


P[AGR=[NUM='sg', GND='m']] -> 'du' | 'au'
P[AGR=[NUM='sg', GND='f']] -> 'de' | 'à'
P[AGR=[NUM='pl']] -> 'des' | 'aux'


N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'

N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'

D[AGR=[NUM='sg', GND='m', DEF='i']] -> 'un'
D[AGR=[NUM='sg', GND='f', DEF='i']] -> 'une'

D[AGR=[NUM='sg', GND='m', DEF='d']] -> 'le'
D[AGR=[NUM='sg', GND='f', DEF='d']] -> 'la'

D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'



"""

french_grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(french_grammar)

valid_productions = set()

for tokens in list(generate(french_grammar, n=100000)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        valid_productions.add(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

for np in sorted(valid_productions):
    print(np)

[输出]:

au garcon
aux filles
aux garcons
de la fille
de un garcon
de une fille
des filles
des garcons
du garcon
la fille
le garcon
les filles
les garcons
un garcon
une fille
à la fille
à un garcon
à une fille