使用 python 进行分词和解析

Question

我没有代码可以给你看，因为我不知道如何开始。当前的目标是至少能够从包含一些数据的文件中创建令牌例如：

file.txt

Name : Sid
data : Lazy Developer

%description 
This is a packaging file 

%install
 Enter the location to install the package.

和 python 代码应该能够从此文件创建标记，然后在需要时根据输入打印数据。

如果 getData() 是函数那么

getData('name') 应该输出“Sid” GetData('description') 应该给出下面的文本。

Answer 1

正如评论员所说，您的问题与该网站并不相符。但是，我将尝试为您指明正确的方向。

您的 file.txt 实际上是一个 yaml 文件。参见 this answer

import yaml
with open('file.txt', 'r') as f:
    doc = yaml.load(f)
print(doc["Name"])

我还强烈建议阅读 this section of Dive Into Python（以及阅读整本书）。将来尝试一些代码并将其与您的问题分享。

Answer 2

从 file.txt 检索数据：

data = {}
with open('file.txt', 'r') as f: # opens the file
    for line in f: # reads line by line
        key, value = line.split(' : ') # retrieves the key and the value
        data[key.lower()] = value.rstrip() # key to lower case and removes end-of-line '\n'

然后，data['name'] returns 'Sid'.

编辑： 由于问题已更新，这是新的解决方案：

data = {}
with open('file.txt', 'r') as f:
    header, *descriptions = f.read().split('\n\n')
    for line in header.split('\n'):
        key, value = line.split(' : ')
        data[key.lower()] = value.rstrip()
    for description in descriptions:
        key, value = description.split('\n', 1)
        data[key[1:]] = value
print(data)

如果行之间或键末尾有一些空格，您可能需要对此进行调整...

一个较短的方法可能是使用 regex 和方法 re.group().

使用 python 进行分词和解析

tokenizing and parsing with python

python

parsing

token

tokenize

file.txt