在 PyParsing 中,如何指定一个 Word 不等于给定的文字?
In PyParsing, how to specify that a Word is not equal to a given literal?
我正在尝试解析来自 http://www.apkmirror.com such as http://www.apkmirror.com/apk/google-inc/gmail/gmail-7-3-26-152772569-release-release/gmail-7-3-26-152772569-release-android-apk-download/ 的 APK 下载页面。通常,"APK details" 部分具有以下结构:
我想将“17329196”解析为 version_code
,"arm" 解析为 architecture
,"com.skype.m2" 解析为 package
。但是,有时 architecture
行会丢失,如下所示:
到目前为止,使用 Scrapy 选择器
apk_details = response.xpath('//*[@title="APK details"]/following-sibling::*[@class="appspec-value"]//text()').extract()
我已经能够提取包含上面显示的 'lines' 的列表。我正在尝试编写一个函数 parse_apk_details
以便通过以下测试:
import pytest
def test_parse_apk_details_with_architecture():
apk_details = [u'Version: 3.0.38_ww (4030038)',
u'arm ',
u'Package: com.lenovo.anyshare.gps',
u'\n',
u'2,239 downloads ']
version_code, architecture, package = parse_apk_details(apk_details)
assert version_code == 4030038
assert architecture == "arm"
assert package == "com.lenovo.anyshare.gps"
@pytest.mark.skip(reason="This does not work yet, because 'Package:' is interpreted by the parser as the architecture.")
def test_parse_apk_details_without_architecture():
apk_details = [u'Version: 3.0.38_ww (4030038)',
u'Package: com.lenovo.anyshare.gps',
u'\n',
u'2,239 downloads ']
version_code, architecture, package = parse_apk_details(apk_details)
assert version_code == 4030038
assert package == "com.lenovo.anyshare.gps"
if __name__ == "__main__":
pytest.main([__file__])
但是,如上所述,第二个测试还没有通过。这是到目前为止的功能:
from pyparsing import Word, printables, nums, Optional
def parse_apk_details(apk_details):
apk_details = "\n".join(apk_details) # The newline character is ignored by PyParsing (by default)
version_name = Word(printables) # The version name can consist of all printable, non-whitespace characters
version_code = Word(nums) # The version code is expected to be an integer
architecture = Word(printables)
package = Word(printables)
expression = "Version:" + version_name + "(" + version_code("version_code") + ")" + Optional(architecture("architecture")) + "Package:" + package("package")
result = expression.parseString(apk_details)
return int(result.get("version_code")), result.get("architecture"), result.get("package")
我尝试 运行 第二个测试时遇到的错误是:
ParseException: Expected "Package:" (at char 38), (line:2, col:10)
我相信正在发生的事情是 "Package:" 正在 'consumed' 作为 architecture
。解决此问题的一种方法是将 architecture = Word(printables)
行更改为类似(伪代码)architecture = Word(printables) + ~"Package:"
的内容,以指示它可以是任何由可打印字符组成的内容,但单词 "Package:" 除外.
如何确保 architecture
仅在不是特定词 "Package:"
时才被解析? (我也对原始问题的基于 scrapy
的替代解决方案感兴趣)。
我最终使用了包含体系结构的行的不同特征(例如 "arm"):体系结构后跟一个换行符(如果存在)这一事实。我将方法 parse_apk_details
修改为以下内容:
from pyparsing import Word, printables, nums, Optional, LineEnd, FollowedBy, Suppress
def parse_apk_details(apk_details):
apk_details = "\n".join(apk_details) # The newline character is ignored by PyParsing (by default)
version_name = Word(printables).setResultsName("version") # The version name can consist of all printable, non-whitespace characters
version_code = Word(nums).setResultsName("version_code") # The version code is expected to be an integer
architecture = Word(printables).setResultsName("architecture") + Suppress(FollowedBy(LineEnd()))
package = Word(printables).setResultsName("package")
expression = "Version:" + version_name + "(" + version_code + ")" + Optional(architecture) + "Package:" + package
result = expression.parseString(apk_details)
return int(result.get("version_code")), result.get("architecture"), result.get("package")
这两个测试都通过了。
你和 architecture = Word(printables) + ~Literal("Package:")
真的很亲密。要进行否定前瞻,从否定开始,然后是匹配:
architecture = ~Literal("Package:") + Word(printables)
我正在尝试解析来自 http://www.apkmirror.com such as http://www.apkmirror.com/apk/google-inc/gmail/gmail-7-3-26-152772569-release-release/gmail-7-3-26-152772569-release-android-apk-download/ 的 APK 下载页面。通常,"APK details" 部分具有以下结构:
我想将“17329196”解析为 version_code
,"arm" 解析为 architecture
,"com.skype.m2" 解析为 package
。但是,有时 architecture
行会丢失,如下所示:
到目前为止,使用 Scrapy 选择器
apk_details = response.xpath('//*[@title="APK details"]/following-sibling::*[@class="appspec-value"]//text()').extract()
我已经能够提取包含上面显示的 'lines' 的列表。我正在尝试编写一个函数 parse_apk_details
以便通过以下测试:
import pytest
def test_parse_apk_details_with_architecture():
apk_details = [u'Version: 3.0.38_ww (4030038)',
u'arm ',
u'Package: com.lenovo.anyshare.gps',
u'\n',
u'2,239 downloads ']
version_code, architecture, package = parse_apk_details(apk_details)
assert version_code == 4030038
assert architecture == "arm"
assert package == "com.lenovo.anyshare.gps"
@pytest.mark.skip(reason="This does not work yet, because 'Package:' is interpreted by the parser as the architecture.")
def test_parse_apk_details_without_architecture():
apk_details = [u'Version: 3.0.38_ww (4030038)',
u'Package: com.lenovo.anyshare.gps',
u'\n',
u'2,239 downloads ']
version_code, architecture, package = parse_apk_details(apk_details)
assert version_code == 4030038
assert package == "com.lenovo.anyshare.gps"
if __name__ == "__main__":
pytest.main([__file__])
但是,如上所述,第二个测试还没有通过。这是到目前为止的功能:
from pyparsing import Word, printables, nums, Optional
def parse_apk_details(apk_details):
apk_details = "\n".join(apk_details) # The newline character is ignored by PyParsing (by default)
version_name = Word(printables) # The version name can consist of all printable, non-whitespace characters
version_code = Word(nums) # The version code is expected to be an integer
architecture = Word(printables)
package = Word(printables)
expression = "Version:" + version_name + "(" + version_code("version_code") + ")" + Optional(architecture("architecture")) + "Package:" + package("package")
result = expression.parseString(apk_details)
return int(result.get("version_code")), result.get("architecture"), result.get("package")
我尝试 运行 第二个测试时遇到的错误是:
ParseException: Expected "Package:" (at char 38), (line:2, col:10)
我相信正在发生的事情是 "Package:" 正在 'consumed' 作为 architecture
。解决此问题的一种方法是将 architecture = Word(printables)
行更改为类似(伪代码)architecture = Word(printables) + ~"Package:"
的内容,以指示它可以是任何由可打印字符组成的内容,但单词 "Package:" 除外.
如何确保 architecture
仅在不是特定词 "Package:"
时才被解析? (我也对原始问题的基于 scrapy
的替代解决方案感兴趣)。
我最终使用了包含体系结构的行的不同特征(例如 "arm"):体系结构后跟一个换行符(如果存在)这一事实。我将方法 parse_apk_details
修改为以下内容:
from pyparsing import Word, printables, nums, Optional, LineEnd, FollowedBy, Suppress
def parse_apk_details(apk_details):
apk_details = "\n".join(apk_details) # The newline character is ignored by PyParsing (by default)
version_name = Word(printables).setResultsName("version") # The version name can consist of all printable, non-whitespace characters
version_code = Word(nums).setResultsName("version_code") # The version code is expected to be an integer
architecture = Word(printables).setResultsName("architecture") + Suppress(FollowedBy(LineEnd()))
package = Word(printables).setResultsName("package")
expression = "Version:" + version_name + "(" + version_code + ")" + Optional(architecture) + "Package:" + package
result = expression.parseString(apk_details)
return int(result.get("version_code")), result.get("architecture"), result.get("package")
这两个测试都通过了。
你和 architecture = Word(printables) + ~Literal("Package:")
真的很亲密。要进行否定前瞻,从否定开始,然后是匹配:
architecture = ~Literal("Package:") + Word(printables)