Sphinx 3 搜索引擎:从 CSV 源读取 JSON 时出现问题

Sphinx 3 Search engine: Having problems reading JSON from CSV source

当我尝试从字段中读取 JSON 内容时,我得到:

WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'

详情如下:

这是我正在尝试读取的(超级简化的)CSV 文件:

1,hello world, document number one,a:foo
22,hello again, document number two,foo:bar
23,hello now, This is some stuff,foo:{bar:baz}
24,hello cow, more test stuff and things,{foo:bar}
55,hello suess, box and sox and goats and moats,[a]
56,hello raven, nevermore said the thing,foo:bar

当我 运行 索引器时,这是我得到的结果:


../bin/indexer --config /home/ec2-user/sphinx/etc/sphinx.conf --all --rotate


Sphinx 3.3.1 (commit b72d67b)
Copyright (c) 2001-2020, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/home/ec2-user/sphinx/etc/sphinx.conf'...
indexing index 'csvtest'...
WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'
WARNING: document 22, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
WARNING: document 23, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:{bar:baz}'
WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
WARNING: document 56, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
collected 6 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 6 docs, 0.1 Kb
total 0.0 sec, 17.7 Kb/sec, 1709 docs/sec
rotating indices: successfully sent SIGHUP to searchd (pid=14393).

这是整个配置文件:

source csvsrc
{
    type = csvpipe
    csvpipe_delimiter = ,
    csvpipe_command = cat /home/ec2-user/sphinx/etc/example.csv
    csvpipe_field_string =t
    csvpipe_attr_string =c
    csvpipe_attr_json =assorted
}



index csvtest
{
    source          = csvsrc
    path            = /var/data/test7
    morphology      = stem_en
    rt_field = t
    rt_field = c
    rt_field = assorted

}


indexer
{
    mem_limit       = 128M
}

searchd
{
    listen          = 9312
    listen          = 9306:mysql41
    log             = /var/log/searchd.log
    query_log       = /var/log/query.log
    pid_file        = /var/log/searchd.pid
    binlog_path     = /var/data
}

如果我确实登录并查询,很明显 JSON 实际上没有编入索引(正如警告所预期的那样)

 select * from csvtest;
+------+-------------+----------------------------------+----------+
| id   | t           | c                                | assorted |
+------+-------------+----------------------------------+----------+
|    1 | hello world |  document number one             | NULL     |
|   22 | hello again |  document number two             | NULL     |
|   23 | hello now   |  This is some stuff              | NULL     |
|   24 | hello cow   |  more test stuff and things      | NULL     |
|   55 | hello suess |  box and sox and goats and moats | NULL     |
|   56 | hello raven |  nevermore said the thing        | NULL     |
+------+-------------+----------------------------------+----------+
6 rows in set (0.00 sec)

我已经尝试了一些东西,但我只是在黑暗中摸索。 我尝试过的一些事情:

  1. JSON 的替代格式。我已经尝试使用 {foo:bar}{[foo:bar]}[{foo,bar}] 基于一些与其他 JSON 输入的经验,他们希望它是顶层的数组或字典。这些实际上会产生略有不同的错误:
WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
  1. 我尝试添加尾随逗号,认为这可能是解析器正在寻找的 $end 标记。这会生成一个实际错误 ERROR: index 'csvtest': source 'csvsrc': not all columns found (found=5, total=4, line=1).,从而阻止生成索引。这对我来说很有意义

2a) 我尝试在 JSON 之后添加一整列,这样我就可以使用结尾逗号,但不会收到阻止索引生成的错误。这确实生成了索引,但没有提供 JSON 解析器正在寻找的 $end 标记。

我完全被难住了。

因此 a:foo 不是有效的 JSON 值 AFAIK。看起来像是要成为对象?所以需要 {...} 包围它。

但即使{foo:bar}也是无效的。至少 'value' 应该被引用 {foo:"bar"}。但实际上键也引用了 {"foo":"bar"}

Javascript 对象在技术上允许不带引号的键名,但 JSON 需要引号。

...还要记住CSV。引号通常用于引用(例如,当列包含逗号时),因此引号需要双重编码!最后有点乱...

24,hello cow, more test stuff and things,"{""foo"":""bar""}"