如何有效地将字符串拆分为 J 中的行?

How to efficiently split a string into lines in J?

我正在尝试用 J 解析一个大型 CSV 文件,这是我想出的分行路由:

splitlines =: 3 : 0
                                     NB. y is the input string
nl_positions =. (y = (10 { a.))      NB. 1 if the character in that position is a newline, 0 otherwise
nl_idx =. (# i.@#) nl_positions      NB. A list of newline indexes in the input string
prev_idx =. (# nl_idx) {. 0 , nl_idx NB. The list above, shifted one position to the right, with 0 as the first element
result =. ''
for_i. nl_idx do.                                  NB. For each newline
    to_drop =. i_index { prev_idx                  NB. The number of characters from the start of the string to skip
    to_take =. i - to_drop                         NB. The number of characters in the current line
    result =. result , < (to_take {. to_drop }. y) NB. Take the current line, box it and add to the result
end.
)

虽然它真的很慢。性能监视器显示第 8 行花费的时间最长,可能是因为删除和获取元素以及扩展结果列表时的所有内存分配:

 Time (seconds)
┌────────┬────────┬─────┬─────────────────────────────────────────┐
│all     │here    │rep  │splitlines                               │
├────────┼────────┼─────┼─────────────────────────────────────────┤
│0.000011│0.000011│    1│monad                                    │
│0.003776│0.003776│    1│[1] nl_positions=.(y=(10{a.))            │
│0.012429│0.012429│    1│[2] nl_idx=.(#i.@#)nl_positions          │
│0.000144│0.000144│    1│[3] prev_idx =.(#nl_idx){.0,nl_idx       │
│0.000002│0.000002│    1│[4] result=.''                           │
│0.027566│0.027566│    1│[5] for_i. nl_idx do.                    │
│0.940466│0.940466│20641│[6] to_drop=.i_index{prev_idx            │
│0.011238│0.011238│20641│[7] to_take=.i-to_drop                   │
│4.310495│4.310495│20641│[8] result=.result,<(to_take{.to_drop}.y)│
│0.006926│0.006926│20641│[9] end.                                 │
│5.313052│5.313052│    1│total monad                              │
└────────┴────────┴─────┴─────────────────────────────────────────┘

有更好的方法吗? 我正在寻找一种方法:

  1. 在没有内存分配的情况下对列表进行切片
  2. 也许用一条数组指令替换整个for循环

如果我没理解错的话,您目前只是想将包含多行的字符串拆分为单独的行。 (我想将行拆分为字段将是稍后阶段的下一步?)

为您想要完成的大部分工作完成繁重工作的关键原语是 cut (;.)。例如:

   <;._2 InputString   NB. box each segment terminated by the last character in the string
   <;._1 InputString   NB. box each segment of InputString starting with the first character in the string
   cut;._2 InputString NB. box each segment of InputString separated by 1 or more spaces

您可能会发现有用的其他相关资源是:splitstringfreadstables/dsv and tables/csv 插件。 freadssplitstring 都在标准库中可用 (post J6)。

   'b' freads 'myfile.txt'  NB. returns contents of myfile.txt boxed by the last character (equivalent to <;._2 freads 'myfile.txt')
   '","' splitstring InputString  NB. boxed sub-strings of input string delimited by left argument

可以使用 Package Manager 安装 tables/dsvtables/csv 插件。安装后,它们可用于拆分行内的行和字段,如下所示:

   require 'tables/csv'
   readcsv 'myfile.csv'
   ',' readdsv 'myfile.txt'
   TAB readdsv 'myfile.txt'