字符串中回文数

Number of palindromes in character strings

我正在尝试使用 Postgres 9.3.5 收集一个包含 6 个字母的回文及其出现次数的列表。

这是我试过的查询:

SELECT word, count(*)
FROM ( SELECT regexp_split_to_table(read_sequence, '([ATCG])([ATCG])([ATCG])()()()') as word
       FROM reads ) t
GROUP BY word;

然而,这会产生 a) 不是回文且 b) 长度大于或小于 6 个字母的结果。

\d reads
Table "public.reads"
Column        |  Type   | Modifiers 
--------------+---------+-----------
read_header   | text    | not null
read_sequence | text    | 
option        | text    | 
quality_score | text    | 
pair_end      | text    | not null
species_id    | integer | 

Indexes:
"reads_pkey" PRIMARY KEY, btree (read_header, pair_end)

read_sequence 包含 DNA 序列,例如 'ATGCTGATGCGGCGTAGCTGGATCGA'

我想查看每个序列中的回文数,因此该示例将包含 1 个,另一个序列可能有 4 个,另外 3 个,依此类推。

每行计数:

SELECT read_header, pair_end, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM   reads r
     , generate_series(1, length(r.read_sequence) - 5 ) i
WHERE  substr(read_sequence, i, 6) ~ '([ATCG])([ATCG])([ATCG])'
GROUP  BY 1,2,3
ORDER  BY 1,2,3,4 DESC;

每个 read_header 和回文的计数:

SELECT read_header, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP  BY 1,2
ORDER  BY 1,2,3 DESC;

read_header 计数:

SELECT read_header, count(*) AS ct
FROM
...
GROUP  BY 1
ORDER  BY 1,2 DESC;

每个回文数:

SELECT substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP  BY 1
ORDER  BY 1,2 DESC;

SQL Fiddle.

解释

A palindrome 可以在距离末尾少 5 个字符的任何位置开始,以允许长度为 6。回文可以 重叠 。所以:

  1. LATERAL 连接中用 generate_series() 生成可能的起始位置列表,并基于此所有可能的 6 字符字符串。

  2. 使用带反向引用的正则表达式测试回文,与您所用的类似,但 regexp_split_to_table() 不是此处的正确函数。使用正则表达式匹配(~).

  3. 聚合,看你实际想要什么。