字符串中回文数

Question

我正在尝试使用 Postgres 9.3.5 收集一个包含 6 个字母的回文及其出现次数的列表。

这是我试过的查询：

SELECT word, count(*)
FROM ( SELECT regexp_split_to_table(read_sequence, '([ATCG])([ATCG])([ATCG])()()()') as word
       FROM reads ) t
GROUP BY word;

然而，这会产生 a) 不是回文且 b) 长度大于或小于 6 个字母的结果。

\d reads
Table "public.reads"
Column        |  Type   | Modifiers 
--------------+---------+-----------
read_header   | text    | not null
read_sequence | text    | 
option        | text    | 
quality_score | text    | 
pair_end      | text    | not null
species_id    | integer | 

Indexes:
"reads_pkey" PRIMARY KEY, btree (read_header, pair_end)

read_sequence 包含 DNA 序列，例如 'ATGCTGATGCGGCGTAGCTGGATCGA'。

我想查看每个序列中的回文数，因此该示例将包含 1 个，另一个序列可能有 4 个，另外 3 个，依此类推。

Answer 1

每行计数：

SELECT read_header, pair_end, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM   reads r
     , generate_series(1, length(r.read_sequence) - 5 ) i
WHERE  substr(read_sequence, i, 6) ~ '([ATCG])([ATCG])([ATCG])'
GROUP  BY 1,2,3
ORDER  BY 1,2,3,4 DESC;

每个 read_header 和回文的计数：

SELECT read_header, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP  BY 1,2
ORDER  BY 1,2,3 DESC;

每 read_header 计数：

SELECT read_header, count(*) AS ct
FROM
...
GROUP  BY 1
ORDER  BY 1,2 DESC;

每个回文数：

SELECT substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP  BY 1
ORDER  BY 1,2 DESC;

SQL Fiddle.

解释

A palindrome 可以在距离末尾少 5 个字符的任何位置开始，以允许长度为 6。回文可以重叠。所以：

在 LATERAL 连接中用 generate_series() 生成可能的起始位置列表，并基于此所有可能的 6 字符字符串。
使用带反向引用的正则表达式测试回文，与您所用的类似，但 regexp_split_to_table() 不是此处的正确函数。使用正则表达式匹配(~).
聚合，看你实际想要什么。

字符串中回文数

Number of palindromes in character strings

regex

sql

postgresql

aggregate-functions

pattern-matching

解释