如何有效地找到具有相同字符数的最长子串
How to find the longest substring with equal amount of characters efficiently
我有一个由字符 A、B、C 和 D 组成的字符串,我正在尝试计算最长的子字符串的长度,该字符串中的每一个字符的数量按任意顺序都相等。
例如 ABCDB 将 return 4,ABCC 0 和 ADDBCCBA 8.
我目前的代码:
public int longestSubstring(String word) {
HashMap<Integer, String> map = new HashMap<Integer, String>();
for (int i = 0; i<word.length()-3; i++) {
map.put(i, word.substring(i, i+4));
}
StringBuilder sb;
int longest = 0;
for (int i = 0; i<map.size(); i++) {
sb = new StringBuilder();
sb.append(map.get(i));
int a = 4;
while (i<map.size()-a) {
sb.append(map.get(i+a));
a+= 4;
}
String substring = sb.toString();
if (equalAmountOfCharacters(substring)) {
int length = substring.length();
if (length > longest)
longest = length;
}
}
return longest;
}
如果字符串长度为 10^4,这目前工作得很好,但我正在努力使它成为 10^5。如有任何提示或建议,我们将不胜感激。
如果只有 A 和 B,那么你可以这样做。
def longest_balanced(word):
length = 0
cumulative_difference = 0
first_index = {0: -1}
for index, letter in enumerate(word):
if letter == 'A':
cumulative_difference += 1
elif letter == 'B':
cumulative_difference -= 1
else:
raise ValueError(letter)
if cumulative_difference in first_index:
length = max(length, index - first_index[cumulative_difference])
else:
first_index[cumulative_difference] = index
return length
四个字母的生活更复杂,但想法大同小异。对于 A 对 B,我们不只保留一个累积差异,而是保留三个,对于 A 对 B,A 对 C,以及 A 对 D。
嗯,首先要避免构造任何字符串。
如果你不产生任何(或几乎不产生)垃圾,就没有必要收集它,这是一个很大的优势。
接下来,使用不同的数据结构:
我建议使用 4 个字节数组,将它们各自符号的计数存储在从相应字符串索引开始的 4 个跨度中。
这应该会大大加快速度。
假设cnt(c, i)
是字符c
在长度i
的前缀中出现的次数。
一个子字符串 (low, high]
具有相等数量的两个字符 a
和 b
当且仅当 cnt(a, high) - cnt(a, low) = cnt(b, high) - cnt(b, low)
,或者,换句话说,cnt(b, high) - cnt(a, high) = cnt(b, low) - cnt(a, low)
。因此,每个位置由 cnt(b, i) - cnt(a, i)
的值描述。现在我们可以将它概括为两个以上的字符:每个位置由一个元组 (cnt(a_2, i) - cnt(a_1, i), ..., cnt(a_k, i) - cnt(a_1, i))
描述,其中 a_1 ... a_k
是字母表。
我们可以遍历给定的字符串并维护当前的元组。在每一步,我们应该通过检查 i - first_occurrence(current_tuple)
的值来更新答案,其中 first_occurrence
是一个散列 table,它存储到目前为止看到的每个元组的第一次出现。不要忘记在迭代之前将零元组放入哈希映射(它对应于一个空前缀)。
您可能希望为 String 的每个索引缓存字符的累积计数——这才是真正的瓶颈所在。还没有彻底测试,但像下面这样的东西应该可以工作。
public class Test {
static final int LEN = 4;
static class RandomCharSequence implements CharSequence {
private final Random mRandom = new Random();
private final int mAlphabetLen;
private final int mLen;
private final int mOffset;
RandomCharSequence(int pLen, int pOffset, int pAlphabetLen) {
mAlphabetLen = pAlphabetLen;
mLen = pLen;
mOffset = pOffset;
}
public int length() {return mLen;}
public char charAt(int pIdx) {
mRandom.setSeed(mOffset + pIdx);
return (char) (
'A' +
(mRandom.nextInt() % mAlphabetLen + mAlphabetLen) % mAlphabetLen
);
}
public CharSequence subSequence(int pStart, int pEnd) {
return new RandomCharSequence(pEnd - pStart, pStart, mAlphabetLen);
}
@Override public String toString() {
return (new StringBuilder(this)).toString();
}
}
public static void main(String[] pArgs) {
Stream.of("ABCDB", "ABCC", "ADDBCCBA", "DADDBCCBA").forEach(
pWord -> System.out.println(longestSubstring(pWord))
);
for (int i = 0; ; i++) {
final double len = Math.pow(10, i);
if (len >= Integer.MAX_VALUE) break;
System.out.println("Str len 10^" + i);
for (int alphabetLen = 1; alphabetLen <= LEN; alphabetLen++) {
final Instant start = Instant.now();
final int val = longestSubstring(
new RandomCharSequence((int) len, 0, alphabetLen)
);
System.out.println(
String.format(
" alphabet len %d; result %08d; time %s",
alphabetLen,
val,
formatMillis(ChronoUnit.MILLIS.between(start, Instant.now()))
)
);
}
}
}
static String formatMillis(long millis) {
return String.format(
"%d:%02d:%02d.%03d",
TimeUnit.MILLISECONDS.toHours(millis),
TimeUnit.MILLISECONDS.toMinutes(millis) -
TimeUnit.HOURS.toMinutes(TimeUnit.MILLISECONDS.toHours(millis)),
TimeUnit.MILLISECONDS.toSeconds(millis) -
TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(millis)),
TimeUnit.MILLISECONDS.toMillis(millis) -
TimeUnit.SECONDS.toMillis(TimeUnit.MILLISECONDS.toSeconds(millis))
);
}
static int longestSubstring(CharSequence pWord) {
// create array that stores cumulative char counts at each index of string
// idx 0 = char (A-D); idx 1 = offset
final int[][] cumulativeCnts = new int[LEN][];
for (int i = 0; i < LEN; i++) {
cumulativeCnts[i] = new int[pWord.length() + 1];
}
final int[] cumulativeCnt = new int[LEN];
for (int i = 0; i < pWord.length(); i++) {
cumulativeCnt[pWord.charAt(i) - 'A']++;
for (int j = 0; j < LEN; j++) {
cumulativeCnts[j][i + 1] = cumulativeCnt[j];
}
}
final int maxResult = Arrays.stream(cumulativeCnt).min().orElse(0) * LEN;
if (maxResult == 0) return 0;
int result = 0;
for (int initialOffset = 0; initialOffset < LEN; initialOffset++) {
for (
int start = initialOffset;
start < pWord.length() - result;
start += LEN
) {
endLoop:
for (
int end = start + result + LEN;
end <= pWord.length() && end - start <= maxResult;
end += LEN
) {
final int substrLen = end - start;
final int expectedCharCnt = substrLen / LEN;
for (int i = 0; i < LEN; i++) {
if (
cumulativeCnts[i][end] - cumulativeCnts[i][start] !=
expectedCharCnt
) {
continue endLoop;
}
}
if (substrLen > result) result = substrLen;
}
}
}
return result;
}
}
您可以统计word
中字符出现的次数。那么,一个可能的解决方案可能是:
- 如果
min
是word
中任意字符的最小出现次数,那么min
也是我们要查找的子串中每个字符的最大可能出现次数为了。在下面的代码中,min
是 maxCount
.
- 我们迭代
maxCount
的递减值。在每一步中,我们要搜索的字符串的长度都是 maxCount * alphabetSize
。我们可以将其视为滑动 window 的大小,我们可以滑过 word
.
- 我们将 window 滑过
word
,计算 window 中字符的出现次数。如果 window 是我们正在搜索的子字符串,我们 return 结果。否则,我们继续搜索。
[固定]代码:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount) {
int startIndex = 0;
boolean found = false;
while (startIndex + length <= word.length()) {
int[] count = new int[ALPHABET_SIZE];
for (int i = startIndex; i < startIndex + length; i++) {
char c = word.charAt(i);
int cnt = ++count[c - 'A'];
if (cnt > maxCount) {
break;
}
}
if (equalValues(count, maxCount)) {
found = true;
break;
} else {
startIndex++;
}
}
return found;
}
// Returns true if all values in c are equal to value
private boolean equalValues(int[] count, int value) {
boolean result = true;
for (int i : count) {
if (i != value) {
result = false;
break;
}
}
return result;
}
[MERGED] 这是 Hollis Waite 使用累积计数的解决方案,但考虑了我在第 1 点和第 2 点的观察结果。这可能会提高某些输入的性能:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[][] cumulativeCount = new int[ALPHABET_SIZE][];
for (int i = 0; i < ALPHABET_SIZE; i++) {
cumulativeCount[i] = new int[word.length() + 1];
}
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
for (int j = 0; j < ALPHABET_SIZE; j++) {
cumulativeCount[j][i + 1] = count[j];
}
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount, cumulativeCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount, int[][] cumulativeCount) {
int startIndex = 0;
int endIndex = (startIndex + length) - 1;
boolean found = true;
while (endIndex < word.length()) {
for (int i = 0; i < ALPHABET_SIZE; i++) {
if (cumulativeCount[i][endIndex] - cumulativeCount[i][startIndex] != maxCount) {
found = false;
break;
}
}
if (found) {
break;
} else {
startIndex++;
endIndex++;
}
}
return found;
}
假设在长度为 N 的字符串中有 K 个可能的字母。我们可以跟踪使用长度为 K 的向量 pos 看到的字母的平衡,更新如下:
- 如果看到字母 1,则添加 (K-1, -1, -1, ...)
- 如果看到字母 2,则添加 (-1, K-1, -1, ...)
- 如果看到字母 3,添加 (-1, -1, K-1, ...)
维护一个散列,将 pos 映射到到达 pos 的第一个字符串位置。只要 hash[pos] 已经存在且子串值为 s[hash[pos]:pos].
,就会出现平衡子串
维护哈希的成本是 O(log N),因此处理字符串需要 O(N log N)。这与目前的解决方案相比如何?这些类型的问题往往有线性解决方案,但我还没有遇到过。
这里有一些代码演示了 3 个字母和 运行 使用偏向随机字符串的想法。 (统一的随机字符串允许大约一半字符串长度的解决方案,这不利于打印)。
#!/usr/bin/python
import random
from time import time
alphabet = "abc"
DIM = len(alphabet)
def random_string(n):
# return a random string over choices[] of length n
# distribution of letters is non-uniform to make matches harder to find
choices = "aabbc"
s = ''
for i in range(n):
r = random.randint(0, len(choices) - 1)
s += choices[r]
return s
def validate(s):
# verify frequencies of each letter are the same
f = [0, 0, 0]
a2f = {alphabet[i] : i for i in range(DIM)}
for c in s:
f[a2f[c]] += 1
assert f[0] == f[1] and f[1] == f[2]
def longest_balanced(s):
"""return length of longest substring of s containing equal
populations of each letter in alphabet"""
slen = len(s)
p = [0 for i in range(DIM)]
vec = {alphabet[0] : [2, -1, -1],
alphabet[1] : [-1, 2, -1],
alphabet[2] : [-1, -1, 2]}
x = -1
best = -1
hist = {str([0, 0, 0]) : -1}
for c in s:
x += 1
p = [p[i] + vec[c][i] for i in range(DIM)]
pkey = str(p)
if pkey not in hist:
hist[pkey] = x
else:
span = x - hist[pkey]
assert span % DIM == 0
if span > best:
best = span
cand = s[hist[pkey] + 1: x + 1]
print("best so far %d = [%d,%d]: %s" % (best,
hist[pkey] + 1,
x + 1,
cand))
validate(cand)
return best if best > -1 else 0
def main():
#print longest_balanced( "aaabcabcbbcc" )
t0 = time()
s = random_string(1000000)
print "generate time:", time() - t0
t1 = time()
best = longest_balanced( s )
print "best:", best
print "elapsed:", time() - t1
main()
样本 运行 输入 10^6 个字母和 3 个字母的字母表:
$ ./bal.py
...
best so far 189 = [847894,848083]: aacacbcbabbbcabaabbbaabbbaaaacbcaaaccccbcbcbababaabbccccbbabbacabbbbbcaacacccbbaacbabcbccaabaccabbbbbababbacbaaaacabcbabcbccbabbccaccaabbcabaabccccaacccccbaacaaaccbbcbcabcbcacaabccbacccacca
best: 189
elapsed: 1.43609690666
我有一个由字符 A、B、C 和 D 组成的字符串,我正在尝试计算最长的子字符串的长度,该字符串中的每一个字符的数量按任意顺序都相等。 例如 ABCDB 将 return 4,ABCC 0 和 ADDBCCBA 8.
我目前的代码:
public int longestSubstring(String word) {
HashMap<Integer, String> map = new HashMap<Integer, String>();
for (int i = 0; i<word.length()-3; i++) {
map.put(i, word.substring(i, i+4));
}
StringBuilder sb;
int longest = 0;
for (int i = 0; i<map.size(); i++) {
sb = new StringBuilder();
sb.append(map.get(i));
int a = 4;
while (i<map.size()-a) {
sb.append(map.get(i+a));
a+= 4;
}
String substring = sb.toString();
if (equalAmountOfCharacters(substring)) {
int length = substring.length();
if (length > longest)
longest = length;
}
}
return longest;
}
如果字符串长度为 10^4,这目前工作得很好,但我正在努力使它成为 10^5。如有任何提示或建议,我们将不胜感激。
如果只有 A 和 B,那么你可以这样做。
def longest_balanced(word):
length = 0
cumulative_difference = 0
first_index = {0: -1}
for index, letter in enumerate(word):
if letter == 'A':
cumulative_difference += 1
elif letter == 'B':
cumulative_difference -= 1
else:
raise ValueError(letter)
if cumulative_difference in first_index:
length = max(length, index - first_index[cumulative_difference])
else:
first_index[cumulative_difference] = index
return length
四个字母的生活更复杂,但想法大同小异。对于 A 对 B,我们不只保留一个累积差异,而是保留三个,对于 A 对 B,A 对 C,以及 A 对 D。
嗯,首先要避免构造任何字符串。
如果你不产生任何(或几乎不产生)垃圾,就没有必要收集它,这是一个很大的优势。
接下来,使用不同的数据结构:
我建议使用 4 个字节数组,将它们各自符号的计数存储在从相应字符串索引开始的 4 个跨度中。
这应该会大大加快速度。
假设
cnt(c, i)
是字符c
在长度i
的前缀中出现的次数。一个子字符串
(low, high]
具有相等数量的两个字符a
和b
当且仅当cnt(a, high) - cnt(a, low) = cnt(b, high) - cnt(b, low)
,或者,换句话说,cnt(b, high) - cnt(a, high) = cnt(b, low) - cnt(a, low)
。因此,每个位置由cnt(b, i) - cnt(a, i)
的值描述。现在我们可以将它概括为两个以上的字符:每个位置由一个元组(cnt(a_2, i) - cnt(a_1, i), ..., cnt(a_k, i) - cnt(a_1, i))
描述,其中a_1 ... a_k
是字母表。我们可以遍历给定的字符串并维护当前的元组。在每一步,我们应该通过检查
i - first_occurrence(current_tuple)
的值来更新答案,其中first_occurrence
是一个散列 table,它存储到目前为止看到的每个元组的第一次出现。不要忘记在迭代之前将零元组放入哈希映射(它对应于一个空前缀)。
您可能希望为 String 的每个索引缓存字符的累积计数——这才是真正的瓶颈所在。还没有彻底测试,但像下面这样的东西应该可以工作。
public class Test {
static final int LEN = 4;
static class RandomCharSequence implements CharSequence {
private final Random mRandom = new Random();
private final int mAlphabetLen;
private final int mLen;
private final int mOffset;
RandomCharSequence(int pLen, int pOffset, int pAlphabetLen) {
mAlphabetLen = pAlphabetLen;
mLen = pLen;
mOffset = pOffset;
}
public int length() {return mLen;}
public char charAt(int pIdx) {
mRandom.setSeed(mOffset + pIdx);
return (char) (
'A' +
(mRandom.nextInt() % mAlphabetLen + mAlphabetLen) % mAlphabetLen
);
}
public CharSequence subSequence(int pStart, int pEnd) {
return new RandomCharSequence(pEnd - pStart, pStart, mAlphabetLen);
}
@Override public String toString() {
return (new StringBuilder(this)).toString();
}
}
public static void main(String[] pArgs) {
Stream.of("ABCDB", "ABCC", "ADDBCCBA", "DADDBCCBA").forEach(
pWord -> System.out.println(longestSubstring(pWord))
);
for (int i = 0; ; i++) {
final double len = Math.pow(10, i);
if (len >= Integer.MAX_VALUE) break;
System.out.println("Str len 10^" + i);
for (int alphabetLen = 1; alphabetLen <= LEN; alphabetLen++) {
final Instant start = Instant.now();
final int val = longestSubstring(
new RandomCharSequence((int) len, 0, alphabetLen)
);
System.out.println(
String.format(
" alphabet len %d; result %08d; time %s",
alphabetLen,
val,
formatMillis(ChronoUnit.MILLIS.between(start, Instant.now()))
)
);
}
}
}
static String formatMillis(long millis) {
return String.format(
"%d:%02d:%02d.%03d",
TimeUnit.MILLISECONDS.toHours(millis),
TimeUnit.MILLISECONDS.toMinutes(millis) -
TimeUnit.HOURS.toMinutes(TimeUnit.MILLISECONDS.toHours(millis)),
TimeUnit.MILLISECONDS.toSeconds(millis) -
TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(millis)),
TimeUnit.MILLISECONDS.toMillis(millis) -
TimeUnit.SECONDS.toMillis(TimeUnit.MILLISECONDS.toSeconds(millis))
);
}
static int longestSubstring(CharSequence pWord) {
// create array that stores cumulative char counts at each index of string
// idx 0 = char (A-D); idx 1 = offset
final int[][] cumulativeCnts = new int[LEN][];
for (int i = 0; i < LEN; i++) {
cumulativeCnts[i] = new int[pWord.length() + 1];
}
final int[] cumulativeCnt = new int[LEN];
for (int i = 0; i < pWord.length(); i++) {
cumulativeCnt[pWord.charAt(i) - 'A']++;
for (int j = 0; j < LEN; j++) {
cumulativeCnts[j][i + 1] = cumulativeCnt[j];
}
}
final int maxResult = Arrays.stream(cumulativeCnt).min().orElse(0) * LEN;
if (maxResult == 0) return 0;
int result = 0;
for (int initialOffset = 0; initialOffset < LEN; initialOffset++) {
for (
int start = initialOffset;
start < pWord.length() - result;
start += LEN
) {
endLoop:
for (
int end = start + result + LEN;
end <= pWord.length() && end - start <= maxResult;
end += LEN
) {
final int substrLen = end - start;
final int expectedCharCnt = substrLen / LEN;
for (int i = 0; i < LEN; i++) {
if (
cumulativeCnts[i][end] - cumulativeCnts[i][start] !=
expectedCharCnt
) {
continue endLoop;
}
}
if (substrLen > result) result = substrLen;
}
}
}
return result;
}
}
您可以统计word
中字符出现的次数。那么,一个可能的解决方案可能是:
- 如果
min
是word
中任意字符的最小出现次数,那么min
也是我们要查找的子串中每个字符的最大可能出现次数为了。在下面的代码中,min
是maxCount
. - 我们迭代
maxCount
的递减值。在每一步中,我们要搜索的字符串的长度都是maxCount * alphabetSize
。我们可以将其视为滑动 window 的大小,我们可以滑过word
. - 我们将 window 滑过
word
,计算 window 中字符的出现次数。如果 window 是我们正在搜索的子字符串,我们 return 结果。否则,我们继续搜索。
[固定]代码:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount) {
int startIndex = 0;
boolean found = false;
while (startIndex + length <= word.length()) {
int[] count = new int[ALPHABET_SIZE];
for (int i = startIndex; i < startIndex + length; i++) {
char c = word.charAt(i);
int cnt = ++count[c - 'A'];
if (cnt > maxCount) {
break;
}
}
if (equalValues(count, maxCount)) {
found = true;
break;
} else {
startIndex++;
}
}
return found;
}
// Returns true if all values in c are equal to value
private boolean equalValues(int[] count, int value) {
boolean result = true;
for (int i : count) {
if (i != value) {
result = false;
break;
}
}
return result;
}
[MERGED] 这是 Hollis Waite 使用累积计数的解决方案,但考虑了我在第 1 点和第 2 点的观察结果。这可能会提高某些输入的性能:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[][] cumulativeCount = new int[ALPHABET_SIZE][];
for (int i = 0; i < ALPHABET_SIZE; i++) {
cumulativeCount[i] = new int[word.length() + 1];
}
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
for (int j = 0; j < ALPHABET_SIZE; j++) {
cumulativeCount[j][i + 1] = count[j];
}
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount, cumulativeCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount, int[][] cumulativeCount) {
int startIndex = 0;
int endIndex = (startIndex + length) - 1;
boolean found = true;
while (endIndex < word.length()) {
for (int i = 0; i < ALPHABET_SIZE; i++) {
if (cumulativeCount[i][endIndex] - cumulativeCount[i][startIndex] != maxCount) {
found = false;
break;
}
}
if (found) {
break;
} else {
startIndex++;
endIndex++;
}
}
return found;
}
假设在长度为 N 的字符串中有 K 个可能的字母。我们可以跟踪使用长度为 K 的向量 pos 看到的字母的平衡,更新如下:
- 如果看到字母 1,则添加 (K-1, -1, -1, ...)
- 如果看到字母 2,则添加 (-1, K-1, -1, ...)
- 如果看到字母 3,添加 (-1, -1, K-1, ...)
维护一个散列,将 pos 映射到到达 pos 的第一个字符串位置。只要 hash[pos] 已经存在且子串值为 s[hash[pos]:pos].
,就会出现平衡子串维护哈希的成本是 O(log N),因此处理字符串需要 O(N log N)。这与目前的解决方案相比如何?这些类型的问题往往有线性解决方案,但我还没有遇到过。
这里有一些代码演示了 3 个字母和 运行 使用偏向随机字符串的想法。 (统一的随机字符串允许大约一半字符串长度的解决方案,这不利于打印)。
#!/usr/bin/python
import random
from time import time
alphabet = "abc"
DIM = len(alphabet)
def random_string(n):
# return a random string over choices[] of length n
# distribution of letters is non-uniform to make matches harder to find
choices = "aabbc"
s = ''
for i in range(n):
r = random.randint(0, len(choices) - 1)
s += choices[r]
return s
def validate(s):
# verify frequencies of each letter are the same
f = [0, 0, 0]
a2f = {alphabet[i] : i for i in range(DIM)}
for c in s:
f[a2f[c]] += 1
assert f[0] == f[1] and f[1] == f[2]
def longest_balanced(s):
"""return length of longest substring of s containing equal
populations of each letter in alphabet"""
slen = len(s)
p = [0 for i in range(DIM)]
vec = {alphabet[0] : [2, -1, -1],
alphabet[1] : [-1, 2, -1],
alphabet[2] : [-1, -1, 2]}
x = -1
best = -1
hist = {str([0, 0, 0]) : -1}
for c in s:
x += 1
p = [p[i] + vec[c][i] for i in range(DIM)]
pkey = str(p)
if pkey not in hist:
hist[pkey] = x
else:
span = x - hist[pkey]
assert span % DIM == 0
if span > best:
best = span
cand = s[hist[pkey] + 1: x + 1]
print("best so far %d = [%d,%d]: %s" % (best,
hist[pkey] + 1,
x + 1,
cand))
validate(cand)
return best if best > -1 else 0
def main():
#print longest_balanced( "aaabcabcbbcc" )
t0 = time()
s = random_string(1000000)
print "generate time:", time() - t0
t1 = time()
best = longest_balanced( s )
print "best:", best
print "elapsed:", time() - t1
main()
样本 运行 输入 10^6 个字母和 3 个字母的字母表:
$ ./bal.py
...
best so far 189 = [847894,848083]: aacacbcbabbbcabaabbbaabbbaaaacbcaaaccccbcbcbababaabbccccbbabbacabbbbbcaacacccbbaacbabcbccaabaccabbbbbababbacbaaaacabcbabcbccbabbccaccaabbcabaabccccaacccccbaacaaaccbbcbcabcbcacaabccbacccacca
best: 189
elapsed: 1.43609690666