,

Contents ยท Suffix Array + LCP (Kasai)


Overview and Use Cases

  • Suffix array (SA): array of starting indices of all suffixes in lexicographic order.
  • LCP array: LCP[i] = longest common prefix of suffixes SA[i] and SA[i-1].
  • Use cases: substring queries, distinct substrings count, longest repeated substring, string compression.

Building the Suffix Array (Doubling)

// SA construction by doubling + stable sort on (rank[i], rank[i+k])
function buildSA(s) {
  const n = s.length;
  const sa = Array.from({length: n}, (_, i) => i);
  let r = Array.from({length: n}, (_, i) => s.charCodeAt(i));
  let tmp = Array(n).fill(0);
  for (let k = 1; k < n; k <<= 1) {
    sa.sort((a,b) => r[a] - r[b] || ((a + k < n ? r[a+k] : -1) - (b + k < n ? r[b+k] : -1)));
    tmp[sa[0]] = 0;
    for (let i = 1; i < n; i++) {
      const a = sa[i-1], b = sa[i];
      const same = r[a] === r[b] && (a + k < n ? r[a+k] : -1) === (b + k < n ? r[b+k] : -1);
      tmp[b] = tmp[a] + (same ? 0 : 1);
    }
    for (let i = 0; i < n; i++) r[i] = tmp[i];
    if (r[sa[n-1]] === n-1) break; // all ranks unique
  }
  return sa;
}
  • Notes: Use radix/counting sort on pairs to get O(n log n) with good constants.
  • Terminator: commonly append a sentinel like '$' smaller than any char to avoid bounds checks.

LCP Array via Kasai

// Kasai's algorithm: O(n) from SA
function buildLCP(s, sa) {
  const n = s.length;
  const rank = Array(n).fill(0);
  for (let i = 0; i < n; i++) rank[sa[i]] = i;
  const lcp = Array(n).fill(0);
  let k = 0;
  for (let i = 0; i < n; i++) {
    const r = rank[i];
    if (r === 0) { k = 0; continue; }
    const j = sa[r - 1];
    while (i + k < n && j + k < n && s[i + k] === s[j + k]) k++;
    lcp[r] = k;
    if (k > 0) k--;
  }
  return lcp; // lcp[0] is 0 by convention
}
  • Idea: reuse previous LCP minus one as a lower bound when sliding one position forward.

Applications

// Number of distinct substrings of s: n*(n+1)/2 - sum(LCP)
function countDistinctSubstrings(s) {
  const sa = buildSA(s), lcp = buildLCP(s, sa);
  const n = s.length;
  const total = n * (n + 1) / 2;
  const overlap = lcp.reduce((a,b) => a + b, 0);
  return total - overlap;
}

// Longest repeated substring length = max(LCP)
function longestRepeatedSubstringLength(s) {
  const sa = buildSA(s), lcp = buildLCP(s, sa);
  return Math.max(...lcp);
}

// Pattern search via binary search on SA
function findPattern(s, sa, p) {
  let lo = 0, hi = sa.length;
  const cmp = (i) => s.slice(i, i + p.length).localeCompare(p);
  // lower bound
  while (lo < hi) {
    const mid = (lo + hi) >> 1;
    if (cmp(sa[mid]) < 0) lo = mid + 1; else hi = mid;
  }
  const start = lo; hi = sa.length;
  // upper bound
  while (lo < hi) {
    const mid = (lo + hi) >> 1;
    if (cmp(sa[mid]) <= 0) lo = mid + 1; else hi = mid;
  }
  const end = lo;
  return sa.slice(start, end); // starting indices of matches
}

Notes and Optimizations

  • SA-IS / DC3: linear-time construction algorithms with more complex implementations.
  • Memory: prefer typed arrays for large inputs.
  • Unicode: careful with code points vs. code units in JavaScript.

Exercises

  1. Implement SA via radix sort on pairs to speed up the doubling step.
  2. Compute the longest common substring between two strings using SA + LCP on s1#s2.
  3. Use SA to list all occurrences of a pattern and compare vs. KMP.
Suffix structures unlock many substring problems; practice converting ideas into SA + LCP terms.