DP for String Metrics: Longest Common Subsequence

This articles is about longest common subsequence in a series of articles: DP for String Metrics.

Series: DP for String Metrics

Dynamic programming (DP) solves a complicate problem in optimal substructure by breaking it down into simpler sub-problems, and there are DP patterns that are very effective for string metrics. These articles are my summary notes on them. Please feel free to ping me for anything you think I should add.

  • Levenshtein Distance
  • Longest Common Subsequence: this article
  • Regular Expression
  • Distinct Subsequence (coming soon)
  • Longest Repeating Subsequence (coming soon)
  • Hamming Distance (coming soon)

Longest Common Subsequence

Longest common sequence (LCS) is used for things like fuzzy string searches and diff command. It finds the longest sequence common in (typically) two strings where the common sequence doesn’t need to happen consecutively. E.g. LCS between ABCD and ACBD is 3 (either ABD or ACD.)

DP Table

In the table below, “stone” (vertically mapped on left) and “longest” (mapped horizontally on top) are compared. In LCS, we define base cases to be non-character (expressed as ε), so values of the first row and column are all zero.

We go through each cell from position [1, 1] to right every row, comparing each character of two strings. If the characters are identical, we increment a number from left and above by one and update the cell with it. If not, we use the best value up until this point, which is the larger one of either top or left cell. Finally, the cell at the end would be the result count.

Implementation

def lcs(s, target):
    slen = len(s) + 1  # +1 for empty string
    tlen = len(target) + 1
    dp = [[0 for _ in range(tlen)] for _ in range(slen)]  # DP table creation

    for i in range(1, slen):
        for j in range(1, tlen):
            if s[i-1] == target[j-1]:  # if characters are identical, increment number from left top
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])  # if not, use best value either from top of left

    return dp[-1][-1]

Notes

LCS only returns metrics. So if LCS sequence itself is desired instead of metric, we would need another function to retrieve the characters. This can be achieved by using the DP table created in LCS function, tracking the largest numbers from the end and collecting the characters.