DP for String Metrics: Regular Expression
This articles is about regular expression in a series of articles: DP for String Metrics.
Series: DP for String Metrics
Dynamic programming (DP) solves a complicate problem in optimal substructure by breaking it down into simpler sub-problems, and there are DP patterns that are very effective for string metrics. These articles are my summary notes on them. Please feel free to ping me for anything you think I should add.
- Levenshtein Distance
- Longest Common Subsequence
- Regular Expression: this article
- Distinct Subsequence (coming soon)
- Longest Repeating Subsequence (coming soon)
- Hamming Distance (coming soon)
Regular Expression with DP
This is a problem to check regular expression matching with dynamic programming. It’s actually a simplified regex with limited patterns including .
, *
and characters
only. Real implementations of regular expression support more syntaxes and they are more complicated.
Following is the explanation of symbols:
- character = literal match
.
= any character*
= 0 or more of preceding expression (+
is 1 or more and?
is 0 or 1)
Literal character matching needs characters to be in an exact position and order. For example, a pattern of abc
without any metacharacter matches only with abc
. The metacharacter of .
matches with any character, which seems straightforward. The metacharacter of *
is a little tricky here. It matches with repeated occurrence of previous expression, but it also accepts 0 occurence. For example, having abcccd
, a pattern of abc*d
is a match, but abc*de*
is also a match since e*
means e
can exist 0 time as well.
DP Table
In the table below, “abcccd” (vertically mapped on left) is the string to be checked with the pattern, “a.c*de*” (mapped horizontally on top). We define base cases to be empty character, so only [0, 0]
is true for the character match and the rest of cells in the first row & column are all false.
We go through each cell from position [1, 1]
to right every row, comparing each character against a pattern character. When the two characters match, the cell is updated with the value from above and left. The same goes when a pattern character is .
since it matches with any character. This chain of bool values from top left to bottom right is the basis of this DP to judge if a pattern match happens or not. A pattern character of *
requires checks of different angles. Firstly we need to bring a DP value of 2 cells left from a current cell. This is because *
character accepts 0 occurrence of previous character. This update actually serves for actual occurrence of the previous character as well. When an actual sequence of a previous character happens, we update the current cell from one above, which would make the current cell true only when there’s a match before a sequential pattern such as c*
starts. Finally we get the result at the last cell.
Before closing DP table explanation, there’s an edge case we need to handle actually. When a pattern has only sequential matches like a*
or a*b*c*
for example, we would not get the true value we vertically copied in the above case. So what we need is that 0 occurrence check we did in the example on the first row before starting DP table updates. This can be found in implementation below.
Implementation
def regex(s, pattern):
slen = len(s) + 1
plen = len(pattern) + 1
dp = [[False for _ in range(plen)] for _ in range(slen)]
dp[0][0] = True
for i in range(1, plen): # handles sequential pattern in the first row
if pattern[i-1] == '*':
dp[0][i] = dp[0][i-2]
for i in range(1, slen):
for j in range(1, plen):
if pattern[j-1] == '.' or pattern[j-1] == s[i-1]: # '.' or char match
dp[i][j] = dp[i-1][j-1] # diagnal update
elif pattern[j-1] == '*': # sequence pattern
dp[i][j] = dp[i][j-2] # update for 0 occurence
if pattern[j-2] == '.' or pattern[j-2] == s[i-1]: # match before sequence pattern
dp[i][j] = dp[i-1][j] # vertical update
return dp[-1][-1]