This articles is about regular expression in a series of articles: DP for String Metrics.
Series: DP for String Metrics
Dynamic programming (DP) solves a complicate problem in optimal substructure by breaking it down into simpler sub-problems, and there are DP patterns that are very effective for string metrics. These articles are my summary notes on them. Please feel free to ping me for anything you think I should add.
Regular Expression with DP
This is a problem to check regular expression matching with dynamic programming. It’s actually a simplified regex with limited patterns including
characters only. Real implementations of regular expression support more syntaxes and they are more complicated.
Following is the explanation of symbols:
- character = literal match
.= any character
*= 0 or more of preceding expression (
+is 1 or more and
?is 0 or 1)
Literal character matching needs characters to be in an exact position and order. For example, a pattern of
abc without any metacharacter matches only with
abc. The metacharacter of
. matches with any character, which seems straightforward. The metacharacter of
* is a little tricky here. It matches with repeated occurrence of previous expression, but it also accepts 0 occurence. For example, having
abcccd, a pattern of
abc*d is a match, but
abc*de* is also a match since
e can exist 0 time as well.
In the table below, “abcccd” (vertically mapped on left) is the string to be checked with the pattern, “a.c*de*” (mapped horizontally on top). We define base cases to be empty character, so only
[0, 0] is true for the character match and the rest of cells in the first row & column are all false.
We go through each cell from position
[1, 1] to right every row, comparing each character against a pattern character. When the two characters match, the cell is updated with the value from above and left. The same goes when a pattern character is
. since it matches with any character. This chain of bool values from top left to bottom right is the basis of this DP to judge if a pattern match happens or not. A pattern character of
* requires checks of different angles. Firstly we need to bring a DP value of 2 cells left from a current cell. This is because
* character accepts 0 occurrence of previous character. This update actually serves for actual occurrence of the previous character as well. When an actual sequence of a previous character happens, we update the current cell from one above, which would make the current cell true only when there’s a match before a sequential pattern such as
c* starts. Finally we get the result at the last cell.
Before closing DP table explanation, there’s an edge case we need to handle actually. When a pattern has only sequential matches like
a*b*c* for example, we would not get the true value we vertically copied in the above case. So what we need is that 0 occurrence check we did in the example on the first row before starting DP table updates. This can be found in implementation below.