How to implement Capturing Groups in Regular Expressions.

Capturing groups are noted using parentheses as in example:

'regular ((abab)?cdab(ac)+)* expression'

We take first group, by starting grouped character string from leftmost bracket, and ending it with rightmost bracket. If number of brackets is not even, or they do not match, regular expression does not compile.

In above example, first group would be character string:


We can name it as 'group 1', please note that this is also grouped regular expression.

We create finite state automaton that we'll use to match character strings that are defined by this language (regular expression) to be accepted by finite state automaton.

once we have 'machine' that accepts such strings we can use it in constructing more complex automaton that uses such.

we reduced first regular expression to (nonformal example):

'regular ((abab)?cdab(ac)+)* expression' =
  'regular "group 1"* expression'
  where "group 1" is '(abab)?cdab(ac)+'
    and asterisk (*) applies to 'group 1'.

then we can process the 'group 1' the same way, building a tree of finite state automatons that delegate group matching subtasks to their children.

group number is determined by number of opening bracket starting from the left.

See also: Data Search & Dragonfly.

No comments:

Post a Comment