Regular Expressions
Formal Languages, Regular Expressions, Automata and Transducers
Useful links:
A Formal Language can model a phenomenon, e.g. written English
Examples: All combinations of the letters, any number of As, followed by any number of Bs, mathematical equations, all the sentences of a simplified version of written English, a sequence of musical notation (e.g., the notes in Beethoven's 9th Symphony), etc.
A formal grammar: a set of rules that match all and only instances of a formal language. A formal grammar defines a formal language.
In Computer Science, Formal grammars are used to generate and recognize formal languages (e.g., programming languages)
- Parsing a string of a language involves:
- Recognizing the string and
- Recording the analysis showing it is part of the language
- A compiler translates from language X to language Y, e.g.,This may include parsing language X and generating language Y
- If all natural languages were formal languages, then Machine Translation systems would just be compilers
Symbols that can be replaced by other symbols
Symbols that cannot be replaced by other symbols
Replace the symbol sequence XYZ with abXzY:
XYZ → abXzY
Marks the start of the language
Type0 ⊇ Type1 ⊇ Type2 ⊇ Type3
Equivalent to Turing Machine, general system capable of simulating any algorithm.
αAβ → αγβ
- Greek letters = 0 or more non-terms/terms.
- A = non-terminal
- Rule means: replace A with γ, when A is between α and β
A → αγβ
- Like context-sensitive, except left-hand side can only contain exactly one non-terminal
Example Rule from linguistics:

Context-free Rules
Regular (finite state) grammars
A → βa
orA → ϵ
(left regular)A → aβ
, orA → ϵ
(right regular)
Like Type 2, except:
- Non-terminals can precede terminals in left regular grammar
- Non-terminals can follow terminals in right regular grammar
- Null string is allowed
Type-3 grammars generate the regular languages
Type-3 grammars must have a single non-terminal on the left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a single non-terminal.
The productions must be in the form
X → a
or X → aY
where X, Y ∈ N (Non terminal)
and a ∈ T (Terminal)
The rule S → ε is allowed if S does not appear on the right side of any rule.
Type 3 grammars: Least expressive, Most efficient processors
Type 0 grammars: Most expressive, Least efficient processors
Complexity of recognizer for languages:
- Type 0: exponential
- Type 1: polynomial
- Type 2:
- Type 3:
Type 3 grammars:
- Include regular expressions and finite state automata (aka, finite state machines)
- The focal point of the rest of this talk
Type 2 grammars:
- Commonly used for natural language parsers
- Used to model syntactic structure in many linguistics theories (often supplemented by other mechanisms)
- Important for later talks on constituent structure & parsing
Human Language believed to be “mildly context sensitive”
- Less expressive than type 1 (context sensitive)
- More expressive than type 2 (context-free)
Three Adjoining Grammars
- https://repository.upenn.edu/cgi/viewcontent.cgi?article=1706&context=cis_reports
- https://www.aclweb.org/anthology/H86-1020.pdf
- Formalism by A. Joshi & others
- May be able to handle these cases
Concatenation
- If X is a regexp and Y is a regexp, then XY is a regexp
- Examples:
- If
ABC
andDEF
are regexps, thenABCDEF
is a regexp - If
AB*
andBC*
are regexps, thenAB*BC*
is a regexp Note: Kleene _ is explained below
Disjunction
- If X is a regexp and Y is a regexp, then
X | Y
is a regexp - Example:
ABC|DEF
will match eitherABC
orDEF
Repetition
- If X is a regexp then a repetition of X will also be a regexp
- The Kleene Star:
A*
means 0 or more instances of A - Regexp{number}:
A{2}
means exactly 2 instances of A
Disjunction of characters
[ABC]
– means the same thing asA | B | C
[a-zA-Z0-9]
– character ranges are equivalent to lists:a|b|c|...|A|B|...|0|1|...|9
Negation of character lists/sequences
^
inside bracket means complement of disjunction, e.g.,[^a-z]
means a character that is neither a nor b nor c … nor z
Parentheses
- Disambiguate scope of operators
A(BC)|(DEF)
meansABC
orADEF
- Otherwise defaults apply, e.g.,
ABC|D
meansABC
orABD
?
signifies optionalityABC?
is equivalent to(ABC)|(AB)
*
indiates 1 or moreA(BC)*
is equivalent toA|(A(BC)+)
Special Symbols:
- Period means any character, e.g.,
A.*B
– matches A and B and any characters between - Carrot (
^
) means the beginning of a line, e.g., ^ABC matches ABC at the beginning of a line [*Note dual usage of ^ as negation operator] - Dollar sign (
$
) means the end of a line, .e.g.,[\.?!] *$
matches final punctuation, zero or more spaces and the end of a line
Sets of characters:
\w
=[A-Za-z0-9_]
\W
=[^A-Za-z0-9_]
RegEx Generator
Devices for recognizing finite state grammars (include regexps)
Two types
- Deterministic Finite State Automata (DFSA): Rules are unambiguous
- NonDeterministic FSA (NDFSA): Rules are ambiguous. Sometimes more than one sequence of rules must be attempted to determine if a string matches the grammar. Ways to solve this: Backtracking, Parallel Processing and Look Ahead.
Any NDFSA can be mapped into an equivalent (but larger) DFSA

DFSA Example for RegEx A(ab)*ABB
Algorithm:
D-Recognize(tape, machine)
pointer ← beginning of tape
current state ← initial state Q0
repeat until the end of the input is reached
look up (current state,input symbol) in transition table
if found: set current state as per table look up
advance pointer to next position on tape
else: reject string and exit function
if current state is a final state: accept the string
else: reject the string

NDFSA Example for RegEx A(ab)*ABB
Algorithm:
ND-Recognize(tape, machine)
agenda ← {(initial state, start of tape)}
current state ← next(agenda)
repeat until accept(current state) or agenda is empty
agenda ← Union(agenda,look_up_in_table(current state,next_symbol))
current state ← next(agenda)
if accept(current state): return(True)
else: false
Accept if at the end of the tape and current state is a final state
Next defined differently for different types of search
- Choose most recently added state first (depth first)
- Chose least recently added state first (breadth first)
- Etc.
Last modified 1yr ago