# Regular Expressions

Formal Languages, Regular Expressions, Automata and Transducers

Useful links:

A Formal Language can model a phenomenon, e.g. written English

Examples: All combinations of the letters, any number of As, followed by any number of Bs, mathematical equations, all the sentences of a simplified version of written English, a sequence of musical notation (e.g., the notes in Beethoven's 9th Symphony), etc.

A formal grammar: a set of rules that match all and only instances of a formal language. A formal grammar defines a formal language.

In Computer Science, Formal grammars are used to

**generate**and**recognize**formal languages (e.g., programming languages)- Parsing a string of a language involves:
- Recognizing the string and
- Recording the analysis showing it is part of the language

- A compiler translates from language X to language Y, e.g.,This may include parsing language X and generating language Y
- If all natural languages were formal languages, then Machine Translation systems would just be compilers

Symbols that can be replaced by other symbols

Symbols that cannot be replaced by other symbols

Replace the symbol sequence XYZ with abXzY:

`XYZ → abXzY`

Marks the start of the language

Type0 ⊇ Type1 ⊇ Type2 ⊇ Type3

Equivalent to Turing Machine, general system capable of simulating any algorithm.

`αAβ → αγβ`

- Greek letters = 0 or more non-terms/terms.
- A = non-terminal
- Rule means: replace A with γ, when A is between α and β

`A → αγβ`

- Like context-sensitive, except left-hand side can only contain exactly one non-terminal

Example Rule from linguistics:

Context-free Rules

Regular (finite state) grammars

`A → βa`

or`A → ϵ`

(left regular)`A → aβ`

, or`A → ϵ`

(right regular)

Like Type 2, except:

- Non-terminals can precede terminals in left regular grammar
- Non-terminals can follow terminals in right regular grammar
- Null string is allowed

Type-3 grammars generate the regular languages

Type-3 grammars must have a single non-terminal on the left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a single non-terminal.

The productions must be in the form

`X → a`

or `X → aY`

where X, Y ∈ N (Non terminal)

and a ∈ T (Terminal)

The rule S → ε is allowed if S does not appear on the right side of any rule.

Type 3 grammars: Least expressive, Most efficient processors

Type 0 grammars: Most expressive, Least efficient processors

Complexity of recognizer for languages:

- Type 0: exponential
- Type 1: polynomial
- Type 2:$O(n^3)$
- Type 3:$O(n logn)$

Type 3 grammars:

- Include regular expressions and finite state automata (aka, finite state machines)
- The focal point of the rest of this talk

Type 2 grammars:

- Commonly used for natural language parsers
- Used to model syntactic structure in many linguistics theories (often supplemented by other mechanisms)
- Important for later talks on constituent structure & parsing

Human Language believed to be “mildly context sensitive”

- Less expressive than type 1 (context sensitive)
- More expressive than type 2 (context-free)

Three Adjoining Grammars

- https://repository.upenn.edu/cgi/viewcontent.cgi?article=1706&context=cis_reports
- https://www.aclweb.org/anthology/H86-1020.pdf
- Formalism by A. Joshi & others
- May be able to handle these cases

Concatenation

- If X is a regexp and Y is a regexp, then XY is a regexp
- Examples:
- If
`ABC`

and`DEF`

are regexps, then`ABCDEF`

is a regexp - If
`AB*`

and`BC*`

are regexps, then`AB*BC*`

is a regexp Note: Kleene _ is explained below

Disjunction

- If X is a regexp and Y is a regexp, then
`X | Y`

is a regexp - Example:
`ABC|DEF`

will match either`ABC`

or`DEF`

Repetition

- If X is a regexp then a repetition of X will also be a regexp
- The Kleene Star:
`A*`

means 0 or more instances of A - Regexp{number}:
`A{2}`

means exactly 2 instances of A

Disjunction of characters

`[ABC]`

– means the same thing as`A | B | C`

`[a-zA-Z0-9]`

– character ranges are equivalent to lists:`a|b|c|...|A|B|...|0|1|...|9`

Negation of character lists/sequences

`^`

inside bracket means complement of disjunction, e.g.,`[^a-z]`

means a character that is neither a nor b nor c … nor z

Parentheses

- Disambiguate scope of operators
`A(BC)|(DEF)`

means`ABC`

or`ADEF`

- Otherwise defaults apply, e.g.,
`ABC|D`

means`ABC`

or`ABD`

`?`

signifies optionality`ABC?`

is equivalent to`(ABC)|(AB)`

`*`

indiates 1 or more`A(BC)*`

is equivalent to`A|(A(BC)+)`

Special Symbols:

- Period means any character, e.g.,
`A.*B`

– matches A and B and any characters between - Carrot (
`^`

) means the beginning of a line, e.g., ^ABC matches ABC at the beginning of a line [*Note dual usage of ^ as negation operator] - Dollar sign (
`$`

) means the end of a line, .e.g.,`[\.?!] *$`

matches final punctuation, zero or more spaces and the end of a line

Sets of characters:

`\w`

=`[A-Za-z0-9_]`

`\W`

=`[^A-Za-z0-9_]`

RegEx Generator

Devices for recognizing finite state grammars (include regexps)

Two types

- Deterministic Finite State Automata (DFSA): Rules are unambiguous
- NonDeterministic FSA (NDFSA): Rules are ambiguous. Sometimes more than one sequence of rules must be attempted to determine if a string matches the grammar. Ways to solve this: Backtracking, Parallel Processing and Look Ahead.

Any NDFSA can be mapped into an equivalent (but larger) DFSA

DFSA Example for RegEx A(ab)*ABB

Algorithm:

D-Recognize(tape, machine)

pointer ← beginning of tape

current state ← initial state Q0

repeat until the end of the input is reached

look up (current state,input symbol) in transition table

if found: set current state as per table look up

advance pointer to next position on tape

else: reject string and exit function

if current state is a final state: accept the string

else: reject the string

NDFSA Example for RegEx A(ab)*ABB

Algorithm:

ND-Recognize(tape, machine)

agenda ← {(initial state, start of tape)}

current state ← next(agenda)

repeat until accept(current state) or agenda is empty

agenda ← Union(agenda,look_up_in_table(current state,next_symbol))

current state ← next(agenda)

if accept(current state): return(True)

else: false

Accept if at the end of the tape and current state is a final state

Next defined differently for different types of search

- Choose most recently added state first (depth first)
- Chose least recently added state first (breadth first)
- Etc.

Last modified 1yr ago