Natural Language Processing

Regular Expressions

Formal Languages, Regular Expressions, Automata and Transducers
Useful links:

Formal Language = Set of Strings of Symbols

A Formal Language can model a phenomenon, e.g. written English
Examples: All combinations of the letters, any number of As, followed by any number of Bs, mathematical equations, all the sentences of a simplified version of written English, a sequence of musical notation (e.g., the notes in Beethoven's 9th Symphony), etc.

What is a Formal Grammar for?

A formal grammar: a set of rules that match all and only instances of a formal language. A formal grammar defines a formal language.
In Computer Science, Formal grammars are used to generate and recognize formal languages (e.g., programming languages)
  • Parsing a string of a language involves:
    • Recognizing the string and
    • Recording the analysis showing it is part of the language
  • A compiler translates from language X to language Y, e.g.,This may include parsing language X and generating language Y
  • If all natural languages were formal languages, then Machine Translation systems would just be compilers

A Formal Grammar Consists of

N: a Finite set of non-terminal symbols

Symbols that can be replaced by other symbols

T: a Finite set of terminal symbols

Symbols that cannot be replaced by other symbols

R: a set of rewrite rules

Replace the symbol sequence XYZ with abXzY: XYZ → abXzY

S: A special non-terminal that is the start symbol

Marks the start of the language

The Chomsky Hierarchy

Type0 ⊇ Type1 ⊇ Type2 ⊇ Type3

Type 0: No restrictions on rules

Equivalent to Turing Machine, general system capable of simulating any algorithm.

Type 1: Context-sensitive rules

αAβ → αγβ
  • Greek letters = 0 or more non-terms/terms.
  • A = non-terminal
  • Rule means: replace A with γ, when A is between α and β

Type 2: Context-free rules

A → αγβ
  • Like context-sensitive, except left-hand side can only contain exactly one non-terminal
Example Rule from linguistics:
Context-free Rules

Type 3: Context-free rules with restrictions

Regular (finite state) grammars
  • A → βa or A → ϵ (left regular)
  • A → aβ, or A → ϵ (right regular)
Like Type 2, except:
  • Non-terminals can precede terminals in left regular grammar
  • Non-terminals can follow terminals in right regular grammar
  • Null string is allowed
Type-3 grammars generate the regular languages

Further Simplifications

Type-3 grammars must have a single non-terminal on the left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a single non-terminal.
The productions must be in the form X → a or X → aY
where X, Y ∈ N (Non terminal)
and a ∈ T (Terminal)
The rule S → ε is allowed if S does not appear on the right side of any rule.


Type 3 grammars: Least expressive, Most efficient processors
Type 0 grammars: Most expressive, Least efficient processors
Complexity of recognizer for languages:
  • Type 0: exponential
  • Type 1: polynomial
  • Type 2:
  • Type 3:
    O(nlogn)O(n logn)

CL mainly features Type 2 & 3 Grammars

Type 3 grammars:
  • Include regular expressions and finite state automata (aka, finite state machines)
  • The focal point of the rest of this talk
  • Also see Nooj platform for NLP​
Type 2 grammars:
  • Commonly used for natural language parsers
  • Used to model syntactic structure in many linguistics theories (often supplemented by other mechanisms)
  • Important for later talks on constituent structure & parsing

Type 1.5 Grammars

Human Language believed to be “mildly context sensitive”
  • Less expressive than type 1 (context sensitive)
  • More expressive than type 2 (context-free)
Some complex dependencies cannot be expressed in context free rules, e.g. see this​
Three Adjoining Grammars
  • https://repository.upenn.edu/cgi/viewcontent.cgi?article=1706&context=cis_reports
  • https://www.aclweb.org/anthology/H86-1020.pdf
  • Formalism by A. Joshi & others
  • May be able to handle these cases

Regular Expressions

  • If X is a regexp and Y is a regexp, then XY is a regexp
  • Examples:
    • If ABC and DEF are regexps, then ABCDEF is a regexp
    • If AB* and BC* are regexps, then AB*BC* is a regexp Note: Kleene _ is explained below
  • If X is a regexp and Y is a regexp, then X | Y is a regexp
  • Example: ABC|DEF will match either ABC or DEF
  • If X is a regexp then a repetition of X will also be a regexp
    • The Kleene Star: A* means 0 or more instances of A
    • Regexp{number}: A{2} means exactly 2 instances of A
Disjunction of characters
  • [ABC] – means the same thing as A | B | C
  • [a-zA-Z0-9] – character ranges are equivalent to lists: a|b|c|...|A|B|...|0|1|...|9
Negation of character lists/sequences
  • ^ inside bracket means complement of disjunction, e.g., [^a-z] means a character that is neither a nor b nor c … nor z
  • Disambiguate scope of operators
    • A(BC)|(DEF) means ABC or ADEF
    • Otherwise defaults apply, e.g., ABC|D means ABC or ABD
? signifies optionality
  • ABC? is equivalent to (ABC)|(AB)
* indiates 1 or more
  • A(BC)* is equivalent to A|(A(BC)+)
Special Symbols:
  • Period means any character, e.g., A.*B – matches A and B and any characters between
  • Carrot (^) means the beginning of a line, e.g., ^ABC matches ABC at the beginning of a line [*Note dual usage of ^ as negation operator]
  • Dollar sign ($) means the end of a line, .e.g., [\.?!] *$ matches final punctuation, zero or more spaces and the end of a line
Sets of characters:
  • \w = [A-Za-z0-9_]
  • \W = [^A-Za-z0-9_]


RegEx Generator

Finite State Automata

Devices for recognizing finite state grammars (include regexps)
Two types
  • Deterministic Finite State Automata (DFSA): Rules are unambiguous
  • NonDeterministic FSA (NDFSA): Rules are ambiguous. Sometimes more than one sequence of rules must be attempted to determine if a string matches the grammar. Ways to solve this: Backtracking, Parallel Processing and Look Ahead.
Any NDFSA can be mapped into an equivalent (but larger) DFSA


DFSA Example for RegEx A(ab)*ABB
D-Recognize(tape, machine)
pointer ← beginning of tape
current state ← initial state Q0
repeat until the end of the input is reached
look up (current state,input symbol) in transition table
if found: set current state as per table look up
advance pointer to next position on tape
else: reject string and exit function
if current state is a final state: accept the string
else: reject the string


NDFSA Example for RegEx A(ab)*ABB
ND-Recognize(tape, machine)
agenda ← {(initial state, start of tape)}
current state ← next(agenda)
repeat until accept(current state) or agenda is empty
agenda ← Union(agenda,look_up_in_table(current state,next_symbol))
current state ← next(agenda)
if accept(current state): return(True)
else: false
Accept if at the end of the tape and current state is a final state
Next defined differently for different types of search
  • Choose most recently added state first (depth first)
  • Chose least recently added state first (breadth first)
  • Etc.