Regular Expressions
Formal Languages, Regular Expressions, Automata and Transducers
Last updated
Formal Languages, Regular Expressions, Automata and Transducers
Last updated
Useful links:
(Chinese)
A Formal Language can model a phenomenon, e.g. written English
Examples: All combinations of the letters, any number of As, followed by any number of Bs, mathematical equations, all the sentences of a simplified version of written English, a sequence of musical notation (e.g., the notes in Beethoven's 9th Symphony), etc.
A formal grammar: a set of rules that match all and only instances of a formal language. A formal grammar defines a formal language.
In Computer Science, Formal grammars are used to generate and recognize formal languages (e.g., programming languages)
Parsing a string of a language involves:
Recognizing the string and
Recording the analysis showing it is part of the language
A compiler translates from language X to language Y, e.g.,This may include parsing language X and generating language Y
If all natural languages were formal languages, then Machine Translation systems would just be compilers
Symbols that can be replaced by other symbols
Symbols that cannot be replaced by other symbols
Replace the symbol sequence XYZ with abXzY: XYZ → abXzY
Marks the start of the language
Type0 ⊇ Type1 ⊇ Type2 ⊇ Type3
Equivalent to Turing Machine, general system capable of simulating any algorithm.
αAβ → αγβ
Greek letters = 0 or more non-terms/terms.
A = non-terminal
Rule means: replace A with γ, when A is between α and β
A → αγβ
Like context-sensitive, except left-hand side can only contain exactly one non-terminal
Example Rule from linguistics:
Regular (finite state) grammars
A → βa
or A → ϵ
(left regular)
A → aβ
, or A → ϵ
(right regular)
Like Type 2, except:
Non-terminals can precede terminals in left regular grammar
Non-terminals can follow terminals in right regular grammar
Null string is allowed
Type-3 grammars generate the regular languages
Type-3 grammars must have a single non-terminal on the left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a single non-terminal.
The productions must be in the form X → a
or X → aY
where X, Y ∈ N (Non terminal)
and a ∈ T (Terminal)
The rule S → ε is allowed if S does not appear on the right side of any rule.
Type 3 grammars: Least expressive, Most efficient processors
Type 0 grammars: Most expressive, Least efficient processors
Complexity of recognizer for languages:
Type 0: exponential
Type 1: polynomial
Type 3 grammars:
Include regular expressions and finite state automata (aka, finite state machines)
The focal point of the rest of this talk
Type 2 grammars:
Commonly used for natural language parsers
Used to model syntactic structure in many linguistics theories (often supplemented by other mechanisms)
Important for later talks on constituent structure & parsing
Human Language believed to be “mildly context sensitive”
Less expressive than type 1 (context sensitive)
More expressive than type 2 (context-free)
Three Adjoining Grammars
https://repository.upenn.edu/cgi/viewcontent.cgi?article=1706&context=cis_reports
https://www.aclweb.org/anthology/H86-1020.pdf
Formalism by A. Joshi & others
May be able to handle these cases
Concatenation
If X is a regexp and Y is a regexp, then XY is a regexp
Examples:
If ABC
and DEF
are regexps, then ABCDEF
is a regexp
If AB*
and BC*
are regexps, then AB*BC*
is a regexp Note: Kleene _ is explained below
Disjunction
If X is a regexp and Y is a regexp, then X | Y
is a regexp
Example: ABC|DEF
will match either ABC
or DEF
Repetition
If X is a regexp then a repetition of X will also be a regexp
The Kleene Star: A*
means 0 or more instances of A
Regexp{number}: A{2}
means exactly 2 instances of A
Disjunction of characters
[ABC]
– means the same thing as A | B | C
[a-zA-Z0-9]
– character ranges are equivalent to lists: a|b|c|...|A|B|...|0|1|...|9
Negation of character lists/sequences
^
inside bracket means complement of disjunction, e.g., [^a-z]
means a character that is neither a nor b nor c … nor z
Parentheses
Disambiguate scope of operators
A(BC)|(DEF)
means ABC
or ADEF
Otherwise defaults apply, e.g., ABC|D
means ABC
or ABD
?
signifies optionality
ABC?
is equivalent to (ABC)|(AB)
*
indiates 1 or more
A(BC)*
is equivalent to A|(A(BC)+)
Special Symbols:
Period means any character, e.g., A.*B
– matches A and B and any characters between
Carrot (^
) means the beginning of a line, e.g., ^ABC matches ABC at the beginning of a line [*Note dual usage of ^ as negation operator]
Dollar sign ($
) means the end of a line, .e.g., [\.?!] *$
matches final punctuation, zero or more spaces and the end of a line
Sets of characters:
\w
= [A-Za-z0-9_]
\W
= [^A-Za-z0-9_]
Devices for recognizing finite state grammars (include regexps)
Two types
Deterministic Finite State Automata (DFSA): Rules are unambiguous
NonDeterministic FSA (NDFSA): Rules are ambiguous. Sometimes more than one sequence of rules must be attempted to determine if a string matches the grammar. Ways to solve this: Backtracking, Parallel Processing and Look Ahead.
Any NDFSA can be mapped into an equivalent (but larger) DFSA
Algorithm:
Algorithm:
Accept if at the end of the tape and current state is a final state
Next defined differently for different types of search
Choose most recently added state first (depth first)
Chose least recently added state first (breadth first)
Etc.
Type 2:
Type 3:
Also see
Some complex dependencies cannot be expressed in context free rules, e.g. see