Here we go. I know that probably I should have started this new topic talking about the grammar of the language we are going to implement, but as long as I see grammar strictly related to parsing, I’ve preferred to talk about scanning first.
We will go back to the grammar the next time, when talking about how to parse an input file.
I wrote about scanning (or lexing if you prefer) a while ago, when blogging about expression evalution in ActionScript. Scanning an input file is quite always the same (although some languages might require unusual features), and what I wrote about expressions works also for a general language.
The goal of the scanning process is to group some characters together, skipping the ones that have no meaning for the language like spaces. Each group of characters is usually called token. So a Scanner converts a textual input into a stream of tokens, each one rappresenting a possible valid word for our language. While scanning you don’t take care about the meaning of what you are grouping or about the fact that a sequence of tokens is meaningful. This is a task for the Parser as we will see.
When talking about expressions, I showed a manual implementation for a Lexer. Now I want to take a different approach and show you a possible implementation for a simple dynamic scanner. This scanner will be based on regular expressions: each regular expression will rappresent a given token, and we will be able to assign callbacks to the scanner that will be executed each time a given token is extracted from the input.
Writing a general and reusable scanner is usually a good practice. A common approach is to use some scanner generators, that are usually regular expression based too but are able to generate the code for a scanner at compile time. Our approach is different (and generates slower scanners) because regular expressions are evaluated at runtime; but it is fine for a test project.
You can download the source code for the regular expression based lexer (RegexLexer) here, as long as a simple usage example. Let’s see this example together so we can briefly discuss it (PBLexer.as):