Actionscript parsing experiences: PyBison & PLY

My experiments with text-parsing continue..

PyBison
Last day I founded a python library (pybison) which runs the generated python parser at near the speed of C-based parsers, due to direct hooks into bison-generated C code.
Cool, unfortunately I couldn’t compiled it for Windows and so I made my test on Ubuntu only. What I did was just to export the already written lexer/grammar using bison2py (boundled with pybison) and run it.

If you want to take a look at the python parser try it by downloading the source code here.
The run.py file accept these parameters:
usage: run.py [options]
options:
-h, --help show this help message and exit
-v, --verbose Turn on verbose mode
-o OUTPUT_FILE, --output=OUTPUT_FILE
Don't print to stdout but to the output file
-i INPUT_FILE, --input=INPUT_FILE
Input file to be parsed
-x, --to-xml Returns a human-readable xml representation of the
parse tree
-b, --bison-debug Print the Bison/Flex debug

You can download the source code here

PLY
The second test I did was using PLY, an implementation of lex and yacc parsing tools for Python. Being implemented entirely in python it should be much more slower that pybison, but I didn’t find any difference with the pybison parser version. In fact PLY , like the traditional bison, creates tables starting from the grammar syntax.
Ok, Both of the implementations are slower that the pure C parser, but extremely faster that antlr!
(They took more or less 0.02 to 0.5 secs for parsing and generating the AST.)
Unlike pybison PLY is still mantained and offers more features and a better error handling.. even if the whole grammar has to be rewritten in python, and it can be compiled in Windows too.

To run the test just write:
python run.py {filename}
P.S. Unfortunately the yacc parser isn’t yet complete because I still need to find a way for parsing correctly E4X and XML syntax..

ActionScript Parsing, the YACC revenge :)

After my first attempts with ANTLR scanners in python/java I decided to start back with Bison/Flex again to see the difference in performances.
So first I need to wrote from scratch the grammar/lexer files using only the ECMAScript 4 specifications and much patience (the elastic grammar file help me a lot too).

After finishing a first version of the parser I tested it on the same file (75Kb actionscript file) which both java and python parsed in more than 1 second.
The result was unbelievable: 0.02 seconds for that file!

Then I tested it on multiple files, and for about 320 files of the whole adobe corelib library it took 220ms

Ok, the parser it’s not yet complete and doesn’t care about regexp and xml syntax, but its performance convinced me enough…
Now, the next step is to finish and test the parser and finally create a python library using pyrex, then a benchmark test again.

If someone is interested in testing the parser, download it (use “parser –help” form the command line for usage help), but remember this is only a first test.. not really helpful right now (I just wanted to share my text/parsing experiences).

An experience with antlr, java and python

I just wanted to share a little experience with generating an AS3 parser using antlr and python.
I was trying first to create the parser using GNU Flex and Bison in C, probably the best way for a very performancing parser.
Yeah, that’s right.. but looking at the antlr syntax I realized that’s easier and easier.
Moreover I start using this very useful eclipse plugin for antlr debugging which made my life easier!

The grammar file I created is a compromise between the asdt grammar file and the ECMA-262 grammar specification.

Once finished working on my eclipse project I’ve managed to parse succesfully all the adobe corelibs files using this java test file:
package org.sepy.core.parsers.as3;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import antlr.CommonAST;
import antlr.RecognitionException;
import antlr.TokenStreamException;
public class Application {
public static void main(String argv[])
{
if(argv.length > 0)
{
File file = new File(argv[0]);
if(file.exists())
{
FileInputStream is = null;
try {
is = new FileInputStream(file);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
AS3Lexer L = new AS3Lexer(is);
AS3Parser P = new AS3Parser(L);
try {
P.compilationUnit();
} catch (RecognitionException e) {
// TODO Auto-generated catch block
System.out.println(" line=" + e.line + ", column="+ e.column);
System.out.println(e.getMessage());
e.printStackTrace(System.err);
} catch (TokenStreamException e) {
// TODO Auto-generated catch block
System.out.println(" line=" + L.getLine() + ", column="+ L.getColumn());
System.out.println(L.getGuessInfo());
System.out.println(e.getMessage());
e.printStackTrace(System.err);
}
CommonAST.setVerboseStringConversion(false, P.getTokenNames());
CommonAST ast = (CommonAST) P.getAST();
System.out.println("Tree:");
System.out.println(ast.toStringTree());
}
}
}
}

Ok, done that I decided to export the grammar file for python (thanks to antrl python export feature).
Everything works fine also for python, but I realized that the python script were so much slower than the java one!
import sys
import antlr
import AS3Parser
import AS3Lexer
L = AS3Lexer.Lexer(filename);
P = AS3Parser.Parser(L);
P.setFilename(filename)
try:
P.compilationUnit();
ast = P.getAST();
except:
pass

On a 75Kb actionscript file the python script took about 7 seconds to run, while the java application only 2 seconds. I know python interpreter caould be slower than many other languages, but I never thought so much slower.
So I run the python hotshot profiler to see which could be the bottleneck in the python script and I found most of the problems were due to unuseless antlr (the python module) method’s calls.
After making corrections to the antlr.py file the same script took exactly half of the time. Now 3 seconds. Wow 🙂
But not fast enough.
So I enabled for the antlr python script psyco module and this time the same script took only 1.6 seconds.
Now the python script is fast enough, even if I’m sure I can make more optimizations in the antlr module…