Tuesday, March 3, 2009

Performance, regular expressions and Ragel

Recently I got a note that SnakeYAML is slower that JYaml. I have written a small stress test to load a document in a cycle. Indeed, it clearly showed that SnakeYAML performs bad with a big load.
Profiler could help to find the bottleneck - regular expressions. First, they are slow (I am afraid Just-in-Time compiler does not work here). Second, they scale badly, the bigger the load the slower they perform.
Regular expressions are used to support implicit types. Based on the data format SnakeYAML creates an appropriate Java class:
123 -> Integer
1.0 -> Float
false -> Boolean
2009-03-25 -> Date
abc -> String
and so on.
Of course we may take the JYaml's approach and drop regular expressions. Then all the scalars become Strings. It works but then developers must support all the (weird) formats like:
'1_000.5000_' -> 1000.5
23:59:59 -> int
Off -> false
~ -> null
etc.
Fortunately I came across Ragel. I gave it a try. It is cool. It creates an extremely fast implementation.
Regular expressions for implicit types are removed. First, we do not need to compile them when an instance of Yaml is created. Second, we do not match a long list of regular expressions against each and every scalar with an implicit type. As a result the stress test performs 2-3 times faster! This is impressive.
Of course if a single YAML document is loaded then the performance would not grow that much. This is because SnakeYAML has a number of static initializers for constants. Once the very first Yaml instance is created other instances are very chip to create. It should not have any significant influence on the performance.
This changes will be introduced soon in SnakeYAML 1.1.


0 comments: