Tuesday, November 11, 2008

I have created a space in Assembla to look closely at JvYaml and call it SnakeYAML (http://trac-hg.assembla.com/snakeyaml). The source migrated from CVS to Mercurial. Standard Maven folder structure is applied.
It is very convenient that JvYaml is a direct port of PyYAML. It is so easy to see the Python implementation and compare the deviations. It is even possible to debug two implementations in parallel on 2 computers ! (Synergy is dead useful).
Before the code is changed let us contribute tests. A lot of examples from the http://yaml.org/spec/1.1/ are created. Unfortunately a number of tests fails.
These are some deviation from original Python code:
  • Reader is dropped (in favor of java.io.Reader) and BOM is not respected. When stream is read the encoding must be known which is not always possible (and it is against the specification)
  • Scanner implementation is simplified. All the comments are removed.
  • Python implementation is not followed very closely. For instance a boolean in Python may be True, False and None. But Java implementation is using a primitive instead of the class Boolean and the third state is gone. It causes for example trimming the trailing spaces in the block scalars.
  • Python module is close to Java package. It helps separate code logically.
  • no tests are imported from PyYAML

Let us improve the implementation and try to follow the specification as close as possible.
This is what is done so far:
  • Java does not have multiple inheritance (which is very good!). The way how multiple inheritance is used in PyYAML is not very correct. Let us follow a reliable recommendation - "use composition over inheritance". Now Reader is an instance variable in Scanner.
  • Change the public interface and stay closer to PyYAML. Use Iterator instead of List. The java.io.InputStream is used and the encoding is recognized (and ignored) automatically
  • Rename classes with respect to "Python module" -> "Java package".
  • Define code formatter which can be imported to Eclipse
  • Go through ScannerImpl and try to stay as close as possible to PyYAML. A number of issues fixed. The size is almost doubled (~2000 lines), mostly because of the comments in the code.
  • some tests are imported from PyYAML

Because SnakeYAML provides some improvements over existing YAML libraries I can release the library.
Documentation is much worse then it should be. I will try to improve it later.
If somebody needs a reliable YAML parser for Java take SnakeYAML !
SnakeYAML 0.4 is born to this beautiful World...

6 comments:

n4te said...

Cool. It is good to see a better parser for Java since JvYaml has some issues and is now basically dead.

It would be nice to replace the JvYaml-based parser/emitter in YamlBeans with SnakeYAML. I tried it but the events received were slightly different than YamlBeans' parser presents them. It shouldn't take much to iron out though.

What do you think about merging the projects? I am open to this if we can keep the YamlBeans API and current feature set unchanged.

BTW, I don't think SnakeYaml's API should deal with InputStreams. I think that if you are reading characters, it should take a Reader. It is the API user's responsibility for providing a reader with the correct encoding.

YamlBeans has the parser keep track of the line number, which can be useful for reporting errors. Sometimes an error cannot be detected until after the line with the actual error, but often this saves time tracking down problems.

Andrey Somov said...

Hi Nate,
sorry for the late reply, I am just back from sunny Munich...
In the version 0.7 of SnakeYAML there is nothing left from JvYAML. Everything was overwritten (taken from PyYAML 3.06).
SnakeYAML API is very close to PyYAML API. This is because PyYAML is the best library for YAML available. It is very stable and feature rich. There is a big community behind it.
I am also open to cooperation. This is what SnakyYAMl can offer:
- comprehensive test suite (test coverage is 97%!)
- any Java object can be constructed (not only JavaBeans): http://trac-hg.assembla.com/snakeyaml/wiki/Documentation#Constructorsrepresentersresolvers
- flexible collection support (it is possible to define which implementation is created for sequences and maps)
- advanced parser and emitter

What can YamlBeans bring to the joined project ?
I found this in the spec:
YamlWriter writer = new YamlWriter(new FileWriter("output.yml"));
writer.getConfig().setPropertyElementType(Contact.class, "phoneNumbers", Phone.class);
writer.write(contact);
This is missing in SnakeYAML but when I have time I plan to import a similar feature from PyYAML. There it is very flexible and
is somehow similar to XPath.

I have chosen InputStream instead of Reader because this is the only way to meet the specification. The encoding is defined in the begining of the stream.
It means that a user cannot be responsible for quessing the encoding. Because of the long standing bug in Java (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058)
I have to use my own way.

SnakeYAML does not only keeps track of the line with an error but also tries to keep the context. See Mark class for details.

nate said...

Hi Andrey,

In YamlBeans, the YamlReader and YamlWriter classes are the useful bits. Everything else is just to parse YAML, so is replaced by SnakeYAML.

YamlReader goes from YAML text to a Java object graph. It uses a Java class definition so that the data types in the YAML can be inferred. This means that little or no extra information (eg class name tags) needs to be in the YAML.

YamlWriter goes from a Java objet graph and emits YAML text. The text emitted is the minimum necessary to allow YamlReader to reconstruct the object graph.

These two classes allow the YAML to remain uncluttered and makes reading, writing, and editing the YAML by hand much easier. Also important is that these classes require very little setup. YamlReader just takes a Java class and YamlWriter just takes a Java object instance.

I think it would be easy to change YamlBeans to use SnakeYAML rather than its current YAML parser/emitter. Both YamlReader and YamlWriter are event based. I gave SnakeYAML a try for parsing, but I received unexpected events in YamlReader and I haven't had the time to determine why.

You may want to use what Python does, though I have not looked at it. I can say that I think YamlBeans is about as simple as it gets and works in a Java-like manner. FWIW, YamlBeans is very similar to the popular XStream project.

I think supporting the use of an InputStream to guess the type is good. However I tried to use SnakeYAML with some existing code that used a Reader and there is no easy way to go from a Reader to an InputStream. Maybe SnakeYAML could support a Reader in addition to an InputStream? It is good for API convenience since usually in Java characters come from a Reader, but also to allow a user to explicitly handle the decoding themselves if they choose (eg, maybe the YAML text is not UTF).

Under the covers you could simply wrap the InputStream in a UnicodeReader class that does the automatic encoding discovery and BOM byte skipping. Google "UnicodeReader", there are many proven examples.

Ah, I didn't notice it was tracking the line numbers for errors. Cool!

Andrey Somov said...

Hi Nate,
it is very difficult to be simple and flexible at the same time. YamlBeans API is simple but it misses some
important features.

JavaBeans:
JavaBeans support is great but what if the object to be read/write is not a JavaBean?
Another use case: how to parse an object which scalar representation matches a specific regular expression? For instance '123-45-67' should call
'new PhoneNumber(String number)'?
How to change a default scalar representation for a object (which may or may not be a JavaBean)?
(you can see how it is done in SnakeYAML here http://trac-hg.assembla.com/snakeyaml/wiki/Documentation#Constructorsrepresentersresolvers)

Anchors:
YamlBeans treats anchors in a special way. This is due to the bug in JvYaml that emits anchors even for the same integer in the list.
So YamlBeans makes it possible to ignore anchors. Unfortunately it leads to a serios problem when you try to emit recursive objects.
It is not an issue for JvYaml (and for YamlBeans?) because recursive objects are not supported but for SnakeYaml anchors cannot be switched off.

Config:
configuration in YamlBeans is simplified. Imaging you wish to output a scalar which is long enough (like a paragrath in a book). Do you split
the lines? How do you define what is the length of the line and what is the indent?
In SnakeYAML there are 2 ways to output sequences and maps: block style and flow style.
There are other things which are configurable in PyYAML (and SnakeYAML) but not present in YamlBeans.

InputStream:
You should not transfer Reader to an InputStream, it is the other way around. Every input is a stream of bytes (File or Socket) and you need
to do some extra work to produce a stream of characters. That is why you should simply drop your Reader and connect SnakeYAML to the source.
SnakeYAML cannot support a Reader in addition to an InputStream. What should happen if a user configures UTF-16 in the Reader but the actual
stream is UTF-8 (according to the BOM)?
You are right I have used available (http://koti.mbnet.fi/akini/java/unicodereader/) library for BOM recognition.
(according to the specification YAML must be UTF-8 or UTF-16)

n4te said...

JavaBeans:
What is it if not a JavaBean? Direct field assignment could be supported. Otherwise having any framework, YamlBeans or SnakeYAML, calling arbitrary methods for gets and sets doesn't seem like a valid use case. For this situation you would want to read the data as a map and make the calls yourself or write a special handler (Constructor in SnakeYAML, ScalarSerializer in YamlBeans).

"For instance '123-45-67' should call 'new PhoneNumber(String number)'?" It is standard for serialization frameworks to require a no-arg constructor rather than attempt invoking a constructor with arbitrary objects. If your point was "how does it know '123-45-67' is a String and not an int?", then the answer is that YamlBeans knows the type of the field on the class, so therefore knows how to interpret that data in the yaml. In this case, regex should not be used to guess types.

Representers are a neat feature of SnakeYAML. The counterpart in YamlBeans is to implement a ScalarSerializer to determine how a given class should be serialized and deserialized. This is also a workaround to use YamlBeans to invoke a non-no-arg constructor.

ImplicitResolvers seem too error prone for my taste, I would rather the data types be explicitly defined and enforced by a Java class definition.

Anchors:
YamlBeans handles anchors itself, not because of any bug in JvYaml, but because YamlBeans doesn't use the portion of JvYaml that manages anchors.

Recursive objects are an edge case. It is true that YamlBeans does not handle this. This could be fixed in a few ways: detection of the recursion and an exception or omission of an anchor, or manually disabling anchors.

Config:
If SnakeYAML is used under YamlBeans, the SnakeYAML config can be exposed.

InputStream:
"You should not transfer Reader to an InputStream, it is the other way around." I stated the same thing. Because SnakeYAML only accepts InputStream and it is not feasible to go from a Reader to an InputStream, SnakeYAML cannot be used if I only have a Reader. This can happen, for example, if I am obtaining text using some other 3rd party library that only exposes a Reader. This is a standard practice in Java, since it does not make sense for the other 3rd party API to expose an InputStream and force users of the API to do the decoding.

I don't think SnakeYAML and YamlBeans are mismatched. By YamlBeans using SnakeYAML for parsing and emitting, it gains all that you have built into SnakeYAML. I am not proposing that YamlBeans be the only way to round trip object graphs, only that it be available for use on top of SnakeYAML to make dealing with JavaBeans extremely simple. It is true that parsing into JavaBeans isn't the only way to parse yaml, but it is the expected approach in Java land and YamlBeans does it simply be using a Java class definition.

Andrey Somov said...

I came across a complex example (http://yaml.org/spec/1.1/#id859060) and I must say it is a challenge for SnakeYAML to parse it into Java business objects.
I will look how YamlBeans works to see how it can help. Indeed SnakeYAML and YamlBeans are not that different.
I also agree that it should be possible to use a java.io.Reader
as input provided that it works properly with UTF-8 (default Java implementation does not).