Wednesday, May 22, 2013

Hans Moleman issues

XML, especially GML schema validation can be hard. The mysterious Xerces 'honor all schema locations' flag springs to mind (this is a mystery yet to be fully understood). Often, slow schema validation processes (which seem to fetch schemas from the web) can be traced to Hans Moleman. No, sorry, wrong link, to Hans Moleman.

So what's happening? And what does Hans Moleman have to do with it?

As the GML experts among you may know, GML application schemas depend on the GML schema, which in turn consists of many (varies amongst versions) schemas, depending on other schemas like for example the W3C XLinks schema, which in turn includes the W3C XML schema (the schema for the xml namespace itself: http://www.w3.org/XML/1998/namespace).

So even when validating a feature collection against a local version of a GML application schema, the schema parser might still get to a point where it needs to fetch dependent schemas from the internet. And since the xml.xsd is the last one in the chain, it's also the one that gets requested the most.

According to W3C people, they had ~130 million accesses to this file per day, and since decided to completely block eg. the Java default HTTP UserAgent and others. Apparently they later had a change of heart, and don't block it any more, but the xml.xsd URL has a delay of several seconds upon loading (see http://www.w3.org/2001/xml.xsd).

So when validating multiple documents, which all need the xml.xsd, with all schemas loaded freshly every time, you'll get a delay of several seconds where your computer seems to do nothing at all.

We've thought about the problem of remote schemas quite a while ago, and made use of a custom Xerces entity resolver to load OGC and W3C schemas from a local artifact which we ship with deegree. There would also be other solutions, our JAXB schema generation for example makes use of standard XML catalog files to avoid fetching schemas from the web.

But unfortunately the CITE WFS 1.0.0 tests (and others) do not (although newer versions tend to load required schemas from the classpath as well).

Using reverse engineering using an eclipse plugin (see the other post from today) I was able to fix this (they were already using a custom entity resolver, loading everything from the web all the time). Now a complete deegree build including integration tests runs only needs 13 minutes on fast machine!

For those interested, have a look at our deegree-compliance-tests module.

All the library sources

One of the nicer features in eclipse is that it is able to automatically browse not only through your own sources, but through library sources as well, if they're available. But what if they're not?

In that case eclipse shows the class in a byte code view, with a button to attach a source .jar. Unfortunately it  is often the case that you don't have the sources, either because they were not uploaded to maven central, or because they're closed altogether.

In any case, it is obviously often desirable while debugging to see what a library function actually expects, or why it fails. Contrary to popular belief, the actual code is always the ultimate documentation, because it's up to date even if human language docs are not.

So recently, while chasing after a Hans Moleman issue (more on that later), I needed sources from binary classes. Of course my first thought was 'decompiler', so I searched the eclipse marketplace for 'decompiler'.

I found JadClipse for eclipse 4.x, installed it, and voila: when I double click on a class file with no sources attached, the decompiler automatically decompiles it and shows me the source. Now that's what I call a plugin without hassle! That's a perfect counterpart to the -DdownloadSources flag for the maven eclipse mojo, now I never need to go without a library source again.