Thursday, April 10, 2008

Deserializing non printable characters

Today I got on my desk to fix a problem with our import and export functionality. In an interface open to customers one of them had entered non printable characters in a string property which caused the import to fail if the item had been exported.

My first thought was that it seemed strange that the XmlSerializer of the .NET Framework didn't automatically encode string properties to proper XML. Looking at the XML file it was correct tho. Or, proper XML format at least. To be proper XML it seems you may not use control characters other than CR, LF and TAB.

XML Spec 1.0 says:
Char ::= #x9 #xA #xD [#x20-#xD7FF] [#xE000-#xFFFD] [#x10000-#x10FFFF]

Considering this it seems like the Serialize method that generated with XML should be the one complaining, but in this case the problem is reported by the Deserialize method. It throws an InvalidOperationException saying it encountered an invalid character in a call to ScanHexEntity.

With the help of Google I did however find a simple solution to the problem. Calling Deserialize with an XmlTextReader instead of a StringReader. Since the former has a constructor that take the latter as argument it has minimum inpact on the code too.

Solution:
xmlSerializer.Deserialze(new XmlTextReader(sr));

The reason is that this XmlTextReader contructor initiates with the Normalization property set to false, while the XmlTextReader created under the hood from other calls to Deserialize is created with the property set to true.

A more sophisticated solution would be to base64 encode the string, but I'll stick with the simple solution. More than having very limited inpact and being a very quick solution it also has the benefit that old export files will work.

No comments: