Quantcast
Channel: Rick Strahl's Web Log
Viewing all articles
Browse latest Browse all 664

Returning an XML Encoded String in .NET

$
0
0

XML is not as popular as it once was, but there's still a lot of XML based configuration and data floating around today. Just today I was working with a conversion routine that needs to generate XML formatted templates, and one thing that I needed is an easy way to generate a properly encoded XML string.

Stupid Pet Tricks

I'll preface this by saying that your need for generating XML as standalone strings should be a rare occurrance. The recommendation for generating any sort of XML is to create a proper XML document XmlWriter or Linq to XML structure and create your XML that way which provides built-in type to XML conversion.

In most cases you'll want to use a proper XML processor whether it's an XML Document, XmlWriter or LINQ to XML to generate your XML. When you use those features the data conversion from string (and most other types) is built in and mostly automatic.

However, in this case I have a huge block of mostly static XML text and creating the entire document using structured XML documents seems like overkill when really i just need to inject a few simple values.

So in this case I'm looking for a way to format values as XML for which the XmlConvert static class works well.

Should be easy right? Well...

The XMLConvert class works well - except for string conversions which it doesn't support. XmlConvert.ToString() works with just about any of the common base types except for string to convert properly XML formatted content.

Now what?

Reading an encoded XML Value

There are a number of different ways that you can generate XML output and all of them basically involve creating some sort of XML structure and reading the value out of the 'rendered' document.

The most concise way I've found (on StackOverflow from John Skeet) is the following:

public static string XmlString(string text)
{
    return new XElement("t", text).LastNode.ToString();
}

which you can call with:

void Main()
{
    XmlString("Brackets & stuff <> and \"quotes\" and more 'quotes'.").Dump();
}

and which produces:

Brackets &amp; stuff &lt;&gt; and "quotes" and more 'quotes'.

If you don't want to use LINQ to XML you can use an XML Document instead.

private static XmlDoc _xmlDoc;

public string XmlString(string text)
{
	_xmlDoc = _xmlDoc ?? new XmlDocument();
	var el = _xmlDoc.CreateElement("t");
	el.InnerText = text;
	return el.InnerXml;
}

Note that using XmlDocument is considerably slower than XElement even with the document caching used above.

System.Security.SecurityElement.Escape()?

The SecurityElement.Escape() is a built-in CLR function that performs XML encoding. It's a single function so it's easy to call, but it will always encode all quotes without options. This is OK, but can result in extra characters if you're encoding for XML elements. Only attributes need quotes encoded. The function is also considerably slower than the other mechanisms mentioned here.

Just Code

If you don't want to deal with adding a reference to LINQ to XML or even System.Xml you can also create a simple code routine. XML strings really just escape 5 characters (3 if you're encoding for elements), plus it throws for illegal characters < CHR(32) with the exception of tabs, returns and line feeds.

The simple code to do this looks like this:

///  <summary>
///  Turns a string into a properly XML Encoded string.
///  Uses simple string replacement.
/// 
///  Also see XmlUtils.XmlString() which uses XElement
///  to handle additional extended characters.
///  </summary>
///  <param name="text">Plain text to convert to XML Encoded string</param>
/// <param name="encodeQuotes">
/// If true encodes single and double quotes.
/// When embedding element values quotes don't need to be encoded.
/// When embedding attributes quotes need to be encoded.
/// </param>
/// <returns>XML encoded string</returns>
///  <exception cref="InvalidOperationException">Invalid character in XML string</exception>
public static string XmlString(string text, bool encodeQuotes = false)
{
    var sb = new StringBuilder(text.Length);

    foreach (var chr in text)
    {
        if (chr == '<')
            sb.Append("&lt;");
        else if (chr == '>')
            sb.Append("&gt;");
        else if (chr == '&')
            sb.Append("&amp;");
        // special handling for quotes
        else if (encodeQuotes && chr == '\"')
            sb.Append("&quot;");
        else if (encodeQuotes && chr == '\'')
            sb.Append("&apos;");
        // Legal sub-chr32 characters
        else if (chr == '\n')
            sb.Append("\n");
        else if (chr == '\r')
            sb.Append("\r");
        else if (chr == '\t')
            sb.Append("\t");
        else
        {
            if (chr < 32)
                throw new InvalidOperationException("Invalid character in Xml String. Chr " +
                                                    Convert.ToInt16(chr) + " is illegal.");
            sb.Append(chr);
        }
    }

    return sb.ToString();
}

Attributes vs. Elements

Notice that the function above optionally supports quote encoding. By default quotes are not encoded.

That's because elements are not required to have quotes encoded because there are no string delimiters to worry about in an XML element. This is legal XML

<doc>This a "quoted" string. So is 'this'!</doc>

However, if you are generating an XML string for an attribute you do need to encode quotes because the quotes are the delimiter for the attribute. Makes sense right?

<doc note="This a &quot;quoted&quot; string. So is &apos;this&apos;!"

Actually, the &apos; is not required in this example because the attribute delimiter is ". So this is actually more correct:

<doc note="This a &quot;quoted&quot; string. So is 'this'!"

However, both are valid XML. The string function above will encode single and double quotes when the encodeQuotes parameter is set to true to handle setting attribute values.

The following LINQPad code demonstrates:

void Main()
{
	var doc = new XmlDocument();
	doc.LoadXml("<d><t>This is &amp; a \"test\" and a 'tested' test</t></d>");	
	doc.OuterXml.Dump();
	var node = doc.CreateElement("d2");
	node.InnerText = "this & that <doc> and \"test\" and 'tested'";
	doc.DocumentElement.AppendChild(node);
	var attr = doc.CreateAttribute("note","this & that <doc> and \"test\" and 'tested'");
	node.Attributes.Append(attr);
	doc.OuterXml.Dump();
}

The document looks like this:

<d><t>This is &amp; a "test" and a 'tested' test</t><d2 note="this &amp; that &lt;doc&gt; and &quot;test&quot; and 'tested'">
    	this &amp; that &lt;doc&gt; and "test" and 'tested'</d2></d>

Bottom line: Elements don't require quotes to be encoded, but attributes do.

Performance

This falls into the pre-mature optimization bucket, but I was curious how well each of these mechanisms would perform relative to each other. It would seem that XElement and especially XmlDocument would be very slow as they process the element as an XML document/fragment that has to be loaded and parsed.

I was very surprised to find that the fastest and most consistent solution in various sizes of text was XElement which was faster than my string implementation. For small amounts of text (under a few hundred characters) the string and XElement implementations were roughly the same, but as strings get larger XElement started to become considerably faster.

As an aside, the custom string version also runs considerably faster in Release Mode (in LINQPad run with Optimizations On) with optimizations enabled rather than debug mode. In debug mode performance was about 3-4x slower. Yikes.

Not surprisingly XmlDocument - even the cached version - was the slower solution. With small strings roughly 50% slower, with larger strings many times slower and incrementally getting slower as the string size gets larger.

Surprisingly slowest of them all was SecurityElement.Escape() which was nearly twice as slow as the XmlDocument approach.

Whatever XElement is doing to parse the element, it's very efficient and it's built into the framework and maintained by Microsoft, so I would recommend that solution, unless you want to avoid the XML assembly references in which case the custom solution string works as well with smaller strings and reasonably close with large strings.

Take all of these numbers with a grain of salt - all of them are pretty fast for one off parsing and unless you're using manual XML encoding strings in loops or large batches, the perf difference is not of concern here.

If you want to play around with the different approaches, here's a Gist that you can load into LINQPad that you can just run:

Summary

XML string encoding is something you hopefully won't have to do much of, but it's one thing I've tripped over enough times to take the time to write up here. Again, in most cases my recommendation would be to write strings using some sort of official XML parser (XmlDocument or XDocument/XElement), but in the few cases where you just need to jam a couple of values into a large document, nothing beats simple string replacement in the document for simplicity and easy maintenance and that's the one edge, use-case where a function like XmlString() makes sense.

Resources

© Rick Strahl, West Wind Technologies, 2005-2018
Posted in .NET  C#  XML  

Viewing all articles
Browse latest Browse all 664

Trending Articles