Encoding Unicode Characters as XML Entities in PHP

Written by James Mansson on December 14, 2012 Categories: PHP, XML

Since I started using the jqGrid jQuery plugin, I have had to wrestle with a tricky problem when generating content for the grid on the server in response to AJAX requests from the grid. The key issue is that the response to these requests is in the form of an XML file, and because the text for the grid cells may contain Unicode characters, these need to be converted to the appropriate XML entities.

The server-side code I write tends to be in PHP, so I naturally looked for a standard PHP function that does this job. Unfortunately there are only functions for HTML, in particular htmentities. My initial solution was to use htmlentities on the text and define the HTML entities in XML. For instance, the following XML defines the HTML nbsp (no break space) entity:

<!DOCTYPE characters 
[
<!ELEMENT characters (character*) >
<!ELEMENT character  (#PCDATA   ) >

<!ENTITY nbsp   "&#160;" >
]
>

However, this proved a rather tedious approach, as there are 252 HTML entities in the HTML 4.0 standard, and therefore to be comprehensive I would have to have 252 ENTITY entries. Initially I tried just defining what I thought would be the most common, but I periodically had to add new lines, so in the end I created a version with all the entries.

I found a more elegant solution on the htmlentities PHP manual page. I took the suggested code and wrapped it up in the following PHP function:

/**
 * Returns normal chars as chars and others as numeric html entities.
 */
public function superEntities($str)
{
	// Check that the Multibyte String extension is loaded
	if (!extension_loaded('mbstring'))
		return htmlentities($str);

	// Get rid of existing entities else double-escape
	$str = html_entity_decode(stripslashes($str), ENT_QUOTES, 'UTF-8');

	// Return array of every multi-byte character
	$ar = preg_split('/(?<!^)(?!$)/u', $str);

	$str2 = '';

	foreach ($ar as $c)
	{
		$o = ord($c);

		/* 
			Condition				Description
			---------				-----------
			strlen($c) > 1			multi-byte [unicode]
			$o < 32 || $o > 126		control characters/latin others
			$o > 33 && $o < 40		quotes + ampersand
			$o > 59 && $o < 63		html
		*/

		if ((strlen($c) > 1) || ($o < 32 || $o > 126) || ($o > 33 && $o < 40) || ($o > 59 && $o < 63))
		{
			// Convert to numeric entity
			$c = mb_encode_numericentity($c, array (0x0, 0xffff, 0, 0xffff), 'UTF-8');
		}

		$str2 .= $c;
	}

	return $str2;
}

As it was not certain that the Multibyte String (mbstring) extension would always be installed on the target system, I introduced a fall-back to use htmlentities instead when it was not; in that case, I would include the XML entity definitions at the top the XML file.

No Comments on Encoding Unicode Characters as XML Entities in PHP

Leave a Reply

Your email address will not be published. Required fields are marked *