If you wanna make sure your text gets parsed correctly you mostly use htmlentities. However this method has 2 downsides:
1. It does not convert in to numeric entities so you’ll have problems when parsing as XML
2. It does NOT cover all characters that are like to show up.
So, to address this Issues, first for Point 1:
function _convertAlphaEntitysToNumericEntitys($entity){
return '&#'.ord(html_entity_decode($entity[0])).';';
}
$content = preg_replace_callback('/&([\w\d]+);/i','_convertAlphaEntitysToNumericEntitys',$content);
Here all “normal” entities are taken (which you already have, using htmlentities) and replaced by their numeric counterparts so they can be parsed as XML, now that leaves us with our second Problem, the Fact that only a small range of characters is covered in the first Place:
function _convertAsciOver127toNumericEntitys($entity){
if(($asciCode = ord($entity[0])) > 127){
return '&#'.$asciCode.';';
}else{
return $entity[0];
}
}
$content = preg_replace_callback('/[^\w\d ]/i','_convertAsciOver127toNumericEntitys'), $content);
And there you go, the resulting Text should have no entitie Problems in XML.