Character entities

When you are working with data from the web like web scraping results, you may encounter weird combinations of characters like & or " or   or ä. They represent special characters that are not allowed in XML or HTML, which is a special case of XML.

A simple toy XML file would look like this:

<xml version="1.0" encoding="UTF-8">
<message>
  <to>Ramona</to>
  <from>Ben</from>
  <head>Reminder</head>
  <body>I will be home late today</body>
</message>

XML uses tags like <message> or <from> in order to structure the file. However, characters like < or > would confuse the parser that reads these XML files. It wouldn't be able to distinguish between the tags used for structuring the XML file and the characters within the text. That is why certain characters used by XML have to be replaced with their corresponding entity encoding. For example, < in a text becomes &lt;, > becomes &gt; and the ampersand & becomes &amp;.

In other cases, the XML or HTML file itself has been saved in a character set that does not support certain characters like for example German Umlauts (รค, รถ, รผ). In order to use them anyway, the creator of the file has to replace them with their entity encoding which are &auml;, &ouml; and &uuml;.

Whenever you use with XML files, expect these character entities and manually check for them. You should never manually translate back and forth between regular text and entity-replaced text. Usually, all programs and functions capable of reading and writing XML should do this for you. Unfortunately, this will not always work, and you may still find these combinations of characters in your data. To decode a string that contains XML entities within R, you can use xml_text(read_xml(paste0("<x>", s, "</x>"))) from the xml2 package.

Takeaways:

  • When working with data from the web, check for encoded entities

  • Don't ever try to manually replace encoded entities. Always use the existing tools for that purpose

Last updated