# Character entities

When you are working with data from the web like web scraping results, you may encounter weird combinations of characters like `&amp;` or `&quot;` or `&#160;` or `&auml;`. They represent special characters that are not allowed in XML or HTML, which is a special case of XML.

A simple toy XML file would look like this:

```
<xml version="1.0" encoding="UTF-8">
<message>
  <to>Ramona</to>
  <from>Ben</from>
  <head>Reminder</head>
  <body>I will be home late today</body>
</message>
```

XML uses tags like `<message>` or `<from>` in order to structure the file. However, characters like `<` or `>` would confuse the parser that reads these XML files. It wouldn't be able to distinguish between the tags used for structuring the XML file and the characters within the text. That is why certain characters used by XML have to be replaced with their corresponding entity encoding. For example, `<` in a text becomes `&lt;`, `>` becomes `&gt;` and the ampersand `&` becomes `&amp;`.

In other cases, the XML or HTML file itself has been saved in a character set that does not support certain characters like for example German Umlauts (ä, ö, ü). In order to use them anyway, the creator of the file has to replace them with their entity encoding which are \&auml;, \&ouml; and \&uuml;.

Whenever you use with XML files, expect these character entities and manually check for them. You should never manually translate back and forth between regular text and entity-replaced text. Usually, all programs and functions capable of reading and writing XML should do this for you. Unfortunately, this will not always work, and you may still find these combinations of characters in your data. To decode a string that contains XML entities within R, you can use `xml_text(read_xml(paste0("<x>", s, "</x>")))` from the xml2 package.

Takeaways:

* When working with data from the web, check for encoded entities
* Don't ever try to manually replace encoded entities. Always use the existing tools for that purpose


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://b-greve.gitbook.io/beginners-guide-to-clean-data/text-mining-problems/character-entities.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
