📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Text mining problems

Character entities

When you are working with data from the web like web scraping results, you may encounter weird combinations of characters like & or " or   or ä. They represent special characters that are not allowed in XML or HTML, which is a special case of XML.

A simple toy XML file would look like this:

<xml version="1.0" encoding="UTF-8">
<message>
  <to>Ramona</to>
  <from>Ben</from>
  <head>Reminder</head>
  <body>I will be home late today</body>
</message>

XML uses tags like <message> or <from> in order to structure the file. However, characters like < or > would confuse the parser that reads these XML files. It wouldn't be able to distinguish between the tags used for structuring the XML file and the characters within the text. That is why certain characters used by XML have to be replaced with their corresponding entity encoding. For example, < in a text becomes &lt;, > becomes &gt; and the ampersand & becomes &amp;.

In other cases, the XML or HTML file itself has been saved in a character set that does not support certain characters like for example German Umlauts (ä, ö, ü). In order to use them anyway, the creator of the file has to replace them with their entity encoding which are &auml;, &ouml; and &uuml;.

Whenever you use with XML files, expect these character entities and manually check for them. You should never manually translate back and forth between regular text and entity-replaced text. Usually, all programs and functions capable of reading and writing XML should do this for you. Unfortunately, this will not always work, and you may still find these combinations of characters in your data. To decode a string that contains XML entities within R, you can use xml_text(read_xml(paste0("<x>", s, "</x>"))) from the xml2 package.

Takeaways:

  • When working with data from the web, check for encoded entities

  • Don't ever try to manually replace encoded entities. Always use the existing tools for that purpose

PreviousSpecial charactersNextLookalike characters

Last updated 4 years ago

Was this helpful?