# Column type violations

Most CSV parsers allow you to specify the data types of each column. Sometimes, they will even guess the data type based on the content of the first few rows. If the parser hits a record where the data does not fit the assumed column type, it will try to implicitly cast to that type. If this attempt fails, most parsers will abort with an error message or insert a missing value.

The latter option is something you never want to happen! Sure, it sucks when your import runs for an while and then fails with an error message, but at least you know that something went wrong. When some implicit type-casting creates a NULL value, you will probably never know that this happened. You might look at the parsed data, see a bunch of NULL values and assume that they were there from the beginning, while it was your ETL pipeline that created them.

Also, this type of automatic type-casting is somewhat error-prone because you are assuming that an entire column follows a certain format, e.g. containing a floating-point number or a certain date format. Like I have emphasized before, you should never leave your assumptions untested.

My preferred way of importing data is to import all columns as text data types (in R the character class) and run a few tests. Usually, I use regular expressions to test if a date always has the same format or if a column always contains a number. This way, you can test your assumptions and easily find out in which row and which column they don't hold up rather than simply getting an error message. If you are going to type-cast an entire column from one data type to another, always do it explicitly! Don't leave it to chance.

Takeaways:

* Avoid implicit typecasting
* Read columns as text and do the typecasting in an explicit step where you can easily check for problem cases
* Check your assumption regarding the data format in each column


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://b-greve.gitbook.io/beginners-guide-to-clean-data/common-csv-problems/column-type-violations.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
