intro to cleaning data

Common data formats

Goverments and other sources share data in a wide variety of formats. Here are some of the most common:

DBF: Database format

The database format is frequently used to store and organize large collections of data. Many applications create or access dbf files and occasionally governments will provide raw data in this way. The two most common applications use to read dbf files are Microsoft Access (Windows only) and OpenOffice, a free, open source office suite (Mac, Win and Linux). Excel and Google Docs cannot read them.

And while the term database may sound scary, the data can usually be read as a two-dimensional grid that can be edited in a spreadsheet.

database

 

CSV: Comma separated value

CSV files may be the most widely distributed data files from governments and can be read into almost every spreadsheet application. The data is a text file where each cell has been separated by a comma and each row ends in a paragraph return. The comma is a "delimiter." The file above would look like this:

csv example

 

TSV: Tab separated values

The TSV format is another text file. But instead of using a comma, cells are separated with tabs. 

TSV example

 

Fixed width

Fixed width is also a text file but it differs from CSV and TSV. In this case, each entry is separated by spaces to create a nicely aligned grid. The example below replaces the spaces with other characters to better illustrate the format.  Excel and OpenOffice can open these files.

fixed width example

 

XML: Extensible markup language

XML arranges data in a hierarchy similar to the way HTML works. Each row of data is described by opening and closing tags. The data in each row is described by another set of tags taken from the column headers. This format is useful to export data to for use in web pages.  Excel will open XML files with some success.

XML

 

JSON: JavsScript Object Notation

Like XML, JSON is primarily used to export data for use with JavaScript in web pages and applications. Google Refine can open some JSON files.

json example