intro to cleaning data
Common data formats
Goverments and other sources share data in a wide variety of formats. Here are some of the most common:
DBF: Database format
The database format is frequently used to store and organize large collections of data. Many applications create or access dbf files and occasionally governments will provide raw data in this way. The two most common applications use to read dbf files are Microsoft Access (Windows only) and OpenOffice, a free, open source office suite (Mac, Win and Linux). Excel and Google Docs cannot read them.
And while the term database may sound scary, the data can usually be read as a two-dimensional grid that can be edited in a spreadsheet.

CSV: Comma separated value
CSV files may be the most widely distributed data files from governments and can be read into almost every spreadsheet application. The data is a text file where each cell has been separated by a comma and each row ends in a paragraph return. The comma is a "delimiter." The file above would look like this:

TSV: Tab separated values
The TSV format is another text file. But instead of using a comma, cells are separated with tabs.

Fixed width
Fixed width is also a text file but it differs from CSV and TSV. In this case, each entry is separated by spaces to create a nicely aligned grid. The example below replaces the spaces with other characters to better illustrate the format. Excel and OpenOffice can open these files.

XML: Extensible markup language
XML arranges data in a hierarchy similar to the way HTML works. Each row of data is described by opening and closing tags. The data in each row is described by another set of tags taken from the column headers. This format is useful to export data to for use in web pages. Excel will open XML files with some success.

JSON: JavsScript Object Notation
Like XML, JSON is primarily used to export data for use with JavaScript in web pages and applications. Google Refine can open some JSON files.


