intro to cleaning data
Importing CSV files
There is a proper way to import a text data file into Excel or OpenOffice. We're going to work with a CSV file but the same applies to any kind of delimited or fixed width text file. This does not currently work with Google Docs.
Download the data-import.csv file.
Open a new spreadsheet and go to the File menu and select Import. A new window will open. Select CSV and click Import.
Navigate to the file you downloaded and select it. This opens an import wizard. Make sure that Delimited is checked and click Next
Make sure that only the Comma delimiter is checked and click Next.
This takes you to the window where the important work happens. Click the Advanced button and make sure that the Decimal separator is a period and that the Thousands separator is blank. If you don't change the thousands separator, Excel will add commas which you will have to remove if you want to export the completed file as a CSV.
Now look at the Column data format choices. General is checked by default and applied to each of the columns. You can see this reflected in the Data preview.
The General setting does a pretty good job of distinguishing between text and numbers and will work in most cases.
An important exception is where text columns are made of numbers. Scroll over in the preview area until you see the County FIPS column. "001" is the standard federal numeric identifier for Baker County in Oregon. Think of it as an abbreviation to save space — it's really a label.
When numbers are imported into Excel, extra zeros are deleted. The General setting results in a State column that drops the zero and displays a "1." But without the zeros, the column will no longer match up with a Census GIS map file in a process called Joining.
This makes it very important to retain leading zeros, so click on the County column and change the Column data format to Text. Now Excel will keep the zeros on import.
You can also choose to select a column and Skip it. This is valuable with very large data sets but I prefer to import everything when possible.
Click finish. Excel will ask you where to save it. Use the default setting and you should get something that resembles this: