intro to cleaning data

Cleaning data

The most important rule of cleaning data is to always work on a copy. So save a copy. I also like to work in a specific order but you may find something that works for you. I suggest starting with tasks that don't require functions.

Delete empty rows

Click on the row label (a number) to select a row. You can hold down the shift or control keys to select multiple rows. Then right-click or control-click and select Delete. 

Delete cells that contain notes, etc.

Get rid of special characters

This includes &,!,~, html tags or any other junk. Start with the Find command (cmnd-f) and Replace with another value. When in doubt, replace with an under-stroke or hyphen. You can also leave the Replace field empty to delete characters.

Check Spelling

Spelling is important if you plan to join your data with a GIS mapping file - a single mistake can make the whole thing fail. 

Go to the Tools menu and select Spelling. This has to be repeated on each sheet. If there are words that aren't in your dictionary, add them. It's OK to add common header labels like GEOID (Shows up in every census file).  

Capitalization and abbreviations can create the same problem. Convert non-standard abbreviations using the Find and replace tool 

Format number cells

If you plan to export your clean data as a CSV file you must remove all commas from cells that contain numbers. If you don't, those commas will break your data into extra cells. 

Select the cells that you need to change.  Go to the Format menu and select Cells to open a window. Click the Numbers tab and under Category, select Number. Set the number of decimal places and make sure "Use 1000 separator" is unchecked then click OK. 

Copy all the data and paste it on a new sheet 

Delete columns and rows to narrow your data set

Manipulate your data with functions 

Save this step for last. That way you are working on just the data you need.

Copy the resulting data into a new sheet. Use the Paste Special command to preserve the values and get rid of the formulas.

Export the sheet as a CSV file