Tutorial: Intro to cleaning data
Understanding how to clean data is an important skill every reporter needs. Demographic, financial and other data is available on a city, county, state and national level in the United States.
But understanding how to take a large data file and distill it into a usable form can be daunting.
In this tutorial, you'll learn how spreadsheets work, basic data-cleaning workflow and how to use formulas and functions to clean data. This is a general tutorial and it doesn't delve deeply into one program. We'll use Microsoft Excel but most of the same techniques work in Google Spreadsheets and other programs.
I will expand on this tutorial as time permits. If there is something you would like to see included, please leave a suggestion in the comments.
This tutorial does not cover analysis. That can be more effectively done in Google Refine. That tutorial is in the works, but until it's available check out this tutorial from ProPublica.
How spreadsheets work
Spreadsheets with rows and rows of data can be intimidating. But when you break them down to their essential parts they are really quite simple. Let's look at a blank spreadsheet:
Cells are arranged in columns and rows. All columns are identified by a letter and rows are identified by a number. This creates a map that IDs every cell in the spreadsheet. Cell A1 is in the top left corner and cell B2 is down and to the right.
This is critically important and what makes spreadsheets powerful tools. If we know the ID of a cell or range of cells, we can manipulate values without having to manually type them in.
Type 2 in cell A1. Next, go to cell C2 and type =A1 and hit Return. The result is that cell C2 contains the number 2.
The equal ( = ) sign is special and tells Excel that instead of containing text or a number, this cell contains a set of instructions (usually called a formula or function). Notice that cell A1 is highlighted and the formula appears in the Formula Bar. You can write the formula in either place.
Exercise 2: Simple math
We can use formulas to perform simple math.
Type 2 in cell A2. Next, go to cell C3 and type =A1+A2 and hit Return. Notice that cell C3 contains the number 4
Replace the addition ( + ) symbol with the minus ( - ) symbol and the result is 0. Replace with the multiplication symbol ( * - asterisk) and the division symbol ( / - forward slash) to see more results.
Exercise 3: Calculate percentage, percent change
You can also use this to perform more complicated math. If you want to calculate a percentage click on cell C4 and type =A2/A1*100 and the result is 100. This translates to =smaller number/larger number *100 (to move the decimal two places).
Change the value of cell A1 to 4 and the result in cell C4 updates to 50. All the previous formulas update as well.
Percent change is a very useful formula to know.
Click on cell C5 and type =(A2-A1)/A1*100. The result is 100 percent. This translates to =(new value-old value)/old value*100 (to move the decimal two places).
When values are in parenthesis, they are calculated before being fed into the formula. So in this case the result of (A2-A1) equals -2. -2 is then divided by A1 (4) and then multiplied by 100 for a result of -50%.
Formulas are great but what if you want to work with multiple cells? You could write something like=A1+A2+A3+A4+A5+A6 but that quickly gets out of hand. The other way to do this is to use built in functions. Functions are essentially built-in formulas and are common to almost every spreadsheet program.
Type 2 in cell A3 and type 2 in cell A4. Next, go to cell C6, type =SUM(A1:A4) and hit Return.The result should be 10.
The colon is used to define a range. This allows us to avoid typing in every cell that we want to add.
SUM is the name of the function. You can replace it with AVERAGE and MEDIAN to get different values.
Also, most spreadsheet programs will give you hints when you want to type in a function. A list will update as you type. To choose from the list simply click on the one you want and it will be added to the cell.
Functions can also fix common data problems. Let's look at two examples.
When cleaning data, it's best to end up with a single row that labels your data columns. This is called a header row. But many sources release data that includes 2, 3, 4, or 5 header rows. This is pretty easy to fix with cut and paste if your data has only a few columns. But cut and paste takes too long with a large data set. That's where functions come in.
Download and open the data-cleaning.xls file. There are three header rows followed by three blank rows.
Type =CONCATENATE(D1,D2,D3) in cell D4 and hit return.
That takes all the text and puts it in one cell that reads Violentcrimerate.
We need to add spaces between words to make the text readable. To do that we include &" " (a space between the quotes). Change your formula so that it looks like this: =CONCATENATE(D1&" ",D2&" ",D3) to add spaces between the words.
There are two ways of applying a formula to other cells in a row. The first is to click once in the cell with the formula and hover your mouse over the bottom left corner until the cursor changes to a black cross.
Next, click and drag to the left (or right).
When you release the mouse, the formula updates for each column. Disregard the error warnings. we'll get to those in a moment.
Now lets apply the formula to the entire row. Select cell A4, then press and hold the Shift key and select T4. That will select the entire row. Now press Control-r. That applies the formula to all the selected cells as long as the first cell has a formula.
TIP: To do the same thing with a column, hit Control-d. In that case, the top-most cell must have a formula.
Keyboard shortcut: To select a row of data, click in the first cell then hit control-shift-right arrow. You can also use the down, left and up arrows.
Nice, huh? CONCATENATE has a lot of uses, including combining columns of data into a single field. For example, addresses separated into columns for street, city and state can be concatenated into a single column for easier geocoding with Google Maps.
Now let's address those errors. Look at cell A1 — it's empty and that has some ramifications. Click once on cell A4 and the error message tells us that the formula references empty cells. And since we added spaces to those empty cells, our Year label has two extra spaces in front of it.
We'll use the TRIM function to get rid of that. TRIM looks for spaces at the beginning and end of a cell and deletes them.
Click on cell A5 and type =TRIM(A4).
Once you see the spaces are gone, select cell A5, then press control-shift-right arrow (or shift click cell T5) to select the row. Type control-r to apply the formula and Voila!
But wait, it gets better.
You can combine some functions. In this case, you can TRIM and CONCATENATE at once by typing =TRIM(CONCATENATE(A1&" ",A2&" ",A3)). This can be a great time saver.
Save your work.
Now a final bit of clean up. We have a properly formatted header row and need to get rid of all the extra source header rows. If we delete them as shown below, we get error messages. That's because our good header row uses functions that are looking for cells that no longer exist.
We get around this with a little bit of Excel magic called Paste Special.
Click once on the Row 5 label to select the row and copy it to your clipboard. Then click on the Row 6 label to select it. Go to the Edit menu and select Paste Special. A window will open with a bunch of options. Find the Paste area and select Values.
Click OK and the values are pasted into the cells. The equivalent command in Google Spreadsheets is Paste Values.
The functions are gone so you can now delete rows 1-5. Select the first five rows, go to the Edit menu and select Delete
Common data formats
Goverments and other sources share data in a wide variety of formats. Here are some of the most common:
DBF: Database format
The database format is frequently used to store and organize large collections of data. Many applications create or access dbf files and occasionally governments will provide raw data in this way. The two most common applications use to read dbf files are Microsoft Access (Windows only) and OpenOffice, a free, open source office suite (Mac, Win and Linux). Excel and Google Docs cannot read them.
And while the term database may sound scary, the data can usually be read as a two-dimensional grid that can be edited in a spreadsheet.
CSV: Comma separated value
CSV files may be the most widely distributed data files from governments and can be read into almost every spreadsheet application. The data is a text file where each cell has been separated by a comma and each row ends in a paragraph return. The comma is a "delimiter." The file above would look like this:
TSV: Tab separated values
The TSV format is another text file. But instead of using a comma, cells are separated with tabs.
Fixed width is also a text file but it differs from CSV and TSV. In this case, each entry is separated by spaces to create a nicely aligned grid. The example below replaces the spaces with other characters to better illustrate the format. Excel and OpenOffice can open these files.
XML: Extensible markup language
XML arranges data in a hierarchy similar to the way HTML works. Each row of data is described by opening and closing tags. The data in each row is described by another set of tags taken from the column headers. This format is useful to export data to for use in web pages. Excel will open XML files with some success.
JSON: JavsScript Object Notation
Importing CSV files
There is a proper way to import a text data file into Excel or OpenOffice. We're going to work with a CSV file but the same applies to any kind of delimited or fixed width text file. This does not currently work with Google Docs.
Download the data-import.csv file.
Open a new spreadsheet and go to the File menu and select Import. A new window will open. Select CSV and click Import.
Navigate to the file you downloaded and select it. This opens an import wizard. Make sure that Delimited is checked and click Next
Make sure that only the Comma delimiter is checked and click Next.
This takes you to the window where the important work happens. Click the Advanced button and make sure that the Decimal separator is a period and that the Thousands separator is blank. If you don't change the thousands separator, Excel will add commas which you will have to remove if you want to export the completed file as a CSV.
Now look at the Column data format choices. General is checked by default and applied to each of the columns. You can see this reflected in the Data preview.
The General setting does a pretty good job of distinguishing between text and numbers and will work in most cases.
An important exception is where text columns are made of numbers. Scroll over in the preview area until you see the County FIPS column. "001" is the standard federal numeric identifier for Baker County in Oregon. Think of it as an abbreviation to save space — it's really a label.
When numbers are imported into Excel, extra zeros are deleted. The General setting results in a State column that drops the zero and displays a "1." But without the zeros, the column will no longer match up with a Census GIS map file in a process called Joining.
This makes it very important to retain leading zeros, so click on the County column and change the Column data format to Text. Now Excel will keep the zeros on import.
You can also choose to select a column and Skip it. This is valuable with very large data sets but I prefer to import everything when possible.
Click finish. Excel will ask you where to save it. Use the default setting and you should get something that resembles this:
The most important rule of cleaning data is to always work on a copy. So save a copy. I also like to work in a specific order but you may find something that works for you. I suggest starting with tasks that don't require functions.
Delete empty rows
Click on the row label (a number) to select a row. You can hold down the shift or control keys to select multiple rows. Then right-click or control-click and select Delete.
Delete cells that contain notes, etc.
Get rid of special characters
This includes &,!,~, html tags or any other junk. Start with the Find command (cmnd-f) and Replace with another value. When in doubt, replace with an under-stroke or hyphen. You can also leave the Replace field empty to delete characters.
Spelling is important if you plan to join your data with a GIS mapping file - a single mistake can make the whole thing fail.
Go to the Tools menu and select Spelling. This has to be repeated on each sheet. If there are words that aren't in your dictionary, add them. It's OK to add common header labels like GEOID (Shows up in every census file).
Capitalization and abbreviations can create the same problem. Convert non-standard abbreviations using the Find and replace tool
Format number cells
If you plan to export your clean data as a CSV file you must remove all commas from cells that contain numbers. If you don't, those commas will break your data into extra cells.
Select the cells that you need to change. Go to the Format menu and select Cells to open a window. Click the Numbers tab and under Category, select Number. Set the number of decimal places and make sure "Use 1000 separator" is unchecked then click OK.
Copy all the data and paste it on a new sheet
Delete columns and rows to narrow your data set
Manipulate your data with functions
Save this step for last. That way you are working on just the data you need.
Copy the resulting data into a new sheet. Use the Paste Special command to preserve the values and get rid of the formulas.
Export the sheet as a CSV file
This page is a synopsis of the last sections, in the order that I prefer to work.
- If importing CSV, retain leading zeros
- Make sure data was imported properly and column data matches the column headers
- Save a copy immediately
- Delete blank rows within the data
- Consolidate Column headers to a single row
- Format all integer columns to eliminate commas
- Find and replace special characters (&, ! , ~, etc) When in doubt, replace with a space, under-stroke or hyphen
- Check spelling (especially in columns you plan to Join)
- Check capitalization (especially in columns you plan to Join)
- Check abbreviations (especially in columns you plan to Join)
- Keep totals from original data to check against. This ensures data is not lost.
- Manipulate data in columns last.
- Delete what you don’t need
|Get data or apply function to data from a different sheet||=Sheet1!A1|
|Apply function to a column (shift-select cells, function must be in top cell)||Control-d|
|AVERAGE||Calculates the average||=AVERAGE(A2:A10)|
|MEDIAN||Finds the value (number) in the middle of a range.||=MEDIAN(A2:A10)|
|Calculate percent (A2 is the smaller number)||=A2/B2*100|
|Calculate percent change (B2 is the new number, A2 is the old number)||=((B2/A2)-1)*100
||Check for duplicates.||=EXACT(A2:A10)
|COUNTIF||Counts the number of cells where a given value appears. (range, criteria) Criteria can be a number or text that appears in cells. This exmple uses a ZIP Code.||=COUNTIF(A2:A10, 94705 )
||Removes spaces from beginning and end of text.||=TRIM(A2:A10)|
||Removes all non-printable characters from text.||=CLEAN(A2:A10)
||Joins text items into one text item — (&” “] adds a space between cells
||=CONCATENATE(A2&" ", B2)
||Converts text to lowercase.
|UPPER||Converts text to uppercase.
||Capitalizes the first letter in each word of a text value.
||Formats a number as text with a fixed number of decimals.
|MID||Returns a specific number of characters from a text string starting at the position that you specify. Useful for converting nine-diget ZIP codes. (Cell, Start position, number of chars)
||Returns the rightmost characters from a text value.
|LEFT||Returns the leftmost characters from a text value.
||Replaces or add characters (For example, adding digits back to ZIP code).
||Substitutes old text with new text.
||=SUBSTITUTE(A2:A10, ”Oregon”, “OR”)|
About this Tutorial
This tutorial was written for instruction given during Knight Digital Media Center workshops and an Intro to Data Visualization course at the U.C Berkeley Graduate School of Journalism.
This content may not be republished in print or digital form without express written permission from KDMC. Please see our Content Redistribution Policy at kdmc.berkeley.edu/license.