intro to cleaning data

Using functions

Formulas are great but what if you want to work with multiple cells? You could write something like=A1+A2+A3+A4+A5+A6 but that quickly gets out of hand. The other way to do this is to use built in functions. Functions are essentially built-in formulas and are common to almost every spreadsheet program. 

Type 2 in cell A3 and type 2 in cell A4. Next, go to cell C6, type =SUM(A1:A4) and hit Return.The result should be 10. 

Calculate SUM in Excel

The colon is used to define a range. This allows us to avoid typing in every cell that we want to add. 

SUM is the name of the function. You can replace it with AVERAGE and MEDIAN to get different values.

Also, most spreadsheet programs will give you hints when you want to type in a function. A list will update as you type. To choose from the list simply click on the one you want and it will be added to the cell.

Functions can also fix common data problems. Let's look at two examples. 

CONCATENATE

When cleaning data, it's best to end up with a single row that labels your data columns. This is called a header row. But many sources release data that includes 2, 3, 4, or 5 header rows.  This is pretty easy to fix with cut and paste if your data has only a few columns. But cut and paste takes too long with a large data set. That's where functions come in.

Download and open the data-cleaning.xls file. There are three header rows followed by three blank rows. 

Type =CONCATENATE(D1,D2,D3) in cell D4 and hit return. 

concatenate function

That takes all the text and puts it in one cell that reads Violentcrimerate.

We need to add spaces between words to make the text readable. To do that we include &" " (a space between the quotes). Change your formula so that it looks like this: =CONCATENATE(D1&" ",D2&" ",D3) to add spaces between the words. 

concatenate with spaces

 

There are two ways of applying a formula to other cells in a row. The first is to click once in the cell with the formula and hover your mouse over the bottom left corner until the cursor changes to a black cross.

Apply function to neighboring cells

 

Next, click and drag to the left (or right).

apply function to neighboring cells

 

When you release the mouse, the formula updates for each column. Disregard the error warnings. we'll get to those in a moment.

Apply function to neighboring cells

 

Now lets apply the formula to the entire row. Select cell A4, then press and hold the Shift key and select T4. That will select the entire row. Now press Control-r. That applies the formula to all the selected cells as long as the first cell has a formula.

TIP: To do the same thing with a column, hit Control-d. In that case, the top-most cell must have a formula.

Keyboard shortcut: To select a row of data, click in the first cell then hit control-shift-right arrow. You can also use the down, left and up arrows. 

Nice, huh? CONCATENATE has a lot of uses, including combining columns of data into a single field. For example, addresses separated into columns for street, city and state can be concatenated into a single column for easier geocoding with Google Maps.

TRIM

Now let's address those errors. Look at cell A1 — it's empty and that has some ramifications. Click once on cell A4 and the error message tells us that the formula references empty cells. And since we added spaces to those empty cells, our Year label has two extra spaces in front of it.

We'll use the TRIM function to get rid of that. TRIM looks for spaces at the beginning and end of a cell and deletes them.

Click on cell A5 and type =TRIM(A4).  

trim function

Once you see the spaces are gone, select cell A5, then press control-shift-right arrow (or shift click cell T5) to select the row. Type control-r to apply the formula and Voila! 

But wait, it gets better. 

You can combine some functions. In this case, you can TRIM and CONCATENATE at once by typing =TRIM(CONCATENATE(A1&" ",A2&" ",A3)). This can be a great time saver.

Save your work.

Paste Special

Now a final bit of clean up. We have a properly formatted header row and need to get rid of all the extra source header rows. If we delete them as shown below, we get error messages. That's because our good header row uses functions that are looking for cells that no longer exist. 

missing data

We get around this with a little bit of Excel magic called Paste Special. 

Click once on the Row 5 label to select the row and copy it to your clipboard. Then click on the Row 6 label to select it. Go to the Edit menu and select Paste Special. A window will open with a bunch of options. Find the Paste area and select Values.

paste special

Click OK and the values are pasted into the cells. The equivalent command in Google Spreadsheets is Paste Values.

The functions are gone so you can now delete rows 1-5. Select the first five rows, go to the Edit menu and select Delete