Sunday, June 22, 2008

Documenting some chart making in Open Office and Excel.

This post may be hard to follow but it is for myself to document my techniques and not loose them. If you learn from it you may find it useful for your spreadsheet work. It is also a way for me to practice for work and to push my spreadsheet abilities for data handeling.



I downloaded some data that was free. There were only 5 years covered. I have been using years as my X axis for the past two weeks. The data was divided by age groups and I only downloaded 5 year age bands from 45 to 89 and then 90 and over. It was for cause of death data and is published annually. These vital statistics are free for download. I have around 27 causes picked out. I had the data for both sexes and for men and for women. So for each year I had a column and for each age band and each of the 3 gender categories I had the 27 causes in rows. All in all about 750 rows of data.

I downloaded that data into a spreadsheet. I then created a new sheet in that workbook. Editorial Note: When one downloads data it seems the workbook has only the one sheet unlike when one creates a new workbook with three sheets. In the new sheet I copied totals for cause of death for the same age bands and the same gender categories but from all causes of death in Canada. This was copied from a separate download. These two sheets were then not changed and after this step; data were only copied out from these two download sheets. Then in a third sheet I made my own row labels for the genders and age bands matching the second sheet's row ordering. The labels from the database were lengthy and unreadable in the width of my spreadsheet window and the differences between the labels could not be viewed easily. Editorial note: the labels were so long not even a 30 inch monitor would display them completely I am guessing. These new labels were shorter and after much copy and pasting I had made all some 31 rows labels by age bands and gender categories. They were in the same row ordering as the downloaded deaths by all causes. I then made five year columns after these row labels in the first column and copied the totals deaths by all causes by years data to this sheet.

Then I went back to a copy of the death by specific causes download in separate workbook and saved a copy of that download as a text file or comma separated file. I then closed it and imported it. I imported the data as a text file. This is one of my data handling tricks that could also be done with regular expressions if one was using a programming or scripting language. The picking of the appropriate delimiters is the trick. I chose delimiters during the import that would cut the long row labels in this file so that the specific causes' names, age band's names and gender names, were all brought in to this new spreadsheet in separate columns. Actually I used three columns and then copied these labels into a new fourth sheet in my main workbook. The row order for the 750 rows was preserved.

I then went back to the first sheet in the main workbook with all the cause specific data and copied all the data but not the labels into the new fourth sheet. So sheet four was the same as sheet one, but I now had three columns for labels and the labels were short and readable on the screen. I then made new rows in sheet four at the top of the beginning of each specific gender category (both sexes, males, females) and the beginning of each age band (5 year age bands 45-89 and the age band 90 and over). So when for example, the data for both sexes, 45 to 49,and 27 rows of specific causes of death started in the sheet, a new row started the 27 rows for both sexes, 45 to 49 years old. Then going back to sheet three with the total causes of death by genders and age bands, I added two more columns between the labels and the yearly totals columns to make a proper column alignment with the three column labels in sheet four. Then one at a time, the new rows I had created in sheet four were pasted with the rows from the 31 age and gender specific death by all causes data and short labels for gender and age bands from the third sheet.

I was now ready to calculate some results. I created formulas for counts for specific causes over total causes and then made this number a percent. I used the real number for the total but for the specific cause count I used a cell reference. I then filled down the columns into the 27 specific causes effecting the calculation for all five years and all 27 causes for the first set of data which was 45 to 49 both sexes. This was the only data I worked through today. I still had these 27 cause specific rows by 5 years columns of data highlighted after the fill so I then used format cells to change all the numbers to percents with 6 decimal places.

I then charted all non-zero rows basically following the charting techniques in Excel I have learned at work but I did it in Open Office. Somethings are done differently in Open Office charts. I then also did a counts chart that matched the percents charts but makes no reference to total deaths. After a long break, I spent some time formating the charts for printing and printed the two charts out.

I then saved and closed all the workbooks. I thought it would be a good idea and relaxing and good for my intellectual health to document all my work so I have posted this notes.

No comments: