Friday, March 18, 2005

I was working hard this morning on my third assignment in data mining. I got the data read after looking at it again today later in the evening. I was able to do summary statistics but not histograms. The number of variables is too large for a plot. I need to divide the data with a 10 fold cross validation. I must also do some data dimension reduction.

The problem is email spam. I could use a Principle Components reduction using subsets of the variables. There are variables based on word and character frequency that could be extracted and broken down. There is also a capitals series if variables that could be reduced by PC analysis. Then I might have only 9 dimensions or 9 variables, 3 sets of 3 PC components. It may even end up smaller.

No comments: