Wednesday, December 24, 2008

Adding up prices on a web page.

Task: total prices of a web page list.
Goal: to price the cost of owning every O'Reilly pocket reference and guide.
Software needed:
A web browser
A spreadsheet
R software
a text editor.

Find a well formated web price list such as
http://oreilly.com/store/series/pocketrefs.csp

Step 1: Web browser
Highlight the beginning of the price list (i.e. highlight the beginning of the first item/record on the web page). Scroll to the end of the list or web page. Hold your shift key down and click just after the last item/record. Now copy this highlighted list. You should have all the items highlighted

Step 2 Spreadsheet
Paste this in to an empty spreadsheet. Examine this and see if prices are on one row alone. It does not matter if there are dollar signs with the prices in the cells or also in this case a currency symbol USD. But here we find in the spreadsheet a price such as "$9.99 USD" occurring in every eighth row. The Prices should be in cells by themselves. But there is all the other stuff like the title and the blurb which here we want to parse out.

Step 3 Spreadsheet
Save the spreadsheet as a CSV file comma separated file. Remember the name and directory for the first line of R code below in the next step.

Step 4: R software ( www.r-project.org), text editor
Copy and paste the code below into a text editor
Change the directory and file name to your file and directory name from step 3 in the first and and the directory in the last line of this code. On the last line chose a file name.
Change the line jump <-(seq(x,y,z)
so that x is the first row a price occurs and y the last row a price occurs and
z is the number of rows between each price.

You can also change jump(x,y,z) to parse out all the titles or all the blurbs or all the dates published.


After editing copy the code into the R command line a line at a time
the # lines are comments


Use R software and run this code on the R command line after editing it in the text editor. Stop after each line that begins with # to check the results. good luck and use R help if needed.

###################Code Starts here##################
OPocket <-read.csv(file="/Users/ptimusk/documents/school/books/OReillyPocket08-12-24.csv", header= FALSE)
# read into R and the variable OPocket the csv file created by Excel save as csv.
#change "Users/ptimusk/documents/school/books/...to your directory
# and file name OReillyPocket08-12-24.csv

OPocket
#check variable for rows that contain prices. Every foth row starting at 4 ending at 394
#
jump <- seq(4, 394, 5)
# create an index of rows where prices occur. Start with row 4. End with row 394
#Count by 5 rows

OPocket.Prices <- as.vector (OPocket[jump,1])
#take only the rows(indexed by "jump") with prices out of the variable OPocket

OPocket.Prices
# check to see above variable contains quoted prices in 79 rows

write.csv(OPocket.Prices,file="/Users/ptimusk/documents/school/books/OPocketPrices08-12-24.csv")
#write the prices to a csv text file

###############Code ends here######################

Step 5 text editor
Open with the text editor the file you just wrote on the last line of R code
Save as txt

Step 6 Spreadsheet

With the spreadsheet and a new empty spreadsheet open the txt file you just saved. That should be the text file with the Prices. This file should have the variable OPocket.Prices as an R format variable.

The spreadsheet should take you through some import steps where
you can adjust the separator characters to parse out the prices only without the $
and without the currency symbol. You also parse out the first line and the quotes.

Sum the prices column
Your done!
Answer: all the O'Reilly pocket references and guides will cost 851.69 $ for the 79 books.

No comments: