Introduction to R Jiang Du Jan 17th 2008 What is R? A software package for data analysis and graphical representation Scripting language Flexible and customizable Free
Weaknesses Not particularly efficient in handling large data sets Slow in executing big loops 2 Where to get R? http://www.r-project.org/ 3 Basic operations
> 1+2*3 [1] 7 > log(10) [1] 2.302585 > 4^2 [1] 16 > sqrt(16) [1] 4 > pi [1] 3.141593 4
Basic operations > x = pi * 2 >x [1] 6.283185 > floor(x) [1] 6 > ceiling(x) [1] 7 5 Data type: vector > x = c(1,2,3,5,4)
>x [1] 1 2 3 5 4 > y = 1:5 >y [1] 1 2 3 4 5 >x+2 [1] 3 4 5 7 6 > x+y [1] 2 4 6 9 9 > length(x) [1] 5 > sorted_x = sort(x)
> sorted_x [1] 1 2 3 4 5 6 Data type: vector >x [1] 1 2 3 5 4 > x[3] [1] 3 > x[1:2] [1] 1 2
> x[-3] [1] 1 2 5 4 > x[x > 3] [1] 5 4 >x>3 [1] FALSE FALSE FALSE TRUE TRUE > which(x > 3) [1] 4 5 7 Data type: matrix
> m = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE) >m [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 > m[1, 2] [1] 2 > m[1:2, 2:3] [,1] [,2] [1,] 2 3 [2,] 5 6
8 Data type: matrix > m2 = matrix(c(2,0,0,0,2,0,0,0,2), nrow = 3, byrow = TRUE) > m2 [,1] [,2] [,3] [1,] 2 0 0 [2,] 0 2 0 [3,] 0 0 2 > m * m2 [,1] [,2] [,3] [1,] 2 0 0
[2,] 0 10 0 [3,] 0 0 18 > m %*% m2 [,1] [,2] [,3] [1,] 2 4 6 [2,] 8 10 12 [3,] 14 16 18 9 Date type: data frame > a = c(1:5) > b = a^2
> df = data.frame(a,b) > df a b 11 1 22 4 33 9 4 4 16 5 5 25 > df$b [1] 1 4 9 16 25 > df[3, 2] [1] 9
10 Data type: data frame > dim(df) [1] 5 2 > subset(df, a > 2) a b 33 9 4 4 16 5 5 25 > subset(df, a > 2 & b < 10)
ab 339 11 Visualization of data > x = 1:10 > y = x^2 > plot(x, y) > z = c(rep(1, 3), rep(5:6, 10), 1:10) > hist(z) 12
Visualization of data > x = seq(-10, 10, length= 30) >y=x > f = function(x,y) { r z = outer(x, y, f) > persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue") 13 Visualization of data
14 Loops, functions, etc. > x = c(1, 2, 3, 4, 5) >y=x > for (i in 1:length(x)) {y[i] = x[i]^2} >y [1] 1 4 9 16 25 > apply(as.array(x), 1, "^", 2) [1] 1 4 9 16 25 > x^2 [1] 1 4 9 16 25
15 Loops, functions, etc. > x = 1:5 > f3 = function(x) {return(x^3)} > apply(as.array(x), 1, f3) [1] 1 8 27 64 125 > source("~/test.r") [1] -1 -1 9 16 25
16 One of the most useful commands ? > ?apply 17 Practice: on Bordeaux wines
Problem Bordeaux wine vintage quality and the weather Bordeaux wines in different vintage years have different qualities (reflected in prices) The older the better? Weather is an important factor Hot, dry summer preferred 18 Practice: the data
WRAIN Winter (Oct.-March) Rain ML DEGREES Average Temperature (Deg Cent.) April-Sept. HRAIN Harvest (August and Sept.) ML TIME_SV Time since Vintage (Years) 19 Practice: load the data > wine_data = read.table("~/wine.data",
header = TRUE, na.strings = "."); 20 Practice: visualization > plot(wine_data$TIME_SV, wine_data$LPRICE2); 21 Practice: visualization
22 Practice: visualization 23 Practice: visualization avg_price = median(wine_data$LPRICE2, na.rm = TRUE); plot(wine_data$DEGREES, wine_data$HRAIN, type = "n", xlab = "Temperature", ylab = "Harvest rain"); points(wine_data$DEGREES[wine_data$LPRICE2 >= avg_price],
wine_data$HRAIN[wine_data$LPRICE2 >= avg_price], pch = 19, col = "blue"); points(wine_data$DEGREES[wine_data$LPRICE2 < avg_price], wine_data$HRAIN[wine_data$LPRICE2 < avg_price], pch = 19, col = "red"); legend(15, 250, c(">= avg price", "< avg price"), pch = 19, col = c("blue", "red")); 24 Practice: linear regression Find a set of parameters a, , e, such that:
LPRICE2 ~ a * WRAIN + b * DEGREES + c * HRAIN + d * TIME_SV + e + error_term The overall error should be minimized In this case, the sum/average of squared errors Sum((prediction - actual_price)^2) 25 Practice: linear regression > lmfit = lm(LPRICE2 ~ WRAIN + DEGREES + HRAIN + TIME_SV,
wine_data); > lmfit Coefficients: (Intercept) -12.145334 WRAIN 0.001167 DEGREES 0.616392
HRAIN -0.003861 TIME_SV 0.023847 > cat("RMS: ", sqrt(sum(lmfit$residuals^2)/length(lmfit$r esiduals)), "\n");
RMS: 0.2586167 26 Practice: linear regression 27 Practice: linear regression plot(wine_data$VINT, wine_data$LPRICE2, xlab = "Vintage year", ylab = "log2 rel. price, pch = 19,
col = "black"); points(wine_data$VINT[30:38], predict(lmfit, wine_data[30:38,]), pch = 19, col = "red"); legend(1965, -0.2, c("old data", "prediction"), pch = 19, col = c("black", "red")); 28 Practice: linear regression
29 Practice: linear regression Using fewer parameters in the model? LPRICE2 ~ b * DEGREES + c * HRAIN + d + error_term lmfit2 = lm(LPRICE2 ~ DEGREES + HRAIN, wine_data); RMS: 0.349513 30
Links Classesv2: http://classesv2.yale.edu/ Course wiki: http://lab.zoo.cs.yale.edu/cs445-wi ki/ R: http://www.r-project.org/ Bordeaux wine analysis: http://www.liquidasset.com/orley.h 31 tm