This file gives you instructions on how to run linear regression in Splus Enter Splus by typing Splus followed by return. Wait for a few seconds until you get the Splus prompt >. HOW DO YOU READ DATA INTO SPLUS? We are going to work with a" data.frame" which we create from a file that I have sent you,called black.cherry.data . There are several ways of creating a "data.frame".The following instructions give you one of these ways. Type: >trees.dat_read.table("black.cherry.data",col.names=c("diam","height","vol")) Splus goes and gets the file and with the commands that I gave under "names" it creates a special array called a data.frame with 3 columns called "diam", "height" and "vol". If you typethe following > trees.dat this is what you get diam height vol 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7 7 11.0 66 15.6 8 11.0 75 18.2 9 11.1 80 22.6 10 11.2 75 19.9 11 11.3 79 24.2 12 11.4 76 21.0 13 11.4 76 21.4 14 11.7 69 21.3 15 12.0 75 19.1 16 12.9 74 22.2 17 12.9 85 33.8 18 13.3 86 27.4 19 13.7 71 25.7 20 13.8 64 24.9 21 14.0 78 34.5 22 14.2 80 31.7 23 14.5 74 36.3 24 16.0 72 38.3 25 16.3 77 42.6 26 17.3 82 55.4 27 17.5 82 55.7 28 17.9 80 58.3 29 18.0 80 51.5 30 18.0 80 51.0 31 20.6 87 77.0 SOME BASIC NOTIONS The lm() function is the basic function for doing linear regression. To regress vol on diam you type > attach(trees.dat) >trees.fit1_lm(vol~diam,trees.dat) Anticipating on what we have already done in class,if I want to regress vol on diam and height,i will type > trees.fit_lm(vol~diam+height,trees.dat) To obtain the results of the regression(here I choose the multiple regression of vol on diam and height),I type > summary(trees.fit) The output is Call: lm(formula = vol ~ diam + height, data = trees.dat) Residuals: Min 1Q Median 3Q Max -6.439 -2.611 -0.2912 2.174 8.516 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -58.2076 8.5800 -6.7841 0.0000 diam 4.6988 0.2642 17.7868 0.0000 height 0.3436 0.1295 2.6530 0.0130 Residual standard error: 3.868 on 28 degrees of freedom Multiple R-Squared: 0.9483 F-statistic: 256.9 on 2 and 28 degrees of freedom, the p-value is 0 Correlation of Coefficients: (Intercept) diam diam 0.1932 height -0.9342 -0.5237 There is now an object called trees.fit and within this objects there are other objects that Splus has created.To find out what those objects are,type > names(trees.fit) [1] "coefficients" "residuals" "fitted.values" "effects" [5] "R" "rank" "assign" "df.residual" [9] "contrasts" "terms" "call" You can obtain those objects individually by doing > trees.fit$resid 1 2 3 4 5 6 7 8 5.453528 5.762 5.409481 0.528838 -1.10355 -1.360683 -0.5588335 -1.051454 9 10 11 12 13 14 15 1.16054 -0.2912215 2.164397 -0.4746134 -0.07461338 0.8211071 -4.850291 16 17 18 19 20 21 22 23 -5.63562 2.184511 -6.438649 -4.863816 -3.728328 0.1211614 -4.305855 0.9462409 24 25 26 27 28 29 30 31 -3.414766 -2.242539 4.140501 3.500734 4.908448 -2.361436 -2.861436 8.516217 > trees.fit$coeff (Intercept) diam height -58.20759 4.698837 0.3436245 etc... You might have noticed that the "summary()" command that we used before did not give us the sum of squares corresponding to each predictor variable. To obtain these sum of squares,type > anova(trees.fit) Analysis of Variance Table Response: vol Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) diam 1 7581.781 7581.781 506.6764 0.0000000 height 1 105.317 105.317 7.0382 0.0129969 Residuals 28 418.985 14.964 Question:does the "anova" command give us partial sum of squares or sequential sum of squares?To find out let us give the predictor variables in a different order. > try_lm(vol~height+diam,trees.dat) > anova(try) Analysis of Variance Table Response: vol Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) height 1 2952.995 2952.995 197.3432 3.319567e-14 diam 1 4734.104 4734.104 316.3714 1.110200e-16 Residuals 28 418.985 14.964 What is your conclusion? We see that whatever the order in wich the variables are given,the coefficients are all "significantly diffrenr from zero". We can observe the impact of deleting or addind predictor variables by using the "add1" or "drop1" function. Let us use the function drop1 on the object trees.fit. > trees.drop1_drop1(trees.fit) > trees.drop1 Single term deletions Model: vol ~ diam + height Df Sum of Sq RSS Cp 418.985 508.768 diam 1 4734.104 5153.089 5212.944 height 1 105.317 524.303 584.158 You see that 1.if we do not remove any variable,the residual sum of squares does not change. 2.if we remove diam,the sum of squares that is transferred to SSE.the residual sum of squares is 4734.10 and we can verify that the new SSE is 418.985+4734.104=5153.089 3.if we remove height,its corresponding sum of squares 105.317 goes to the ols SSe to give 524.303. Are the sum of sqaures given by drop1 and add1 sequental or partial? To use the function add1,we have to start with a number of variables which is not maximum. Let us start with a regression that has no predictor variables;the functionnal formula is then vol~1 > trees.add1_add1(trees0.fit,~height+diam) > trees.add1 Single term additions Model: vol ~ 1 Df Sum of Sq RSS Cp 8106.084 8646.489 height 1 2952.995 5153.089 6233.900 diam 1 7581.781 524.303 1605.114 Interpret the different terms of this object as you have done it for drop1. HERE ARE A FEW OTHER MODELS THAT YOU COULD TRY WITH THE GIVEN VARIABLES. > try1_lm(vol~-1+height+diam) > try2_lm(vol~height+I(diam^2)) > try3_lm(vol~poly(height,3)+diam) > try4_lm(vol~height+diam+height*diam) > try5_lm(vol~height*diam) TO SEE WHAT THE FUNCTINNAL FORMULAS CORRESPONDING TO THOSE MODELS ARE DO > try1 Call: lm(formula = vol ~ -1 + height + diam) Coefficients: height diam -0.4772425 5.045026 Degrees of freedom: 31 total; 29 residual Residual standard error: 6.180254 etc.. pour les autres formules. RESIDUALS,PREDICTION AND DIAGNOSTICS To obtain the predicted values of Y at different levels of the predictor variables type > trees.pred_predict(trees.fit) > trees.pred 1 2 3 4 5 6 7 8 9 4.846472 4.538 4.790519 15.87116 19.90355 21.06068 16.15883 19.25145 21.43946 10 11 12 13 14 15 16 17 20.19122 22.0356 21.47461 21.47461 20.47889 23.95029 27.83562 31.61549 18 19 20 21 22 23 24 25 33.83865 30.56382 28.62833 34.37884 36.00586 35.35376 41.71477 44.84254 26 27 28 29 30 31 51.2595 52.19927 53.39155 53.86144 53.86144 68.48378 To obtain the residuals,type > trees.res_residuals(trees.fit) > trees.res 1 2 3 4 5 6 7 8 5.453528 5.762 5.409481 0.528838 -1.10355 -1.360683 -0.5588335 -1.051454 9 10 11 12 13 14 15 1.16054 -0.2912215 2.164397 -0.4746134 -0.07461338 0.8211071 -4.850291 16 17 18 19 20 21 22 23 -5.63562 2.184511 -6.438649 -4.863816 -3.728328 0.1211614 -4.305855 0.9462409 24 25 26 27 28 29 30 31 -3.414766 -2.242539 4.140501 3.500734 4.908448 -2.361436 -2.861436 8.516217 The "lm.influence"function will give us a series of diagnostics that will allow us to identify influencial points and outliers. > trees.infl_lm.influence(trees.fit) > trees.infl $coefficients: (Intercept) diam height 1 -60.09476 4.796622 0.3487896 2 -62.67536 4.756018 0.3895576 3 -63.50146 4.732589 0.4045843 4 -58.33777 4.703328 0.3443157 5 -58.69316 4.676459 0.3544425 6 -59.04995 4.666995 0.3609355 7 -57.78942 4.699801 0.3382243 8 -58.16537 4.688583 0.3453262 9 -57.81670 4.717713 0.3346521 10 -58.19415 4.696293 0.3440208 11 -57.66866 4.728373 0.3303984 12 -58.21736 4.694558 0.3447100 13 -58.20913 4.698164 0.3437952 14 -58.62811 4.697775 0.3489647 15 -57.86816 4.675723 0.3453260 16 -57.25456 4.703564 0.3327459 17 -56.69993 4.729898 0.3173171 18 -63.14630 4.611026 0.4270661 19 -56.15037 4.741234 0.3113916 20 -54.02946 4.779170 0.2766818 21 -58.19725 4.698693 0.3434604 22 -59.17787 4.697425 0.3585450 23 -58.41670 4.690596 0.3473880 24 -56.69752 4.766783 0.3135564 25 -58.07645 4.730350 0.3374313 26 -57.00878 4.645326 0.3352492 27 -57.21119 4.649739 0.3374322 28 -57.62016 4.605748 0.3497879 29 -58.48353 4.745032 0.3403301 30 -58.54195 4.754813 0.3396326 31 -52.55626 4.466071 0.3051865 $sigma: 1 2 3 4 5 6 7 8 3.777861 3.751928 3.768619 3.937887 3.93277 3.928965 3.937627 3.933802 9 10 11 12 13 14 15 16 3.932307 3.938866 3.915439 3.938173 3.939258 3.935865 3.822641 3.781275 17 18 19 20 21 22 23 24 3.9134 3.705069 3.818223 3.855479 3.939214 3.846929 3.934851 3.877037 25 26 27 28 29 30 31 3.913791 3.84962 3.875052 3.8104 3.909718 3.895794 3.4709 $hat: [1] 0.11576594 0.14666902 0.17612103 0.05917429 0.12053469 0.15550541 [7] 0.11465323 0.05147098 0.09185016 0.04795754 0.07368613 0.04805058 [13] 0.04805058 0.07280583 0.03763485 0.03575893 0.13047087 0.14243230 [19] 0.06704405 0.21188939 0.03569177 0.04508901 0.05028648 0.11237164 [25] 0.06966646 0.09083819 0.09572463 0.10661936 0.11005460 0.11005460 [31] 0.22607746 The column $coefficients give us the coefficients hwen all points are considered except the i th point,similarly for $sigma which gives us the estimate of sigma when the i th point is excluded from the data. The $hat column gives us the elements of the hat matrix. You can use these and other quantities to compute different diagnostics. For example to compute the standardized residuals,do the following > s_summary(trees.fit)$sigma > h_trees.infl$hat > trees.res_trees.res/(s*sqrt(1-h)) You will notice that we have given the same name to the residuals and the standardized residuals.this is OK if you keep track of what you do because now tress.res does not denore residuals but standardized residuals. To compute the studentized residuals,do > e_trees.fit$residuals > trees.stud_e/(si*(1-h)^.5) > trees.stud 1 2 3 4 5 6 7 8 1.53514 1.662494 1.5814 0.1384538 -0.2992155 -0.3768602 -0.1508312 -0.2744432 9 10 11 12 13 14 15 0.3096947 -0.07577466 0.574351 -0.1235203 -0.01941312 0.2166577 -1.293404 16 17 18 19 20 21 22 -1.517786 0.598629 -1.876568 -1.318819 -1.089287 0.0313218 -1.145417 23 24 25 26 27 28 29 0.2467614 -0.9348573 -0.5940501 1.128014 0.9500167 1.362873 -0.6402492 30 31 -0.7785855 2.789047 To compute the DIFBETAS,i.e. the difference between the betas with all points and the betas withouts the i th point do > summary(trees.fit)$cov.unscaled (Intercept) diam height (Intercept) 4.91969725 0.029259817 -0.069379559 diam 0.02925982 0.004663833 -0.001197493 height -0.06937956 -0.001197493 0.001121161 This gives us the matrix (X'X)^-1. > coef(trees.infl)[1,] (Intercept) diam height -60.09476 4.796622 0.3487896 > coef(trees.fit) (Intercept) diam height -58.20759 4.698837 0.3436245 > b_coef(trees.fit) > bi_coef(trees.infl)[1,] > si1_si[1] > cc1_sqrt(xxi[1,1]) > dfbeta1_(bi[1]-b[1])/(si1*cc1) And we obtain the value of dfbeta for the intercept coefficient > dfbeta1 (Intercept) -0.2252136 GRAPHS To do a graph in Splus ,you have to open a special window type > motif() > par(mfrow=c(2,1)) > attach(trees.dat) > plot(diam,trees.res,xlab="diametre",ylab="Std.residuals") > abline(h=0,lty=2) > title("Std Residuals v diametre") > plot(height,trees.res,xlab="height",ylab="Std.Residuals") > abline(h=0,lty=2) > title("Std Residuals v height") These commands give us the graph of the residuals vs the predictor variables. To do the graph of the residuals vs the fitted values,type > par(mfrow=c(1,1)) > plot(trees.pred,trees.res,xlab="predicted volume",ylab="Std.Residuals") > abline(h=0,lty=2) > title("Residuals vs fitted values")