This file gives you instructions on how to run linear regression in Splus
Enter Splus by typing Splus followed by return.
Wait for a few seconds until you get the Splus prompt >.

HOW DO YOU READ DATA INTO SPLUS?

We are going to work with a" data.frame" which we create from a file that
I have sent you,called black.cherry.data .
There are several ways of creating a "data.frame".The following instructions
give you one of these ways.
Type:	
>trees.dat_read.table("black.cherry.data",col.names=c("diam","height","vol"))

Splus goes and gets the file and with the commands that I gave under "names"
it creates a special array called a data.frame with 3 columns called "diam",
"height" and "vol".
If you typethe following

> trees.dat
this is what you get


   diam height  vol 
 1  8.3     70 10.3
 2  8.6     65 10.3
 3  8.8     63 10.2
 4 10.5     72 16.4
 5 10.7     81 18.8
 6 10.8     83 19.7
 7 11.0     66 15.6
 8 11.0     75 18.2
 9 11.1     80 22.6
10 11.2     75 19.9
11 11.3     79 24.2
12 11.4     76 21.0
13 11.4     76 21.4
14 11.7     69 21.3
15 12.0     75 19.1
16 12.9     74 22.2
17 12.9     85 33.8
18 13.3     86 27.4
19 13.7     71 25.7
20 13.8     64 24.9
21 14.0     78 34.5
22 14.2     80 31.7
23 14.5     74 36.3
24 16.0     72 38.3
25 16.3     77 42.6
26 17.3     82 55.4
27 17.5     82 55.7
28 17.9     80 58.3
29 18.0     80 51.5
30 18.0     80 51.0
31 20.6     87 77.0 

SOME BASIC NOTIONS

The lm() function is the basic function for doing linear regression.
To regress vol on diam you type
> attach(trees.dat)
>trees.fit1_lm(vol~diam,trees.dat)
Anticipating on what we have already done in class,if I want to regress
vol on diam and height,i will type

> trees.fit_lm(vol~diam+height,trees.dat)


To obtain the results of the regression(here I choose the multiple regression
of vol on diam and height),I type

> summary(trees.fit)

The output is

Call: lm(formula = vol ~ diam + height, data = trees.dat)
Residuals:
    Min     1Q  Median    3Q   Max 
 -6.439 -2.611 -0.2912 2.174 8.516

Coefficients:
               Value Std. Error  t value Pr(>|t|) 
(Intercept) -58.2076   8.5800    -6.7841   0.0000
       diam   4.6988   0.2642    17.7868   0.0000
     height   0.3436   0.1295     2.6530   0.0130

Residual standard error: 3.868 on 28 degrees of freedom
Multiple R-Squared: 0.9483 
F-statistic: 256.9 on 2 and 28 degrees of freedom, the p-value is 0 

Correlation of Coefficients:
       (Intercept)    diam 
  diam  0.1932            
height -0.9342     -0.5237

There is now an object called trees.fit and within this objects there are
other objects that Splus has created.To find out what those objects are,type

> names(trees.fit)

 [1] "coefficients"  "residuals"     "fitted.values" "effects"      
 [5] "R"             "rank"          "assign"        "df.residual"  
 [9] "contrasts"     "terms"         "call" 

You can obtain those objects individually by doing

> trees.fit$resid
        1     2        3        4        5         6          7         8 
 5.453528 5.762 5.409481 0.528838 -1.10355 -1.360683 -0.5588335 -1.051454
       9         10       11         12          13        14        15 
 1.16054 -0.2912215 2.164397 -0.4746134 -0.07461338 0.8211071 -4.850291
       16       17        18        19        20        21        22        23 
 -5.63562 2.184511 -6.438649 -4.863816 -3.728328 0.1211614 -4.305855 0.9462409
        24        25       26       27       28        29        30       31 
 -3.414766 -2.242539 4.140501 3.500734 4.908448 -2.361436 -2.861436 8.516217

> trees.fit$coeff
 (Intercept)     diam    height 
   -58.20759 4.698837 0.3436245
etc...

You might have noticed that the "summary()" command that we used before did
 not give us the sum of squares corresponding to each predictor variable.
To obtain these sum of squares,type

> anova(trees.fit)
Analysis of Variance Table

Response: vol

Terms added sequentially (first to last)
          Df Sum of Sq  Mean Sq  F Value     Pr(F) 
diam       1  7581.781 7581.781 506.6764 0.0000000
height     1   105.317  105.317   7.0382 0.0129969
Residuals 28   418.985   14.964 

Question:does the "anova" command give us partial sum of squares or
 sequential sum of squares?To find out let us give the predictor variables
 in a different order.

> try_lm(vol~height+diam,trees.dat)
> anova(try)
Analysis of Variance Table

Response: vol

Terms added sequentially (first to last)
          Df Sum of Sq  Mean Sq  F Value        Pr(F) 
height     1  2952.995 2952.995 197.3432 3.319567e-14
diam       1  4734.104 4734.104 316.3714 1.110200e-16
Residuals 28   418.985   14.964

What is your conclusion?
We see that whatever the order in wich the variables are given,the coefficients
are all "significantly diffrenr from zero".
We can observe the impact of deleting or addind predictor variables by using
the "add1" or "drop1" function.
Let us use the function drop1 on the object trees.fit.

> trees.drop1_drop1(trees.fit)
> trees.drop1
Single term deletions

Model:
vol ~ diam + height
       Df Sum of Sq      RSS       Cp 
<none>               418.985  508.768
  diam  1  4734.104 5153.089 5212.944
height  1   105.317  524.303  584.158

You see that 
1.if we do not remove any variable,the residual sum of squares does not change.
2.if we remove diam,the sum of squares that is transferred to SSE.the residual
sum of squares is 4734.10 and we can verify that the new SSE is
418.985+4734.104=5153.089
3.if we remove height,its corresponding sum of squares 105.317 goes to the
ols SSe to give 524.303.

Are the sum of sqaures given by drop1 and add1 sequental or partial?

To use the function add1,we have to start with a number of variables which is not maximum.
Let us start with a regression that has no predictor variables;the
 functionnal formula is then vol~1

> trees.add1_add1(trees0.fit,~height+diam)
> trees.add1
Single term additions

Model:
vol ~ 1
       Df Sum of Sq      RSS       Cp 
<none>              8106.084 8646.489
height  1  2952.995 5153.089 6233.900
  diam  1  7581.781  524.303 1605.114

Interpret the different terms of this object as you have done it for drop1.

HERE ARE A FEW OTHER MODELS THAT YOU COULD TRY WITH THE GIVEN VARIABLES.

> try1_lm(vol~-1+height+diam)
> try2_lm(vol~height+I(diam^2))
> try3_lm(vol~poly(height,3)+diam)
> try4_lm(vol~height+diam+height*diam)
> try5_lm(vol~height*diam)
TO SEE WHAT THE FUNCTINNAL FORMULAS CORRESPONDING TO THOSE MODELS ARE DO

> try1
Call:
lm(formula = vol ~ -1 + height + diam)

Coefficients:
     height     diam 
 -0.4772425 5.045026

Degrees of freedom: 31 total; 29 residual
Residual standard error: 6.180254 
etc.. pour les autres formules.

RESIDUALS,PREDICTION AND DIAGNOSTICS

To obtain the predicted values of Y at different levels of the predictor
variables type

> trees.pred_predict(trees.fit)
> trees.pred
        1     2        3        4        5        6        7        8        9 
 4.846472 4.538 4.790519 15.87116 19.90355 21.06068 16.15883 19.25145 21.43946
       10      11       12       13       14       15       16       17 
 20.19122 22.0356 21.47461 21.47461 20.47889 23.95029 27.83562 31.61549
       18       19       20       21       22       23       24       25 
 33.83865 30.56382 28.62833 34.37884 36.00586 35.35376 41.71477 44.84254
      26       27       28       29       30       31 
 51.2595 52.19927 53.39155 53.86144 53.86144 68.48378

To obtain the residuals,type

> trees.res_residuals(trees.fit)
> trees.res
        1     2        3        4        5         6          7         8 
 5.453528 5.762 5.409481 0.528838 -1.10355 -1.360683 -0.5588335 -1.051454
       9         10       11         12          13        14        15 
 1.16054 -0.2912215 2.164397 -0.4746134 -0.07461338 0.8211071 -4.850291
       16       17        18        19        20        21        22        23 
 -5.63562 2.184511 -6.438649 -4.863816 -3.728328 0.1211614 -4.305855 0.9462409
        24        25       26       27       28        29        30       31 
 -3.414766 -2.242539 4.140501 3.500734 4.908448 -2.361436 -2.861436 8.516217

The "lm.influence"function will give us a series of diagnostics that
will allow us to identify influencial points and outliers.

> trees.infl_lm.influence(trees.fit)
> trees.infl
$coefficients:
   (Intercept)     diam    height 
 1   -60.09476 4.796622 0.3487896
 2   -62.67536 4.756018 0.3895576
 3   -63.50146 4.732589 0.4045843
 4   -58.33777 4.703328 0.3443157
 5   -58.69316 4.676459 0.3544425
 6   -59.04995 4.666995 0.3609355
 7   -57.78942 4.699801 0.3382243
 8   -58.16537 4.688583 0.3453262
 9   -57.81670 4.717713 0.3346521
10   -58.19415 4.696293 0.3440208
11   -57.66866 4.728373 0.3303984
12   -58.21736 4.694558 0.3447100
13   -58.20913 4.698164 0.3437952
14   -58.62811 4.697775 0.3489647
15   -57.86816 4.675723 0.3453260
16   -57.25456 4.703564 0.3327459
17   -56.69993 4.729898 0.3173171
18   -63.14630 4.611026 0.4270661
19   -56.15037 4.741234 0.3113916
20   -54.02946 4.779170 0.2766818
21   -58.19725 4.698693 0.3434604
22   -59.17787 4.697425 0.3585450
23   -58.41670 4.690596 0.3473880
24   -56.69752 4.766783 0.3135564
25   -58.07645 4.730350 0.3374313
26   -57.00878 4.645326 0.3352492
27   -57.21119 4.649739 0.3374322
28   -57.62016 4.605748 0.3497879
29   -58.48353 4.745032 0.3403301
30   -58.54195 4.754813 0.3396326
31   -52.55626 4.466071 0.3051865
$sigma:
        1        2        3        4       5        6        7        8 
 3.777861 3.751928 3.768619 3.937887 3.93277 3.928965 3.937627 3.933802
        9       10       11       12       13       14       15       16 
 3.932307 3.938866 3.915439 3.938173 3.939258 3.935865 3.822641 3.781275
     17       18       19       20       21       22       23       24 
 3.9134 3.705069 3.818223 3.855479 3.939214 3.846929 3.934851 3.877037
       25      26       27     28       29       30     31 
 3.913791 3.84962 3.875052 3.8104 3.909718 3.895794 3.4709

$hat:
 [1] 0.11576594 0.14666902 0.17612103 0.05917429 0.12053469 0.15550541
 [7] 0.11465323 0.05147098 0.09185016 0.04795754 0.07368613 0.04805058
[13] 0.04805058 0.07280583 0.03763485 0.03575893 0.13047087 0.14243230
[19] 0.06704405 0.21188939 0.03569177 0.04508901 0.05028648 0.11237164
[25] 0.06966646 0.09083819 0.09572463 0.10661936 0.11005460 0.11005460
[31] 0.22607746

The column $coefficients give us the coefficients hwen all points are
 considered except the i th point,similarly for $sigma which gives us the estimate of sigma when the i th point is excluded from the data.
The $hat column gives us the elements of the hat matrix.

You can use these and other quantities to compute different diagnostics.
For example to compute the standardized residuals,do the following

> s_summary(trees.fit)$sigma
> h_trees.infl$hat
> trees.res_trees.res/(s*sqrt(1-h))

You will notice that we have given the same name to the residuals and the
 standardized residuals.this is OK if you keep track of what you do because now
tress.res does not denore residuals but standardized residuals.
To compute the studentized residuals,do

> e_trees.fit$residuals
> trees.stud_e/(si*(1-h)^.5)

> trees.stud
       1        2      3         4          5          6          7          8 
 1.53514 1.662494 1.5814 0.1384538 -0.2992155 -0.3768602 -0.1508312 -0.2744432
         9          10       11         12          13        14        15 
 0.3096947 -0.07577466 0.574351 -0.1235203 -0.01941312 0.2166577 -1.293404
        16       17        18        19        20        21        22 
 -1.517786 0.598629 -1.876568 -1.318819 -1.089287 0.0313218 -1.145417
        23         24         25       26        27       28         29 
 0.2467614 -0.9348573 -0.5940501 1.128014 0.9500167 1.362873 -0.6402492
         30       31 
 -0.7785855 2.789047
To compute the DIFBETAS,i.e. the difference between the betas with all points
and the betas withouts the i th point do

> summary(trees.fit)$cov.unscaled
            (Intercept)         diam       height 
(Intercept)  4.91969725  0.029259817 -0.069379559
       diam  0.02925982  0.004663833 -0.001197493
     height -0.06937956 -0.001197493  0.001121161
This gives us the matrix (X'X)^-1.
> coef(trees.infl)[1,]
 (Intercept)     diam    height 
   -60.09476 4.796622 0.3487896
> coef(trees.fit)
 (Intercept)     diam    height 
   -58.20759 4.698837 0.3436245
> b_coef(trees.fit)
>  bi_coef(trees.infl)[1,]
> si1_si[1]
> cc1_sqrt(xxi[1,1])
> dfbeta1_(bi[1]-b[1])/(si1*cc1)
And we obtain the value of dfbeta for the intercept coefficient

> dfbeta1
 (Intercept) 
  -0.2252136

GRAPHS

To do a graph in Splus ,you have to open a special window
type
> motif()
> par(mfrow=c(2,1))
> attach(trees.dat)
> plot(diam,trees.res,xlab="diametre",ylab="Std.residuals")
> abline(h=0,lty=2)
> title("Std Residuals v diametre")
> plot(height,trees.res,xlab="height",ylab="Std.Residuals")
> abline(h=0,lty=2)
> title("Std Residuals v height")

These commands give us the graph of the residuals vs the predictor variables.
To do the graph of the residuals vs the fitted values,type

> par(mfrow=c(1,1))
>  plot(trees.pred,trees.res,xlab="predicted volume",ylab="Std.Residuals")
> abline(h=0,lty=2)
> title("Residuals vs fitted values")