R Beginner’s Quick Book Part-2
Linear Regression
We start with basic linear regression, where y is a scalar dependent variable and x1,x2,x3 are independent variables. We start by plotting the variables.
plot(y ~ x1, data) plot(y~x2,data)…
Once we identify the variables we should include in the model,
fit = lm(y ~ x1 + x2 + x3, data)
fitsum = summary(fit)
Individual components of the fit can be extracted form the object fitsum. To see what is available in an object:
names(fitsum)
Most important items are
fitsum$res fitsum$fit
ANOVA
Analysis of Variance can be carried out by the same lm function, only the factors have to be defined.We start with a quantitative variable y, and factors x1 and x2.
x1 = factor(x1) x2 = factor(x2)
anovafit = lm(y ~ x1 + x2, data)
summary(anovafit)
If we need interactions, we write y ~ x1 + x2 + x3 + x2:x3 or simply x1 + x2*x3.In most problems, we encounter a combination of quantitative and categorical predictors, commonly known as ANOCOVA. If x1 is quantitative, and x2 is categorical, we simply write,
model = lm(y ~ x1 + factor(x2), data)
summary(model)
This will give us the desired output.
Logistic Regression
To get to the next analysis method, we need a predictor variable which is 0-1 valued. Moreover, it is no longer a linear model. It comes under the realm of a generalized linear model under Binomial family, and with a canonical logit link function.We have a binary variable y, and a set of predictors, x1,x2,x3 (Some of them are categorical themselves).
logist = glm(y ~ x1 + x2 + factor(x3), family = binomial(“logit”))
summary(logist)
Instead, if the explained variable y comes in form of success-failure counts, we need to combine them in the model.
y = cbind(suc, fail) # suc denotes the number of 1’s and fail the number of zeroes
logist = glm(y ~ factor(x1) + factor(x2) + factor(x3), family = binomial(“logit”))
summary(logist)
Testing sub-models
R gives us an opportunity to compare models (be it linear/logistic regression/ANOVA). E.g. x1+x2 is a sub-model of x1 + x2 + x3. So, we can run two lm’s with this two different models, and check the difference between the residual deviances. We need to compare with the corresponding chi-square values. This will tell us whether x3 can be dropped or not.
Collapsing two or more groups :
Suppose in an ANOVA model, we want to test whether two levels of factor x2, say level 1 and level 4 are different or not.It has 4 levels in all.In a sub-model, we collapse the two levels and test the significance :
model1 = lm(y ~ x1 + I(x2 == 2) + I(x2 == 3) + x3, data)
model2 = lm(y ~ x1 + x2 + x3, data)
and then
pchisq(sum(**1) – sum(**2), 1)
will yield the p-value.
Classification Trees
Classification problems are different than prediction. We have to assign class labels to new cases with predictors.Suppose we have a set of predictors x1,x2,x3 and a class variable y with 4 labels.
library(rpart)
tree = rpart(y ~ x1+x2 + x3 , data, method = “class”)
We can add more options using control option. E.g.
minsplit, minbucket, cp,maxdepth etc.
Those determine the different controls and cutoff for the trees. The method can be “anova” (for Regression tree), “exp” (for survival data) etc.
Nearest Neighbor Classification
Another classification tool is the nearest neighbor classification.
library(class)
Let train denote the training set, test denote the test set (of explanatory variables).Also, cls denotes the class labels for the training set. k denotes the number of classes.
classifier = knn(train, test, cl, k)
summary(classifier)
This is a more robust classifier compared to CART.A very powerful tool, known as boosting, are currently used, to bolster the performance of the classifiers. R has functions for boosting in-built in the package ada.
Clustering
A clustering tool is needed when we don’t have any label/group in the data.Yet, we want to divide the full range in some meaningful groups, previously unspecified. This is known as clustering.The clustering can be based on k-means, like nearest neighbor before.
kmeans(x, centers, iter.max = 10, nstart = 1)
Other clustering methods are generally hierarchical, in the sense they are either top-down or bottoms-up. Various choices are
agnes # agglomerative nesting clustering (Bottoms-up)
diana # divisive nesting clustering (Top-down)
hclust
The graphical representation of this clusters involve the function dendrogram.