R-Beginner’s Quick Book Part-1
History of R
R-Language was derived from S and S-PLUS developed by Bell Laboratories by John Chambers & Rick Becker in late 1970s.The project was initiated by Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand, in the early 1990s.R was a hybrid version of languages S and Scheme
R first source code was released under GNU General Public License (GPL) in 1995.In 1998, the Comprehensive R Achieve Network (CRAN) was established.
Salient features of R
- Open Source Statistics Package (Free/Flexible)
- Object Oriented Programming Language (Sorta)
- Managed by CRAN but contributed to by ANYONE
- Platform independent (Unix, Linux, MS windows, Mac-OS)
What you can do with R
R can be used for all kinds of statistical analysis (not limited to followings) like Hypothesis testing, ANOVA,Linear regression, logistics regression, Bayesian analysis, Optimization, Machine learning: Neural network, support vector machines,Modeling and Simulations,Reports: Charts/Plots,etc.
Why R is so popular?
- Versatility : Easy to write new programs according to the task in hand
- Interactivity : Very user friendly, performs the job one at a time. Hence the user can correct the code at each step separately.
- Research-oriented : Though SAS is the most common package in general, R is most suitable for research.
- FREE !!!
Comparison of statistical software
Feature | R | SAS | Excel | SPSS |
Cost | Free | Commercial: ~$6000 per seat (PC version) / ~$28K per processor | $150 Office 2007 | Commercial: $ 1599 |
Open source | Yes | NO | NO | NO |
Interface | CLI/GUI (limited) | CLI/GUI | GUI | CLI/GUI |
Data set limit | Large (some test may restricted on data size) | Large | Rows:1048576Cols:16384 | Large |
Installation procedure
- Go to http://www.r-project.org/ to download R from your nearest CRAN mirror.
- Download BASE package. Most of the day-to-day analytics requirements are covered.
- For specific research packages, need to download the CONTRIBUTED packages like
− RODBC: To establish connection with Excel
− Nortest: Tests for normality
− Neural: Neural networks
− nleqslv: Solve systems of non linear equations
Getting started
- How to install R:
- Go to http://www.r-project.org/ to download R from your nearest CRAN mirror.
- Download BASE package. Most of the day-to-day analytics requirements are covered.
- For specific research packages, need to download the CONTRIBUTED packages.
- Initiation:
- File > source R code : opens available codes for programs
- File > new script : Write your own code
- File > open script : opens the saved codes from past sessions.
- File > change dir : any file location can be used as work location, where data files will be read from and output files stored.
- Reading data :
- Can be read from any location using read.table command.
- If data is in work directory, extension not required.
- If data is in a library, it can accessed by the function data().
R functions
- R is flexible – write your own function or use in-built functions.
- Use the command window.
- To search for a function, use search e.g. ?solve
- If not sure, use help.search(‘ ’).
- Load the corresponding package and work with the function.
- Use library() to see what packages you have.
- Numerical Summary :
- Functions for basic use —
summary mean median range quantile $ var cor sort
Graphical Representation
- Histograms and Boxplots (One variable):
hist (filename, main=“…”,xlab=“…”,ylab=“…”)
boxplot (filename, main=“…”,xlab=“…”)
- Scatterplots:
Plot( var1~var2,filename,xlab=“…”,ylab=“…”)
- We can put several plots in the same display using par(mfrow=…) commands.
- More than three variables can be also plotted using pairs(data)
Choosing subsets of data
Rows, columns and elements of data
x = data[3,]
y = data[,7]
a = data[2,4]
Suppose we want to choose rows 3,5,7 or rows 5 to 11
data1 = data[c(3,5,7),]
data2 = data[5:11,]
Similarly for columns. Even we can use
data3 = data[- c(1:3,6),] what will that mean
Subsets based on some condition
data4 = data[y > 0] or data4 = data[data$V1 > 0]
Matrix manipulations
One of the main strengths of R lies in the fluent computation of matrix operations.
I = diag(10) # creates identity matrix
A = as.matrix(data) # converts data frame to matrix components
x= cbind(1,A[,2:5])
c = t(x) %*% x # computes x’x
d = t(x) %*% data[,7]
Beta = solve(c,d) or beta = inv(c) %*% d
Other matrix operators include
+ - * dim rank ncol nrow eigen %x%/kronecker
Sub-setting and augmenting matrices are also easy, e.g.
x= cbind(1,A[,2:5])
Factors (Categorical Variables)
- The treatment of factors are different from the treatment of quantitative variables.
- E.g. having means of Zip-codes don’t make sense.
- R automatically takes character strings as factors.
- For numerical ones,
x = as.factor(x) # transforms into factors
- To make contingency tables,
table(x,y) # prepares joint frequencies for different values of factors x and y
factor(z) # converts into factors
levels(z) # gives all the different levels of z
Simulations
- R is very adept in simulating from different distributions, computing quantiles, p-values, density functions etc.
- Suppose we want to generate 500 normal random variables with mean 4.3 and std deviation 2
x = rnorm(500,4.3,2)
Similarly, we can generate from other distributions
rexp rbeta rgamma rbinom rmultinom runif rchisq rpois rf rt
- Next, R can compute the CDF related to any specific vector with respect to a particular distribution with given parameters.
- pbeta(c(0,0.2,0.4,0.6,0.8,1),3,3) # computes the probabilities with respect to Beta(3,3) distribution for the given set of values
- Again, we can compute the quantiles as well. Check,
qgamma(c(0,0.2,0.4,0.6,0.8,1),3,1)
Loops
In R, writing a loop requires a for/do statement followed by {}.
Suppose we want to simulate 10000 random normal variables, and want to test the consistency of the mean and the std deviation.
x = rnorm(10000,5,2) # iid random normal variables
xbar = rep(0,10000) # initiation of the mean
vari = rep(0,10000) # initiation of the variance
for (i in (1:10000))
{
xbar[i] = mean(x[1:i])
vari[i] = var(x[1:i])
}
plot(xbar,pch = ‘l’)
plot(vari,pch = ‘l’)
- Convergence will be clear from those plots.
- Similarly we can write while loops…
Write your own function
- Anyone can write his own function and use them to facilitate future jobs.
- Suppose we want to write the density function of a gamma distribution :
gdensity= function(x,r,lambda) {(lambda**r)*exp(-lambda*x)*(x**(r- 1))/gamma(r)}
- To write more complicated functions :
gdensity = function(x,r,lambda){
if (x<0) return(0)
a = lambda**r
b = lambda*x
c = exp(-b)
d = x**(r-1)
e = gamma(r)
s = a*c*d/e
return(s)
}
Merging/joins
- In R, merging/joins can be accomplished by the function merge.
- Syntax:
merge(x, y, by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,..)
- x, y= data frames, or objects to be coerced to one
- by.x, by.y =specifications of the common columns
- all.x = logical,if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output-can be used for left outer join, right outer join, full join.