0
Posted April 1, 2012 by Team AnalyticpediA in Analytics
 
 

R-Beginner’s Quick Book Part-1


History of R


R-Language was derived from S and S-PLUS developed by Bell Laboratories by John Chambers & Rick Becker in late 1970s.The project was initiated by Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand, in the early 1990s.R was a hybrid version of languages S and Scheme

R first source code was released under GNU General Public License (GPL) in 1995.In 1998, the Comprehensive R Achieve Network (CRAN) was established.

Salient features of R


  •  Open Source Statistics Package (Free/Flexible)
  • Object Oriented Programming Language (Sorta)
  • Managed by CRAN but contributed to by ANYONE
  • Platform independent (Unix, Linux, MS windows, Mac-OS)

What you can do with R


R can be used for all kinds of statistical analysis (not limited to followings) like Hypothesis testing, ANOVA,Linear regression, logistics regression, Bayesian analysis, Optimization, Machine learning: Neural network, support vector machines,Modeling and Simulations,Reports: Charts/Plots,etc.

Why R is so popular?


  •  Versatility :  Easy to write new programs according to the task in hand
  • Interactivity : Very user friendly, performs the job one at a time. Hence the user can correct the code at  each step separately.
  • Research-oriented : Though SAS is the most common package in general, R is most suitable for research. 
  • FREE !!!

 Comparison of statistical software


 

Feature R SAS Excel SPSS
Cost Free Commercial: ~$6000 per seat (PC version) / ~$28K per processor $150 Office 2007 Commercial:     $ 1599
Open source Yes NO NO NO
Interface CLI/GUI (limited) CLI/GUI GUI CLI/GUI
Data set limit Large (some test may restricted on data size) Large Rows:1048576Cols:16384 Large

 Installation procedure


  •  Go to http://www.r-project.org/ to download R from your nearest CRAN mirror.
  •  Download BASE package. Most of the day-to-day analytics requirements are covered.
  •  For specific research packages, need to download the CONTRIBUTED packages like

−       RODBC: To establish connection with Excel

−       Nortest: Tests for normality

−       Neural: Neural networks

−       nleqslv: Solve systems of non linear equations

Getting started


  • How to install R:
    • Go to http://www.r-project.org/ to download R from your nearest CRAN mirror.
    •  Download BASE package. Most of the day-to-day analytics requirements are covered.
    •  For specific research packages, need to download the CONTRIBUTED packages.
  • Initiation:
    • File > source R code : opens available codes for programs
    • File > new script : Write your own code
    • File > open script : opens the saved codes from past sessions.
    • File > change dir : any file location can be used as work location, where data files will be read from and output files stored.
  • Reading data :
    • Can be read from any location using read.table command.
    • If data is in work directory, extension not required.
    • If data is in a library, it can accessed by the function data().

R functions


  •  R is flexible – write your own function or use in-built functions.
  • Use the command window.
  • To search for a function, use search e.g. ?solve
  • If not sure, use help.search(‘ ’).
  • Load the corresponding package and work with the function.
  • Use library() to see what packages you have.
  • Numerical Summary :
  • Functions for basic use —

summary   mean    median   range   quantile   $   var   cor    sort

Graphical Representation


  •  Histograms and Boxplots (One variable):

hist (filename, main=“…”,xlab=“…”,ylab=“…”)

boxplot (filename, main=“…”,xlab=“…”)

  • Scatterplots:

Plot( var1~var2,filename,xlab=“…”,ylab=“…”)

  • We can put several plots in the same display using par(mfrow=…) commands.
  • More than three variables can be also plotted using pairs(data)

Choosing subsets of data


Rows, columns and elements of data

x = data[3,]

y = data[,7]

a = data[2,4]

Suppose we want to choose rows 3,5,7 or rows 5 to 11

data1 = data[c(3,5,7),]

data2 = data[5:11,]

Similarly for columns. Even we can use

data3 = data[- c(1:3,6),]        what will that mean

Subsets based on some condition

data4 = data[y > 0]      or      data4 = data[data$V1 > 0]

Matrix manipulations


One of the main strengths of R lies in the fluent computation of matrix operations.

I = diag(10)    # creates identity matrix

A = as.matrix(data)      # converts data frame to matrix components

x= cbind(1,A[,2:5])

c = t(x) %*% x     # computes x’x

d = t(x) %*% data[,7]

Beta = solve(c,d)    or      beta = inv(c) %*% d

Other matrix operators include

+  -   *   dim rank ncol  nrow eigen %x%/kronecker

Sub-setting and augmenting matrices are also easy, e.g.

x= cbind(1,A[,2:5])

Factors (Categorical Variables)


  •  The treatment of factors are different from the treatment of quantitative variables.
  • E.g. having means of Zip-codes don’t make sense.
  • R automatically takes character strings as factors.
  • For numerical ones,

    x = as.factor(x)    # transforms into factors

  • To make contingency tables,

 table(x,y)      # prepares joint frequencies for different values of factors x and y

 factor(z)       # converts into factors

 levels(z)        # gives all the different levels of z

Simulations


  •  R is very adept in simulating from different distributions, computing quantiles, p-values, density functions etc.
  • Suppose we want to generate 500 normal random variables with mean 4.3 and std deviation 2

    x = rnorm(500,4.3,2)

   Similarly, we can generate from other distributions

 rexp  rbeta  rgamma rbinom  rmultinom  runif  rchisq   rpois   rf  rt

  • Next, R can compute the CDF related to any specific vector with respect to a particular distribution with given parameters.
  • pbeta(c(0,0.2,0.4,0.6,0.8,1),3,3)     # computes the probabilities with respect to Beta(3,3) distribution for the given set of values  
  • Again, we can compute the quantiles as well. Check,

    qgamma(c(0,0.2,0.4,0.6,0.8,1),3,1)

 Loops


In R, writing a loop requires a for/do statement followed by {}.

Suppose we want to simulate 10000 random normal variables, and want to test the consistency of the mean and the std deviation.

x = rnorm(10000,5,2)         # iid random normal variables

xbar = rep(0,10000)           # initiation of the mean

vari = rep(0,10000)            # initiation of the variance

for (i in (1:10000))

{

xbar[i] = mean(x[1:i])

vari[i] = var(x[1:i])

}

plot(xbar,pch = ‘l’)

plot(vari,pch = ‘l’)

  • Convergence will be clear from those plots.
  • Similarly we can write while loops…

Write your own function


  •  Anyone can write his own function and use them to facilitate future jobs.
  • Suppose we want to write the density function of a gamma distribution :

    gdensity= function(x,r,lambda) {(lambda**r)*exp(-lambda*x)*(x**(r- 1))/gamma(r)}

  • To write more complicated functions :

   gdensity = function(x,r,lambda){

    if (x<0) return(0)

    a = lambda**r

    b = lambda*x

    c = exp(-b)

    d = x**(r-1)

    e = gamma(r)

    s = a*c*d/e

    return(s)

    }

Merging/joins


  •  In R, merging/joins can be accomplished by the function merge.
  • Syntax:

   merge(x, y, by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,..)

  • x, y= data frames, or objects to be coerced to one
  • by.x, by.y =specifications of the common columns
  • all.x = logical,if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output-can be used for left outer join, right outer join, full join.

Team AnalyticpediA

 
Avatar of Team AnalyticpediA
Team Analyticpedia pledge to grow even steeper than analytics and bring u the latest knowledge,news,happenings,reviews around the globe and beyond from the realms of analytics and technology.