Handling large data sets in R

Background:

Recently, along with the co-author, I made a presentation on options to handle large data sets using R at NYC DataScience Academy.

You can watch the presentation here

This blog presents an overview of the presentation covering the available options to process large data sets in R efficiently.

The Problem with large data sets in R:

  • R reads entire data set into RAM all at once. Other programs can read file sections on demand.
  • R Objects live in memory entirely.
  • Does not have int64 datatype
    Not possible to index objects with huge numbers of rows & columns even in 64 bit systems (2 Billion vector index limit) . Hits file size limit around 2-4 GB.

How big is a large data set:

We can categorize large data sets in R across two broad categories:

  • Medium sized files that can be loaded in R ( within memory limit but processing is cumbersome (typically in the 1-2 GB range )
  • Large files that cannot be loaded in R due to R / OS limitations as discussed above . we can further split this group into 2 sub groups
    • Large files – (typically 2 – 10 GB) that can still be processed locally using some work around solutions.
    • Very Large files – ( > 10 GB) that needs distributed large scale computing.

We will go through the solution approach for each of these situations in the following sections.

Medium sized datasets (< 2 GB)

Try to reduce the size of the file before loading it into R

  • If you are loading xls files , you can select specific columns that is required for analysis instead of selecting the entire data set.
  • You can not select specific columns if you are loading csv or text file – you might want to pre-process the data in command line using cutor awk commands and filter data required for analysis.

Pre-allocate number of rows and pre-define column classes
Read optimization example :

  1. read in a few records of the input file , identify the classes of the input file and assign that column class to the input file while reading the entire data set
  2. calculate approximate row count of the data set based on the size of the file , number of fields in the column ( or using wc in command line ) and define nrow= parameter
  3. define comment.char parameter
bigfile.sample <- read.csv("data/SAT_Results2014.csv",  
                           stringsAsFactors=FALSE, header=T, nrows=20)  

bigfile.colclass <- sapply(bigfile.sample,class)

bigfile.raw <- tbl_df(read.csv("data/SAT_Results2014.csv", 
                    stringsAsFactors=FALSE, header=T,nrow=10000, 
                    colClasses=attendance.colclass, comment.char=""))  

These simple changes will significantly improve the loading operation in R.

Alternately, use fread option from package data.table

Following table shows optimization steps while reading the file and relative performance improvement achieved.

url <- "./311_Service_2014.csv"
#File size (MB) : 844
#1,844,515 rows 52 columns


#Standard Read.csv ####
==========================================================================
system.time(DF1 <- read.csv(url,stringsAsFactors=FALSE))
#user  system elapsed 
#243.38    5.49  249.73


#Optimized Read.csv ####
==========================================================================
system.time(length(readLines(url)))
#Number of lines : 1844516
#user  system elapsed 
#106.56    2.47  109.63 

classes <- c("numeric",rep("character",48),rep("numeric",2), "character")

system.time(DF2 <- read.csv(url, header = TRUE, sep = ",",  stringsAsFactors = FALSE, nrow = 1844516, colClasses = classes))
#user  system elapsed 
#173.73    3.43  182.73 

#fread ####
==========================================================================
library(data.table)

system.time(DT1 <- fread(url))
#user  system elapsed 
#80.10    1.09   81.30 


#Summary ####
==========================================================================
##    user  system elapsed  Method
##   243.38   5.49   249.73  read.csv (first time)
##   173.73   3.43   182.73  Optimized read.csv
##    80.10    1.09   81.30  fread

Use pipe operators to overwrite files with intermediate results and minimize data set duplication through process steps, if is an appropriate solution to your processing requirements.

Parallel Processing

Parallelism approach runs several computations at the same time and takes advantage of multiple cores or CPUs on a single system or across systems. Following R packages are used for parallel processing in R.

Explicit Parallelism (user controlled)

example:
-rmpi(Message Processing Interface)
-snow(Simple Network of Workstations)

Implicit parallelism (system abstraction)
example:
-doMC/foreach

Given below is an example of multi-core registration using doMC

# enable parallel processing for computationally intensive operations.

library(doMC)
registerDoMC(cores = 4)
Medium sized datasets (2 – 10 GB)

For medium sized data sets which are too-big for in-memory processing but too-small-for-distributed-computing files, following R Packages come in handy.

bigmemory

bigmemory is part of the “big” family which consists of several packages that perform analysis on large data sets. bigmemory uses several matrix objects but we will only focus on big.matrix.

big.matrix is a R object that uses a pointer to a C++ data structure. The location of the pointer to the C++ matrix can be saved to the disk or RAM and shared with other users in different sessions.

By loading the pointer object, users can access the data set without reading the entire set into R.

The following sample code will give a better understanding of how to use bigmemory:

example

# User / Session 1

library(bigmemory)
library(biganalytics)
library(bigtabulate)

#Create big.matrix 

setwd("/Users/sundar/dev")

school.matrix <- read.big.matrix(
    "./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv", 
    type ="integer", header = TRUE, backingfile = "school.bin", 
    descriptorfile ="school.desc", extraCols =NULL) 

# Get the location of the pointer to school.matrix. 
desc <- describe(school.matrix)

str(school.matrix)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slot
##   ..@ address:<externalptr>
# process big matrix in active session. 

colsums.session1 <- sum(as.numeric(school.matrix[,3])) 
colsums.session1
## [1] 67147
# save the location to disk to share the object .
dput(desc , file="/tmp/A.desc")
# Session 2
setwd("/Users/sundar/dev")

library (bigmemory)
library (biganalytics)

# Read the pointer from disk .
shared.desc <- dget("/tmp/A.desc")

# Attach to the pointer in RAM.
shared.bigobject <- attach.big.matrix(shared.desc)

# Check our results .
colsums.session2 <- sum(shared.bigobject[,3]) 
colsums.session2
## [1] 67147

As one can see, bigmemory is a powerful option to read and process big files and share the object as pointer to the matrix object across sessions, which can be treated as a normal R data object.

However, there is a limitation with bigmemory, C++ matrices allow only one type of data. Therefore the data set has to be only one class of data.

That leads us to the next package to handle large data sets in R

ff

ff is another package dealing with large data sets similar to bigmemory. It uses a pointer as well but to a flat binary file stored in the disk, and it can be shared across different sessions.
One advantage ff has over bigmemory is that it supports multiple data class types in the data set unlike bigmemory.

example

library(ff)
                                 
# creating the file
school.ff <- read.csv.ffdf(file="/Users/sundar/dev/mixed_matrix_SAT__College_Board__2010_School_Level_Results.csv")

#creates a ffdf object 
class(school.ff)
## [1] "ffdf"
# ffdf is a virtual dataframe
str(school.ff)
## List of 3
##  $ virtual: 'data.frame':    5 obs. of  7 variables:
##  .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...
##  .. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE
##  .. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE
##  .. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE
##  .. $ PhysicalElementNo: int  1 2 3 4 5
##  .. $ PhysicalFirstCol : int  1 1 1 1 1
##  .. $ PhysicalLastCol  : int  1 1 1 1 1
##  .. - attr(*, "Dim")= int  157 5
##  .. - attr(*, "Dimorder")= int  1 2
##  $ physical: List of 5
##  .. $ characters           : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd045531d5b.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  ..  .. ..- attr(*, "Levels")= chr "aabc"
##  ..  .. ..- attr(*, "ramclass")= chr "factor"
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ Number.of.Test.Takers: list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd053ac64eb.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ Critical.Reading.Mean: list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd05b15ab37.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ Mathematics.Mean     : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd06b9bd698.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  .. $ Writing.Mean         : list()
##  ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr> 
##  ..  .. ..- attr(*, "vmode")= chr "integer"
##  ..  .. ..- attr(*, "maxlength")= int 157
##  ..  .. ..- attr(*, "pattern")= chr "ffdf"
##  ..  .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd04425cc59.ff"
##  ..  .. ..- attr(*, "pagesize")= int 65536
##  ..  .. ..- attr(*, "finalizer")= chr "close"
##  ..  .. ..- attr(*, "finonexit")= logi TRUE
##  ..  .. ..- attr(*, "readonly")= logi FALSE
##  ..  .. ..- attr(*, "caching")= chr "mmnoflush"
##  ..  ..- attr(*, "virtual")= list()
##  ..  .. ..- attr(*, "Length")= int 157
##  ..  .. ..- attr(*, "Symmetric")= logi FALSE
##  .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"
##  $ row.names:  NULL
## - attributes: List of 2
##  .. $ names: chr [1:3] "virtual" "physical" "row.names"
##  .. $ class: chr "ffdf"
# ffdf object can be treated as any other R object
sum(school.ff[,3])
## [1] 66029

Very Large datasets

There are two options to process very large data sets ( > 10GB) in R.

  1. Use integrated environment packages like Rhipe to leverage Hadoop MapReduce framework.
  2. Use RHadoop directly on hadoop distributed system.

Storing large files in databases and connecting through DBI/ODBC calls from R is also an option worth considering.

Conclusion:

As you would have realized by now, R does provide many options to handle data files , whatever size they come in – small, medium or large.

Go ahead and analyse that data set in full, the one that you have been holding off till now due to system memory size limitations.

References:

Taking R to the limit

R vs Python

Should you teach Python or R for data science?

http://www.dataschool.io/python-or-r-for-data-science/

Last week, I published a post titled Lessons learned from teaching an 11-week data science course, detailing my experiences and recommendations from teaching General Assembly’s 66-hour introductory data science course.

In the comments, I received the following question:

I’m part of a team developing a course, with NSF support, in data science. The course will have no prerequisites and will be targeted for non-technical majors, with a goal to show how useful data science can be in their own area. Some of the modules we are developing include, for example, data cleansing, data mining, relational databases and NoSQL data stores. We are considering as tools the statistical environment R and Python and will likely develop two versions of this course. For now, we’d appreciate your sense of the relative merits of those two environments. We are hoping to get a sense of what would be more appropriate for computer and non computer science students, so if you have a sense of what colleagues that you know would prefer, that also would be helpful.

That’s an excellent question! It doesn’t have a simple answer (in my opinion) because both languages are great for data science, but one might be better than the other depending upon your students and your priorities.

At General Assembly in DC, we currently teach the course entirely in Python, though we used to teach it in both R and Python. I also mentor data science students in R, and I’m a teaching assistant for online courses in both R and Python. I enjoy using both languages, though I have a slight personal preference for Python specifically because of its machine learning capabilities (more details below).

Here are some questions that might help you (as educators or curriculum developers) to assess which language is a better fit for your students:

Do your students have experience programming in other languages?

If your students have some programming experience, Python may be the better choice because its syntax is more similar to other languages, whereas R’s syntax is thought to be unintuitive by many programmers. If your students don’t have any programming experience, I think both languages have an equivalent learning curve, though many people would argue that Python is easier to learn because its code reads more like regular human language.

Do your students want to go into academia or industry?

In academia, especially in the field of statistics, R is much more widely used than Python. In industry, the data science trend is slowly moving from R towards Python. One contributing factor is that companies using a Python-based application stack can more easily integrate a data scientist who writes Python code, since that eliminates a key hurdle in “productionizing” a data scientist’s work.

Are you teaching “machine learning” or “statistical learning”?

The line between these two terms is blurry, but machine learning is concerned primarily with predictive accuracy over model interpretability, whereas statistical learning places a greater priority on interpretability and statistical inference. To some extent, R “assumes” that you are performing statistical learning and makes it easy to assess and diagnose your models. scikit-learn, by far the most popular machine learning package for Python, is more concerned with predictive accuracy. (For example, scikit-learn makes it very easy to tune and cross-validate your models and switch between different models, but makes it much harder than R to actually “examine” your models.) Thus, R is probably the better choice if you are teaching statistical learning, though Python also has a nice package for statistical modeling (Statsmodels) that duplicates some of R’s functionality.

Do you care more about the ease with which students can get started in machine learning, or the ease with which they can go deeper into machine learning?

In R, getting started with your first model is easy: read your data into a data frame, use a built-in model (such as linear regression) along with R’s easy-to-read formula language, and then review the model’s summary output. In Python, it can be much more of a challenging process to get started simply because there are so many choices to make: How should I read in my data? Which data structure should I store it in? Which machine learning package should I use? What type of objects does that package allow as input? What shape should those objects be in? How do I include categorical variables? How do I access the model’s output? (Et cetera.) Because Python is a general purpose programming language whereas R specializes in a smaller subset of statistically-oriented tasks, those tasks tend to be easier to do (at least initially) in R.

However, once you have mastered the basics of machine learning in Python (using scikit-learn), I find that machine learning is actually a lot easier in Python than in R. scikit-learn provides a clean and consistent interface to tons of different models. It provides you with many options for each model, but also chooses sensible defaults. Its documentation is exceptional, and it helps you to understand the models as well as how to use them properly. It is also actively being developed.

In R, switching between different models usually means learning a new package written by a different author. The interface may be completely different, the documentation may or may not be helpful in learning the package, and the package may or may not be under active development. (caret is an excellent R package that attempts to provide a consistent interface for machine learning models in R, but it’s nowhere near as elegant a solution as scikit-learn.) In summary, machine learning in R tends to be a more tiresome experience than machine learning in Python once you have moved beyond the basics. As such, Python may be a better choice if students are planning to go deeper into machine learning.

Do your students care about learning a “sexy” language?

R is not a sexy language. It feels old, and its website looks like it was created around the time the web was invented. Python is the “new kid” on the data science block, and has far more sex appeal. From a marketing perspective, Python may be the better choice simply because it will attract more students.

How computer savvy are your students?

Installing R is a simple process, and installing RStudio (the de facto IDE for R) is just as easy. Installing new packages or upgrading existing packages from CRAN (R’s package management system) is a trivial process within RStudio, and even installing packages hosted on GitHub is a simple process thanks to the devtools package.

By comparison, Python itself may be easy to install, but installing individual Python packages can be much more challenging. In my classroom, we encourage students to use the Anaconda distributionof Python, which includes nearly every Python package we use in the course and has a package management system similar to CRAN. However, Anaconda installation and configuration problems are still common in my classroom, whereas these problems were much more rare when using R and RStudio. As such, R may be the better choice if your students are not computer savvy.

Is data cleaning a focus of your course?

Data cleaning (also known as “data munging”) is the process of transforming your raw data into a more meaningful form. I find data cleaning to be easier in Python because of its rich set of data structures, as well as its far superior implementation of regular expressions (which are often necessary for cleaning text).

Is data exploration a focus of your course?

The pandas package in Python is an extremely powerful tool for data exploration, though its power and flexibility can also make it challenging to learn. R’s dplyr is more limited in its capabilities than pandas (by design), though I find that its more focused approach makes it easier to figure out how to accomplish a given task. As well, dplyr’s syntax is more readable and thus is easier for me to remember. Although it’s not a clear differentiator, I would consider R a slightly easier environment for getting started in data exploration due to the ease of learning dplyr.

Is data visualization a focus of your course?

R’s ggplot2 is an excellent package for data visualization. Once you understand its core principles (its “grammar of graphics”), it feels like the most natural way to build your plots, and it becomes easy to produce sophisticated and attractive plots. Matplotlib is the de facto standard for scientific plotting in Python, but I find it tedious both to learn and to use. Alternatives like Seaborn and pandas plottingstill require you to know some Matplotlib, and the alternative that I find most promising (ggplot for Python) is still early in development. Therefore, I consider R the better choice for data visualization.

Is Natural Language Processing (NLP) part of your curriculum?

Python’s Natural Language Toolkit (NLTK) is a mature, well-documented package for NLP. TextBlobis a simpler alternative, spaCy is a brand new alternative focused on performance, and scikit-learn also provides some supporting functionality for text-based feature extraction. In comparison, I find R’s primary NLP framework (the tm package) to be significantly more limited and harder to use. Even if there are additional R packages that can fill in the gaps, there isn’t one comprehensive package that you can use to get started, and thus Python is the better choice for teaching NLP.

If you are a data science educator, or even just a data scientist who uses R or Python, I’d love to hear from you in the comments! On which points above do you agree or disagree? What are some important factors that I have left out? What language do you teach in the classroom, and why?

I look forward to this conversation!

P.S. Want to hear about new Data School blog posts, video tutorials, and online courses? Subscribe to my newsletter.

Speed comaprison using R, awk and Perl

In some of tutorials(T1,T2), I used awk because it is faster. Here I do some experiments how much it is faster. I compared five methods : 1) Old R functions, 2) dplyr and readr 3) data.table 4) awk and 5) perl.

GinatdatAI.txt file have multi trait GWAS summary statistics of 2719717 SNPs. The data have rs-id, chromosome, genomic coordinate, 18 of GWAS summary statistics and allele information. Part of data looks like this.

bash-3.2$ head -n 3 GiantdatAl.txt
rs4747841 10 9918166 0.94 0.31 0.68 0.63 0.31 0.50 0.16 0.76 0.47 0.80 0.26 0.38 0.96 0.27 0.49 0.65 0.55 0.75 a g
rs4749917 10 9918296 0.94 0.31 0.68 0.63 0.31 0.50 0.16 0.75 0.47 0.80 0.26 0.38 0.96 0.27 0.49 0.65 0.55 0.75 t c
rs737656 10 98252982 0.70 0.28 0.25 0.27 0.28 0.67 0.25 0.59 0.70 0.74 0.94 0.29 0.34 0.49 0.54 0.35 0.97 0.38 a g

In this example, AGRN.txt have 3 SNPs mapped.

bash-3.2$ cat AGRN.txt
rs3121561
rs2799064
rs3128126

We would like to take subset of AGRN from GiantdatAl.txt. It can be done using built in R functions.

## in R (Old)  ###
t1 <- proc.time()
dat1 <- read.table("GiantdatAl.txt", colClasses=c("character", "character",
               "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
               "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
               "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
               "numeric", "character", "character")  )
snps <- read.table("AGRN.txt", header=FALSE)
sdat <- dat1[ dat1[,1] %in% snps$V1, ]
t2 <- proc.time()
sdat
##                V1 V2      V3   V4    V5   V6   V7   V8    V9  V10  V11  V12
## 1165433 rs2799064  1 1022518 0.76 0.580 0.32 0.18 0.66 0.610 0.72 0.31 0.87
## 1166049 rs3128126  1 1026830 0.53 0.076 0.83 0.12 0.94 0.081 0.67 0.39 0.59
## 1168776 rs3121561  1 1055000 0.45 0.380 0.78 0.14 0.73 0.012 0.40 0.11 0.98
##          V13  V14  V15  V16  V17  V18  V19  V20  V21 V22 V23
## 1165433 0.79 0.96 0.93 0.40 0.83 0.97 0.73 0.82 0.56   t   g
## 1166049 0.17 0.78 0.47 0.73 0.15 0.48 0.81 0.63 0.58   a   g
## 1168776 0.07 0.12 0.33 0.54 0.64 0.98 0.81 0.48 0.59   t   c
t2 - t1
##   user  system elapsed
## 26.844   1.484  28.340

Now, I tried the same job using dplyr and readr. These R packages are coded under C++ using Rcpp.

library(readr)
library(dplyr)
### in R , using readr and dplyr ###

t1 <- proc.time()
dat1 <- read_delim("GiantdatAl.txt", delim=" ", col_type=cols("c", "c", "i", "d",
                   "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d",
                   "d", "d", "d", "d", "d", "c", "c"), col_names=FALSE )
snps <- read_delim("AGRN.txt", col_names=FALSE, delim= " ", col_type=cols("c") )
sdat <- filter(dat1, X1 %in% snps$X1)
t2 <- proc.time()
sdat
## Source: local data frame [3 x 23]
##
##          X1    X2      X3    X4    X5    X6    X7    X8    X9   X10   X11   X12
##       (chr) (chr)   (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 rs2799064     1 1022518  0.76 0.580  0.32  0.18  0.66 0.610  0.72  0.31  0.87
## 2 rs3128126     1 1026830  0.53 0.076  0.83  0.12  0.94 0.081  0.67  0.39  0.59
## 3 rs3121561     1 1055000  0.45 0.380  0.78  0.14  0.73 0.012  0.40  0.11  0.98
## Variables not shown: X13 (dbl), X14 (dbl), X15 (dbl), X16 (dbl), X17 (dbl), X18
##    (dbl), X19 (dbl), X20 (dbl), X21 (dbl), X22 (chr), X23 (chr)
t2 - t1
##   user  system elapsed
## 13.443   0.686  14.169

Using readr and dplyr this data manipulation can be done about 2 times faster.

Let’s try data.table next. It is reported that fread function in data.table package is a bit faster since the function is coded using pure C.

library(data.table)
### in R using data.table  ###

t1 <- proc.time()
dat1 <- fread("GiantdatAl.txt", colClasses=c("character", "character", "integer",
              "double", "double", "double", "double", "double", "double", "double",
              "double", "double", "double", "double", "double", "double", "double",
              "double", "double", "double", "double", "character", "character")  )
snps <- fread("AGRN.txt", header=FALSE)
sdat <- dat1[ V1 %chin% snps$V1, ]
t2 <- proc.time()
t2 - t1
##   user  system elapsed
## 10.839   0.331  11.174

We can check that data.table is a bit faster.

Next I used awk command for the same data manipulation.

### using ask ####

t1 <- proc.time()
cmd2run <- "awk 'NR==FNR {h[$1] = $1; next} h[$1] !=\"\" {print $0}' AGRN.txt GiantdatAl.txt > exdat1.txt"
system(cmd2run)
sdat <- read.table("exdat1.txt", colClasses=c("character", "character", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "character", "character")  )
t2 <- proc.time()
t2 - t1
##  user  system elapsed
##  2.771   0.361   3.142

Yes, it is super fast. It is more than 3 times faster then any other methods that I compared here.

Next one is using perl. I was just curious and tried it. I write a perl code, TakeSubDat.pl for this.

t1 <- proc.time()
cmd2run <- "./TakeSubDat.pl AGRN.txt GiantdatAl.txt > exdat2.txt"
system(cmd2run)
sdat <- read.table("exdat2.txt", colClasses=c("character", "character", "numeric",
              "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
              "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
              "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
              "character", "character")  )
t2 <- proc.time()
t2 - t1
##   user  system elapsed
## 33.550   0.350  33.935

The speed using perl was even slower then old R. Maybe my code is not efficient.

Next, I used R microbenchmark package for more exact speed comparison. f1() is using old R, f2() is using dplyr and readrf3() is using data.tablef4() is using awk and f5() is using perl. The same procedure evaluated 273 times and summary table is provided below. We can check that awk is very fast compared to other methods.

> summary(Res)
  expr       min        lq      mean    median        uq       max neval
  1 f1() 20.309837 24.543637 28.156107 26.497877 29.282222 184.19319   273
  2 f2() 10.917517 12.781069 15.563997 14.476658 16.171709 101.48219   273
  3 f3()  8.577753 10.611985 12.609191 11.639268 13.519316  68.04706   273
  4 f4()  2.774250  3.392059  3.926914  3.625753  4.052245  18.72048   273
  5 f5() 29.320233 34.100792 36.069089 34.621827 36.131966  50.47498   273

In conclusion, it would be good to use dplyr readr and data.table for plain data manipulation since they are easy to use (easier syntax). If speed matters, it would be good to write awk script.

I thank Irucka Embry for helpful suggestions to use %chin% and microbenchmark package. If you have better opinons and suggestions please e-mail me, ikwak@umn.edu. Thank you! 🙂

 

The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.

Data Literacy

You might have heard of “computer literacy“. Data literacy would in the future become as essential as computer literacy is now. Application of Data Science to various fields of knowledge is still at a nascent state. Companies that realize it and lead innovation in data science will become the leaders of the tech world. IBM’s Watson is just the tip of the iceberg. There is one example that most people can relate to that illustrates the difference. Compare Google’s speech recognition against Siri, there is a world of difference because of Google’s advanced machine learning algorithms (Deep Learning). They could overcome the challenges of accent to a great extent.

Machine learning will revolutionize the industry like no one has ever imagined and will also displace a significant chunk of the workforce in the world. Not only companies but also the talent pool will need to adapt. Just like it adapted with the advent of computer and digital technology. Here is an interesting article titled “Will Your Job Be Done By A Machine?”. The advances in diagnostic medicine possible through machine learning is awe-inspiring. Take a look at this competition onkaggle.com

The tools of data science need not be limited to certain niche technology sectors. They can be used in any process that involves computation of data larger than a few gigabytes. They are being used to monitor resource usage and plan for the future. They are also being used in the retail sector to predict demand and prepare for the holiday season. There is a huge potential in the nexus of data science, Real time graphics and machine learning ranging from creating just a beautiful presentation to creating self driving cars; from predicting the failure of a machine to the outcome of an election.