Wednesday, March 18, 2015

Comparison of packages RPostgreSQL and RPostgres, vol 1 - connecting

Analysing data often includes that data being stored in a database. In my case it's stored in a PostgreSQL database. My data flow is pull data into R -> analyse it, compute some new features -> return data into database for other people and tools. I could do it via files, but apparently pulling data directly from Postgre into R is really fast.

The first thing though is connecting to said database. There's a choice between two packages, RPostgreSQL and RPostgres.The former is by Dirk Eddelbuettel and the latter is by Hadley Wickham. Very well known names in the R world and they were both also added as members to R Foundation in 2014.

Setting up the connection with RPostgreSQL:

library(RPostgreSQL)

dbhost <- '127.0.0.1'
dbport <- '5432'
dbuser <- 'lauri_koobas'
dbname <- 'my_db'

## loads the PostgreSQL driver
drv <- dbDriver("PostgreSQL")

## Open a connection
con <- dbConnect(drv, host=dbhost, port=dbport, dbname=dbname, user=dbuser, password=scan(what=character(),nmax=1,quiet=TRUE))

The password part works in such a way as to prompt you for it in console window. I usually follow it up by hitting CTRL-L to clear the console as well. No need to save or show your password anywhere for too long.

Setting the connection up with RPostgres is pretty much the same, except you use RPostgres::Postgres() instead of drv as parameter for dbConnect.

But what if you need to run your script automatically? Or have it deployed to a production server where you don't have a way to go and type in your password. Writing passwords directly into code is a very bad idea as the code will end up in a repository and you really don't want it to end up in Github.

Well, there's more than one way to approach this problem as well. You could write the password to a configuration file that is not stored in a repository. Or you could put the whole connection string to a configuration file, so your script only loads it and connects.

But PostgreSQL offers another solution - the .pgpass file. It works a bit diffently in Windows and Unix, but essentially it's a type of configuration file that is especially meant for this exact problem. It's of the following format and you can use wildcards in it:

hostname:port:database:username:password

Using it is a bit tricky though and took me a while to figure out. What you need to do is provide everything BUT the password and then it will go and find the password in that .pgpass file. An example using RPostgres package.

.pgpass file:
*:*:my_db:lauri_koobas:password

R script:
con <- dbConnect(RPostgres::Postgres(), host="127.0.0.1", port="5432", dbname="my_db", user="lauri_koobas")

That's it. Keep your code versionized and your passwords safe!

Tuesday, March 17, 2015

Statistical Learning MOOC by Stanford

I took the course and this is a short summary of my feelings about it.

Pros:

  • there was some humor (or attempts of it) by presenters
  • the book this course is based on is very good, at least the few pages of it that I managed to read
  • the interviews with the original inventors of some of the methods were cool!
Cons:
  • the quizzes were rather cryptic and of very variable difficulty, especially the ones concerning R
  • if you answered incorrectly then "show answer" was sometimes very helpful and other times not at all
If you've done mostly Coursera courses, then the style of quizzes is very different - the current course asks your intuition about exceptions to the rule. It will show you the limitations of what you just learned and is best treated as an additional learning resource. The forums were fairly dead, compared to your regular Coursera course, but still helpful - all you really need is one good answer.

In conclusion, the course helped me see and understand some new aspects of modeling. As it was my main goal anyway, then it's all good.

Monday, February 23, 2015

How to manage a large amount of data with (Postgre)SQL in R

In my R projects I've used data stored in a MySQL database before and it's straightforward. You send a query and you receive the result and that's it. It works well if you work with a small dataset and can manage to write the queries by hand or inserting/updating only tens or hundreds of rows in the database. Just looping through the data and doing the queries one by one is the fast (to write) and easy solution.

Recently I've had a project where the backend is PostgreSQL and the amount of data to pull and push is a few hundred thousand rows at the time. Querying is still fine, but doing updates in a loop is definately not. Even at 1 second per insert it would take days to push the results back into database. I could dump the results into a file and COPY that into database and that's probably the only solution once the number of rows gets into millions, but there is another way for intermediate data size (1k through 1mln).

The package used is RPosgtreSQL (with examples at Google Code). The aspect that makes inserting a large number of rows at once is the only very slightly mentioned multiple insert format - bottom of manual. The idea is that you just list the values one after the other:

INSERT INTO products (product_no, name, price) VALUES
    (1, 'Cheese', 9.99),
    (2, 'Bread', 1.99),
    (3, 'Milk', 2.99);

Sounds simple enough. But I hope you want to be safe in your SQL queries, so you escape all the values. It appears that part is somewhat complicated to do. Luckily the famous dplyr by Hadley comes to resque with functions like sql() and escape(). There isn't a good vignette out there for it, the best is probably just the ?sql help inside R.

The goal is to escape everything that is not a static text. Lets say we want to insert multiple rows into table v_tmp into columns named key_row, value1, value2 and added_time. And here's the code that does it - every value is escaped and put through the sql() function. The output looks properly weird, but when R sends it to database, then the correct things end up in tables.

df <- data.frame(key_row = 1:5, value1 = paste0("It's '", letters[1:5], "'!"), value2 = rnorm(5))
# key_row    value1      value2
# 1       1 It's 'a'!  2.74812502
# 2       2 It's 'b'! -0.06665964
# 3       3 It's 'c'! -0.40579730
# 4       4 It's 'd'! -0.41636723
# 5       5 It's 'e'!  0.56018515

# start off the insert clause
v.insert <- "INSERT INTO v_tmp(key_row, value1, value2, added_time) VALUES "
# pre-make the vector for pieces
v.pieces <- character(nrow(df))
# cycle through the data (in data.frame called df)
for (i in 1:nrow(df)) {
  v.pieces[i] <- sql(paste0(
    '(', escape(df$key_row[i]), ',
    ', escape(df$value1[i]), ',
    ', escape(df$value2[i]), ', now())'
  ))
}
# put it together
v.insert <- paste0(v.insert, paste0(v.pieces, collapse=","))
# v.insert result:
# [1] "INSERT INTO v_tmp(key_row, value1, value2, added_time) VALUES (1,\n    'It''s ''a''!',\n    2.74812501918287, now()),(2,\n    'It''s ''b''!',\n    -0.0666596436163713, now()),(3,\n    'It''s ''c''!',\n    -0.405797295384545, now()),(4,\n    'It''s ''d''!',\n    -0.416367233404705, now()),(5,\n    'It''s ''e''!',\n    0.560185149545129, now())"
# and into the database it goes
dbSendQuery(con, v.insert)


Friday, February 20, 2015

Function to easily draw a scatterplot with polynomial regression lines

In exploring ones data (e.g., for subsequent modeling), it is often useful to fit different order polynomial regression lines to compare how they fit.

Although R is very good for plotting, adding nonlinear regression lines to a plot is a bit tedious. Here’s a simple function 'polyreglines' that plots a scatterplot of x ja y and adds polynomial regression lines up to specified order. It also adds a legend with adjusted R-squared values for the models. When argument “all” is set to FALSE, only one regression line of specified order is drawn.

polyreglines <- function(x.txt, y.txt, data, order=3, all=T, xlab, ylab, leg.pos, ...) {
  x <- with(data, get(x.txt))
  y <- with(data, get(y.txt))
  if(missing(xlab)) xlab <- x.txt
  if(missing(ylab)) ylab <- y.txt
  if(all==T) {
  plot(x, y, xlab=xlab, ylab=ylab , ...)
  R2s <- numeric(length=order)
    for(i in 1:order) {
      fit <- lm(y~poly(x,i), data=data)
      R2s[i] <- round(summary(fit)$adj.r.squared,2)
      x1 <- seq(from = min(x), to = max(x), length.out = 1000)
      g <- data.frame(x = x1)
      lines(x1, predict(fit, g), lty=i)
    }
    if(missing(leg.pos)) leg.pos = "topright"
    legend(leg.pos,legend=R2s,lty=c(1:order),title=expression(Adjusted ~ R^2))
  }
  else {
     plot(x, y, xlab=xlab, ylab=ylab , ...)
     fit <- lm(y~poly(x,order), data=data)
     R2s <- round(summary(fit)$adj.r.squared,2)
     x1 <- seq(from = min(x), to = max(x), length.out = 1000)
     g <- data.frame(x = x1)
     lines(x1, predict(fit, g))
     if(missing(leg.pos)) leg.pos = "topright"
     legend(leg.pos,legend=R2s,lty=1,title=expression(Adjusted ~ R^2))
   }
}

Note that x and y are column names and so have to be within quotation marks. (It is also possible to pass further arguments to plot and also specify the legend position.)

Example:
polyreglines("mpg","hp", mtcars, 2)


polyreglines("mpg","hp",mtcars,2,all=F,leg.pos="bottomleft", main="Example")


If you think this function is useful for you, you can save the code in a text file (e.g. as “polyreglines.R”) in your working directory and load it using source() (e.g. source(“polyreglines.R”))

Wednesday, September 17, 2014

Waking up the blog again!

It's been a good long while since my last post. It feels like it's time to pick it up again much more regularly. Someone said that one should think about the balance between how much they consume and how much they produce. The author of if was talking about intellectual stuff. So here I am - reading massive amounts daily, but writing nothing down. It's about to change since I also know that expressing thoughts verbally is good for retention.

Lets start with an awesome post I read this morning - Modern anti-spam and E2E crypto. It talks at length about how Google and other email service providers have waged war with spammers. He also links to a Google blog post that talks about how they decide if you should be asked additional verification upon log in attempt:
In fact, there are more than 120 variables that can factor into how a decision is made.
Talk about feature engineering :)

But how is that all connected to data science? Well, all of it is classification problems, coupled with mostly reinforcement learning. Most of the stuff the guy writes about has probably never seen the light of day - all proprietary magic. Fascinating! :)

Anyways, my evening watch list has been this - videos from ICML2014. Enjoy.