2012-12-31

How odd was the UEFA draw?

I've been away for some time without closely following the media, and without significant internet access. When such a period is over it takes some time to regain momentum. Thus my short exit poll series will be continued in 2013. For now I'm still sorting out what I missed over the last weeks. Today's short post is part of this process.

There has been some discussion while I was away about how probable the UEFA Champions League draw really was. The Daily Mail quoted "leading football statisticians" that came up with 2160900:1, and bookies that would have offered 5000:1.

R-bloggers have contributed several articles, and the consensus arrived at was that the chances were 1:5463, or 0.0001830496. I'll take this for granted.

 

The odds are much better

 

0.0001830496 still doesn't seem like a hell of a chance. But this number doesn't reflect the probability of seeing what happened in Nyon. Here's why:

The UEFA made several dry runs, not just the one that has been discussed. According to the German magazine SPIEGEL ONLINE a UEFA speaker confirmed that they're carrying out "numerous tests".

Just one of the "numerous" dry runs turned out to be the same as the final draw. I don't know how many draws have been made for testing purposes. But let's assume they're routinely doing ten dry runs before the final draw.

Let's assume further that there are 5463 different results that are all equally probable. Then, mathematically, the chances of what has been discussed widely are about 1:546.3, or 0.001830496.

 

From Nyon to Monte Carlo...

 

Let's do some simulation now. I created the function sim.nyon as a quick and dirty way to do the whole job:

sim.nyon <- function(poss, testruns, n.sims) {
  # parameters:
  # poss: number of possible results to draw from
  # testruns: number of dry runs before the final draw
  # n.sims: number of simulations to be run
  runs <- testruns + 1 # dry runs plus final draw
  y <- sample(poss, (n.sims*runs), replace = TRUE)
  y <- matrix(y, nrow = n.sims)
  y <- y - y[,runs]
  y <- y[, -runs]
  is.like.nyon <- apply(y, 1, function(x) {any(x==0)})
  return(table(is.like.nyon))
}

In the UEFA case we want to run 10,000 simulations with 5463 possibilities and assume 10 dry runs:

possibilities <- 5463
testruns <- 10
sims <- 10^4
sim.nyon(possibilities, testruns, sims)
# is.like.nyon
# FALSE  TRUE
#  9981    19

After all the UEFA thingy doesn't seem to be that improbable after all.

2012-11-22

Washington Gave Thanks To R:



The folks over at is.R() brought along an "adorable Turkey". I guess we'll thank them for their gift - and pardon the Turkey. Personally I thank them for a bucket of hints and tricks I freely used and will continue to do so.

Tal brought us a great community of R bloggers. I thank him for the work he invested, and personally, I thank him for allowing me in.

I thank the readers, and the writers, and I thank the R team for giving us "the means we have of acquiring and diffusing useful knowledge". George Washington told us to so:

That we may then all unite in rendering unto him our sincere and humble thanks, for his kind care and protection of the People of this Country previous to their becoming a Nation, for the signal and manifold mercies, and the favorable interpositions of his providence, which we experienced in the course and conclusion of the late war, for the great degree of tranquility, union, and plenty, which we have since enjoyed, for the peaceable and rational manner, in which we have been enabled to establish constitutions of government for our safety and happiness, and particularly the national One now lately instituted, for the civil and religious liberty with which we are blessed; and the means we have of acquiring and diffusing useful knowledge; and in general for all the great and various favors which he hath been pleased to confer upon us.

(George Washington, on October 3, 1789)

The following produced the plot at the top:

library(gplots)
message <- "...the means\nwe have of\nacquiring\n"
message <- paste0(message, "and\ndiffusing\nuseful")
message <- paste0(message, "\nknowledge...")
textplot(message, "center", "top", col="green")

And here's a way to get the Thanksgiving Date from R:

library(RcppBDT)
thanksgiving <- getNthDayOfWeek(fourth, Thu, Nov, 2012)

Happy Thanksgiving!

2012-11-17

Get the exit polls from CNN using R and Python

Yesterday I posted an example of plotting 2012 U.S. presidential exit poll results using ggplot2. There I took for granted that a data.frame containing all we need resides in a file called "PresExitPolls2012.Rdata". Today I want to show how I scraped the data from CNN.

The challenge


At first I tried to scrape the site using RCurl and the XML package. But the result was very disappointing. I just got empty data.frames while all browsers I used showed the data. Looking at the source code of the page, however, was equally disappointing:

Where I expected the percentage of say women voting for Romney, I saw a javascript variable name. Only looking at the generated source with Firebug revealed the data. The CNN pages are dynamically created by javascript that jqueried the data into variables. No way getting the data with RCurl.

The solution


So I needed a real browser that could be controlled by a script. I decided to use a Python script to read the generated html from CNN. Here's the Python code that draws heavily on a thread I stumbled upon in a German forum:


Next I needed a function in R that puts together the URL for one of the CNN state sites, calls the Python and returns a page tree of the generated html. getStateData() does the job:



The page tree getStateData returns contains a lot of noise like preliminary county results for some, but only some, of the counties. There are some "fake" exit polls designed to explain "ho to read exit polls". And for every question asked the results appear a couple of times.

Filtering out the noise


To separate the wheat from the chaff, the grain from the husk, I split the job over two functions, parseEpNode and getExitPolls.

getExitPolls parses the tree using XPath, then calls parseEpNode for each of the nodes containing exit polls. (As an aside: this is an application of the "Split-Apply-Combine Strategy for Data Analysis" (pdf) described by Hadley Wickham when he introduced the plyr package. Ironically my getExitPolls doesn't use plyr::llply but the R standard lapply, though it makes use of plyr::rbind.fill...)

parseEpNode is the real work horse of the process. It filters out duplicate entries and demo polls. Again it relies on the Split-Apply-Combine Strategy without using l*ply. Sometimes lapply is easy enough, and Hadley himself uses it internally for some cases as well.


Putting it all together


This script puts it all together and produces the Rdata file the existence of which I only assumed yesterday. It starts with a list of the 19 states + D.C. where no exit polls have been conducted in 2012 taken from the Washington Post and puts together the states of interest, again as a list to which getExitPolls can be lapply'd.



A probably much shorter post will add some improvements to the process. More later...

2012-11-15

Exit PEBOS - Enter exit polls



PEBOS is over. Time to look at the details of the Election. The final results are not yet in, but the exit polls are there, and up for grabs. Just to get warm: here's a tiny example.

Obviously Romney had an age problem. But for now I don't want to speculate about political consequences. This is just an example plot.

Let's imagine we have a data.frame "EP" that contains the state level exit polls for the presidential election 2012. (Actually, I have these data, and tomorrow I'll post how I got them using R - and a tiny bit of Python. For today I just let them reside in a file called "PresExitPolls2012.Rdata".)

Update: I've released the code to create the PresExitPolls2012.Rdate file today.

I fire up R and the first code snippet is

library(ggplot2)
library(plyr)
library(reshape)
 
load(file="PresExitPolls2012.Rdata")
head(EP)
Created by Pretty R at inside-R.org
For now I just concentrate on the "Vote by Age". There are two different age groupings for that question:
unique(EP$QNo[EP$question=="Vote by Age"])
# 4 category breakdown
head(EP[EP$QNo==2, ])
# 6 category breakdown
head(EP[EP$QNo==3, ])
Created by Pretty R at inside-R.org

Today I want to produce a plot of the 6 category breakdown, so I reduce the data and do some checks:

1. There might be some inconsistency between states in the numbering of the questions. There should be 6 categories for each state.

2. This year's exit polls have been conducted in 31 states. In addition to this the reduced dataset should contain the nation wide data. So I expect 32 "states" in the newly created VbA dataset.

Both checks can easily be implemented with the daply funktion from Hadley's plyr package:

VbA <- EP[EP$QNo==3, ]
unique(daply(VbA, .(state), nrow)) == 6
length(daply(VbA, .(state), nrow)) == 32
Created by Pretty R at inside-R.org

The plot needs the data to be in "long format". I let Hadley's melt function (from the reshape package) do the job. Then I remove all Candidates with the exception of Obama and Romney.

vba <- melt(VbA, id = c("state", "answer"), variable_name = "Candidate")
unique(vba$Candidate)
# we're only interested in Obama and Romney
vba <- vba[vba$Candidate %in% c("Obama", "Romney"), ]
Created by Pretty R at inside-R.org

Finally the plot can be created. Initially the plot was a mess with garbled and unreadable text elements. I'm indebted to the people over at is.R() for their most valuable hints that helped me arrive at a readable plot.


But before plotting there's a fix to be applied. In the VbA data.frame the numbers for the candidate were numeric. For some reason I'll have yet to look into this made the NA's appear like peaks with both candidates having roughly the same value of about 70. (Thanks to lemonlaug whose comment alerted me to the absurdity in the original plot.)

Now to the fix. It's as simple as that:

vba$value <- as.numeric(vba$value)

Here's the final code snippet:

png(file = "VbA2012.png", width = 960, height = 960)
ggplot(vba, aes(answer, value)) +
  geom_line(aes(group = Candidate, color = Candidate)) +
  facet_wrap(~ state, ncol = 4) +
  labs(title = "2012 Presidential Vote by Age\n",
    y = "Percentage\n",
    x = "Age group\n"
  ) +
  theme(axis.text.x = element_text(colour = "black",
          size = 9,
          angle = 45,
          vjust = 1,
          hjust = 1),
        axis.text.y = element_text(colour = "black",
          size = 9,
          angle = 0,
          vjust = 1,
          hjust = 1)
  ) +
  scale_y_discrete(breaks=c(30, 50, 70)) +
  scale_colour_manual(values =  c("darkblue", "darkred"))
dev.off()
Created by Pretty R at inside-R.org

2012-11-08

Another crosshairs


C. DeSante over at is.R() has PEBOS as well, but turned it into a great explanation of the way predictions like Nate Silver's work.

For a while the 538 team had PEBOS as well: "The FiveThirtyEight team is still recuperating, but the election provided a fresh supply of data points that we’ll be connecting in the coming days."


This inspired me to add some predictions of the popular vote to my crosshairs plot. The data are from here.

We can see that Nate was closest. Well how does the old song sing?
Pass my window, pass my gate
Pass my window, pass my gate
You can pass my window, can pass my gate
But you'll never pass FiveThirtyEight...
The additional code is simple:

PEBOS (Post Election Burn Out Syndrome)




I guess that all those that tried to follow the presidential election as closely as possible are more than just a little bit exhausted mentally. I call this PEBOS - Post Election Burn Out Syndrome.

Among us some concentrated on the horserace aspect of the polls. Those will wake up a few days from now, released from their fever, and turn their eyes on other things until 2014 (or even 2016) revives their political strain.

Others, like me, will spend weeks and months studying the data amassed and not sufficiently looked into over the last year. When the final results are in, all polls cleanly collected, and the ANES 2012 data have been released, it's time to indulge in the analysis...

For now, I'll try to get some rest.

And this is the code, on github, that generated the crosshairs plot shown above. We can see that the national polls were somewhat biased against Barack Obama, though they mostly underestimated the Romney vote as well.

More to come very soon.

2012-11-01

PrettyR R

When it comes to R blogging I'm a complete newbie. So I'm still struggling with the technical details.

Part of the process is prettifying the code snippets. One of the standard ways of doing this involves copy-and-paste-ing the R code into the Pretty R syntax highlighter.

While assembling the bits to do the posting programmatically I wrote a function that replaces the copy-and-paste part.

Now here's the function prettified by itself:

prettyR <- function(file) {
  require(RCurl)
  require(XML)
  Rcode <- readLines(file)
  Rcode <- paste(Rcode, collapse="\n")
  # assemble the parameters for the http POST to the Pretty R web site
  URL <- "http://www.inside-r.org/pretty-r/tool"
  parameters <- list(
    op = "edit-submit", 
    form_id = "pretty_r_tool_form", 
    code_input = Rcode
  )
  # send the http POST request
  rawHTML <- postForm(URL, .params = parameters)
  parsedHTML <- htmlParse(rawHTML)
 
  # find the node
  prettified <- getNodeSet(parsedHTML, "//div[@class='form-item']/textarea")[[1]]
  prettified <- xmlValue(prettified[[1]])
  return(prettified)
}
Created by Pretty R at inside-R.org

2012-10-27

Replicating Krugman (Pt. 1)

I'm a regular reader of Paul Krugman's NYT blog. A couple of months ago I decided to replicate some of the plots Paul published. At the time I was evaluating the quantmod package which seemed to made things easy.

An easy start

In the first part I'll reproduce two plots from Real Government Spending Per Capita. It's straightforward to get the data using quantmod:

require(quantmod)

# First we get the time series from the Federal Reserve Bank of St. Louis (FRED) 
getSymbols('GEXPND', src = 'FRED')
getSymbols('GDPDEF', src = 'FRED')
getSymbols('SLEXPND', src = 'FRED')
getSymbols('POP', src = 'FRED')
getSymbols('USREC', src = 'FRED')

# Paul Krugman doesn't use the raw series but real per capita values.
# These are the necessary calculations.
log_gov_overall <- log(GEXPND/(GDPDEF*POP), exp(1))
log_gov_state_local <- log(SLEXPND/(GDPDEF*POP), exp(1))
It's just as easy to create the standard plots:
plot(log_gov_overall
  , main = "Real Government Spending Per Capita\nOverall (Log Values)"
)
plot of chunk unnamed-chunk-2
plot(log_gov_state_local
  , main = "Real Government Spending Per Capita\nState & Local (Log Values)"
)
plot of chunk unnamed-chunk-2
A closer look, however, reveals that something is clearly missing. Paul took his plots straight from FRED itself. One of the features is that recessions are shown by graying the background.

FRED style plots

After struggling with the parameters and different quantmod plots for a while I decided to write a function that takes the series to be plotted as a parameter and creates a FRED style plot.

This makes some use of the internal workings of plot.xts:
FREDplot <-  function (x
  , main = deparse(substitute(x))
  , las = 1
    , ...) {
# function to create FRED style plots with shaded regions for U.S. recessions
  # this function needs the time series to be xts
  if  (!(class(x)[1] == "xts") &&
      (class(x)[2] == "zoo")){
        stop("Only xts objects can be properly handled by this function!")
      }
  # get the recession dates from FRED only once
  if  (!("USREC" %in% ls(name = globalenv()) &&
      attributes(USREC)$src == "FRED")){
        getSymbols('USREC', src = 'FRED')
  }

  # transform FRED recession dates to xy.coordinates
  usrec <- xy.coords(.index(USREC), USREC[, 1])

  # prepare plot
  xycoords <- xy.coords(.index(x), x[, 1])
  ep <- axTicksByTime(x, TRUE, format.labels = TRUE)
  plot.zoo(xycoords$x, xycoords$y
    , type = 'n', axes = FALSE, ann = FALSE
  )

  # plot axis & box
  axis(1, at = xycoords$x, labels = FALSE
    , col = "#BBBBBB"
  )

  axis(1, at = xycoords$x[ep], labels = names(ep)
    , las = 1,lwd = 1, mgp = c(3, 2, 0)
  )

  axis(2, las = las)

  # create shaded areas
  xblocks(usrec$x,ifelse(usrec$y == 1, "gray", "white"))

  # add grid
  abline(v = xycoords$x[ep], col = "grey", lty = 4)
  abline(h = seq(min(xycoords$y), max(xycoords$y)
    , length.out = 9), col = "grey", lty = 4
  )

  # plot curve
  lines(xycoords$x, xycoords$y,...)

  # add title
  title(main = main)
  title(sub = "Shaded regions are US recession periods 
              (data from research.stlouisfed.org)"
  )

  box()
}

Using this function the plots look much better:
# the first plot
FREDplot(log_gov_overall,
  main = "Real Government Spending Per Capita\nOverall (Log Values)")
plot of chunk unnamed-chunk-4

# and the second
FREDplot(log_gov_state_local,
  main = "Real Government Spending Per Capita\nState and Local Level (Log Values)")
plot of chunk unnamed-chunk-4
Actually I like these plots better than the originals here and there.
There are, however, a couple of things that could - and should - be improved. More of this in future installments.
I'll call it a day for now.