Mle-notes

R Code for Maximum Likelihood

In class we learned about Stata’s ml command.

R actually has several similar commands. Below is an R equivalent for most of what is covered in the Class 3 sheet, using R’s optim() command. R is very useful for this sort of thing, so I’ll try to walk through it in one of the labs.

ml.logit <- function(specification, data, betas.init, method) { 

	beta.iterations		<- matrix(nrow=0,ncol=dim(data)[2])
	loglik.iterations 	<- numeric(0)
	 
	sumloglik <- function(beta, y, x){ 	
        loglik <- sum(-y*log(1 + exp(-(x%*%beta))) - (1-y)*log(1 + exp(x%*%beta)))
	     	
        loglik.iterations 	<<- c(loglik.iterations, loglik)
        beta.iterations		<<- rbind(beta.iterations, betas=beta)

        return(-loglik) # switch sign b/c optim() minimizes rather than maximizes
	} 

	# get the dependent variable name
	dv = rownames(attr(terms(specification),"factors"))[1]
	# now get dependent variable data
	y = as.numeric(data[,match(dv,colnames(data))])

	# now get all our independent variable data
	# note: model.matrix() creates a matrix with an initial column
	# of 1s plus all columns in "data" corresponding
	# to right-hand side variables in "specification"
	x=model.matrix(specification, model.frame(data))
	
	# name betas
	names(betas.init) = colnames(x) 
	 
	mle = optim(betas.init,sumloglik,x=x,y=y,hessian=T, control=list(trace=6),method=method)
	return(list(betas=mle$par,vcov=solve(mle$hessian),log.likelihood=-mle$value, beta.iterations = beta.iterations, loglik.iterations=loglik.iterations))

}

# create data
n <- 1000
x1 <- rnorm(n,0,3) 
x2 <- rnorm(n,0,5)
require(arm) # if not installed, can also use pnorm()
y <- rbinom(n,1,invlogit(x1-x2))

# specify model, data, etc
data <- data.frame(x1=x1,x2=x2,y=y)
specification <- as.formula("y~x1+x2")
betas.init <- rep(0,dim(data)[2]) # length works b/c y dropped, but intercept added
method <- "Nelder-Mead" # also try "BFGS", "CG", "L-BFGS-B", "SANN", "Brent"

# run model
model <- ml.logit(specification,data,betas.init,method)
model$betas
model$vcov
model$log.likelihood

# check against regular logit
require(arm)
(model2 <- glm(y~x1+x2,family="binomial"(link="logit")))

# plot evolution of likelihood and parameters
par(mfrow=c(2,2))
plot(model$loglik.iterations, xlab="Iteration", ylab="Log-Likelihood", cex=.6)
plot(model$beta.iterations[,1], xlab="Iteration", ylab="Intercept Estimate", cex=.6) # evolution of Intercept
plot(model$beta.iterations[,2], xlab="Iteration", ylab="b1 Estimate", cex=.6) # evolution of b1
plot(model$beta.iterations[,3], xlab="Iteration", ylab="b2 Estimate", cex=.6) # evolution of b2

You can download the full code here.

Homework 3 Notes

In general the homeworks were good again.

Just one note: from here on out, always write a brief description for each question.

Even if you’re just writing a function – as in question 1 – be sure to write out at least one or two sentences about what the function is doing. Likewise, if the HW asks for a graph or table, assume that you’re also implicitly being asked to discuss it for a bit. The goal isn’t to trip you up, but to make sure you’ve got some kind of intuition about what you’ve just done.

A couple other notes about the homework follow below … they’re kind of long-ish, so see the sidebar if you want to skip around.

LOGIT vs Probit

A few people read too much into the different sizes of the logit vs. probit coefficients.

Remember, for logit the coefficients are log-odds ratios, while for probit, the coefficients are z-scores. In themselves, they’re not going to tell you much. All you should pay attention to are the direction (ie, positive or negative) and, in some cases, the statistical significance.

The coefficients only become meaningful when we move into the probability space, at which point we can in fact start to talk about the size of a parameter’s effect and whether or not it is substantively significant.

(Note: for more on the different between logit and probit, I highly recommend reading this answer on CrossValidated.)

GRADIENTS

Thankfully pretty much everyone recognized that the gradients were getting smaller as the log-likelihood converged on its maximum.

After digging into gradients a lot more, I think the best way to explain them is to imagine you’re hiking in the Appalachians but you’re not allowed to use a footpath. What’s the best way to reach the nearest peak?

One way would be to take the gradient with respect to height at your starting point – ie, the exact direction in the north-south/east-west plane that the mountain rises the fastest – and begin climbing in that direction.

Pretty soon you’ll face a problem though: how far do you keep travelling in that direction? Unless by some miracle the mountain is symmetrical, the first gradient you measure will not point you in the exact direction of the summit. (And even if it is symmetrical, the gradient still won’t tell you how far to travel to get to the top.)

To reach the peak, then, you need to stop and calculate the gradient every so often. Unfortunately, there’s no perfect way to know how often to do this: depending on the contour of the mountain, it might be better to find the gradient every 100 feet, or it might be better to find it every 1000. Or it might be best to do it every thousand feet at first, and then every few feet at the top. Or it might be some other combination.

You probably already realized where this is heading. Instead of a mountain in a three-dimensional space of length, width and height, imagine plotting a dataset in the three-dimensional space of age, income and education.

The way you know you’ve reached a peak in that space (or a valley; see below) is when the rate of change in each dimension approaches zero at the same time.

The question is how to get to such a peak from any given starting point. As with our mountain, the best way to get there will depend on the specific contour of the surface connecting our data. Sometimes you’ll want to calculate a gradient vector early and often, other times you’ll want to do it less.

Since it’s not always clear what the most efficient way is to arrive at the peak, you want multiple ways of trying to get there – which is why multiple algorithms have been developed, such as the Newton-Raphson, BFGS, etc.

HESSIAN

So where does the Hessian come in?

As I hinted at above, a three-dimensional space of age, income and education is the same, mathematically, as a three-dimensional space of length, width and height. The difference is that when we deal with length, width and height, we’ve got a very clear sense – thanks to both our eyes and the wonders of gravity – of what a peak is and what a valley is.

In any other space though we can’t rely on those senses. Instead we need a way of distinguishing peaks from valleys mathematically.

And the way we do that is with the Hessian matrix. Along its diagonal are the second partial derivatives for each variable, which tell us the rate of change of the rate of change for each dimension.

If the rate of change of our function is zero in all dimensions and the diagonals along the Hessian are all positive, then we know we’ve hit a valley. If they’re all negative, then we know we’ve hit a peak.

HESSIAN AND VARIANCE

Finally, the other thing the Hessian does is provide cross-partial second derivatives. These are the points off the diagonal, and they tell us about our function with respect to the rate of change in one dimension times the rate of change in another.

If you think of what covariance is – ie, how two variables move together in your space – you can see how such cross-partials would be useful. The information they provide about simultaneous rates of change is what makes it possible to then use the Hessian to calculate the variance-covariance matrix for our parameters.

The ability to use the Hessian in this way is why you can think of the Hessian as the “variance”.

R Code for HW 2

If you wanted to do HW 2 in R, here’s one way you could have done it:

require(mvtnorm) # for rmvnorm() 
require(Matrix)  # for diagonal matrix


# create the data
n <- 1000
sds <- c(1,2,3,5,2,1)
means <- c(rep(0,length(sds)))
cov.matrix <- as.matrix(Diagonal(length(sds),sds))
mydata <- rmvnorm(n,means,cov.matrix)

# label the data
colnames(mydata) <- c("alpha","x1","x2","b1","b2","error")
mydata <- data.frame(mydata)
attach(mydata)

# show b1
summary(b1)
plot(density(b1))

# create and show y
y <- alpha + b1*x1 + b2*x2 + error
summary(y)
plot(density(y))

Note that rmvnorm() relies on the covariance matrix rather than the correlation matrix.

If you wanted to use the correlation matrix (as drawnorm does in Stata), here’s how you could do it:

{% highlight r linenos %} require(mvtnorm) # for rmvnorm() require(Matrix) # for diagonal matrix

cor_to_cov <- function (sds,cor.matrix) { diag.matrix <- Diagonal(length(sds),sds) return(as.matrix(diag.matrix %% cor.matrix %% diag.matrix)) }

n <- 1000 cor.matrix <- matrix(c( 1, 0.1, 0.1, 0.1, 1, 0.1, 0.1, 0.1, 1),3) sds <- c(1,2,1) means <- c(rep(0,length(sds))) mydata <- rmvnorm(n,means,cor_to_cov(sds,cor.matrix))

colnames(mydata) <- c(“alpha”,“x1”,“b1”) summary(mydata[,“b1”]) plot(density(mydata[,“b1”])) {% endhighlight %}

The full .R file for all this is here.

Homework 2 Notes

For the most part this homework was good.

In the first two questions, some people didn’t realize that they had to draw the betas from the normal distribution as well, or did draw them from the normal distribution, but forgot to include them when they generated y.

The Stata code should have been something like:

drawnorm x1 x2 b1 b2 alpha error, n(10000) means(2,5,2,-2,1,0) sds(10,2,10,2,10,20) 
gen y = alpha + b1*x1 + b2*x2 + error

This question tripped some people up, so I want to walk through it in-depth. Put simply, the social processes underlying your observed variables are very likely different from the social processes underlying your unobserved variables.

To get an intuitive sense of this, consider the effect of violent protest on foreign direct investment. A simple, common sense theory would suggest that the more violent protest events a country experiences, the less likely foreign investers will be to invest in that country. Moreover, it’s likely that the effect of violent protest on FDI is fairly constant across countries, enough so anyway that we can reasonably model the effect of violent protest as being normally distributed around some negative mean.

We can thus model this as follows:

Where:

But what about the violent protests themselves? What are the social processes underlying that data?

We can imagine a lot of reasons for why violent protests happen, from a food shortage to government repression. Theoretically, though, violent protest is costly for everyone, so we generally assume violent protests won’t happen very often, but then when they do (since the government is unlikely to immediately give in), they’ll probably happen a lot. A better distribution to model such a process is the Poisson. We’ll learn more about the Poisson and its variants later, but the gist here is that such a distribution can result in most countries having relatively low levels of violent protest, while a few have quite a lot.

Formally, we would model this as:

That’s clearly a very different distribution than we think our Bprotest adheres to.

To be clear, the differences between the distributions of our observed and unobserved data won’t always be this extreme. For example, both your observed and unobserved variables may belong to the same family of distribution – which is what we assumed in the homework when we drew both observed and unobserved variables from the normal distribution – but the key point is that the social processes are generally separate/different, and as a result we ought to model our observed and unobserved variables separately too.

Homework 1 Notes

Thankfully everyone did pretty well on this. There are a few minor points that came up often though.

Coefficient Standard Errors

So. Fortunately everyone seemed to understand what happens to b1’s standard error when you increase the variance of x1 on the one hand and of the error term on the other. But no one included the formal explanation, so here it is:

Note that within the square root you’ve really got a fraction, with the model variance in the numerator and the jj-th element of the model’s variance-covariance matrix (ie, the (X'X)) in the denominator. Since the model variance is on the top of the fraction, any increase in it will lead to an increase in all standard errors, while since the variance of a particular x(j) will be on the bottom of the fraction, increasing it will lead to a decrease in the standard error of the corresponding coefficient.

Finally, while it may be clear from the equation why a 10x increase in the overall model standard error leads to a correspondng 10x increase in the coefficient standard errors (namely, you’ve got a square within a square root, so the exponents cancel out), it’s probably less obvious from the equation as shown that the cells along the diagonal of the variance-covariance matrix are themselves squared, which is why you also get a similar 10x decrease in coefficient standard error when you increase the standard deviation of a particular x(j) by exactly 10x.

For more on this, see Wikipedia or work through the R code in this answer on a question about regression coefficient standard errors on the cross-validation forum.

Biased Models (Correlated Errors) vs Inefficient Models (Heteroskedastic Errors)

This one gives me a headache every time I think about it. In fact, I even have a vague memory of confusing them in the first lab.

But here’s the deal.

Correlating the error term with x1 or x2 generates biased estimated coefficients. In a word, your betas themselves are going to be off.

By contast, heteroskedasticizing (is that a word?) your error term generates inefficient standard errors for your estimated coefficients. This is different from bias. Your estimates can still be right, but it’s going to take more information for you to be as confident that they are right. Put differently, with heteroskedastic errors, with a large enough N you’ll end up with the same coefficients, but your standard errors are going to be bigger than they would be if the model was homoskedastic. The only way to get the standard errors down in that case is to keep increasing your N (the need for more data for the same result is why the term ‘inefficient’ is used.)

One last point about this: OLS can handle some types of heteroskedasticity, provided that you correct for it appropriately in your model. So heteroskedastic errors do not immediately rule out OLS as an option. They only rule it out if there’s no way to correct for them fully (as is the case, say, with binary data).

Endogeneity

A few people said something to the effect of, “correlated errors so omitted variable bias so OMG ENDOGENEITY AHHH!'

That’s true as far as it goes. Endogeneity is in fact present when one variable is informed/determined by another variable in your system.

The issue is that there’s two very different kinds of endogeneity. The endogeneity that is really problematic is when your y is influencing your x1. Since the whole point of this social science thing is causal inference, when your IVs are endogenous with your DV, that’s a HUGE problem. That’s why when a senior prof goes off on endogeneity in someone’s model at a job talk and things get super awkward, this is almost always the kind of endogeneity being discussed.

By contrast, when an IV is endogenous with an omitted variable, that’s problematic, but it’s problematic because it’s messing with your estimate of an IV’s causal effect, not because it’s throwing into doubt the entire direction of the causal relationship between your IV and DV.

I guess what I’m saying is: be careful when you throw up your hands and shout ‘endogeneity!’ It only really merits the shout if the dependent variable might be involved.

R Code for Class 1

Below I’m going to walk through some examples for how to do in R the kinds of things we covered in class using Stata.

The whole .R file for this is here.

Simulate Data and Run Linear Regression

Here’s how to simulate some basic data and run a regression:

set.seed(123)
n <- 100
x1 <- rnorm(n,0,1)
x2 <- rnorm(n,3,5)
error <- rnorm(n,0,10)
y <- x1 + x2 + error

model <- lm(y~x1+x2)
summary(model)
vcov(model)

One thing to pay attention to are the last two lines. In the variable model, you’re storing in a single object all the data associated with your regression. summary and vcov perform operations on the data in that object, but you can also access those data directly.

Try running the following and see what happens:

model$coefficients
model$residuals

Simulate Data from a Multivariate Normal Distribution

In Stata, we learned how to use drawnorm.

To do the same in R, you first need to install the mtvnorm package. Then run the following:

set.seed(123)
n<-10000
require(mvtnorm)
my.means <- c(x1=0,x2=3,err=0)
my.varcov <- matrix(c(3^2,0,0,0,5^2,0,0,0,10^2),ncol=3)
my.data <- rmvnorm(n,my.means,my.varcov)
my.data <- list(x1=my.data[,1],x2=my.data[,2],err=my.data[,3])
attach(my.data)
y <- x1 + x2 + err
model<-lm(y~x1+x2)
summary(model)
vcov(model)

You should get fairly similar results compared with what you got above. Note that you’re using the variance-covariance matrix though, not the correlation matrix, so you’re not bounded -1 to 1.

Checking Your Errors

R actually comes with a residual plotter built in. Run the following:

set.seed(123)
x1 <- rnorm(1000)
error <- rnorm(1000)
y <- x1 + error
qqnorm(y)
qqline(y)

Now try again with heteroskedastic errors:

y <- 2 + x1 + error*error
qqnorm(y)
qqline(y)

Needless to say, for linear models, this is a great way to visually verify the homoskedasticity assumption.