Dickstein and Morales: A User's Guide | Stephen P. Ryan | Washington University in St. Louis

A primer and user’s guide to the discrete choice inequality model proposed in Dickstein and Morales (QJE, 2018, “What Do Exporter’s Know?” [link]).

Dickstein and Morales: What is it and why should you care?

Dickstein and Morales (QJE, 2018), hereafter referred to as DM, propose a new estimator for discrete choice models where agents may have expectations about some of the covariates that go into their utility/profit functions. They derive moment inequalities that are consistent with the underlying model while allowing for arbitrary distributions of expectations, subject to the restriction that agents predict covariates correctly on average (i.e. rational expectations). In their paper, their application looks at the decision of firms to export or not; each firm makes a prediction about the profitability of exporting when deciding whether to do so. Each firm can have arbitrary individual expectations about some of the covariates in the profit function (such as anticipated revenue of sending products to a given country) under the rational expectations restriction that firms predict those covariates correctly on average. Furthermore, the econometrician does not have to specify or estimate the distribution of expectations. The cost of doing so is that the estimates may be set identified. The econometrician will also need access to high-quality instruments that are correlated with the underlying agent beliefs. DM show in both Monte Carlos and their application that the estimator is quite powerful and can provide meaningful estimates of the underlying utility/profit function. Finally, the estimator can be used to test the information sets of agents. The basic idea is to use a specification test adapted to the setting of moment inequalities, such as that from Andrews and Soares (ECMA, 2010), to assess whether or not the statistical model is consistent with the assumption that agents took a variable into account when making the discrete choice. Under the null hypothesis, violations of the inequalities are orthogonal to the covariates, and specification tests provide a formal mechanism for testing that hypothesis.

Model

The Canonical Discrete Choice Model

We will consider a binary discrete choice setting. Agent $i$ can choose from a single inside good or an outside option. The agent receives utility from each good; as only the difference in utility between the two choices will matter, we normalize the utility of outside good to zero. We let the inside good have the following deterministic utility function:

$$u_i = x_{i1} \beta_1 + x_{i2} \beta_2 + \epsilon_i,$$

where $x_i$ are observable covariates and $\beta$ are marginal utilities, which are the parameters of interest in this model. To rationalize that agents sometimes appear to choose “dominated” options due to private preference shocks, we allow the utility of the inside good to also depend on an additive error, $\epsilon_i$; the distribution function of shocks is given by $F(\epsilon_i)$. To emphasize, this agent-choice-specific shock is observed by the agent but not the econometrician.

The agent chooses the inside option if and only if it maximizes their utility:

$$1(d_i=1|X_i,\beta) = 1(X_i\beta + \epsilon_i > 0).$$

The probability of this event is:

$$Pr(d_i=1|X_i,\beta) = Pr(X_i\beta + \epsilon_i > 0) = Pr(\epsilon_i > -X_i\beta) = 1- F(-X_i\beta).$$

This theoretical choice probability forms the basis for a maximum likelihood or GMM estimation procedure. For example, for a guess of the utility parameters, the log-likelihood for agent $i$:

$$LLH_i(\hat{\beta}) = d_i \ln Pr(d_i = 1 | X_i,\hat{\beta}) + (1-d_i) \ln (1-Pr(d_i = 1 | X_i,\hat{\beta}).$$

We denote the argument in the LLH with a hat to emphasize that it is a parameter. Taking the derivative with respect to $\hat{\beta_j}$, we obtain scores of the log-likelihood:

$$d_i \frac{1}{Pr(d_i=1)} \frac{\partial Pr(d_i=1)}{\partial \hat{\beta_j}} – (1-d_i) \frac{1}{1-Pr(d_i=1)} \frac{\partial Pr(d_i=1)}{\partial \hat{\beta_j}} = 0,$$

where the dependence of the probability on $X$ and $\hat{\beta_j}$ has been suppressed for clarity. As long as the underlying density of $F$ is continuous, $\frac{\partial Pr(d_i=1)}{\partial \hat{\beta_j}}\neq 0$ and can be cancelled out. That results in the following simplified equation:

$$ d_i \frac{1}{Pr(d_i=1)} – (1-d_i) \frac{1}{1-Pr(d_i=1)} = 0.$$

Plugging in the definition of the choice probability gives:

$$ d_i \frac{1}{1-F(-X_i\hat{\beta})} – (1-d_i) \frac{1}{F(-X_i\hat{\beta})} = 0.$$

This equation is the basis of the GMM estimator; we will try to find a parameter that minimizes a function of the above moment, which equals zero at (only) the truth. Identification follows by inspection—under the data-generating process, $E[d_i] = 1-F(-X_i\beta)$. Since $F(\cdot)$ is a monotone function, the first-order condition holds if and only if the parameter equals the truth, $\hat{\beta}=\beta$.

For reasons that will become clear shortly, we rearrange the above score function in two ways.¹ The first way we will denote by $m_{1i}$:

$$m_{1i}(\hat{\beta}) = d_i \frac{F(-X_i\hat{\beta})}{1-F(-X_i\hat{\beta})} – (1-d_i) = 0,$$

and the second by $m_{2i}$:

$$m_{2i}(\hat{\beta}) = -d_i + (1-d_i) \frac{1-F(-X_i\hat{\beta})}{F(-X_i\hat{\beta})} = 0.$$

We convert these individual score functions into moments by taking expectations:

$$m_{1i}(\hat{\beta}) = E_{d_i | X_i } \left[ d_i \frac{F(-X_i\hat{\beta})}{1-F(-X_i\hat{\beta})} – (1-d_i) \right] = 0.$$

The expectation here is with respect to the individual’s unknown preference shock. To obtain aggregate moments that we will take to the sample, we further take expectations with respect to the distribution of agents:

$$m_{1}(\hat{\beta}) = E_{X_i} \left[ E_{d_i | X_i} \left[ d_i \frac{F(-X_i\hat{\beta})}{1-F(-X_i\hat{\beta})} – (1-d_i) \right] \right] = 0.$$

To parse this expression, consider it in two parts. In plain English, the innermost expectation says that at the truth, the average value of $d_i$, conditional on a specific value of $X_i$, will make the moment equal to zero. The outermost expectation then says that equality must hold for all possible values of $X$.

This is an important observation, because it says that the model has to hold for every possible combination of inputs we can feed it. However, we typically (and often mindlessly) throw away useful statistical information by using much simpler unconditional moments. In finite samples, the expectation equation above simplifies to the following sum:

$$\frac{1}{N} \sum_{i=1}^N d_i \frac{F(-X_i\hat{\beta})}{1-F(-X_i\hat{\beta})} – (1-d_i)=0.$$

This sum has the appealing property of being simple and easy to understand but note that we have thrown out information. To help see this, consider that this sum is conceptually composed from two different sums that are the direct analogs of the expectations above: an inner sum for all observations of a given value of $X_i$ and an outer sum all possible values of $X_i$. Denoting the number of elements where $X_i=X$ by $N(X_i)$, that expanded sum is:

$$ \sum_{X_i \in X} \frac{N(X_i)}{N} \left[ \frac{1}{N(X_i)} \sum_{i=1}^{N(X_i)} d_i \frac{F(-X_i\hat{\beta})}{1-F(-X_i\hat{\beta})} – (1-d_i) \right]=0.$$

The term in brackets is exactly the average moment conditional on a value of $X_i$. The critical insight is that, under the theoretical data-generating process that we have imposed on the economic environment, the conditional moment has to hold for every value of $X_i$, whereas the sum above imposes that it only has to hold on average. One could do better by incorporating the level of $X_i$ into the moments in some fashion. For example, one could split the sample in half in some systematic fashion and form separate moments using each subsample. The downside is that we have fewer observations per moments, but we are closer to imposing the fully extent of the theoretical restrictions. As we introduce the DM model below, we’ll see that being able to impose these theoretical restrictions is going to have larger importance in that setting than in typical moment equality settings.

GMM Estimator

To form an estimator that we could take to the data, we would typically interact the moments with some function of the covariates. In our two covariate, two parameter utility function, two possible moments could be:

$$E[X_1 m_1]=0,$$

and

$$E[X_2 m_1]=0.$$

Stacking the moments into $G(\hat{\beta})$ and minimizing a quadratic form, such as $G(\hat{\beta})’WG(\hat{\beta})$, where $ W $ is a positive semi-definite weighting matrix (such as the inverse of the covariance of the moments), with respect to $\hat{\beta}$ leads to our GMM estimate.

Introducing Expectations

DM depart from the canonical discrete choice model by introducing agent expectations about one (or more) of the covariates in the utility function. What exactly does this mean? It means that some of the $X$ entering the utility function of the agent at the time the decision is made are perceived to potentially be different than what the econometrician observes ex-post. To fix ideas, suppose that agents have heterogeneous beliefs about only $x_1$. We denote the covariate the econometrician observes as the usual $x_i$, while the agent perceives that covariate to be $\tilde{x}_i$. We relate the two quantities by:

$$x_i = \tilde{x}_i + \nu_i,$$

where $\nu_i$ is the signal that the agent observes about the covariate.

While individual agents are allowed to have any beliefs that they want about the $x_1$, DM impose a rational expectations requirement: on average, the population of agents predicts $x_1$ correctly. Stated another way, the distribution of $\nu$ is not specified outside of the requirement that it must be mean zero. Agents may also have information that they use to form expectations about $x_1$; some of that information may be observable to the econometrician, while some may be private. DM propose a method for both estimating utility parameters in the presence of unknown individual beliefs and testing the information sets of agents.

To see how this works, first consider the moment function above. The introduction of beliefs requires the econometrician to integrate out another level of private information. Previously, we integrated out the preference shocks when forming expectations of our moments. Now, we need to integrate out the beliefs of each agent when forming moments. Denoting the vector of covariates that the agent uses to make decisions as $\tilde{X}$, the finite sample representation of the score is:

$$\frac{1}{n} \sum_{i=1}^n d_i E_{\nu} \left[ \frac{F(-\tilde{X}_i\hat{\beta})}{1-F(-\tilde{X}_i\hat{\beta})} \right] – (1-d_i)=0 .$$

Note that there is now a new expectation that was not in the canonical model; the econometrician has to integrate out over the distribution of $\nu$. This is a conceptually difficult problem, as a.) we do not want to impose any particular distribution on the beliefs, and b.) we often do not have any idea what the beliefs of individual agents might have been. This is where the first genius insight of DM comes in.

If the distribution function of the preference shocks belongs to the family of log-concave distributions (which includes all the usual suspects, such as normal, uniform, and type I extreme value), then the ratio $ \frac{1-F(-\tilde{X}_i\hat{\beta})}{F(-\tilde{X}_i\hat{\beta})} $ is convex. Furthermore, applying Jensen’s inequality, the expectation of a mean-zero random variable inside a convex function is larger than the convex function applied to the random variable’s expectation:

$$ E_{\nu} \left[ \frac{F(-\tilde{X}_i\hat{\beta})}{1-F(-\tilde{X}_i\hat{\beta})} \right] \geq \frac{F(-E_{\nu}[\tilde{X}_i]\hat{\beta})}{1-F(-E_{\nu}[\tilde{X}_i]\hat{\beta})}. $$

Why is this helpful? Because we know that, whatever the distribution of $\nu$ may be, once we take into account the fact that we need to integrate it out, the equality above becomes an inequality:

$$\frac{1}{n} \sum_{i=1}^n d_i \frac{F(-\tilde{X}_i\hat{\beta})}{1-F(-\tilde{X}_i\hat{\beta})} – (1-d_i) \geq 0.$$

Why is this? Because the first term is positive and the second term is negative. Taking expectations with respect to the private signals only blows up the positive contribution to the sum, which means that the equality turns into a positive inequality.

There is simultaneously a subtlety and tremendous power here: we have accounted for the presence of private expectations without ever having to actually solve for that distribution. This is in contrast the private information revolution from auctions, for example, where equilibrium conditions for optimal play are imposed on agent behavior in order to infer what their private information had to have been. Here, we never solve for the private information, and yet the inequality is consistent with a model where agents may perceive the values of the covariates to be different from what the econometrician observes. It is an amazing feat that we can recover consistent estimates of $\beta$ in such an environment.

The downside to this approach is that the restrictions on the data-generating process are now in the form of inequalities, and as such, the econometrician may lose point identification. That is, a range of parameters may all be equally good at satisfying the constraints imposed by the econometric model. The question becomes, how do we impose constraints from the underlying model in such a way that we obtain practically useful estimates? Fortunately, a large literature in econometrics looking at the properties of moment inequalities has some answers for us.

First, we need to think about all the restrictions imposed by the model. Above, I mentioned that the agent may have some sources of information that are related to their beliefs, and some of those sources of information may be observable to the econometrician. Denote the observable set of covariates that are relevant to agent beliefs as $W$. If that is the case, then the moment functions should be mean independent of those information variables. Another way of stating this in plain English is that if agents use certain covariates to optimize their behavior, the assumption that we baked into our model that observed choices are derived from utility-maximizing behavior implies that they cannot systematically make mistakes with respect to those covariates. In mathematical terms, the following conditional expectation of the moments holds:

$$E[m_1(\hat{\beta}) | X, W] \geq 0.$$

While $W$ does not directly enter the utility function of the agents, under the imposed assumption of utility-maximizing behavior it must be the case that optimization errors cannot be systematic functions of $W$, because otherwise the agent would have conditioned on that information and changed their behavior. Fundamentally, this is exactly the same insight as the linear instrument variables model: while the instruments do not enter directly into the linear equation, it has to be the case that those instruments cannot be systematically related to prediction errors.

The simplest case to see this most clearly is in the case where the econometrician knows the exact expectations of the agent. In this case, $W = \tilde{x}$. While we still use the observed $x$ in forming the moment inequalities, it follows that those errors must be mean independent of the true expectations, $W$.

One of the issues that practitioners have to confront when using the conditional moment inequality above is that the conditional moment is generally computationally infeasible. If the conditioning variables are continuous, one has to figure out how to convert conditional moments into unconditional moments. Fortunately, Andrews and Shi (2014, J. Econometrics) have a solution.

From Conditional to Unconditional Moments

Andrews and Shi suggest using a mechanism composed of nested hypercubes to convert conditional moments into unconditional moments. Essentially, they suggest transforming the conditional moment expectation into an unconditional moment of the variety:

$$E[g(X,W) m_1 (\hat{\beta}) ] \geq 0.$$

We interact the moments with functions of the conditioning variables. The intuition behind this is clear: the idea with conditioning is to evaluate the moment at a given value of the conditioning variables, $(X=x,W=w)$. There are an infinite number of moments that one can construct this way when the conditioning variables are continuous. So, we need to aggregate observations in some way that simultaneously preserves their information while balancing that against practical considerations of sample size and computational power. They are many ways to achieve efficiency asymptotically, but there are also important practical considerations in finite samples. Andrews and Shi suggest two approaches.

Fully Interacted HyperCubes

The first approach begins by cutting each $z \in Z$ into disjoint subintervals; we then assign each observation to a single moment by fully interacting all of the subintervals of $Z$.

For example, suppose that we have two continuous variables and a single fixed effect in $Z$. We begin by cutting each of the $z \in Z$ into subintervals. One possible cut would be to partition above and below the median for each $z$. For the fixed effect, we separate observations by whether the fixed effect is zero or one. We then fully interact these three conditions, generating moments that look like this:

$$
g(Z) = \begin{cases}
1(z_1 < median(z_1))1(z_2 < median(z_2))1(fe_1 = 0) & \\
1(z_1 < median(z_1))1(z_2 < median(z_2))1(fe_1 = 1) & \\
1(z_1 \geq median(z_1))1(z_2 < median(z_2))1(fe_1 = 0) & \\
1(z_1 \geq median(z_1))1(z_2 < median(z_2))1(fe_1 = 1) & \\
1(z_1 < median(z_1))1(z_2 \geq median(z_2))1(fe_1 = 0) & \\
1(z_1 < median(z_1))1(z_2 \geq median(z_2))1(fe_1 = 1) & \\
1(z_1 \geq median(z_1))1(z_2 \geq median(z_2))1(fe_1 = 0) & \\
1(z_1 \geq median(z_1))1(z_2 \geq median(z_2))1(fe_1 = 0) &
\end{cases}
$$

Pairwise Rectangles

The issue with fully-interacted hypercubes is the curse of dimensionality: interacting all subsets of all covariates generates a number of moments that grows extremely rapidly in the dimensionality of the covariates and the number of partitions. Andrews and Shi (2014) suggest a different approach that uses all pairwise interactions of covariates in this case. One produces the $g(z)$ function in the following fashion:

$$
g_{ij}(Z) = 1(\underline{z_i} \leq z_i < \bar{z_i}, \underline{z_j} \leq z_j < \bar{z_j}), \forall i=\{1,\dots,K\}, j=\{i+1,\dots,K\},
$$
with an analogous set of moments for fixed effects (i.e. either on or off instead of partitions). The number of these moments obviously grows much slower than the fully-interacted moments, but has two drawbacks. The first is that using only the pairwise interactions loses statistical information, which can be very important in settings like DM where only some of the $Z$ are informative about the identified set. The second, which is less immediately obvious, is that the moment become highly correlated. By construction, the fully-interacted moments are orthogonal to each other, as each observation only belongs to one $g(Z)$. With pairwise moments, however, each observation belongs to several $g(Z)$. This causes issues with the covariance matrix of the moments when it is used in estimation. For example, in standard GMM with equalities the optimal weighting matrix is the inverse of the covariance matrix of the moments. When the correlation becomes high enough, this inversion is no longer numerically stable (or may not even exist). This can cause problems with estimation and, by extension, inference, where numerical stability is necessary to obtain correct inference. As a result, using an objective function like MMM is highly suggested (or required) when using pairwise instrumenting functions to convert conditional moments into unconditional moments.

Inference with Moment Inequalities

A large econometric literature has sought to answer the question of how to perform inference in settings with moment inequalities. I focus on Andrews and Soares (ECMA, 2010), although there is a substantial literature following them that refines their procedures in various ways (e.g. Bugni, Canay, and Shi (J.Econometrics, 2015)). First, define the test statistic as:
$$
T_n(\theta) = S(n^{1/2} \bar{m}_n(\theta) ,\Sigma_n(\theta)),
$$
where $S$ is a function that takes the (scaled) moments, $ \bar{m}_n(\theta) $, and their covariances, $ \Sigma_n(\theta) $, as arguments. Two common ones are the quasi-likelihood ratio (QLR) and modified method of moments (MMM). I will focus on the MMM for reasons explained below.

The MMM $S$ function is:
$$
S_{MMM}(m,\Sigma) = \sum_{j=1}^p [m_j / \sigma_j]_{-}^2 + \sum_{j=p+1}^{J} (m_j / \sigma_j)^2,
$$
where $J$ is the total number of moments, and we have ordered them in such a way such that the first $p$ are inequalities, and the remaining $J – p$ are equalities. Note that this looks a lot like the standard GMM objective function save for two differences. First, we do not use any of the covariance information. Second, the inequalities are only penalized if they are negative.

Andrew and Soares propose Generalized Moment Selection (GMS) as a method for producing critical values for hypothesis tests using the statistic defined above. The 10,000 foot motivation is that we do not know the limiting distribution of the test statistic, so we cannot just use a value from a lookup table for the appropriate critical value when performing tests. Instead, when dealing with moment inequalities the limiting distribution of the depends on the slackness of the inequalities. When the inequalities are far from binding (e.g. they are slack), they should not contribute much (or anything) to the limiting distribution of the test statistic. On the other hand, as they discuss in their paper, it is useful to think about a worst-case scenario where all the moment inequalities are valid but are exactly zero in the limit. One can construct a critical value for $T_n(\theta)$ by taking the $1-\alpha$ percentile of the distribution of $T_n(\theta)$ when all the inequalities are equal to zero. This procedure generates the correct asymptotic size but has poor power. The reason is that the worst-case scenario critical value is going to be relatively big. A natural question is relatively big compared to what? The answer is that it is large relative to a critical value that incorporates empirical information that conveys just how slack each moment inequality is. When the number of inequalities is large, and only one of them is near or at zero and the rest are far from binding, the critical value under the worst-case scenario is going to be huge compared to one computed that accounts for all of the inequalities being slack except for one. The question is: which moment inequalities are slack, and how do we incorporate that information appropriately into our econometric methodology? Andrews and Soares provide an answer.

The basic idea of the GMS procedure is to use the sample to construct measures of slackness that operate on an inequality-by-inequality basis. The measure of slackness is a sample-based estimate of how far the inequality is from binding; that is, how far away from zero it is if it positive. We will add that to the test statistic moments both under the null and the alternative when we perform our tests. There are some technical details that Andrews and Soares do a comprehensive job of explaining, but that is the main idea.

How does this work in practice? First, let us rearrange the test statistic by doing the matrix equivalent of normalizing by the standard deviation:

$$
T_n(\theta) = S(\hat{D}^{-1/2}_n(\theta) n^{1/2} \bar{m}_n(\theta) , \hat{\Omega}_n(\theta)),
$$

where $ \hat{D}_n(\theta) = Diag(\hat{\Sigma}_n(\theta)) $. $ \hat{\Omega}_n(\theta) $ is the correlation matrix of the moments. It is defined by $ \hat{\Omega}_n(\theta) = \hat{D}^{-1/2}_n(\theta) \hat{\Sigma}_n(\theta) \hat{D}^{-1/2}_n(\theta) $. To see that this is exactly the same function as what we had before, note that we have just shifted the division of each moment by its standard deviation to the definition of the first argument and set the second arguments to (effectively) an identity matrix. Passing that through the definition of MMM above results in exactly the same expression. So why do it? Because it lends itself to the following convenient observation, which is that we can obtain the asymptotic limiting distribution of the normalized test statistic via:

$$
S(\Omega^{1/2} Z^* + (h_1, 0_{J-p}) , \Omega),
$$
where $ Z^* $ is a vector of standard normal random variables and $ h $ is an appropriately scaled $ p \times 1 $ vector containing the inequality-by-inequality measure of slackness. Note that computing this distribution is extremely fast—one only has to evaluate the test statistic function for each draw a random standard normal. In practice, I have found that I can draw 10,000 of these and evaluate the test statistic in near real-time. The only numerical nastiness is computing the square root of the correlation matrix. We can avoid doing so using the bootstrap version at the expense of computation time.

So where does the measure of slackness come from? First, define the following:
$$
\xi_n(\theta) = \kappa_n^{-1} n^{1/2} \hat{D}^{-1/2}_n(\theta) \bar{m}_n(\theta).
$$
This looks somewhat complicated at first, but it is straightforward when broken into its constituent pieces. The first part, $ \kappa_n $, is a divergent sequence, such as the Bayesian information criterion choice:
$$
\kappa_n = (\ln n)^{1/2}.
$$
The idea here is to ensure that moments that are away from zero are given an amplified measure of slackness as the sample size grows, but at a rate less than square root. The diagonal matrix is simply the normalization done above applied to the moments.

We then replace $ h $ with a function of the meaure of slackness and the correlation matrix of the moments, $ \psi ( \xi_n(\theta), \hat{\Omega}_n(\theta) ) $. Andrews and Soares propose several variations of $ \psi $, but all them have the same idea: we are going to inflate the moment inequalities that look far away from zero in the limiting distribution by $ \psi(\xi) $. When an element of $ \xi $ is very large, we will tend to ignore it in the limiting distribution. In fact, Andrews and Soares propose forms of $ \psi $ where one literally sets the moment to be positive infinity when the moment is large enough, guaranteeing that it has no influence on the resulting distribution. Here are several of the proposed functions which operate moment by moment:

$$
\begin{align}
\psi_j^{(1)}(\xi,\Omega) & = \infty \text{ if } \xi_j \geq 1, 0 \text{ otherwise.} \\
\psi_j^{(3)}(\xi,\Omega) & = \xi_j \text{ if } \xi_j \geq 0, 0 \text{ otherwise.} \\
\psi_j^{(4)}(\xi,\Omega) & = \xi_j . \\
\end{align}
$$

Putting it all together, we can compute the critical value by taking the $ 1-\alpha $ percentile of the distribution of:

$$
S(\hat{\Omega}_n^{1/2} Z^* + \psi(\xi_n(\theta), \hat{\Omega}_n(\theta) ) , \hat{\Omega}_n(\theta)).
$$

We use the critical value as the threshold in specification tests. By finding all the parameters for which we fail to reject the null hypothesis that the moment (in)equalities hold, we generate the confidence interval/set.² I have found that the Laplace-type Estimator (LTE) of Chernozhukov and Hong (2003) [link] to be an excellent way to produce candidate sets of parameters to test.

The Max Statistic

Canay, Illanes, and Velez (2023, link) suggest following Chernozhukov, Chetverikov, and Kato (2019, link) and using a “max” statistic when using a large number of inequalities:

$$
S(\cdot, \cdot) = \max_{1 \leq j \leq p} (-1) \frac{m_j}{\sigma_j},
$$

where the statistic is formed by taking only the worst violation.³ The reason to favor a statistic like this is that it grows relatively slowly in the number of inequalities, $p$. When the statistic is negative, none of the inequalities are violated. In practice, I have found that this statistic has dramatically better coverage rates than other approaches, but only when used with the bootstrap.

Practical Considerations

All of the theory above is well and good, but when it is time to actually estimate a model, there are some additional considerations that need to be addressed.

Choice of the Instrumenting Function

First, the choice of the $g(Z)$ function is very important. In ongonig work with Michael Greenstone, Mike Greenberg, and Mike Yankovich (updated copy with DM results hopefully coming soon), we estimate discrete choice models for first-time reenlistees in the US Army. We use the DM approach to account for the fact that soldiers may have varying expectations about the future mortality hazards that they face when making their reenlistment decision (the goal of the paper is to understand how they trade off increases in that mortality risk against financial bonuses they may receive for reenlisting). The choice of $g(Z)$ in this setting is critical. We have lots of covariates, so using the fully-interacted instrumenting functions described above is infeasible (I accidentally turned this option on when handing off to my coauthor to run it, and we crashed the computer because we were generating over a million moments). Using the pairwise instrumenting functions cuts down on that, but has the additional problem of generating substantial correlation in the moments, which makes using the QLR function infeasible, and which greatly reduces the statistical information in the moment function. Using the MMM function fixes the first problem, but the issue remains that the inequalities are largely uninformative. A quick peek at the first moment inequality tells us why:

$$\frac{1}{n} \sum_{i=1}^n d_i \frac{1-F(-\tilde{X}_i\hat{\beta})}{F(-\tilde{X}_i\hat{\beta})} – (1-d_i) \geq 0.$$

Note that as the expectation error gets larger, the ratio of $(1-F)/F$ becomes arbitrarily large. This means that, holding all else equal, the inequality is easier to satisfy. The only time that the inequality will not be satisfied is when the $-(1-d_i)$ component is larger than the ratio. This will only happen in certain parts of the $X$ space where the choice is not clearly a yes or a no due to extremely large or small utilities. Identification obtains in an intermediate range of the covariates where the model tries to match the rate of choices. The issue is that the econometrician does not know a priori where that part of the $X$ space is going to be. The takeaway here is the one has to be very careful about how the instrumenting functions are defined and used.

Rejection and Coverage rates

The second issue is that rejection and coverage rates in your sample, with your estimator, may deviate from those proposed by the theory. There is a large number of implementation decisions that one has to make when taking the DM estimator to data; all of those decisions can, in principle, influence the estimates and their inference. A careful Monte Carlo study may give some guidance about how well you are doing in your specific situation.

Using the max function above as your test statistic in conjunction with the GMS procedure from Andrews and Soares has produced excellent coverage rates in my experience when the critical value is generated using a bootstrap. The asymptotic versions discussed in Canay et al generally are too small in moderately large samples (up to 100,000 observations) in the DM setting when there are more than a few covariates. The bootstrap version does a much better job, in my experience, with matching the distribution under the null.

Java/Stata Github Package

A Java implementation of the DM estimator and a link to STATA can be found at Github here: https://github.com/cactus911/dmDiscreteChoice.

Thank you to Jihye Jeon for pointing out that I had swapped the fraction around in the first draft of this page. ↩︎
With apologies to my econometrics professors in graduate school, I don’t ever recall hearing about inverting tests (and casual investigation of my applied peers suggests that this is not a common topic in other PhD programs, since no one outside econometricians I have asked is aware of this) before reading up on the moment inequality literature. So a quick digression on specification tests and confidence sets: specification tests evaluate whether the model is internally consistent at a point. That is, are all the inequalities and equalities valid at a given parameter. In GMM, when you have an overidentified model then you can use Hansen’s J-test to evaluate the model. Take $Q(\theta) = nG(\theta)’W(\theta)’G(\theta)$, where $W(\theta)$ is the optimal weighting matrix (the inverse of the covariance matrix of the moments) and compare it to the tail value of the chi-squared distribution with degrees of freedom equal to the number of moments minus the number of parameters. If $Q$ is larger than this value, we reject the null hypothesis that the moments are close (enough) to zero. If we fail to reject this test, can we test many parameters in a neighborhood of the minimizer. All the values of the parameter that we fail to reject the null form our confidence interval/set. ↩︎
In their notation, inequalities are supposed be negative, thus the (-1) in this equation. ↩︎