A primer and user’s guide to the discrete choice inequality model proposed in Dickstein and Morales (*QJE*, 2018, “What Do Exporter’s Know?”).

## Dickstein and Morales: What is it and why should you care?

Dickstein and Morales (*QJE*, 2018), hereafter referred to as DM, propose a new estimator for discrete choice models when agents may have expectations about some of the covariates that go into their utility/profit functions. They derive moment inequalities that are consistent with the underlying model while allowing for *arbitrary *distributions of expectations, subject to the restriction that agents predict covariates correctly on average (i.e. rational expectations). In their paper, their application looks at the decision of firms to export or not; each firm makes a prediction about the profitability of exporting when deciding whether to do so. Each firm can have arbitrary individual expectations about some of the covariates in the profit function (such as anticipated revenue of sending products to a given country) under the rational expectations restriction that firms predict those covariates correctly on average. Furthermore, the econometrician does not have to specify or estimate the distribution of expectations. The cost of doing so is that the estimates may be set identified. The econometrician will also need access to high-quality instruments that are correlated with the underlying agent beliefs. DM show in both Monte Carlos and their application that the estimator is quite powerful and can provide meaningful estimates of the underlying utility/profit function. Finally, the estimator can be used to test the information sets of agents. The basic idea is to use a specification test adapted to the setting of moment inequalities, such as that from Andrews and Soares (*ECMA*, 2010), to assess whether or not the statistical model is consistent with the assumption that agents took a variable into account when making the discrete choice. Under the null hypothesis, violations of the inequalities are orthogonal to the covariates, and specification tests provide a formal mechanism for testing that hypothesis.

## Model

#### The Canonical Discrete Choice Model

We will consider a binary discrete choice setting. The agent can choose from a single inside good or an outside option. The agent receives utility from each good; as only the difference in utility between the two choices will matter, we normalize the utility of outside good to zero. We let the inside good have the following deterministic utility function:

$$u = x_{1} \beta_1 + x_{2} \beta_2 + \epsilon,$$

where \(x\) are observable covariates and \(\beta\) are marginal utilities, which are the parameters of interest in this model. To rationalize that agents sometimes appear to choose “dominated” options due to private preference shocks, we allow the utility of the inside good to also depend on an additive error, \(\epsilon\); the distribution function of shocks is given by \(F(\epsilon)\). To emphasize, this agent-choice-specific shock is observed by the agent but not the econometrician.

The agent chooses the inside option if and only if it maximizes their utility:

$$1(d=1|X,\beta) = 1(X\beta + \epsilon > 0).$$

The probability of this event is:

$$Pr(d=1|X,\beta) = Pr(X\beta + \epsilon > 0) = Pr(\epsilon > -X\beta) = 1- F(-X\beta).$$

This theoretical choice probability forms the basis for a maximum likelihood or GMM estimation procedure. For example, for a guess of the utility parameters, the log-likelihood for a sample of \(n\) agents is:

$$LLH(\hat{\beta}) = \sum_{i=1}^n d_i \ln Pr(d_i = 1 | X_i,\hat{\beta}) + (1-d_i) \ln (1-Pr(d_i = 1 | X_i,\hat{\beta}).$$

We denote the argument in the LLH with a hat to emphasize that it is a parameter. Taking derivatives with respect to \(\hat{\beta}\), we obtain scores of the log-likelihood:

$$\sum_{i=1}^n d_i \frac{1}{Pr(d_i=1)} \frac{\partial Pr(d_i=1)}{\partial \hat{\beta}} – (1-d_i) \frac{1}{1-Pr(d_i=1)} \frac{\partial Pr(d_i=1)}{\partial \hat{\beta}} = 0,$$

where the dependence of the probability on \(X\) and \(\hat{\beta}\) has been suppressed for clarity. As long as the underlying density of \(F\) is continuous, \(\frac{\partial Pr(d_i=1)}{\partial \hat{\beta}}\neq 0\) and can be cancelled out. That results in the following simplified equation:

$$\sum_{i=1}^n d_i \frac{1}{Pr(d_i=1)} – (1-d_i) \frac{1}{1-Pr(d_i=1)} = 0.$$

Plugging in the definition of the choice probability gives:

$$\sum_{i=1}^n d_i \frac{1}{1-F(-X_i\hat{\beta})} – (1-d_i) \frac{1}{F(-X_i\hat{\beta})} = 0.$$

This equation is the basis of the GMM estimator; we will try to find a parameter that minimizes a function of the above moment, which equals zero at (only) the truth. Identification follows by inspection—under the data-generating process, \(d_i = 1-F(-X_i\beta)\). If one fixes \(X_i = \bar{X}\), then \(d_i\) converges to its theoretical counterpart, \(1-F(-\bar{X}\beta)\). Since \(F(\cdot)\) is a monotone function, the first-order condition only holds if and only if the parameter equals the truth, \(\hat{\beta}=\beta\).

For reasons that will become clear shortly, we rearrange the above score function in two ways. The first way we will denote by \(m_1\):

$$m_{1i}(\hat{\beta}) = d_i \frac{1-F(-X_i\hat{\beta})}{F(-X_i\hat{\beta})} – (1-d_i),$$

and the second by \(m_2\):

$$m_{2i}(\hat{\beta}) = -d_i + (1-d_i) \frac{F(-X_i\hat{\beta})}{1-F(-X_i\hat{\beta})}.$$

A standard GMM approach would be to interact the first-order condition (either \(m_1\) or \(m_2\)) with the covariates. In our two covariate, two parameter utility function, two possible moments could be:

$$E[X_1 m_1]=0,$$

and

$$E[X_2 m_1]=0.$$

Note that we have taken expectations here with respect to the unknown preference shocks. These moments say that, averaging across an infinite number of draws of samples from the population, the score of the likelihood function is mean independent from levels of \(X\). Another way of stating that is that the prediction errors in our model, at the true parameter, are not functions of the covariates.

Stacking the moments into \(G(\hat{\beta})\) and minimizing a quadratic form, such as \(G(\hat{\beta})’G(\hat{\beta})\), with respect to \(\hat{\beta}\) leads to our GMM estimate.

#### Introducing Expectations

DM depart from the canonical discrete choice model by introducing agent expectations about one (or more) of the covariates in the utility function. What exactly does this mean? It means that some of the (\X\) entering the utility function of the agent at the time the decision is made are perceived to potentially be different than what the econometrician observes ex-post. To fix ideas, suppose that agents have heterogeneous beliefs about only \(x_1\). We denote the covariate the the econometrician observes as the usual \(x_i\), while the agent perceives that covariate to be \(\tilde{x}_i\). We relate the two quantities by:

$$x_i = \tilde{x}_i + \nu_i,$$

where \(\nu_i\) is the signal that the agent observes about the covariate.

While individual agents are allowed to have any beliefs that they want about the \(x_1\), DM impose a rational expectations requirement: on average, the population of agents predicts \(x_1\) correctly. Stated another way, the distribution of \(\nu\) is not specified outside of the requirement that it must be mean zero. Agents may also have information that they use to form expectations about \(x_1\); some of that information may be observable to the econometrician, while some may be private. DM propose a method for both estimating utility parameters in the presence of unknown individual beliefs and testing the information sets of agents.

To see how this works, first consider the moment function above. The introduction of beliefs requires the econometrician to integrate out another level of private information. Previously, we integrated out the preference shocks when forming expectations of our moments. Now, we need to integrate out the beliefs of each agent when forming moments. Denoting the vector of covariates that the agent uses to make decisions as \(\tilde{X}\), the finite sample representation of the score is:

$$\frac{1}{n} \sum_{i=1}^n d_i E_{\nu} \left[ \frac{1-F(-\tilde{X}_i\hat{\beta})}{F(-\tilde{X}_i\hat{\beta})} \right] – (1-d_i) = 0.$$

Note that there is now a new expectation that was not in the canonical model; the econometrician has to integrate out over the distribution of \(\nu\). This is a conceptually difficult problem, as a.) we do not want to impose any particular distribution on the beliefs, and b.) we often do not have any idea what the beliefs of individual agents might have been. This is where the first genius insight of DM comes in.

If the distribution function of the preference shocks belongs to the family of log-concave distributions (which includes all the usual suspects, such as normal, uniform, and type I extreme value), then the ratio \( \frac{1-F(-\tilde{X}_i\hat{\beta})}{F(-\tilde{X}_i\hat{\beta})} \) is convex. Furthermore, applying Jensen’s inequality, the expectation of a mean-zero random variable inside a convex function is larger than the convex function applied to the random variable’s expectation:

$$ E_{\nu} \left[ \frac{1-F(-\tilde{X}_i\hat{\beta})}{F(-\tilde{X}_i\hat{\beta})} \right] \geq \frac{1-F(-E_{\nu}[\tilde{X}_i]\hat{\beta})}{F(-E_{\nu}[\tilde{X}_i]\hat{\beta})}. $$

Why is this helpful? Because we know that, whatever the distribution of \(\nu\) may be, once we take into account the fact that we need to integrate it out, the equality above becomes an inequality:

$$\frac{1}{n} \sum_{i=1}^n d_i \frac{1-F(-\tilde{X}_i\hat{\beta})}{F(-\tilde{X}_i\hat{\beta})} – (1-d_i) \geq 0.$$

Why is this? Because the first term is positive and the second term is negative. Taking expectations with respect to the private signals only blows up the positive contribution to the sum, which means that the equality turns into a positive inequality.

There is simultaneously a subtlety and tremendous power here: we have accounted for the presence of private expectations without ever having to actually solve for that distribution. This is in contrast the private information revolution from auctions, for example, where equilibrium conditions for optimal play are imposed on agent behavior in order to infer what their private information had to have been. Here, we never solve for the private information, and yet the inequality is consistent with a model where agents may perceive the values of the covariates to be different from what the econometrician observes. It is an amazing feat that we can recover consistent estimates of \(\beta\) in such an environment.

The downside to this approach is that the restrictions on the data-generating process are now in the form of inequalities, and as such, the econometrician may lose point identification. That is, a range of parameters may all be equally good at satisfying the constraints imposed by the econometric model. The question becomes, how do we impose constraints from the underlying model in such a way that we obtain practically useful estimates? Fortunately, a large literature in econometrics looking at the properties of moment inequalities has some answers for us.

First, we need to think about all the restrictions imposed by the model. Above, I mentioned that the agent may have some sources of information that are related to their beliefs, and some of those sources of information may be observable to the econometrician. Denote the observable set of covariates that are relevant to agent beliefs as \(W\). If that is the case, then the moment functions should be mean independent of those information variables. Another way of stating this in plain English is that if agents use certain covariates to optimize their behavior, the assumption that we baked into our model that observed choices are derived from utility-maximizing behavior implies that they cannot systematically make mistakes with respect to those covariates. In mathematical terms, the following conditional expectation of the moments holds:

$$E[m_1(\hat{\beta}) | X, W] \geq 0.$$

While \(W\) does not directly enter the utility function of the agents, under the imposed assumption of utility-maximizing behavior it must be the case that optimization errors cannot be systematic functions of \(W\), because otherwise the agent would have conditioned on that information and changed their behavior. Fundamentally, this is exactly the same insight as the linear instrument variables model: while the instruments do not enter directly into the linear equation, it has to be the case that those instruments cannot be systematically related to prediction errors.

The simplest case to see this most clearly is in the case where the econometrician knows the exact expectations of the agent. In this case, \(W = \tilde{x}\). While we still use the observed \(x\) in forming the moment inequalities, it follows that those errors must be mean independent of the true expectations, \(W\).

One of the issues that practitioners have to confront when using the conditional moment inequality above is that the conditional moment is generally computationally infeasible. If the conditioning variables are continuous, one has to figure out how to convert conditional moments into unconditional moments. Fortunately, Andrews and Shi (2014, *J. Econometrics*) have a solution.

## From Conditional to Unconditional Moments

Andrews and Shi suggest using a mechanism composed of nested hypercubes to convert conditional moments into unconditional moments. Essentially, they suggest transforming the conditional moment expectation into an unconditional moment of the variety:

$$E[g(X,W) m_1 (\hat{\beta}) ] \geq 0.$$

We interact the moments with functions of the conditioning variables. The intuition behind this is clear: the idea with conditioning is to evaluate the moment at a given value of the conditioning variables, \((X=x,W=w)\). There are an infinite number of moments that one can construct this way when the conditioning variables are continuous. So, we need to aggregate observations in some way that simultaneously preserves their information while balancing that against practical considerations of sample size and computational power. They are many ways to achieve efficiency asymptotically, but there are also important practical considerations in finite samples. Andrews and Shi suggest two approaches.

###### Fully Interacted HyperCubes

The first approach begins by cutting each \(z \in Z\) into disjoint subintervals; we then assign each observation to a single moment by fully interacting all of the subintervals of \(Z\).

For example, suppose that we have two continuous variables and a single fixed effect in \(Z\). We begin by cutting each of the \(z \in Z\) into subintervals. One possible cut would be to partition above and below the median for each \(z\). For the fixed effect, we separate observations by whether the fixed effect is zero or one. We then fully interact these three conditions, generating moments that look like this:

$$

g(Z) = \begin{cases}

1(z_1 < median(z_1))1(z_2 < median(z_2))1(fe_1 = 0) & \\

1(z_1 < median(z_1))1(z_2 < median(z_2))1(fe_1 = 1) & \\

1(z_1 \geq median(z_1))1(z_2 < median(z_2))1(fe_1 = 0) & \\

1(z_1 \geq median(z_1))1(z_2 < median(z_2))1(fe_1 = 1) & \\

1(z_1 < median(z_1))1(z_2 \geq median(z_2))1(fe_1 = 0) & \\

1(z_1 < median(z_1))1(z_2 \geq median(z_2))1(fe_1 = 1) & \\

1(z_1 \geq median(z_1))1(z_2 \geq median(z_2))1(fe_1 = 0) & \\

1(z_1 \geq median(z_1))1(z_2 \geq median(z_2))1(fe_1 = 0) &

\end{cases}

$$

###### Pairwise Rectangles

The issue with fully-interacted hypercubes is the curse of dimensionality: interacting all subsets of all covariates generates a number of moments that grows extremely rapidly in the dimensionality of the covariates and the number of partitions. Andrews and Shi (2014) suggest a different approach that uses all pairwise interactions of covariates in this case. One produces the \(g(z)\) function in the following fashion:

$$

g_{ij}(Z) = 1(\underline{z_i} \leq z_i < \bar{z_i}, \underline{z_j} \leq z_j < \bar{z_j}), \forall i=\{1,\dots,K\}, j=\{i+1,\dots,K\}.

$$

## Inference with Moment Inequalities

## Practical Considerations

Fixed effects

Estimation of the scale parameter / ratios of coefficients

Two-step methods versus CUE

## Java/Stata Github Package

A Java implementation of the DM estimator and a link to STATA can be found at Github here: https://github.com/cactus911/dmDiscreteChoice.