This article is about bias of statistical estimators. For other uses in statistics, see
Bias (statistics).
In statistics, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased.
In ordinary English, the word bias is pejorative. In statistics, there are problems for which it may be good to use an estimator with a small, but nonzero, bias. In some cases, an estimator with a small bias may have lesser mean squared error or be medianunbiased (rather than meanunbiased, the standard unbiasedness property). The property of medianunbiasedness is invariant under monotone transformations, while the property of meanunbiasedness may be lost under nonlinear transformations.
Definition
Suppose we have a statistical model parameterized by θ giving rise to a probability distribution for observed data, $P(x\backslash mid\backslash theta)$, and a statistic θ^{^} which serves as an estimator of θ based on any observed data $x$. That is, we assume that our data follow some unknown distribution $P(x\backslash mid\backslash theta)$ (where $\backslash theta$ is a fixed constant that is part of this distribution, but is unknown), and then we construct some estimator $\backslash hat\backslash theta$ that maps observed data to values that we hope are close to $\backslash theta$. Then the bias of this estimator is defined to be
 $\backslash operatorname\{Bias\}[\backslash ,\backslash hat\backslash theta\backslash ,]\; =\; \backslash operatorname\{E\}[\backslash ,\backslash hat\{\backslash theta\}\backslash ,]\backslash theta\; =\; \backslash operatorname\{E\}[\backslash ,\; \backslash hat\backslash theta\; \; \backslash theta\; \backslash ,],$
where E[ ] denotes expected value over the distribution $P(x\backslash mid\backslash theta)$, i.e. averaging over all possible observations $x$.
An estimator is said to be unbiased if its bias is equal to zero for all values of parameter θ.
There are more general notions of bias and unbiasedness. What this article calls "bias" is called "meanbias", to distinguish meanbias from the other notions, with the notable ones being "medianunbiased" estimators. For more details, the general theory of unbiased estimators is briefly discussed near the end of this article.
In a simulation experiment concerning the properties of an estimator, the bias of the estimator may be assessed using the mean signed difference.
Examples
Sample variance
Suppose X_{1}, ..., X_{n} are independent and identically distributed (i.i.d.) random variables with expectation μ and variance σ^{2}. If the sample mean and uncorrected sample variance are defined as
 $\backslash overline\{X\}=\backslash frac\{1\}\{n\}\backslash sum\_\{i=1\}^nX\_i,\; \backslash qquad\; S^2=\backslash frac\{1\}\{n\}\backslash sum\_\{i=1\}^n\backslash left(X\_i\backslash overline\{X\}\backslash ,\backslash right)^2,$
then S^{2} is a biased estimator of σ^{2}, because
 $$
\begin{align}
\operatorname{E}[S^2]
&= \operatorname{E}\left[ \frac{1}{n}\sum_{i=1}^n \left(X_i\overline{X}\right)^2 \right]
= \operatorname{E}\bigg[ \frac{1}{n}\sum_{i=1}^n \big((X_i\mu)(\overline{X}\mu)\big)^2 \bigg] \\[8pt]
&= \operatorname{E}\bigg[ \frac{1}{n}\sum_{i=1}^n (X_i\mu)^2 
2(\overline{X}\mu)\frac{1}{n}\sum_{i=1}^n (X_i\mu) +
(\overline{X}\mu)^2 \bigg] \\[8pt]
&= \operatorname{E}\bigg[ \frac{1}{n}\sum_{i=1}^n (X_i\mu)^2  (\overline{X}\mu)^2 \bigg]
= \sigma^2  \operatorname{E}\left[ (\overline{X}\mu)^2 \right] < \sigma^2.
\end{align}
In other words, the expected value of the uncorrected sample variance does not equal the population variance σ^{2}, unless multiplied by a normalization factor. The sample mean, on the other hand, is an unbiased estimator of the population mean μ.
The reason that S^{2} is biased stems from the fact that the sample mean is an ordinary least squares (OLS) estimator for μ: $\backslash overline\{X\}$ is the number that makes the sum $\backslash sum\_\{i=1\}^N\; (X\_i\backslash overline\{X\})^2$ as small as possible. That is, when any other number is plugged into this sum, the sum can only increase. In particular, the choice $\backslash mu\; \backslash ne\; \backslash overline\{X\}$ gives,
 $$
\frac{1}{n}\sum_{i=1}^n (X_i\overline{X})^2 < \frac{1}{n}\sum_{i=1}^n (X_i\mu)^2,
and then
 $$
\begin{align}
\operatorname{E}[S^2]
&= \operatorname{E}\bigg[ \frac{1}{n}\sum_{i=1}^n (X_i\overline{X})^2 \bigg]
< \operatorname{E}\bigg[ \frac{1}{n}\sum_{i=1}^n (X_i\mu)^2 \bigg] = \sigma^2.
\end{align}
Note that the usual definition of sample variance is
 $s^2=\backslash frac\{1\}\{n1\}\backslash sum\_\{i=1\}^n(X\_i\backslash overline\{X\}\backslash ,)^2,$
and this is an unbiased estimator of the population variance. This can be seen by noting the following formula, which follows from the Bienaymé formula, for the term in the inequality for the expectation of the uncorrected sample variance above:
 $\backslash operatorname\{E\}\backslash big[\; (\backslash overline\{X\}\backslash mu)^2\; \backslash big]\; =\; \backslash frac\{1\}\{n\}\backslash sigma^2\; .$
The ratio between the biased (uncorrected) and unbiased estimates of the variance is known as Bessel's correction.
Estimating a Poisson probability
A far more extreme case of a biased estimator being better than any unbiased estimator arises from the Poisson distribution.^{[1]}^{[2]} Suppose that X has a Poisson distribution with expectation λ. Suppose it is desired to estimate
 $\backslash operatorname\{P\}(X=0)^2=e^\{2\backslash lambda\}.\backslash quad$
(For example, when incoming calls at a telephone switchboard are modeled as a Poisson process, and λ is the average number of calls per minute, then e^{−2λ} is the probability that no calls arrive in the next two minutes.)
Since the expectation of an unbiased estimator δ(X) is equal to the estimand, i.e.
 $E(\backslash delta(X))=\backslash sum\_\{x=0\}^\backslash infty\; \backslash delta(x)\; \backslash frac\{\backslash lambda^x\; e^\{\backslash lambda\}\}\{x!\}=e^\{2\backslash lambda\},$
the only function of the data constituting an unbiased estimator is
 $\backslash delta(x)=(1)^x.\; \backslash ,$
To see this, note that when decomposing e^{−λ} from the above expression for expectation, the sum that is left is a Taylor series expansion of e^{−λ} as well, yielding e^{−λ}e^{−λ} = e^{−2λ} (see Characterizations of the exponential function).
If the observed value of X is 100, then the estimate is 1, although the true value of the quantity being estimated is very likely to be near 0, which is the opposite extreme. And, if X is observed to be 101, then the estimate is even more absurd: It is −1, although the quantity being estimated must be positive.
The (biased) maximum likelihood estimator
 $e^\{2\backslash overline\{X\}\}\backslash quad$
is far better than this unbiased estimator. Not only is its value always positive but it is also more accurate in the sense that its mean squared error
 $e^\{4\backslash lambda\}2e^\{\backslash lambda(1/e^23)\}+e^\{\backslash lambda(1/e^41)\}\; \backslash ,$
is smaller; compare the unbiased estimator's MSE of
 $1e^\{4\backslash lambda\}.\; \backslash ,$
The MSEs are functions of the true value λ. The bias of the maximumlikelihood estimator is:
 $e^\{2\backslash lambda\}e^\{\backslash lambda(1/e^21)\}.\; \backslash ,$
Maximum of a discrete uniform distribution
The bias of maximumlikelihood estimators can be substantial. Consider a case where n tickets numbered from 1 through to n are placed in a box and one is selected at random, giving a value X. If n is unknown, then the maximumlikelihood estimator of n is X, even though the expectation of X is only (n + 1)/2; we can be certain only that n is at least X and is probably more. In this case, the natural unbiased estimator is 2X − 1.
Medianunbiased estimators
The theory of medianunbiased estimators was revived by
An estimate of a onedimensional parameter θ will be said to be medianunbiased, if, for fixed θ, the median of the distribution of the estimate is at the value θ; i.e., the estimate underestimates just as often as it overestimates. This requirement seems for most purposes to accomplish as much as the meanunbiased requirement and has the additional property that it is invariant under onetoone transformation.
Further properties of medianunbiased estimators have been noted by Lehmann, Birnbaum, van der Vaart and Pfanzagl. In particular, medianunbiased estimators exist in cases where meanunbiased and maximumlikelihood estimators do not exist. Besides being invariant under onetoone transformations, medianunbiased estimators have surprising robustness.
Bias with respect to other loss functions
Any meanunbiased minimumvariance estimator minimizes the risk (expected loss) with respect to the squarederror loss function, as observed by Gauss. A medianunbiased estimator minimizes the risk with respect to the absolute loss function, as observed by Laplace. Other loss functions are used in statistical theory, particularly in robust statistics.
Effect of transformations
Note that, when a transformation is applied to a meanunbiased estimator, the result need not be a meanunbiased estimator of its corresponding population statistic. By Jensen's inequality, a convex function as transformation will introduce positive bias, while a concave function will introduce negative bias, and a function of mixed convexity may introduce bias in either direction, depending on the specific function and distribution. That is, for a nonlinear function f and a meanunbiased estimator U of a parameter p, the composite estimator f(U) need not be a meanunbiased estimator of f(p). For example, the square root of the unbiased estimator of the population variance is not a meanunbiased estimator of the population standard deviation: the square root of the unbiased sample variance, the corrected sample standard deviation, is biased. The bias depends both on the sampling distribution of the estimator and on the transform, and can be quite involved to calculate – see unbiased estimation of standard deviation for a discussion in this case.
Bias, variance and mean squared error
While bias quantifies the average difference to be expected between an estimator and an underlying parameter, an estimator based on a finite sample can additionally be expected to differ from the parameter due to the randomness in the sample.
One measure which is used to try to reflect both types of difference is the mean square error,
 $\backslash operatorname\{MSE\}(\backslash hat\{\backslash theta\})=\backslash operatorname\{E\}\backslash big[(\backslash hat\{\backslash theta\}\backslash theta)^2\backslash big].$
This can be shown to be equal to the square of the bias, plus the variance:
 $\backslash begin\{align\}$
\operatorname{MSE}(\hat{\theta})= & (\operatorname{E}[\hat{\theta}]\theta)^2 + \operatorname{E}[\,(\hat{\theta}  \operatorname{E}[\,\hat{\theta}\,])^2\,]\\
= & (\operatorname{Bias}(\hat{\theta},\theta))^2 + \operatorname{Var}(\hat{\theta})
\end{align}
When the parameter is a vector, an analogous decomposition applies:^{[4]}
 $\backslash operatorname\{MSE\}(\backslash hat\{\backslash theta\; \})\; =\backslash operatorname\{trace\}(\backslash operatorname\{Var\}(\backslash hat\{\backslash theta\; \}))$
+\left\Vert\operatorname{Bias}(\hat{\theta},\theta)
\right\Vert^{2}
where
 $\backslash operatorname\{trace\}(\backslash operatorname\{Var\}(\backslash hat\{\backslash theta\; \}))$
is the trace of the covariance matrix of the estimator.
An estimator that minimises the bias will not necessarily minimise the mean square error.
Example: Estimation of population variance
For example,^{[5]} suppose an estimator of the form
 $T^2\; =\; c\; \backslash sum\_\{i=1\}^n\backslash left(X\_i\backslash overline\{X\}\backslash ,\backslash right)^2\; =\; c\; n\; S^2$
is sought for the population variance as above, but this time to minimise the MSE:
 $\backslash begin\{align\}\backslash operatorname\{MSE\}\; =\; \&\; \backslash operatorname\{E\}\backslash left[(T^2\; \; \backslash sigma^2)^2\backslash right]\; \backslash \backslash $
= & \left(\operatorname{E}\left[T^2  \sigma^2\right]\right)^2 + \operatorname{Var}(T^2)\end{align}
If the variables X_{1} ... X_{n} follow a normal distribution, then nS^{2}/σ^{2} has a chisquared distribution with n − 1 degrees of freedom, giving:
 $\backslash operatorname\{E\}[nS^2]\; =\; (n1)\backslash sigma^2\backslash text\{\; and\; \}\backslash operatorname\{Var\}(nS^2)=2(n1)\backslash sigma^4.$
and so
 $\backslash operatorname\{MSE\}\; =\; (c\; (n1)\; \; 1)^2\backslash sigma^4\; +\; 2c^2(n1)\backslash sigma^4$
With a little algebra it can be confirmed that it is c = 1/(n + 1) which minimises this combined loss function, rather than c = 1/(n − 1) which minimises just the bias term.
More generally it is only in restricted classes of problems that there will be an estimator that minimises the MSE independently of the parameter values.
However it is very common that there may be perceived to be a bias–variance tradeoff, such that a small increase in bias can be traded for a larger decrease in variance, resulting in a more desirable estimator overall.
Bayesian view
Most bayesians are rather unconcerned about unbiasedness (at least in the formal samplingtheory sense above) of their estimates. For example, Gelman et al (1995) write: "From a Bayesian perspective, the principle of unbiasedness is reasonable in the limit of large samples, but otherwise it is potentially misleading."^{[6]}
Fundamentally, the difference between the Bayesian approach and the samplingtheory approach above is that in the samplingtheory approach the parameter is taken as fixed, and then probability distributions of a statistic are considered, based on the predicted sampling distribution of the data. For a Bayesian, however, it is the data which is known, and fixed, and it is the unknown parameter for which an attempt is made to construct a probability distribution, using Bayes' theorem:
 $p(\backslash theta\; \backslash mid\; D,\; I)\; \backslash propto\; p(\backslash theta\; \backslash mid\; I)\; p(D\; \backslash mid\; \backslash theta,\; I)$
Here the second term, the likelihood of the data given the unknown parameter value θ, depends just on the data obtained and the modelling of the data generation process. However a Bayesian calculation also includes the first term, the prior probability for θ, which takes account of everything the analyst may know or suspect about θ before the data comes in. This information plays no part in the samplingtheory approach; indeed any attempt to include it would be considered "bias" away from what was pointed to purely by the data. To the extent that Bayesian calculations include prior information, it is therefore essentially inevitable that their results will not be "unbiased" in sampling theory terms.
But the results of a Bayesian approach can differ from the sampling theory approach even if the Bayesian tries to adopt an "uninformative" prior.
For example, consider again the estimation of an unknown population variance σ^{2} of a Normal distribution with unknown mean, where it is desired to optimise c in the expected loss function
 $\backslash operatorname\{Expected\; Loss\}\; =\; \backslash operatorname\{E\}\backslash left[\backslash left(c\; n\; S^2\; \; \backslash sigma^2\backslash right)^2\backslash right]\; =\; \backslash operatorname\{E\}\backslash left[\backslash sigma^4\; \backslash left(c\; n\; \backslash tfrac\{S^2\}\{\backslash sigma^2\}\; 1\; \backslash right)^2\backslash right]$
A standard choice of uninformative prior for this problem is the Jeffreys prior, $\backslash scriptstyle\{p(\backslash sigma^2)\; \backslash ;\backslash propto\backslash ;\; 1\; /\; \backslash sigma^2\}$, which is equivalent to adopting a rescalinginvariant flat prior for ln( σ^{2}).
One consequence of adopting this prior is that S^{2}/σ^{2} remains a pivotal quantity, i.e. the probability distribution of S^{2}/σ^{2} depends only on S^{2}/σ^{2}, independent of the value of S^{2} or σ^{2}:
 $p\backslash left(\backslash tfrac\{S^2\}\{\backslash sigma^2\}\backslash mid\; S^2\backslash right)\; =\; p\backslash left(\backslash tfrac\{S^2\}\{\backslash sigma^2\}\backslash mid\; \backslash sigma^2\backslash right)\; =\; g\backslash left(\backslash tfrac\{S^2\}\{\backslash sigma^2\}\backslash right)$
However, whilst
 $\backslash operatorname\{E\}\_\{p(S^2\backslash mid\; \backslash sigma^2)\}\backslash left[\backslash sigma^4\; \backslash left(c\; n\; \backslash tfrac\{S^2\}\{\backslash sigma^2\}\; 1\; \backslash right)^2\backslash right]\; =\; \backslash sigma^4\; \backslash operatorname\{E\}\_\{p(S^2\backslash mid\; \backslash sigma^2)\}\backslash left[\backslash left(c\; n\; \backslash tfrac\{S^2\}\{\backslash sigma^2\}\; 1\; \backslash right)^2\backslash right]$
in contrast
 $\backslash operatorname\{E\}\_\{p(\backslash sigma^2\backslash mid\; S^2)\}\backslash left[\backslash sigma^4\; \backslash left(c\; n\; \backslash tfrac\{S^2\}\{\backslash sigma^2\}\; 1\; \backslash right)^2\backslash right]\; \backslash neq\; \backslash sigma^4\; \backslash operatorname\{E\}\_\{p(\backslash sigma^2\backslash mid\; S^2)\}\backslash left[\backslash left(c\; n\; \backslash tfrac\{S^2\}\{\backslash sigma^2\}\; 1\; \backslash right)^2\backslash right]$
— when the expectation is taken over the probability distribution of σ^{2} given S^{2}, as it is in the Bayesian case, rather than S^{2} given σ^{2}, one can no longer take σ^{4} as a constant and factor it out. The consequence of this is that, compared to the samplingtheory calculation, the Bayesian calculation puts more weight on larger values of σ^{2}, properly taking into account (as the samplingtheory calculation cannot) that under this squaredloss function the consequence of underestimating large values of σ^{2} is more costly in squaredloss terms than that of overestimating small values of σ^{2}.
The workedout Bayesian calculation gives a scaled inverse chisquared distribution with n − 1 degrees of freedom for the posterior probability distribution of σ^{2}. The expected loss is minimised when cnS^{2} = <σ^{2}>; this occurs when c = 1/(n − 3).
Even with an uninformative prior, therefore, a Bayesian calculation may not give the same expectedloss minimising result as the corresponding samplingtheory calculation.
See also
Science portal 
 Stats portal 
Notes
References
 2236236.
 2236928.
 Allan Birnbaum, 1961. "A Unified Theory of Estimation, I", The Annals of Mathematical Statistics, vol. 32, no. 1 (Mar., 1961), pp. 112–135.
 Van der Vaart, H. R., 1961. "Some Extensions of the Idea of Bias" The Annals of Mathematical Statistics, vol. 32, no. 2 (June 1961), pp. 436–447.
 Pfanzagl, Johann. 1994. Parametric Statistical Theory. Walter de Gruyter.
 .


External links
de:Verzerrung (Statistik)
es:Sesgo estadístico
ru:Несмещённая оценка
This article was sourced from Creative Commons AttributionShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, EGovernment Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a nonprofit organization.