Sõnastik

Valige vasakul üks märksõnadest ...

StatisticsIntroduction

Lugemise aeg: ~25 min

Mathematical probability is about drawing conclusions about the outcomes of random experiments whose randomness is known and specified precisely. Statistics works in the opposite direction: the outcomes are observed, but the probability measure giving rise to those outcomes is unknown. The goal of statistics is to draw conclusions about probability distributions based observations sampled from them.

For example, consider the eventual adult height X of a particular newborn child. There are no pure mathematical considerations that would suggest a specific distribution for X. Our best bet is to collect data on the heights of adults and try to infer a probability distribution which is compatible with the observed data. Suppose we measure the height (in inches) of 10 randomly selected folks and get the following numbers:

heights = [71.54, 66.62, 64.11, 62.72, 68.12,
           69.07, 64.82, 61.92, 68.45, 66.3,
           66.99, 62.2, 61.04, 63.31, 68.94,
           66.27, 66.8, 71.7, 68.93, 66.65,
           71.97, 60.27, 62.81, 70.64, 71.61,
           65.51, 63.1, 66.21, 68.23, 72.32,
           62.29, 63.12, 64.94, 71.89, 65.48,
           63.66, 56.11, 65.63, 61.26, 65.12,
           66.93, 68.51, 67.2, 71.57, 66.65,
           59.77, 61.51, 63.25, 69.12, 64.98]

Each observation provides some evidence about where the probability mass of the height distribution is. We would expect that regions with many observations have more probability mass than regions with few observations, although we should not take this too literally: none of the 50 observations in the list above fall in the interval [70.7, 71.4], but it would not make sense to conclude that adults who are taller than 70.7 inches are necessarily also taller than 71.4 inches.

Exercise
Brainstorm at least two ways to come up with a plausible density function given a list of observations like the one given above.

Nonparametric estimation

A simple way to obtain a probability distribution from a list of observations is to make a . The idea is to subdivide the interval from the smallest to the largest observation into smaller intervals and make a bar chart showing the number of observations which fall into each of these intervals.

using Plots
histogram(heights,
          nbins=12,
          label="",
          xlabel="height (inches)",
          ylabel="count")

You might think of a histogram as just a visualization of the data, but it does give an actual distribution: we consider the piecewise constant function whose graph consists of the tops of the histogram bars, and we divide it by the sum of the areas of the bars (to obtain a new function which integrates to 1):

using Plots
histogram(heights,
          nbins=12,
          label="",
          xlabel="height (inches)",
          ylabel="count",
          normed=true)

The arbitrariness in the density function we obtain by normalizing the histogram is hardly disguised: we would have gotten a different result if we'd used a different number of bins, and we could have even decided to use bins of different widths. Nevertheless, the histogram density approximates the actual distribution quite well if we have a lot of data:

Exercise
Call the function mysample 10000 times and make a histogram of the resulting observations. Compare the histogram density to the actual density, and observe that the two are very close.

Note: you can evaluate the pdf of N₁ at x using pdf(N₁,x).

function mysample()
    if rand() > 0.2
        3 + 0.8*randn()
    else
        -1 + randn()
    end
end

using Distributions
histogram([mysample() for _ in 1:10000],
          nbins=80,
          normed=true,
          label="histogram density")

N₁ = Normal(3,0.8)
N₂ = Normal(-1,1)
#actualdensity(x) = DENSITYFUNCTIONHERE
          
plot!(-6:0.1:6,
      actualdensity,
      linewidth=3,
      label="actual density",
      legend=:topright)    

Solution. The density function describing the distribution that mysample draws from is a linear combination of the two given Gaussian density functions, with weights \frac{4}{5} and \frac{1}{5}:

actualdensity(x) = 0.8pdf(N₁,x)+0.2pdf(N₂,x)

Parametric estimation

Another way to come up with a density function for some data is to assume that the density function belongs to a specific parametric family of densities, like the set of Gaussian distributions. Then we approximate the parameters using the data.

Exercise
Use the sliders to find the μ and σ values for which the normal distribution \mathcal{N}(\mu, \sigma) does the best job of fitting the data. (The meaning of the term "best" here is deliberately left to your discretion). Compare your results to the values obtained using standard methods for this problem by entering your choices for μ and σ in the last line below.

μ=${μ}

σ=${σ}

The best μ value is , and the best σ value is .

Later in this course, we will discuss some approaches to choosing parameters optimally, and we'll leave behind the "eyeball-it" strategy we used in this exercise.

The histogram estimator is called a estimation method, because it doesn't involve assuming that the distribution comes from a particular parametric family. The advantage is that histograms are flexible to represent a variety of density shapes, while parametric methods have the advantage of making more efficient use of data in the situations where the parametric assumption happens to be valid.

Regression

Statistics is not limited to estimating the distribution of a single real-valued random variable like human height. Typically we want to have information about the joint distribution of such a variable with other variables whose values we are in a position to know. Such joint information allows us to make more accurate predictions, and that increased accuracy is usually critical for the business or research purposes that motivated the inquiry.

For example, if we're able to collect the heights of many adults together along the heights of each of their parents, then we can aim to understand the conditional expectation of a person's height, given the heights of their parents. Since we can measure the heights of a child's parents, we can use this information to make a better prediction for how tall the child will grow up to be. The problem of estimating the conditional expectation of one random variable given others is called regression.

In the next section, we will develop some intuitive techniques for estimating density functions for joint distributions. We'll close this section with an exercise involving the estimation of a discrete distribution.

Exercise
Consider a random variable X that you know takes values in \{0,1,2\}. Suppose that 100 independent observations are made from the distribution of X, and suppose they are the values given below. Propose an estimate of the distribution of X.

observations = [
0, 2, 2, 2, 2, 2, 0, 2, 2, 1,
0, 2, 2, 1, 0, 1, 0, 2, 1, 2,
2, 2, 1, 2, 2, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 1, 1, 2, 2, 1,
2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
1, 2, 2, 2, 2, 0, 2, 0, 1, 2,
0, 0, 2, 2, 2, 0, 2, 2, 2, 0,
2, 0, 2, 0, 2, 2, 2, 0, 0, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 0, 2, 2, 2, 2, 2, 1, 0, 2
]

Solution. Since 70% of the observations are 2's, we posit that the probability of the event \{X = 2\} is 70%. Likewise, the probabilities of the events \{X = 1\} and \{X = 0\} we estimate to be 13% and 17%, respectively.

Bruno
Bruno Bruno