# ProbabilityIntroduction

When we do data science, we begin with a data set and work to gain insights about the process that generated the data. Crucial to this endeavor is a robust vocabulary for discussing the behavior of data-generating processes.

It is helpful to initially consider data-generating processes whose randomness properties are specified completely and precisely. The study of such processes is called **probability**. For example, "What's the probability that I get at least 7 heads in 10 independent flips of a fair coin?" is a probability question, because the setup is fully specified: the coins have exactly 50% probability of heads, and the different flips do not affect one another.

The question of whether the coins are really fair or whether the flips are really independent will be deferred to our study of *statistics*. In statistics, we will have the *outcome* of a random experiment in hand and will be looking to draw inferences about the unknown *setup*. Once we are able to answer questions in the "setup outcome" direction, we will be well positioned to approach the "outcome setup" direction.

**Exercise**

Each of the questions below is a probability question or a statistics question. Select ones which are *probability* questions.

*not*rain today?

*Solution.* The first question is **statistics**. We don't know the probability of rain, and we are trying to draw an inference about it based on observed samples.

The second question is a **probability** question. We are given the setup and asked a question which assumes its validity.

The third question is also a **probability** question. We're told the dice are fair, and we're asked a question about the outcome of the rolls.

The third question is a **statistics** question, since the outcome of the rolls is known, and the probabilities are in question.