Descriptive Statistics | IB Math AAHL | Perplex

Descriptive Statistics

Skill Checklist

Track your progress across all skills in your objective. Mark your confidence level and identify areas to focus on.

34 Skills Available

Track your progress:

Don't know

Working on it

Confident

📖 = included in formula booklet • 🚫 = not in formula booklet

Track your progress:

Don't know

Working on it

Confident

📖 = included in formula booklet • 🚫 = not in formula booklet

Population & Data

12 skills

Population

SL 4.1

A population is the entire group of individuals or items you want to study. It can be large (e.g., all IB students worldwide) or small (e.g., a class of 20 students), depending on the research question. For large populations, we often study a portion of the population, or sample, to make inferences about the population.

Watch video explanation →

Types of Variables

SL 4.1

There are two types of data to be familiar with:

Categorical Variables
- Non-numerical categories or labels (e.g., eye color, species, etc).
Quantitative Variables
- Numerical values that can be measured or counted.
- Discrete: can only take certain fixed values (e.g., number of students, shoe size).
- Continuous: can take any value in a range (e.g., height, temperature).

Watch video explanation →

Sampling Error

SL 4.1

Sampling error occurs when there is a difference between a population parameter (e.g., the average IB grade) and the sample statistic (e.g., the average IB grade at one school) used to estimate it.

This error is random and arises simply because a sample is not the entire population, and will occur even with well-designed sampling methods.

Watch video explanation →

Measurement Error

SL 4.1

Measurement error is the inaccuracy in the data collection process. This could result from faulty instruments, poorly worded questions, or misunderstanding by participants.

Watch video explanation →

Coverage Error

SL 4.1

Coverage error occurs when some members of the population are not included in the sampling frame or are underrepresented, leading to a biased sample.

For example, a sample of average IB grades at Swiss schools is unlikely to represent all IB schools worldwide.

Watch video explanation →

Non-response error

SL 4.1

Non-response error happens when selected respondents do not participate or cannot be contacted, possibly creating bias if non-respondents differ systematically from respondents.

Watch video explanation →

Random sampling

SL 4.1

Random sampling means every member of the population has an equal probability of being chosen. This method reduces selection bias.

Powered by Desmos

Watch video explanation →

Convenience sampling

SL 4.1

Convenience sampling uses subjects who are easiest to reach. It is quick and low-cost but can be highly biased if the sample is not representative.

For example, sampling the heights of trees on the outskirts of a dense jungle.

Powered by Desmos

Watch video explanation →

Systematic sampling

SL 4.1

Systematic sampling involves selecting members at regular intervals from a list or sequence. For example, sampling every 5th student from an alphabetically sorted list of names.

Powered by Desmos

Watch video explanation →

Stratified random sampling

SL 4.1

Stratified sampling splits the population into subgroups (strata) based on characteristics (e.g., age, gender). A random sample is then taken from each stratum, often in proportion to its size in the population.

Powered by Desmos

Watch video explanation →

Quota sampling

SL 4.1

Quota sampling is very similar to stratified sampling, except the sample taken from each subgroup is not random.

Powered by Desmos

Watch video explanation →

Identifying and Removing Outliers

SL 4.1

Outliers in data are responses that are much higher or lower than the rest of the data. Because they are such unusual pieces of data, we often check whether outlier data points are the result of an error.

If they are the product of some error we may remove outliers, but we should not remove all of them because many are real data points.

Measuring Center

3 skills

Mean

SL 4.3

The mean of a numerical dataset is the average of all the values:

xˉ=number of valuessum of values🚫

The mean is also sometimes denoted μ.

Watch video explanation →

Median

SL 4.3

The median of a dataset is the middle value when the values are sorted. If a dataset has an even number of values, the median is the average of the middle two.

Mathematically, the median is the

2n+1th🚫

value. Notice that if n is even then 2n+1 is halfway between two consecutive integers, indicating we need to average their values.

Watch video explanation →

Mode

SL 4.3

The mode of a dataset is the most common value in a dataset.

If all the values have the same frequency ([1,2,3]), there is no mode. If multiple - but not all - values share the highest frequency, then we have multiple modes ([1,2,2,3,3]).

Watch video explanation →

Quartiles and Box & Whisker Plots

4 skills

Range

SL 4.3

Range is the difference between a dataset's minimum and maximum values.

Though range may give a sense of the dispersion of a set, outliers will always have a strong effect on range since they pull the minimum or maximum values far from the rest of the data.

Quartiles

SL 4.3

Quartiles are conceptually similar to the median, except that there are three of them: Q1,Q2 and Q3, dividing the sorted dataset into 4 equal-size parts.

Q2 is the median, dividing the datapoints in two.

Q1 is halfway between the first value and the median, at position

4n+1🚫

Q3 is halfway between the median and the last value, at position

43(n+1)🚫

Note: you will not need to find quartiles by hand on IB exams.

Watch video explanation →

Interquartile Range & Outliers

SL 4.3

The interquartile range, denoted IQR, is the difference between the third and first quartile:

IQR =Q3−Q1

A value x in a dataset is said to be an outlier if x<Q1−1.5×IQR or x>Q1+1.5×IQR.

Watch video explanation →

Box & Whisker Plots

SL 4.2

A box-and-whisker plot visually summarizes data by splitting it into quarters. The box shows the middle 50% of your data (from Q1 to Q3), and the line inside marks the median. The whiskers extend to show the spread of data, excluding outliers, which are marked with a cross.

- Minimum: smallest value (left whisker end)

- Lower Quartile (Q1): median of lower half (25% mark)

- Median (Q2): middle value of data set

- Upper Quartile (Q3): median of upper half (75% mark)

- Maximum: largest value (right whisker end)

Powered by Desmos

Watch video explanation →

Standard Deviation and Variance

3 skills

Variance & SD on Calculator (Sx vs σ)

SL 4.3

The variance σ2 of a dataset measures the spread of data around the mean.

The standard deviation σ is the square root of the variance. The advantage of the standard deviation is that is has the same units as the original data.

When you use a calculator to find standard deviation:

Enter your data into L1 using STAT > EDIT. Then, use STAT > CALC > 1-Var Stats and enter L1 as your list by clicking 2ND then 1.

You will see two values: Sx and σx. We use Sx when the data is a sample of a large population, and σx when the data represents the entire population. The difference is due to the fact that a sample will usually have a smaller variance than the population, because there are fewer elements.

Watch video explanation →

Finding variance by hand

AHL 4.14

The mathematical definition of variance, which the calculator is using under the hood, is

σ2=ni=1∑kfi(xi−μ)2=ni=1∑kfixi2−μ2📖

Here, xi are all the existing values in the dataset, and fi is the frequency of each value.

The standard deviation is thus simply the square root of this:

σ=⎷ni=1∑kfi(xi−μ)2📖

Watch video explanation →

Constant changes to data

SL 4.3

If we have a dataset with mean xˉ and standard deviation σ, then if we

add a constant +b to the dataset, the mean increases by b and the standard deviation does not change
scale the values by a, then both the mean and the standard deviation are scaled by a.

Watch video explanation →

Frequency Tables, Histograms and cumulative frequency diagrams

5 skills

Discrete Frequency Tables

SL 4.2

Datasets can be represented in frequency tables, with a row containing the values that exist in the data and a row containing the frequency, or number of times each value appears.

The mean of frequency data can be calculated using the formula:

xˉ=ni=1∑kfixi,n=i=1∑kfi📖

Note that fi is the frequency of the value xi, so n=i=1∑kfi is just the total number of points.

Watch video explanation →

Grouped Frequency Tables

SL 4.2

When data is continuous, we cannot have a column per possible value, as there are infinitely many.

Instead, we use a grouped frequency table to break up the data into specific intervals.

If all the intervals have equal size, then the modal class is the interval in which the most values fall.

We can also estimate the mean from grouped data as if it were a discrete frequency table using the mid-interval values, that is the average of the upper and lower bounds of each interval.

Watch video explanation →

Histograms

SL 4.2

Grouped frequency tables can also be turned into histograms (aka bar graph) by drawing rectangles with base corresponding to the intervals, and heights corresponding to the frequency.

Powered by Desmos

Watch video explanation →

Cumulative frequency graphs and tables

SL 4.2

Cumulative frequency graphs are a powerful visual representation of continuous data.

The value of y at each point x on the curve represents the number of data points less than x.

We start with a grouped frequency table, and add a row for cumulative frequency, which is the number of items in an interval and all previous (lower) intervals. To plot the diagram, we make a point from each column. The x-coordinates are the upper bound of each interval, and the y-coordinates are the cumulative frequency.

Length

3≤x<4

4≤x<5

5≤x<6

6≤x<7

Frequency

3

6

7

3

Cumulative

Frequency

3

9

16

19

Powered by Desmos

Watch video explanation →

Median, quartiles & percentiles on CF Graphs

SL 4.2

Cumulative frequency diagrams can be used to find medians, quartiles, and percentiles.

In the same way that the first quartile, Q1, is the value greater than a quarter (25%) of data values, the kth percentile is the value greater than k% of the data values.

Powered by Desmos

Q1: 0.25× the max
Median: 0.5× the max
Q3: 0.75× the max
kth percentile: 100k× the max.

Watch video explanation →

Linear Regression

7 skills

Plotting approximate best fit line

SL 4.4

Best fit lines can also be drawn approximately by eye. We start by finding the average x and y, giving the point (xˉ,yˉ). We then take a ruler and place it on this point, and adjust the slope until we find a reasonable best fit line.

Powered by Desmos

Watch video explanation →

Regression line y on x

SL 4.4

Linear regression is a statistical method used to model the relationship between two variables when data is given as pairs of points (x,y). We fit a straight line (called the regression line) that minimizes the average vertical distance from the points:

Powered by Desmos

The general equation of the regression line is:

y=ax+b

where a is the slope and b is the y-intercept.

The values of a and b can be found using a calculator:

Use Stat>Edit to fill in x- and y-values into L1 and L2.
Then, press Stat, right arrow to the CALC menu, and select 4:LinReg(ax+b).

Watch video explanation →

Pearson's Product-Moment Correlation Coefficient

SL 4.4

Pearson's product-moment correlation coefficient, denoted by r, measures the strength and direction of a linear relationship between two numerical variables x and y. Its value always lies between −1 and +1:

r=+1: perfect positive linear relationship
r=−1: perfect negative linear relationship
r=0: no linear relationship

A positive value means y generally increases as x increases; a negative value means y generally decreases as x increases. The closer r is to ±1, the stronger the linear relationship.

If you clickmode, scroll to STAT DIAGNOSTICS , hover over ON, and click ENTER, then any time you perform a linear regression, the calculator will provide Pearson's coefficient in addition to the regression line.

Watch video explanation →

Predicting y from x

SL 4.4

Once we have a regression line y=ax+b, we can use it to predict y by plugging in a value of x.

Watch video explanation →

Danger of extrapolation

SL 4.4

When using a regression line to predict y from x, we need to be aware of the danger of extrapolation. This occurs when we try to predict y for a value of x far outside the range of x values in our data. For such an x, we cannot trust that the relationship is the same.

Watch video explanation →

Limitations of predicting x from y

SL 4.4

While it is possible to use a regression line y=ax+b to predict x with

x=ay−b,

this is not a reliable process. The best fit line is determined to minimize the difference between the real y’s and the predicted y’s,so the difference between real and predicted values for x may be much larger.

Watch video explanation →

Regression line x on y

SL 4.10

In the same way that we can plot a straight line minimizing the vertical distances from points (x,y), we can plot a straight line minimizing the horizontal distances. This is called an x on y regression line. We calculate an x on y regression line by switching our x and y lists while using LinReg(ax+b).

Powered by Desmos

With this line we can make reliable predictions of x given y, so long as we are not extrpolating.

Watch video explanation →