erplex
  • Lessons
  • Problems
  • Speed Run
  • Practice Tests
  • Skill Checklist
  • Review Videos
  • Landing Page
  • Sign Up
  • Login
  • erplex
    IB Math AASL
    /
    Descriptive Statistics
    /

    Skills

    Skill Checklist

    Track your progress across all skills in your objective. Mark your confidence level and identify areas to focus on.

    Track your progress:

    Don't know

    Working on it

    Confident

    📖 = included in formula booklet • 🚫 = not in formula booklet

    Track your progress:

    Don't know

    Working on it

    Confident

    📖 = included in formula booklet • 🚫 = not in formula booklet

    Descriptive Statistics

    Skill Checklist

    Track your progress across all skills in your objective. Mark your confidence level and identify areas to focus on.

    33 Skills Available

    Track your progress:

    Don't know

    Working on it

    Confident

    📖 = included in formula booklet • 🚫 = not in formula booklet

    Track your progress:

    Don't know

    Working on it

    Confident

    📖 = included in formula booklet • 🚫 = not in formula booklet

    Population & Data

    12 skills
    Population
    SL 4.1

    A population is the entire group of individuals or items you want to study. It can be large (e.g., all IB students worldwide) or small (e.g., a class of 20 students), depending on the research question.

    Watch video explanation →
    Types of Variables
    SL 4.1

    There are two types of data to be familiar with:

    1. Categorical Variables

      • Non-numerical categories or labels (e.g., eye color, species, etc).

    2. Quantitative Variables

      • Numerical values that can be measured or counted.

      • Discrete: can only take certain fixed values (e.g., number of students, shoe size).

      • Continuous: can take any value in a range (e.g., height, temperature).

    Watch video explanation →
    Sampling Error
    SL 4.1

    Sampling error occurs when there is a difference between a population parameter (e.g., the average IB grade) and the sample statistic (e.g., the average IB grade at one school) used to estimate it.


    This error is random and arises simply because a sample is not the entire population, and will occur even with well-designed sampling methods.

    Watch video explanation →
    Measurement Error
    SL 4.1

    Measurement error is the inaccuracy in the data collection process. This could result from faulty instruments, poorly worded questions, or misunderstanding by participants.


    For example, measuring the length of a table by counting how many hand-lengths you can fit on it is likely to yield a pretty rough estimate.

    Watch video explanation →
    Coverage Error
    SL 4.1

    Coverage error occurs when some members of the population are not included in the sampling frame or are underrepresented, leading to a biased sample.


    For example, a sample of average IB grades at Swiss schools is unlikely to represent all IB schools worldwide.

    Watch video explanation →
    Non-response error
    SL 4.1

    Non-response error happens when selected respondents do not participate or cannot be contacted, possibly creating bias if non-respondents differ systematically from respondents.


    For example, if you send out an internet survey to estimate the percentage of the world population has internet access, then anyone who would respond "no" will not be able to access your survey.


    Similarly, if you ask students to share their grade on the latest math test, those who are unhappy with their scores are less likely to respond, biasing your sample.

    Watch video explanation →
    Random sampling
    SL 4.1

    Random sampling means every member of the population has an equal probability of being chosen. This method reduces selection bias.

    Powered by Desmos

    Watch video explanation →
    Convenience sampling
    SL 4.1

    Convenience sampling uses subjects who are easiest to reach. It is quick and low-cost but can be highly biased if the sample is not representative.


    For example, sampling the heights of trees on the outskirts of a dense jungle.

    Powered by Desmos

    Watch video explanation →
    Systematic sampling
    SL 4.1

    Systematic sampling involves selecting members at regular intervals from a list or sequence. For example, sampling every 5th student from an alphabetically sorted list of names.

    Powered by Desmos

    Watch video explanation →
    Stratified random sampling
    SL 4.1

    Stratified sampling splits the population into subgroups (strata) based on characteristics (e.g., age, gender). A random sample is then taken from each stratum, often in proportion to its size in the population.

    Powered by Desmos

    Watch video explanation →
    Quota sampling
    SL 4.1

    Quota sampling is very similar to stratified sampling, except the sample taken from each subgroup is not random.

    Powered by Desmos

    Watch video explanation →

    Measuring Center

    4 skills
    Mean
    SL 4.3

    The mean of a numerical dataset is the average of all the values:

    xˉ=number of valuessum of values​🚫


    The mean is also sometimes denoted μ.


    For example, the mean of [1,3,4,7,8] is

    xˉ=51+3+6+7+8​=525​=5
    Watch video explanation →
    Mode
    SL 4.3

    The mode of a dataset is the most common value in a dataset.


    For example, the mode of [1,1,2,3,4,4,4,5] is 4, as it appears the most times (3 times).


    If all the values have the same frequency (eg [1,2,3]), there is no mode.

    Watch video explanation →
    Median
    SL 4.3

    The median of a dataset is the middle value when the values are sorted.


    For example, the median of [5,1,7,3,10] is 5 as when sorted the list is [1,3,5,7,10].


    If a dataset has an even number of values, the median is the average of the middle two. For example, the median of [2,3,4,10] is 23+4​=3.5.


    Mathematically, the median is the

    2n+1​th🚫

    value. Notice that if n is even then 2n+1​ is halfway between two consecutive integers, indicating we need to average their values.

    Watch video explanation →
    Constant changes to data
    SL 4.3

    If we have a dataset with mean xˉ and standard deviation σ, then if we

    • add a constant +b to the dataset, the mean increases by b and the standard deviation does not change

    • scale the values by a (eg. a=2 to double all the values), then both the mean and the standard deviation are scaled by a.


    Example

    A dataset has a mean of 5 and a variance of 4. Each item is doubled, and then incremented by 3. Find the new mean and variance.


    An original variance of 4 is a standard deviation of √4=2.


    First we double the items: the mean becomes 10 and the standard deviation becomes 4.


    Then we add 3: the mean becomes 13 and the standard deviation does not change.


    Hence the new mean is 13 and the new variance is 41=16.

    Watch video explanation →

    Measuring Dispersion

    2 skills
    Variance & SD on Calculator (Sx vs σ)
    SL 4.3

    The variance σ2 of a dataset measures the spread of data around the mean.


    The standard deviation σ is the square root of the variance. The advantage of the standard deviation is that is has the same units as the original data.


    When you use a calculator to find standard deviation, you will see two values: Sx and σx. We use Sx when the data is a sample of a large population, and σx when the data represents the entire population.


    The difference is due to the fact that a sample will usually have a smaller variance than the population, because there are fewer elements. The value Sx is larger than σx and attempts to correct for this difference based on the number of items in the sample.


    Example 1

    Find the variance of [1,5,6,7,11,−1].


    Using technology we find Sx=4.31 and σx=3.93. Since this is the whole dataset and not a sample, we use the (smaller) value σx=3.93. The variance is then the square of this so σ2=15.4.


    Example 2

    A scientist samples the length of fish in an aquarium. He measures the length of 5 fish, and finds [21cm,19cm,7cm]. Estimate the variance in the length of fish in the aquarium.


    Using technology we find Sx=7.57 and σx=6.18. Since we have a only a sample of the population, we use Sx=7.57⇒s2=57.3.

    Watch video explanation →
    Constant changes to data
    SL 4.3

    If we have a dataset with mean xˉ and standard deviation σ, then if we

    • add a constant +b to the dataset, the mean increases by b and the standard deviation does not change

    • scale the values by a (eg. a=2 to double all the values), then both the mean and the standard deviation are scaled by a.


    Example

    A dataset has a mean of 5 and a variance of 4. Each item is doubled, and then incremented by 3. Find the new mean and variance.


    An original variance of 4 is a standard deviation of √4=2.


    First we double the items: the mean becomes 10 and the standard deviation becomes 4.


    Then we add 3: the mean becomes 13 and the standard deviation does not change.


    Hence the new mean is 13 and the new variance is 41=16.

    Watch video explanation →

    Quartiles and Box & Whisker Plots

    3 skills
    Quartiles
    SL 4.3

    Quartiles are conceptually similar to the median, except that their are three of them: Q1​,Q2​ and Q3​, dividing the sorted dataset into 4 equal-size parts.


    Q2​ is the median, dividing the datapoints in two.

    Q1​ is halfway between the first value and the median, at position

    4n+1​🚫

    Q3​ is halfway between the median and the last value, at position

    43(n+1)​🚫


    Note: you will not need to find quartiles by hand on IB exams. These examples are for conceptual understanding only.

    Example

    Find the quartiles of the dataset [4,2,2,6,3,7,8,9].


    First we sort the data: [2,2,3,4,6,7,8,9].

    There are n=8 items, so

    • Q1​ is at 48+1​=2.25 (average of second and third).

    • Q2​ (the median) is at 28+1​=4.5

    • Q3​ is at 43(8+1)​=6.75 (average of sixth and seventh)

    So the quartiles are:

    [2,2,3​Q1​​,4,6​Q2​​,7,8​Q3​​,9]

    ie Q1​=2.5, Q2​=5 and Q3​=7.5.


    More examples:

    [2,2Q1​,3,4Q2​,6,7Q3​,8]


    [2,2,3​Q1​=2.5​,4,6Q2​,7,8,9​Q3​=8.5​,10]
    Watch video explanation →
    Interquartile Range & Outliers
    SL 4.3

    The interquartile range, denoted IQR, is the difference between the third and first quartile:

    IQR =Q3​−Q1​


    A value x in a dataset is said to be an outlier if x<Q1​−1.5×IQR or x>Q1​+1.5×IQR.


    For example, in the dataset [0,7,8,9,10,11,18], the quartiles are Q1​=7, Q2​=9, Q3​=11.


    Hence

    IQR =11−7=4

    Then since 0<Q1​−1.5×IQR=1, 0 is an outlier.

    Similarly, since 18>Q3​+1.5×IQR=17, so 18 is an outlier.

    Watch video explanation →
    Box & Whisker Plots
    SL 4.2

    A box-and-whisker plot visually summarizes data by splitting it into quarters. The box shows the middle 50% of your data (from Q1 to Q3), and the line inside marks the median. The whiskers extend to show the spread of data, excluding outliers, which are marked with a cross.


    - Minimum: smallest value (left whisker end)

    - Lower Quartile (Q1): median of lower half (25% mark)

    - Median (Q2): middle value of data set

    - Upper Quartile (Q3): median of upper half (75% mark)

    - Maximum: largest value (right whisker end)


    Powered by Desmos

    Watch video explanation →

    Frequency Tables, Histograms and cumulative frequency diagrams

    5 skills
    Discrete Frequency Tables
    SL 4.2

    Datasets can be represented in frequency tables, with a row containing the values that exist in the data and a row containing the number of times each value appears.


    The mean of frequency data can be calculated using the formula:

    xˉ=ni=1∑k​fi​xi​​,n=i=1∑k​fi​📖


    Note that n=i=1∑k​fi​ is just the total number of points.

    Example

    x

    1

    2

    4

    5

    Frequency

    5

    8

    12

    2

    1. Find xˉ without a calculator.

    xˉ=ni=1∑k​fi​xi​​=5+8+12+21⋅5+2⋅8+4⋅12+5⋅2​≈2.93


    Using a calculator, find

    1. The standard deviation

      σx=1.33⇒ variance is 1.77.

    2. The quartiles

      Q1​=2, Q2​=4, Q3​=4.

    Watch video explanation →
    Grouped Frequency Tables
    SL 4.2

    When data is continuous, we cannot have a column per possible value, as there are infinitely many.


    Instead, we use a grouped frequency table to break up the data into specific intervals.


    If all the intervals have equal size, then the modal class is the interval in which the most values fall.


    We can also estimate the mean from grouped data as if it were a discrete frequency table using the mid-interval values, that is the average of the upper and lower bounds of each interval.


    Example

    Using the dataset [3.1,5.4,5.6,5.9,6.0,6.9], we can fill in the following table:

    Length

    3≤x<4

    4≤x<5

    5≤x<6

    6≤x<7

    Frequency

    1

    0

    3

    2

    so the modal class is 5≤x<6.


    If we only had this table (and not the actual values), we could estimate the mean using the mid-interval values:

    Length

    3.5

    4.5

    5.5

    6.5

    Frequency

    1

    0

    3

    2

    Then

    xˉ≈ni=1∑k​fi​xi​​=1+3+23.5+3⋅5.5+2⋅6.5​=5.5
    Watch video explanation →
    Histograms
    SL 4.2

    Grouped frequency tables can also be turned into histograms (aka bar graph) by drawing rectangles with base corresponding to the intervals, and heights corresponding to the frequency.


    Example

    The histogram

    Powered by Desmos

    is equivalent to the grouped frequency table

    Length

    3≤x<4

    4≤x<5

    5≤x<6

    6≤x<7

    Frequency

    1

    4

    3

    2


    We can tell from the histogram that the modal class is 4≤x<5.

    Watch video explanation →
    Cumulative frequency graphs and tables
    SL 4.2

    Cumulative frequency graphs are a powerful visual representation of continuous data.


    The value of y at each point x on the curve represents the number of data points less than x.


    We start with a grouped frequency table, and add a row for cumulative frequency, which is the number of items in an interval and all previous (lower) intervals:

    Length

    3≤x<4

    4≤x<5

    5≤x<6

    6≤x<7

    Frequency

    3

    6

    7

    3

    Cumulative

    Frequency

    3

    3+6=9

    9+7=16

    16+3=19


    To plot the diagram, we make a point from each column. The x coordinates are the upper bound of each interval, and the y-coordinates are the cumulative frequency. We then draw a smooth curve through all the points, starting on the x-axis at the lower bound of the first interval, and stopping at the last point.


    Powered by Desmos

    Watch video explanation →
    Median, quartiles & percentiles on CF Graphs
    SL 4.2

    Cumulative frequency diagrams can be used to find medians, quartiles, and percentiles.


    In the same way that Q1​ is the value greater than a quarter (25%) of data values, the kth percentile is the value greater than k% of the data values.


    Example

    Powered by Desmos


    Since the y-values represent the number of data points with value less than x, and 50% of values are less than the median, it is is the x-value where the line crosses 0.5× the maximum y value on the curve.


    Similarly, the quartiles and percentiles are the x-values where the curve crosses:

    • Q1​: 0.25× the max

    • Q3​: 0.75× the max

    • kth percentile: 100k​× the max.

    Watch video explanation →

    Linear Regression

    7 skills
    Regression line y on x
    SL 4.4

    Linear regression is a statistical method used to model the relationship between two variables when data is given as pairs of points (x,y). We fit a straight line (called the regression line) that minimizes the average vertical distance from the points:

    Powered by Desmos


    The general equation of the regression line is:

    y=ax+b

    where a is the slope and b is the y-intercept.


    The values of a and b can be found using a calculator.


    Example

    x

    3

    4

    5

    6

    y

    8

    12

    16

    17

    Using a calculator, we find a=3.1 and b=−0.7. So the regression line is y=3.1x−0.7.


    Powered by Desmos

    Watch video explanation →
    Pearson's Product-Moment Correlation Coefficient
    SL 4.4

    Pearson's product-moment correlation coefficient, denoted by r, measures the strength and direction of a linear relationship between two numerical variables x and y. Its value always lies between −1 and +1:

    • r=+1: perfect positive linear relationship

    • r=−1: perfect negative linear relationship

    • r=0: no linear relationship

    A positive value means y generally increases as x increases; a negative value means y generally decreases as x increases. The closer r is to ±1, the stronger the linear relationship.


    Pearson’s coefficient is calculated by:

    r=√∑(x−xˉ)2∑(y−yˉ​)2​∑(x−xˉ)(y−yˉ​)​

    Note: you do not need to use or know this formula.


    On exams, r is determined using technology.


    Example 1

    Find r for the following data, and interpret its significance.

    x

    3

    4

    5

    6

    y

    −8

    −12

    −16

    −17

    Using a calculator, we find r=−0.973. This indicates a very strong, negative relationship.

    Powered by Desmos


    Example 2

    Find r for the following data, and interpret its significance.

    x

    3

    4

    5

    6

    y

    −8

    12

    −16

    5

    Using a calculator, we find r=0.11. This indicates essentially a weak positive relationship, but there is almost no correlation.

    Powered by Desmos

    Watch video explanation →
    Predicting y from x
    SL 4.4

    Once we have a regression line y=ax+b, we can use it to predict y by plugging in a given value of x.


    For example, suppose we have data on the body temperature of babies in the first 12 months after birth. Let's say we have the regression line between body temperature T in °C at age M months:

    T=−0.05M+37.8


    We can then predict the body temperature of a 10 month old baby by plugging in M=10:

    T=−0.05⋅10+37.8=37.3°C
    Watch video explanation →
    Limitations of predicting x from y
    SL 4.4

    It is also possible too use a regression line y=ax+b to predict x using a given value of y:

    x=ay−b​🚫

    But this is not always going to be a reliable process, since the best fit line is determined (via regression) so as to minimize the difference between the real y's and the predicted y's, and so the difference in x could be way off:

    Powered by Desmos

    Watch video explanation →
    Danger of extrapolation
    SL 4.4

    When using a regression line to predict y from x, we need to be aware of the danger of extrapolation. This occurs when we try to predict y for a value of x far outside the range of x values in our data. For such an x, we cannot trust that the relationship is the same.


    In our earlier example, we had data about the body temperature of babies in the first 12 months after birth. We had the regression equation:

    T=−0.05M+37.8


    It would be a foolish extrapolation to use this formula to predict the body temperature of an 80 year old... 80 years is 960 months, so

    T=−0.05⋅960+37.8=−10.2°C

    (a block of ice).

    Watch video explanation →
    Plotting approximate best fit line
    SL 4.4

    Best fit lines can also be drawn approximately by eye. We start by finding the average x and y, giving the point (xˉ,yˉ​). We then take a rule and place it on this point, and adjust the slope until we find a reasonable best fit line.


    Powered by Desmos

    Watch video explanation →
    Regression line x on y
    SL 4.10

    In the same way that we can plot a straight line minimizing the vertical distances from points (x,y), we can plot a straight line minimizing the horizontal distances. This is called an x on y regression line.


    Powered by Desmos


    With this line we can make reliable predictions of x given y, so long as we are not extrpolating.


    Example

    Find the x on y regression line for the following data, and hence predict x when y=14.

    x

    3

    4

    5

    6

    y

    8

    12

    16

    17

    Using a calculator (putting the y-values in the Xlist and x-values in the Ylist) we find

    x=0.305y+0.453


    Plugging in y=14, we find x=4.72 (or 4.73 is you use exact values).

    Watch video explanation →