Topics
Designing good surveys, measurement reliability, and test validity.
Want a deeper conceptual understanding? Try our interactive lesson!
If you've taken a survey before, you know that they can be very poorly designed. We've compiled below some examples of terrible survey questions:
How much do you love Perplex?
Biased question, people don't like to be too critical
How often do your parents argue about money at home?
Too personal
Tell me about your mental health.
Unstructured! I don't know what you're asking
How many hours do you study each week?
Never
3
Quite a lot
Yes
Inconsistent answer choices
When we use statistics to come to a conclusion, the reliability of the conclusion describes how consistent the variables we measured are.
To determine if a measurement is reliable, we ask ourselves whether we would get similar results redoing the same experiment, with the same sample.
An example of an unreliable test is a lab analysis of cholesterol level. The same person could have their blood tested just days apart and get significantly different results. The measurement can be made more reliable in a variety of ways, but at its core unreliability is a measure of how much the result would vary if we redid it many times.
There are two ways to measure reliability that the IB wants you to know:
This just means redoing the measurement sometime later and comparing how much the results change. In some cases (like the cholesterol example) this is a great approach, but in others it works less well.
Imagine measuring drivers' knowledge of road laws by giving them a test. If we re-test them a week later, they would likely have learned from the test and might perform better.
This approach is specifically used to assess humans. It involves giving participants two similar versions of the same test, and seeing how close their performance on one is to the other.
Statistical validity describes how accurately the thing we measured represents what we're interested in.
For example, consider a military fitness test. A test that only measures one trait, like how fast you can run a 5km race, simply doesn't test all the relevant types of physical fitness. This kind of test has a low content validity, because it only measures one small aspect of the domain we care about.
The second kind of validity the IB expects you to know is called criterion validity. The criterion is the thing you actually care about, and the criterion validity measures how well a test predicts the criterion.
A real world example is a polygraph (lie detector test). It's supposed to measure whether a person is lying, but in reality only measures how nervous they are.
Nice work completing Data Collection, here's a quick recap of what we covered:
Exercises checked off
Designing good surveys, measurement reliability, and test validity.
Want a deeper conceptual understanding? Try our interactive lesson!
If you've taken a survey before, you know that they can be very poorly designed. We've compiled below some examples of terrible survey questions:
How much do you love Perplex?
Biased question, people don't like to be too critical
How often do your parents argue about money at home?
Too personal
Tell me about your mental health.
Unstructured! I don't know what you're asking
How many hours do you study each week?
Never
3
Quite a lot
Yes
Inconsistent answer choices
When we use statistics to come to a conclusion, the reliability of the conclusion describes how consistent the variables we measured are.
To determine if a measurement is reliable, we ask ourselves whether we would get similar results redoing the same experiment, with the same sample.
An example of an unreliable test is a lab analysis of cholesterol level. The same person could have their blood tested just days apart and get significantly different results. The measurement can be made more reliable in a variety of ways, but at its core unreliability is a measure of how much the result would vary if we redid it many times.
There are two ways to measure reliability that the IB wants you to know:
This just means redoing the measurement sometime later and comparing how much the results change. In some cases (like the cholesterol example) this is a great approach, but in others it works less well.
Imagine measuring drivers' knowledge of road laws by giving them a test. If we re-test them a week later, they would likely have learned from the test and might perform better.
This approach is specifically used to assess humans. It involves giving participants two similar versions of the same test, and seeing how close their performance on one is to the other.
Statistical validity describes how accurately the thing we measured represents what we're interested in.
For example, consider a military fitness test. A test that only measures one trait, like how fast you can run a 5km race, simply doesn't test all the relevant types of physical fitness. This kind of test has a low content validity, because it only measures one small aspect of the domain we care about.
The second kind of validity the IB expects you to know is called criterion validity. The criterion is the thing you actually care about, and the criterion validity measures how well a test predicts the criterion.
A real world example is a polygraph (lie detector test). It's supposed to measure whether a person is lying, but in reality only measures how nervous they are.
Nice work completing Data Collection, here's a quick recap of what we covered:
Exercises checked off