Example: Right-tailed test
An engineer measured the Brinell hardness of 25 pieces of ductile iron that were subcritically annealed. The resulting data were:
The engineer hypothesized that the mean Brinell hardness of all such ductile iron pieces is greater than 170. Therefore, he was interested in testing the hypotheses:
H0 : μ = 170
HA: μ > 170
The engineer entered his data into Minitab and requested that the "one-sample t-test" be conducted for the above hypotheses. He obtained the following output:
The output tells us that the average Brinell hardness of the n = 25 pieces of ductile iron was 172.52 with a standard deviation of 10.31. (The standard error of the mean "SE Mean", calculated by dividing the standard deviation 10.31 by the square root of n = 25, is 2.06). The test statistic t* is 1.22, and the P-value is 0.117.
If the engineer set his significance level α at 0.05 and used the critical value approach to conduct his hypothesis test, he would reject the null hypothesis if his test statistic t* were greater than 1.7109 (determined using statistical software or a t-table):
Since the engineer's test statistic, t* = 1.22, is not greater than 1.7109, the engineer fails to reject the null hypothesis. That is, the test statistic does not fall in the "critical region." There is insufficient evidence, at the α = 0.05 level, to conclude that the mean Brinell hardness of all such ductile iron pieces is greater than 170.
If the engineer used the P-value approach to conduct his hypothesis test, he would determine the area under a tn - 1 = t24 curve and to the right of the test statistic t* = 1.22:
In the output above, Minitab reports that the P-value is 0.117. Since the P-value, 0.117, is greater than α = 0.05, the engineer fails to reject the null hypothesis. There is insufficient evidence, at the α = 0.05 level, to conclude that the mean Brinell hardness of all such ductile iron pieces is greater than 170.
Note that the engineer obtains the same scientific conclusion regardless of the approach used. This will always be the case.
Example: Left-tailed test
A biologist was interested in determining whether sunflower seedlings treated with an extract from Vinca minor roots resulted in a lower average height of sunflower seedlings than the standard height of 15.7 cm. The biologist treated a random sample of n = 33 seedlings with the extract and subsequently obtained the following heights:
The biologist's hypotheses are:
H0 : μ = 15.7
HA: μ < 15.7
The biologist entered her data into Minitab and requested that the "one-sample t-test" be conducted for the above hypotheses. She obtained the following output:
The output tells us that the average height of the n = 33 sunflower seedlings was 13.664 with a standard deviation of 2.544. (The standard error of the mean "SE Mean", calculated by dividing the standard deviation 13.664 by the square root of n = 33, is 0.443). The test statistic t* is -4.60, and the P-value, 0.000, is to three decimal places.
Minitab Note. Minitab will always report P-values to only 3 decimal places. If Minitab reports the P-value as 0.000, it really means that the P-value is 0.000....something. Throughout this course (and your future research!), when you see that Minitab reports the P-value as 0.000, you should report the P-value as being "< 0.001."
If the biologist set her significance level α at 0.05 and used the critical value approach to conduct her hypothesis test, she would reject the null hypothesis if her test statistic t* were less than -1.6939 (determined using statistical software or a t-table):
Since the biologist's test statistic, t* = -4.60, is less than -1.6939, the biologist rejects the null hypothesis. That is, the test statistic falls in the "critical region." There is sufficient evidence, at the α = 0.05 level, to conclude that the mean height of all such sunflower seedlings is less than 15.7 cm.
If the biologist used the P-value approach to conduct her hypothesis test, she would determine the area under a tn - 1 = t32 curve and to the left of the test statistic t* = -4.60:
In the output above, Minitab reports that the P-value is 0.000, which we take to mean < 0.001. Since the P-value is less than 0.001, it is clearly less than α = 0.05, and the biologist rejects the null hypothesis. There is sufficient evidence, at the α = 0.05 level, to conclude that the mean height of all such sunflower seedlings is less than 15.7 cm.
Note again that the biologist obtains the same scientific conclusion regardless of the approach used. This will always be the case.
Example: Two-tailed test
A manufacturer claims that the thickness of the spearmint gum it produces is 7.5 one-hundredths of an inch. A quality control specialist regularly checks this claim. On one production run, he took a random sample of n = 10 pieces of gum and measured their thickness. He obtained:
The quality control specialist's hypotheses are:
H0 : μ = 7.5
HA: μ ≠ 7.5
The quality control specialist entered his data into Minitab and requested that the "one-sample t-test" be conducted for the above hypotheses. He obtained the following output:
The output tells us that the average thickness of the n = 10 pieces of gums was 7.55 one-hundredths of an inch with a standard deviation of 0.1027. (The standard error of the mean "SE Mean", calculated by dividing the standard deviation 0.1027 by the square root of n = 10, is 0.0325). The test statistic t* is 1.54, and the P-value is 0.158.
If the quality control specialist sets his significance level α at 0.05 and used the critical value approach to conduct his hypothesis test, he would reject the null hypothesis if his test statistic t* were less than -2.2622 or greater than 2.2622 (determined using statistical software or a t-table):
Since the quality control specialist's test statistic, t* = 1.54, is not less than -2.2622 nor greater than 2.2622, the qualtiy control specialist fails to reject the null hypothesis. That is, the test statistic does not fall in the "critical region." There is insufficient evidence, at the α = 0.05 level, to conclude that the mean thickness of all of the manufacturer's spearmint gum differs from 7.5 one-hundredths of an inch.
If the quality control specialist used the P-value approach to conduct his hypothesis test, he would determine the area under a tn - 1 = t9 curve, to the right of 1.54 and to the left of -1.54:
In the output above, Minitab reports that the P-value is 0.158. Since the P-value, 0.158, is greater than α = 0.05, the quality control specialist fails to reject the null hypothesis. There is insufficient evidence, at the α = 0.05 level, to conclude that the mean thickness of all pieces of spearmint gum differs from 7.5 one-hundredths of an inch.
Note that the quality control specialist obtains the same scientific conclusion regardless of the approach used. This will always be the case.
In our review of hypothesis tests, we have focused on just one particular hypothesis test, namely that concerning the population mean \(\mu\). The important thing to recognize is that the topics discussed here — the general idea of hypothesis tests, errors in hypothesis testing, the critical value approach, and the P-value approach — generally extend to all of the hypothesis tests you will encounter.
Your investment advisor proposes you a monthly income investment plan which promises a variable return each month. You will invest in it only if you are assured of an average $180 monthly income. Your advisor also tells you that for the past 300 months, the scheme had returns with an average value of $190 and standard deviation of $75. Should you invest in this scheme?
Hypothesis testing comes to the aid for such decision-making. (Note: This article assumes readers' familiarity with concepts of a normal distribution table, formula, p-value and related basics of statistics.)
What Is hypothesis testing?
Hypothesis or significance testing is a mathematical model for testing a claim, idea or hypothesis about a parameter of interest in a given population set, using data measured in a sample set. Calculations are performed on selected samples to gather more decisive information about characteristics of the entire population, which enables a systematic way to test claims or ideas about the entire dataset.
Here is a simple example: A school principal reports that students in her school score an average of 7 out of 10 in exams. To test this “hypothesis”, we record marks of say 30 students (sample) from the entire student population of the school (say 300) and calculate the mean of that sample. We can then compare the (calculated) sample mean to the (reported) population mean and attempt to confirm the hypothesis.
Another example: The annual return of a particular mutual fund is 8%. Assume that mutual fund has been in existence for 20 years. We take a random sample of annual returns of the mutual fund for, say, five years (sample) and calculate its mean. We then compare the (calculated) sample mean to the (claimed) population mean to verify the hypothesis.
Different methodologies exist for hypothesis testing, but the same four basic steps are involved:
Step 1: Define the hypothesis
Usually the reported value (or the claim statistics) is stated as the hypothesis and presumed to be true. For the above examples, hypothesis will be:
- Example A: Students in the school score an average of 7 out 10 in exams
- Example B: Annual return of the mutual fund is 8% per annum
This stated description constitutes the “Null Hypothesis (H0)” and is assumed to be true – the way a defendant in a jury trial is presumed innocent until proven guilty by evidence presented in court. Similarly, hypothesis testing starts by stating and assuming a “null hypothesis,” and then the process determines whether the assumption is likely to be true or false.
The important point to note is that we are testing the null hypothesis because there is an element of doubt about its validity. Whatever information that is against the stated null hypothesis is captured in the Alternative Hypothesis (H1). For the above examples, alternative hypothesis will be:
- Students score an average which is not equal to 7
- Annual return of the mutual fund is not equal to 8% per annum
In other words, the alternative hypothesis is a direct contradiction of the null hypothesis.
As in a trial, the jury assumes the defendant's innocence (null hypothesis). The prosecutor has to prove otherwise (alternative hypothesis). Similarly, the researcher has to prove that the null hypothesis is either true or false. If the prosecutor fails to prove the alternative hypothesis, the jury has to let the defendant go (basing the decision on null hypothesis). Similarly, if researcher fails to prove alternative hypothesis (or simply does nothing), then null hypothesis is assumed to be true.
Step 2: Set the decision criteria
The decision-making criteria have to be based on certain parameters of datasets and this is where the connection to normal distribution comes into the picture.
As per the standard statistics postulate about sampling distribution, “For any sample size n, the sampling distribution of X̅ is normal if the population X from which the sample is drawn is normally distributed.” Hence, the probabilities of all other possible sample means one could select are normally distributed. (Standard deviations are extremely important to understanding statistical data. Learn more about them by watching Investopedia's video.)
For e.g., determine if the average daily return, of any stock listed on XYZ stock market, around New Year's Day is greater than 2%.
H0: Null Hypothesis: mean = 2%
H1: Alternative Hypothesis: mean > 2% (this is what we want to prove)
Take the sample (say of 50 stocks out of total 500) and compute the mean of sample.
For a normal distribution, 95% of the values lie within two standard deviations of the population mean. Hence, this normal distribution and central limit assumption for the sample dataset allows us to establish 5% as a significance level. It makes sense as under this assumption, there is less than a 5% probability (100-95) of getting outliers that are beyond two standard deviations from the population mean. Depending upon the nature of datasets, other significance levels can be taken at 1%, 5% or 10%. For financial calculations (including behavioral finance), 5% is the generally accepted limit. If we find any calculations that go beyond the usual two standard deviations, then we have a strong case of outliers to reject the null hypothesis.
Graphically, it is represented as follows:
In the above example, if the mean of the sample is much larger than 2% (say 3.5%), then we reject the null hypothesis. The alternative hypothesis (mean >2%) is accepted, which confirms that the average daily return of the stocks are indeed above 2%.
However, if the mean of sample is not likely to be significantly greater than 2% (and remains at, say, around 2.2%), then we CANNOT reject the null hypothesis. The challenge comes on how to decide on such close range cases. To make a conclusion from selected samples and results, a level of significance is to be determined, which enables a conclusion to be made about the null hypothesis. The alternative hypothesis enables establishing the level of significance or the "critical value” concept for deciding on such close range cases. As per a textbook standard definition, “A critical value is a cutoff value that defines the boundaries beyond which less than 5% of sample means can be obtained if the null hypothesis is true. Sample means obtained beyond a critical value will result in a decision to reject the null hypothesis." In the above example, if we have defined the critical value as 2.1%, and the calculated mean comes to 2.2%, then we reject the null hypothesis. A critical value establishes a clear demarcation about acceptance or rejection.
Step 3: Calculate the test statistic
This step involves calculating the required figure(s), known as test statistics (like mean, z-score, p-value, etc.), for the selected sample. We'll get to these in a later section.
Step 4: Make conclusions about the hypothesis
With the computed value(s), decide on the null hypothesis. If the probability of getting a sample mean is less than 5%, then the conclusion is to reject the null hypothesis. Otherwise, accept and retain the null hypothesis.
Types of errors
There can be four possible outcomes in sample-based decision-making, with regards to the correct applicability to entire population:
Decision to Retain
Decision to Reject
Applies to entire population
(TYPE 1 Error - a)
Does not apply to entire population
(TYPE 2 Error - b)
The “Correct” cases are the ones where the decisions taken on the samples are truly applicable to the entire population. The cases of errors arise when one decides to retain (or reject) the null hypothesis based on sample calculations, but that decision does not really apply for the entire population. These cases constitute Type 1 (alpha) and Type 2 (beta) errors, as indicated in the table above.
Selecting the correct critical value allows eliminating the type-1 alpha errors or limiting them to an acceptable range.
Alpha denotes the error on level of significance, and is determined by the researcher. To maintain the standard 5% significance or confidence level for probability calculations, this is retained at 5%.
As per the applicable decision-making benchmarks and definitions:
- “This (alpha) criterion is usually set at 0.05 (a = 0.05), and we compare the alpha level to the p value. When the probability of a Type I error is less than 5% (p < 0.05), we decide to reject the null hypothesis; otherwise, we retain the null hypothesis.”
- The technical term used for this probability is p-value. It is defined as “the probability of obtaining a sample outcome, given that the value stated in the null hypothesis is true. The p value for obtaining a sample outcome is compared to the level of significance."
- A Type II error, or beta error, is defined as “the probability of incorrectly retaining the null hypothesis, when in fact it is not applicable to the entire population.”
A few more examples will demonstrate this and other calculations.
Example 1. A monthly income investment scheme exists that promises variable monthly returns. An investor will invest in it only if he is assured of an average $180 monthly income. He has a sample of 300 months’ returns which has a mean of $190 and standard-deviation of $75. Should he or she invest in this scheme?
Let’s set up the problem. The investor will invest in the scheme if he or she is assured of his desired $180 average return. Here,
H0: Null Hypothesis: mean = 180
H1: Alternative Hypothesis: mean > 180
Method 1 - Critical Value Approach:
Identify a critical value XL for the sample mean, which is large enough to reject the null hypothesis – i.e. reject the null hypothesis if sample mean >= critical value XL
P(identify a Type I alpha error) = P(reject H0 given that H0 is true),
which would be achieved when sample mean exceeds the critical limits i.e.
= P( given that H0 is true) = alpha
Taking alpha = 0.05 (i.e. 5% significance level), Z0.05 = 1.645 (from the Z-table or normal distribution table)
= > XL = 180 +1.645*(75/sqrt(300)) = 187.12
Since the sample mean (190) is greater than the critical value (187.12), the null hypothesis is rejected, and conclusion is that average monthly return is indeed greater than $180, so the investor can consider investing in this scheme.
Method 2 - Using standardized test statistics
One can also use the standardized value z.
Test Statistic, Z = (sample mean – population mean)/(std-dev/sqrt(no. of samples) i.e.
Then, the rejection region becomes
Z= (190 – 180)/(75/sqrt(300)) = 2.309
Our rejection region at 5% significance level is Z> Z0.05 = 1.645
Since Z= 2.309 is greater than 1.645, the null hypothesis can be rejected with the similar conclusion mentioned above.
Method 3 - P-value calculation
We aim to identify P(sample mean >= 190, when mean = 180)
= P (Z >= (190- 180)/( 75 / sqrt (300))
= P (Z >= 2.309) = 0.0084 = 0.84%
The following table to infer p-value calculations concludes that there is confirmed evidence of average monthly returns being higher than 180.
less than 1%
Confirmed evidence supporting alternative hypothesis
between 1% and 5%
Strong evidence supporting alternative hypothesis
between 5% and 10%
Weak evidence supporting alternative hypothesis
greater than 10%
No evidence supporting alternative hypothesis
Example 2: A new stockbroker (XYZ) claims that his brokerage fees are lower than that of your current stoc broker's (ABC). Data available from an independent research firm indicates that the mean and std-dev of all ABC broker clients are $18 and $6 respectively.
A sample of 100 clients of ABC is taken and brokerage charges are calculated with the new rates of XYZ broker. If the mean of sample is $18.75 and std-dev is same ($6), can any inference be made about the difference in the average brokerage bill between ABC and XYZ broker?
H0: Null Hypothesis: mean = 18
H1: Alternative Hypothesis: mean <> 18 (This is what we want to prove)
Rejection region: Z <= - Z2.5 and Z>=Z2.5 (assuming 5% significance level, split 2.5 each on either side)
Z = (sample mean – mean)/(std-dev/sqrt(no. of samples)
= (18.75 – 18) / (6/(sqrt(100)) = 1.25
This calculated Z value falls between the two limits defined by
- Z2.5 = -1.96 and Z2.5 = 1.96.
This concludes that there is insufficient evidence to infer that there is any difference between the rates of your existing broker and the new broker.
Alternatively, The p-value = P(Z< -1.25)+P(Z >1.25)
= 2 * 0.1056 = 0.2112 = 21.12% which is greater than 0.05 or 5%, leading to the same conclusion.
Graphically, it is represented by the following:
Criticism Points for Hypothetical Testing Method
- Statistical method based on assumptions
- Error prone as detailed in terms of alpha and beta errors
- Interpretation of p-value can be ambigous, leading to confusing results
The Bottom Line
Hypothesis testing allows a mathematical model to validate a claim or idea with a certain confidence level. However, like majority of statistical tools and models, it is bound by a few limitations. The use of this model for making financial decisions should be considered with a critical eye, keeping all dependencies in mind. Alternate methods like Bayesian Inference are also worth exploring for similar analysis.
For more on practical applications of data to determine risk, see "5 Ways to Measure Mutual Fund Risk."