Tip:
Highlight text to annotate it
X
Hello and welcome to this lecture;in this lecture, we will learn some of the statistical
methods to testthe observed data,whether they are fitting to some particular distribution
or not.You know that in earlier lectures, as well as in earliermodules also, we havereferred
several time thatwhile handling some problem,wegenerally assume that the dataset that is followinga
particular distribution andwe have seennow,we will see that howwe can testthese thingsthat
whetherthe dataset is reallyfollowing this distribution or not.
There are some of thestatistical test that I am just going totell in this in this lecture
is that,so to use that one and use the knowledge of this hypothesis testing will betestingits
statistically that howthe data is fitting to a particularmodule.So, there are some tests
and so we will just pick up some thispopulartest to see that how we cantest thisgoodness of
fit test, these test are known as the goodness of fittest,so, this is,sothat is our today’s
lecture title is theGoodness of fittest.
So,first we will justsee that howthis canwork, and after that we will taketwo test for this
lecturethat one is that chi square test,this will be chi square, not d;d is mistake, that
is a chi square test and other one is thekolmogorov smirnovtest,it is also known as this K S testas
anabbreviated form.And this K S test again can be that one sample test and two sample
test.Basically,when we take thatthis one sample test,then we generally test it forone sample
that we take and we generally take that fromwhether that particular dataset follow aparticular
distribution or not. So, in this case, one is our sample and other
one is the standard distribution and when you go for the two sample test, it is basically
both are our sample data is there andwe look for the answer whetherboth the samples are
from the same population or not,same population means,there are following the same distribution
or not. So, this ishow we go for one sample test and thistwo sample test.As you know that
for this kind of statistical test,weneed some kind ofsignificance level andsome statisticalsignificance
level. So, whenever we generally conclude or we draw somedecisionfrom whatever the statistical
test we do, we have generally that decision is associated withstatistical significance
level,so that is important.
So, what significance level that we are considering eitherbeforehand thatok,at thissignificance
level,whatever we are going to test is satisfied or notin a statistical sense that is what
we will do.Well, so now, first,we will goto this, whatever I justmentionedso far is that,
it may be a necessary to check howwell a set of observed datafit to aparticular probability
distribution, and this one, this can be describedusing the certainstatistical test known as the goodness-of-fit
test.Andas also, I just try to indicate that the basic philosophy is to measure the discrepancy
between the observed and theoretical values used in thestatistical model.So, this basically
are the generalthing forallthetestthat we going to describe now,that and this one what
we are describing is particularly for the one sample test.
Suppose, when we are saying that I have a dataset and that follows a particular distribution,
sowhat is our interest is that, whetherthere is any discrepancy between that whatever theobserveddistributions
that we can see from this data.And whatever thedistribution that we are assuming to follow,
whether it is normal or log normal or even in the discrete side Poisson or so, whether
those distribution that theoretical values and this whatever the observed from this data
are matching or not.So,thatdiscrepancy we have to see, and that discrepancywe have to
decide whichthrough some distribution, we have todraw someinference in a statistical
way.
So,the example of thisgoodness–of-fit test may be as follows,say,thus that what type
of question that we are looking for theiranswer, say that, whether a sample of a discrete variablefollows
a Poisson distribution or not. So, these are just the example, it is not that always I
am looking for this Poisson distribution or so.So, any distribution, soI know that this
randomvariable is a discrete variable,so I have somehypothesize that whether thatsample
can follow a particular discrete distribution thatwe know already, and those distribution
we have discussed inearlier lectures, in earlier modules.So, like that question,I have a sample
whether that sample follow a Poisson distribution or not.
Similarly,whether a sample of a continuous variable follows a normal distribution or
not, or gamma distribution or not, or log normal distribution or not like that,orin
case of the two samples, whether both the samples are drawn from the identical distribution
or not.So, when we are testing it, we will just take try to take that all this several
types ofthis problems, where we will try to cover almost all thesedifferent cases discrete
variable case,continuous variable case, as well as thetwo samples case taking from two
different sample we will take and whetherwe willtest that whether they are from the identical
distribution or not.
So, aswe mentioned, there aresome of thesecommonly used test,are therethese are standard test
that we generally use forthis goodness-of-fit test.The first one is this, chi square test
and then, the kolmogorov smirnov test, there alsoother test which is a Anderson-darling
test which is generally known to be a littleimprovement over this K S test.And we will take this one
also, but may be for this lecture we will justconsider thischi square test and K S test.
Now, this chi square test, the chi square distribution is used for testing the goodness
of fit of a set of data to a specific probability distribution.For this,we make the comparison
of the observed and hypothetical frequencies that follow the specific probability distribution,
so this is that what we have discussed for this basicphilosophy.And this is basically,
if you see thatoverallapproach is same for all this test,so the comparison of thiswhatever
you have observed from this data and whatever we are suppose to get from that hypothesize
distribution, whether they are matching or not.
So, here basically this chi square test is based on their frequencies,so we will find
out the observed frequency and also see, what is the hypothetical frequencyand obviously,
as we are talking about these frequencies, so we have tocategorize the data into different
beansand each bean what is thefrequency, what is the observed frequency? This observed frequency
means, whatever the data that we are having, based on that, what is the frequency that
we can see, and we are hypothesize onedistribution.So, based on that distribution obviously, that
distribution should have some parameters, with thatparameters, what are thefrequency
that we can observed.And this one can be used for the both the cases; means, both for thisdiscrete
random variable as well as for the continuous randomvariable, we can use this test.
So, what isdone that may be we will discuss nowin a step in different steps,first of all
let us consider that a samplecontains n observed values of the random variable,so whatever
the data that we are having that is that n numbers of data that we are having.And now
thisnotation, that is,O1,O2,O3up to OK, be the K observed frequencies of the variates
and the corresponding frequencies from the assumed theoretical distribution be E 1,E
2,E 3 up to EK.So, this is that observed, that is,this is from whatever that data that
we are having is the O 1, O 2,O 3,OKand these E1,E2,E3,EK, these are from that whatever
the distribution,that we are hypothesize that this dataset may follow that particular distribution.So,
from that distribution whatever the frequency that we are gettingis,E’s Ei's and whatever
from the data that is Oi’s. So,this it is required to test whether the
difference between the observed and the expected frequencies are significant or not.So, this
are the O1,O2,O3,OK and these are from the hypothesize distribution,now they are difference,
whether this difference from the observed and to the theoretical is significant, then
we can say that whatever the datawe are having is not following whatever the distribution
that we have. And on the other hand, if they are almost same,then we can say that, yes,that
it is following that particulardistribution.
Now, so,how to do that one, how tofind out that discrepancy is through thisparameter,
we generally call itas a statistics and that statistics is obtained as that Oiminus Eiwhole
square divided by E iand sum of are allKSmeans, all thisgroups that is from 1 to K. Now, this
quantity this quantity is,as I told that this is the statistics and denoted by here X s
square that we have denoted this one, and it has been found that this X s square or
this statistics,it followsa chi square distribution.If n tends to infinity, that is,thatthat mathematical
limitfor the large number of thissample data, that is available, this statistics follow
a chi square distribution and this chi square distribution is having somedegrees of freedom,here
the degrees of freedom is k minus 1. So, this one show, aswe know that this particular
statistics follows the chi square distribution with k minus 1 degrees of freedom; that means,
we know that what areitsproperties, and based on that at what significance level I want
to test whether,this ismeanssignificantly,this is significant or notthat we have to test.One
thing is, I should mention here that sometimes it is required,for example that when we are
calculating this E i,it may require to estimate some of the parameterof that hypothesize distribution
from the data itself, if that parameter is not known.And in that case,as many parameters
as we are required to estimate from this datathen,thosemany degrees of freedom will be lost.So,if you
are not estimating any parameter from that data available, that time this the degree
of freedom is k minus 1, but if we estimate one parameter then, this degrees of freedom
will be k minus 2, if you are estimating 2 parametersthen degrees of freedom will be
k minus 3.
So,similarly, so as many parameters, you are estimating from the data those many more degrees
of freedomwill be lost. So, now this onesome morethings, some moreproperties, these are
notthat,means, directly I cannotsay that why these are required,but for the betterresults
that is within quote, that for the betterresult, this we should take care before we go ahead
with the test.The first thing is that, the K should be greater than equal to 5, so that
number of a groups that we havedivided that observed data it should be greater than equal
to5 and that Eiis that is thefrequencies that we got in each beam should be at least5, so
this Eishould be greater than equal to 5.
Now, the special case,the most of thecases we may not have the parameters of the theoretical
distribution and then, the parameters should be estimated from the data itself,and the
statistics remain valid if the degree of freedom is reduced by one for every unknown parameter.
So,that is what I just explained, so if you need to estimate the parameter of the hypothesize
distribution from the data then, you need tothen you need to allowthat much degrees
of freedom will be lost. Now,assuming that the distribution follows as this onethat I
told,there will be square here, as you have seen earlier this statistics,this observed
minus, this hypothesize square divided by Eiand their summation.So,if this one is less
than this value, what is shown is,C1 minus alphamuandthis mu as it is already explained
that, this is the degrees of freedom,this C1 minus alpha mu, that is C1, minus alpha
is a value of approximate that this chi square distribution with mu degrees of freedom at
the cumulative probability1 minus alpha. Now, this alpha is that significance levelthat
we are talking about, the assumed theoretical distribution is anacceptable model at the
significant level of this alpha.So, this is what that I was mention initially, that all
these test should be associated with some significant level. Generally, this significance
level are keptaround, say, 0.01 or,sorry, 0.01or 0.05. So,at this cumulative probability
one minus this alpha, so if it is that 5 percent significance level if we say; thatmeans that
cumulative probability at which we are testing it is at the95 percent.So, if the observed
statistics is less thanis less than that cumulativeprobability of that distribution,here it is the chi square
distributionof this95 percent if the alpha is 0.05significance leveland then, we can
say, yes, thatwhatever we havehypothesized may not be rejected.
So, yes, as I told that this should be thesquare, as we have seen it here also, thisstatistics,this
y minus Eipower square. Now, we will take up one example and this example, we have taken
for the discrete case and because,thischi squaredistribution can be used for the is
for the discreteparameters. So,we have it can be used for the continuous as well as
discrete, butthe other than next one what we are going to cover is, the K S test that
isfor the continuous distribution,so here we are taking a discrete example.
Consider a given station in a watershed, where the severe rainstorms are recorded over a
period of 70years,last70 years, we have recorded the how many rainstorms are there in a particular
year.And out of these70 years,22 years were without severe rainstorms,so there are in
22years there are nosevere rainstorms, so number rainstorms is 0.And25 years,14 years,
6 years and 3 years are with 1 rainstorm,2 rainstorm,3 rainstorm and 4 rainstormsrespectively.So,25
years we have observed there is 1 rainstorm, 14years we have seen there are3 rainstorms,16
3 rainstorm, and 3years, there are4 rainstorms are there. Test whether the data can be assumed
to followa poisson distribution at 5 percent significance level.So, here as you can see
that we are hypothesizingthat whether the data that is giventhatwhatever we have recorded,
whether it is following the Poisson distribution ornot, and the significance level is given
as5 percent.
So, here now you see,so the Poisson distribution you know there is somelambda, there is one
parameter that we have discussed earliermodulus, thatparameter that lambda is the mean rate
of occurrence. So, rate so average occurrence rate of this rainstorm,so if you want to calculatethen
we have seen that there are22 years where there is norainstorms. So,22 multiplied by
0,basically, the first thing,then 25 years 1 rainstorm, 14 years2 rainstorm,6 years3
rainstorm and 3 years 4 rainstorms are there are total. So, in this way out of these70years
what are how many total rainstorms that we have seenthat divided by70 gives youthat average
rainstorm that can occur in ayear,so 1.1857 rainstormsyou can see in a year.So, lambda
here is that 1.1857 and the tthat we are- lambda t - that is the totalquantity that
t is here the per year,so one year how muchrainstorms has occurred, so this is the quantity 1.1857.
Now remember that, we have estimated this one from the data itself,so this was not supplied.Sometimes
what happens?these data could have been supplied directly, whether the test,whether the data
is following the Poisson distribution with the lambda is equals to, say,1.3 or 1.1 like
that.If that is given to us that means, we are not estimating that one from the data,
so there the degrees of freedom whatever I told that, it is k minus 1 will be there;but
here, we have already estimated one parameter, sonow the degrees of freedom will bek minus
2. So, now to the check, thegoodness- of-fit,
we use the chisquare distribution atalpha equals to 5 percent significance level, as
the data is so small, the data for the4 storms per year is combined with the3 storms per
year. So,this one is generally done, but so there issome discussion is required.We canNow,
we have to think that what distribution you have to hypothesize,so it is the Poisson distribution
that you have hypothesize andwhat is the support for that distribution. Now, the Poisson distribution
that support that we are looking for is basically starting from the 0 to0,1,2 the discrete values
and goes up to infinity. Now, the data that is obviously is given to
us is that 1 rainstorm,2 rainstorm, 3 rainstorm up to the 4 rainstorm per year, but when we
are hypothesizing that this is thethis isthe Poisson distribution of obviously, I cannotscuttle
at any particular point.So, here what is the general practice is that for the higher side
where data is becoming very small,wecan combine those things to take care two things, one
is that so I can declare that, yes, more than equal to this value is having this frequency
and the second thing is thatwe will also betesting whether that each group is having that minimum
requirement of thisfrequency is available or not.
Because this kind ofasymptotic distribution, thatis, it is goingtowards plus infinity,
so this type of it should be open bounded,but the way the problem is given it is just as
a close boundary at 4rainstormper year. So, here if we just considerthat2rainstorms are
combined together, that is, the 3rainstorm per year and4rainstorm per year case, then
we can say that will greater than equal to3rainstorm per year. So, that remains that, positive
side remains unbounded and also, this will help tocheck whether thatat least greater
than 5, that is, for the better result that we havemention earlier.
So, here also what we are doing is that the 4storms per year and the 3rainstorms per year
are combined together. So, here the null hypothesis is the random variable has a poisson distribution
with lambda equals to 1.1897;alternatehypothesis is the random variable does not follow the
distribution specified in null hypothesis, what is specified here?Significance level
is 0.05 at 5 percent significance level mentioned.And we have thek equals to 4, so the degrees of
freedom here are the k minus 2, sok minus 1 is from there and one parameter we have
estimated,so it isk minus 2, the degree of the freedom is 2. So, that criticalregion
hereis thischi square distribution, that is, chi square distribution with2 degrees of freedom
at alpha equals to0.05.
Now, if you see,so this is that chi square distribution table, basically,you know that
we have discussed this distributionearlier and this standard values are listed in any
standard text book. So, here if you see that this point at this 0.95 these that cumulative
probability are given here and for these degrees offreedom 2,this value is 0.5, 0.99,so these
value is our critical value, that we have to test against.
So, thisis that 5.99 that we got from thistable andthese are thetheseare the beanshere now,so0rainstorm
per year,1 rainstorm per year,2 rainstorm per year and this is the greater than equal
to3 rainstormper year.So, now,as it is given in this data, that there are 22 years where
this0 rainstorms are there and the 25 yearare thereis that 1 rainstorm,14years 2 rainstorm
and 6 years 3 rainstorm and that3 years4 rainstorms are there, sowe have combined it here to get
the 9 years data, where it is greater than equal to4 rainstorm per year.
So, we have avoided that one, in one occurrence it is becoming 3which is less than 5, so that
is also avoided and that right side is kept open.And this theoretical frequencies, that
is,thesevalues what we are getting is that this from this Poisson distribution that lambda
t here equals to this lambda equals to that 1.1897 and t equals to 1,so1 year, so this
lambda t value we will get,and from this poisson distribution if you just putthat x equals
to 0, x equals to 1, x equals to 2 and x equals to 3 with that value of lambda t equals to
that value, then we will get thisvalues,which are the, so are the theoreticalfrequencies
for this Poisson distribution.
Now, so these areso these values we are getting when we are putting this x equals to 0, and
this value you are putting when you are putting x equals to 1 from this Poisson distribution,
so this is the theoretical frequencies and these are the observed frequencies.Now, what
we have to do?We have to get thisstatistics, that is, y minus Eiwhole square divided by
Ei,so if we do this one then we get these values and we have to sum it upthis one, so
this is the summation from is that0.1668. So,theseonethis isourstatistics which is equals
to0.1668 that we have seen from the table. So,that so, which is now is the less than
5.99 that we have seen from the table, so this is less than this this critical value,
so hence thePoisson distribution is a valid model at a5 percent significance level.So,
the decision isthat for this test is that the null hypothesis cannot be rejected at
5 percent significance level. So,there could be some other words that we
can express, ok,the null hypothesis is not rejected and all,so what I feel is that this
should be the proper decision, properprobabilistic or the proper statistical inference should
be written as this one thatnull hypothesis is acceptedis not theright thing to declare.So,rather
the complete thing that we can declare isthat the null hypothesis cannot be rejected at
5 percent significance level because, the result that we have got it depends on what
significance level that we have optedfor. So, that is what we have to mention that at
this significance level the null hypothesis cannot be rejected.
So, next we will go to that our second test, which is the kolmogorov-Smirnov test,in brief
we generally mention here K S test. This test is alsomost commonly used to check the validity
of the assumed model, and it is not applicable for the discrete variables,so mostly that
continuous random variable if the dataset is availablewith thatwe generally get this
one. It relates to the cdf rather than the pdf to a continuousvariable,so here earlier
what we have seen that the frequency that when we are talking about in this chi square
test that is basicallya pdf that we are talking about. Here, this K S test is generally based
on this what is therecdf, that is cumulative distribution function, and it compares the
observedor data based cumulative frequency with the assumed theoretical cumulative distribution,
sodata based cumulative distribution, sorry, this should be distribution with the assumed
theoretical cumulative distribution. So,there are some kinds ofdiscrepancies orshortcomings
I should say, in this chi square test is that,we need to define that,we need to first of allget
thatthat beans thatwe have to first categorize the dataset, the full dataset that is, ok,this
is my range.And in the discrete it isfinethat whatever the example that we have seen,nowfor
the as I told this chi square distribution is also applicable for the continuous distribution.
Now, ifwe take some continuous distribution in case of this chi square test, what we have
to do, we have to firstwe have to first categorize the data into different beans and each bean
we have to see the frequency and also we have to check that whether each bean is having.So,
the number of beans should not be less than 5,as well as each bean should have minimumthatminimum
frequency should be 5, for the better results that we are mention. So, these two things
are not there in this K S test,so it is directlygetting it is directlytheaccess it is results from
the cdf itselfdirectly,so there is no need to categorize the data into beans.And so,
it is avoiding those requirements of these minimum 5 beans and that each bean should
have that minimum frequency of 5,so which is not there in this K S test.
Now,so, if suppose, that there is a random variable x and we havesome dataset of this
X1,X2,X3 up to X n,so representthe ordered sample of size n. So,whatever the data that
we are having from this actual observation,we can first of all we can make it in an arranged
in an increasing order, and that increasing order if that increasing order is that X1,X2,X
nif that isavailable.Now, from this ordered set the empirical or sample distribution function
S n X is developed and this function is basically a step function. So,this is from this data,
so from this sample how to get thatthis cumulative distribution function is as follows.
That is for, when this S n X is equals to 0,in case, when X is less than X1,S n X is
equals to K by n, when this X is in between k to k plus 1 and k can vary from 1,2,3 up
to n minus 1. So, basically for all this X1 to X n for this thing we are definingthese
values,the value of thiscumulativedistribution is k by n. And forXgreater than n is equal
to 1. So, this is basically what we aregetting is, whatever the representative, cumulative
distribution directly from the data and each point it is changing, this value is changing,
so it will look like a step function.
So, obviously as you can see this essence should will start from a value0,it will start
from a value from 0 andat some point it will go, it will increase,and again from the next
one, it should go andlike this, it will go somewhere,it will go flat, and in this way
what will happen?It will ultimately attain the value where it is 1.
Now, this is thedistribution basically we got it from this data.Now, thehypothesized
distribution, supposethat I want to match thatthat normaldistribution,so that normal
distribution with the parameters of course, whether it is supplied or obtained from thissample
data is that with that data, I can also plotwhat should be theshape of thisthat particular
distribution, say,that normal distribution if I say.
Now, if whateverwe have observed the black line here, if this is very close to this red
onewhich is the theoretical distribution obviously,then the data is from that particular distribution.
So, this discrepancy, now again, keeping the sameapproach samefor all that test that I
mention at the beginning of this class, is that here also we have seen that what is the
maximum difference between this two distribution thatcdf, so one is the theoretical and the
other one is theother one is thedatabase.So, that discrepancy we have to assess through
somestatistical test,this is what, that is why we have got this S n X.
So, this S n X is the step function and this FX is the proposed theoretical distribution
that what Ihave drawn in the reading, just now.Now,here if we see the discrepancies between
the theoretical model and the observed data is computed, and the maximum difference the
D n between the S n X and the FX is over the entire range of X is obtained, which is which
isdenoted as D n, which is the maximum for all X,FX minus S n X obvious theabsolute value.
So, now you can see here, for a particular point here as you can see,so from the blue
one that I have drawn, this is also a step function here - the blue line, and thispink
one is that youris that theoretical distribution. Now, the difference between any point to thattheoretical
distribution is your that value is the difference betweentwo things.Now, what we have to pick
up from this two information is that what is the maximum difference?So, at each point
there will be some difference between this blue line and thisand this pink line,sothat
difference at each and every pointhave to find out and we have to select, we have to
pick up the maximum one,so what is the maximumdifference.This blue line obviously,as we are coming leaps
towards this, this is 0,and from where it is ending towards the right to that one it
is equal to 1. And so over this entire range, we have to pick up what is the maximumdifference.
So, thus for aspecified significance level at alpha that K S test compares the maximum
difference with the critical value D n alpha.Now, what is this D n alpha?ThisD n alpha is defined
as the probability that D n less thanequals to D n alpha is equals to 1 minus alpha,again
this alpha here is that significance level that we have mentioned at thebeginning of
this class.So, if the observed value is less than the critical value, then the proposed
distribution is valid at the significance level alpha.So, we have to check that whether
whatever the maximum difference that we get, and whatever the critical value. So, this
probability, that is, that observed D n less than equals tothat critical value whether,
it is equal to this1 minus alpha or not.
Now, the advantage of this K S test,as in the chi square test, division of the data
into interval is not necessary in this case,so I think this things I was just mentioning
while at this starting of this K S test. So,wesothese intervals arenot necessaryhere, because we
are justobserving thisat eachand every data point.The test statistics is distribution
free unlike in the case of the distribution of the chi square test, it works for this
log normal data,however, the test can fail if the data is too far from thisnormality.
So,there is no such restrictionthat with the data should be approximate normal or so, but
it is better to get the better result again that data should be somewherenearnormality
that is why the popularity of this K S test is,we have seen, it is very frequently, it
is applied to test whether the data follow the normal distribution or not, that is, where
the maximum application of this K S test has been found.
Now, the sample distribution,if the sample distribution that n is large, not the distribution
sample size, if the sample size n is large,Smirnov has given the limiting distribution of square
root n multiplied by this D n,so this is that D n that we have defined the maximum difference
that multiplied by square root n, these quantity follow a distribution like this that limit
n tends to infinity, probability of this quantity square root n multiplied by that D n,the statistic
less than equals to z is equals to square root 2 pi by z multiplied by summation of
k equals to 1 to infinity, exponential minus 2k minus 1 square multiplied by pi square
by 8z square. So, this is what it is giving is that for
this n tends to infinity means, when this n isvery largethat time what we can get is
that, how this D n is varying is thatthrough this, thatwe have to find out.suppose that,Suppose
now, so it depends on this what is the significance level that we have fixed, suppose that this
significance level if it is 0.05; that means, thisprobability is equals to 1 minus alpha
that means this 0.95.Now, if we solve this right hand part with equal to that, what isthat,0.95,then
we will get that is z becomes 1.36,if you see this quantity hereinside this exponential
term, that is,2k minus 1 whole square multiplied by pi square by 8z square.
So, this term basically is changing that is from this k equals to 1 to ininfinity, now
k ifk is 1,you can see that this is just minus of this quantity plus exponential of,ifk becomes
2 then, it becomes 9, so minus 9,so exponential minus 9 times of this one and if k becomes3
then it is 25 times of this one,so basically if you just do just a hand calculation also
you will see that, if you just consider the first term itself that is k equals to one
only and remaining if you justignore it then, also you will see for this,this is almost
very closelymatching with this onemay be, it is justvarying afterthird decimal or so.
So, for this n greater than 50, sometimes in some textbookcan refer to thatif is n is
greater than 35 itself, and for this alpha equals to 0.05; that means, this right hand
side is equated with this 0.95 and if we just calculate what should be the value of this
z, if we just consideronly one value of k equals to 1 then,you will see that this z
becomesvery close to this 1.36.So,for this significance level alpha equals to 0.05 that
critical value is 1.36 divided by square root of n,so that means this Z becomes 1.36 and
this is that 1.22square root of n.So,this critical value this should be, so our observed
statistics that is D n should be less than thisparticular value todeclare that,at that
significancelevel, we can acceptthat particularhypothesis or that null hypothesis cannot be rejected.
And similarly,if you put thatsay alpha equals to 0.1; that means, this right hand side if
we equate to… if with that 0.9then,we will see that this quantity, this z is becoming
1.22,so the critical value is 1.22 divided bysquare root n and for these lower values
of n, this is also available inthe standard table from thatdistribution and we can refer
to those tables to get these critical values. So,we will take up one exampleto just to discussall
these things,and here we have taken oneexample of this continuousrandom variable.And as I
mentioned thatthis is mostly used when we are considering that when,whether thedataset
is followingnormal distribution or not. So,that data of the fracture toughness of the plain
concrete specimen made with the burnt brick aggregate is shown in the tablein the next
slide.That data appears to fall approximately a straight line on a normal probability paper,
that if it fallsapproximate normal on a normal probability paper, there is apossibility of
it may follow a normal distribution, and that the parameters aremu equals to 0.54 and thesigma
is equals to 0.051.
To perform thekolmogorov- Smirnov test at 5 percent significance level to statistically
justify the assumption of theassumption for the given data, so data is supplied here that
is the fractures toughness, which is having an unit of mega pascals square root of meterof
the plain concretespecimen, and this is alreadyarranged in anincreasingorder.So, you can see that,so
that there are total 25 samples are there1 to up to this 25and this one that KIC which
is the notation for this fracturestoughness isarranged in an increasing order,so from
0.451 to 0.658.
So, we have to test thatwhether this data set is following the normal distribution or
not,and normal distribution having the parameters 0.54 and 0.051.So, over null hypothesis here
is therandom variable, has a normal distribution with those parameters of course.Alternative
hypothesis,the random variable does not have the specified distribution in this null hypothesis,
level of significance is 0.05.And the critical region from the tablejust show, that is, here
what the number of data is25 that is available and significance level is 0.05.
Now, if you see this table here, that is, these are thevalues of this K S test goodness–of-
fittest.This is that for different n is listed in the first column 10,11,12 like this, and
the second one is that,u is your alpha 0.05.So, this kind of table is available to any standard
text book and here, if you just see this value is highlighted for this n equals to 25 and
the alpha equals to 0.05, the value is 0.264, so from this table we have just picked up
these the critical value of thattest.
Now, if we just do this one,do these same methods that we have explained;now whatever
the data that we are having,we have plotted it forit isdistribution with this step function
as shown in the blue line here.And the theoretical distribution for thisnormal distribution that
cdf of this normal distribution with parameter 0.54 and 0.051 is shown in the line magenta,now
whatever themaximum difference between these two that we have to pick up.
So, the cumulative frequency of the given data is plotted in this figure with respect
to the equation of this K S test and the theoretical distribution function of the normal model
is also shown, what is shown here.From the figure and of course,you can you can check
it in this calculationalso, the maximum discrepancy of the two functions is d max equals to 0.1348
which is occurring at KIC equals to 0.508,so at this value the D max is0.1348.
So, this is the only value that we can that we have to pick up from this comparison,this
is what thatK S test and sometimesfrom this point onwards, may be thatfurther improvement,
we will look for that,but that is the later part.But herefrom this graph,only one information
that we are picking up is,what is the maximum difference,just one particular value we have
to pick up and that value is 0.1348.The maximum discrepancy is 0.1348,now we can test that
it is less than that critical value thatwe have seen that also whichis the critical value
is0.264 that we have seen from the table. So,what we cansee here,so this modelis a normal
distribution with parameter 0.54 and 0.051 is a valid model at 5 percent significance
level, in other words or I should say that this should be used to declare that the null
hypothesis cannot be rejected at 5 percent significance level.
Well,now,we will go to thattwo sample test,keeping the basic philosophy, again here is the same
thing, one is that,so in the two sample; in one sample test what we have done is thatone
the datathat we have obtained and other one is that sum of some standard distribution
that we already know.So, in theexample we have seen that onenormal distribution that
we have used.Now, the two sample test means, that we are not using any standard distribution,the
twosamples are there, two samples both will have their ownthat event that observedcumulative
distributionfunction and we have to pick up the difference between two.
So,in the one sample test basically,that is one observed data and other one is some standard
known theoretical distribution.And in two sample test, both are the observed data, and
we are plotting, and what valuewe have picking up are exactly the same thing.So,the same
test used in the case ofthe one sample testcan be used to evaluate whether the two samples
come from the same distribution or not.Let the maximum absolute difference between two
empirical distributionsfunctions be D m n.
Now, let the twofunctions berepresented as the step functionG m X and S n X based on
the two samples of size m and n respectively, so that two samples are there, one sample
is having the m data other one is having n data, so this should be flexible.It is not
that both the samples may not have the same size of this data.And what we have to gowhat
we have to get is that from here, is that we have to find out what isthe G m X and what
is the S n X, andfollowing the same equation that we haveshown in thisone sample test using
K S test. So, thus here, the difference will be thatmaximum
difference that we have to get, which is the absolute value of this difference between
this G m X and that S n X and that one we havejust pick up only single value again similar
to the one sample test, which is denoted as that D m n this m is the size of this one
sample, other one is the size of the other one.
Now, this is one suchtypical example, how it will look like, the blue one is for the
one sample and the red one is for another one.So,now we have to find out, so basically
the red is again approaching towards this all are 0valuehere, and here also red is going
all are one- values are one.So,now each and every point we can find out what is the difference
there,so at this point may be around pointminus 0.4 or so,the difference is shown herebetween
this point andthis one, this is D m n for this particular value.So, for all such differences
we have to pick up the maximum one, so this K S test for the two sample goodness –of-
fits as looks like this and we have to pick up themaximum one.
Now again, the sample distribution havethe large values ofthe m and n,if this if thisvalues
are large enough that is m and n, the Smirnov has given the limiting distribution as this.The
square root of m n by m plus n, which is the data size for the two samples, multiplied
by this D m n less than equals to z is equals to square root 2 pi by z summation of k equals
to 1 to infinity, exponential of minus 2k minus 1 square multiplied by pi square by
8z square.So, basically what happens from the single sample,it was here it was square
root n and for the two samples test, it is replaced by this square root of m n by m plus
n. So, basically the sample size in the one sample
test we have used it for thisn and for the representative sample size for the two sample
test is this quantity m n bym plus n. So, if we just change this one, so if we once,
we are having this two samples, so what is the representativedata lengthwe have to calculate
first.And the remaining thing is same, whatever we havediscussed for the one sample test,
that is, if we take thatthe significance level is 0.05 say, then this quantity be equate
with this 0.95then, it will again come thatsame value which is 1.36.So now, thecritical value
that is that D m n should be less than equals to 1.36 divided by the square root of this
full quantity, earlier it was square root of n,now it is square root of this m n by
m plus n that is thedifference.
So,we will take one example here, the table showingthe modulus of rupture data for two
different groups of timber is shown in the next slide.Supplier delivereditems in two
lots.The first lot consists of the 50 samples and the second lotconsists of 30samples.Both
the lots were supplied by the same supplier and the second lot is claimed to be superior
to the first lot.Apply the K S test,the kolmogorov- Smirnov test, two sample test, to verify whether
the two samples are from the of the same type or not, that is, whetherin the statistical
sense,we should say that whether both the samples are fromthe same population or not.
So, this is what we have to decide and this50 samples and 30samples for all these timbers
what is the modulus of rapture is shown in this table.This is the first lot that is lot
A, this modulus of rapture is given in Newton permillimeter square. So, this is35.3and like
this, you can see that this is 5 by 10 columns,so all these data refers to the modulus of rapture
for the 50 samples supplied in lot A.
Similarly,in the lot B, these are30 data point of this modulus of rapture in newton per millimeter
square which is supplied in the lot B. Now, we have to test whether both thisdataset is
the sameis from the same population or not.
So,the null hypothesis is that the random variablessampled by thefirst50 values and
the random variables sampled by the next 30 values, have the same distribution or not.And
so and the alternative hypothesis are the random variables have the different distributions,
so whatever we havehypothesized in the nullhypothesis is not valid. Level of significance here is
0.05 and the calculation that we have to do is that the data from the each sample are
ranked separately,so we have to make it, we have to sort it.In the last example, it was
already sorted and that datawas supplied, but here we have to sort it,firstwe have to
give the rank and from there we have to calculate their respectivecumulative distribution,so
both are thatstep functions.
Thesamples are sorted in an increasing order and the ranked accordingly for both the samples
G m X and this S nX are determined as shown in this table in the next slide.And then,
the step functions of the both samples are plotted,the maximum absolute difference between
the empirical distribution is then determined.
So, thebasic steps what we have seen in the singlesample and this two samples are same.So,
this is for the first lot, andyou can see this rank 1, 2,this is ashortenedthis is a
table is shortenedjust to accommodate in a single slide that the rank is 1,2,3,4 and
this continuing up to 22,23,24,25,26,27 and going up to 50.
Now, thisis thatvalue of thatk by n, that is, that rank m by n,so it is 0.02,0.04, and
0.006 like thatfrom this0 it will go on, and it will come to that 1. So, this is for this
lot A, and similarly,this is for the lotB, and if we plot this one, it looks likethistwo
plots arethere, one is that shown byred and the other one is thatblue.Now, the maximum
difference we can observe that thedifference and we can get it and this maximum difference
is found to be 0.12 and as I told that now, there are two samples,one is that m is the
50 data samples and n is the 30, so if we just get it becomes at 18.75,so we have to
see what is the critical value against this sample size of 18.75.
And here if you see, that this sees approximately we have taken is 19 andobviously that for
the proper value.We can go for this linear interpolation between 18 and 19,but here we
have just taken this 0.031 as against this n equals to 19 and this D n 0.05, which is
this significance level - at 5 percent significance level.So, thisso 0.12 thatis what the maximum
difference that we got here is now less than thecriticalvaluesof this point -critical value
not s - so critical value of 0.301 which I have seen from this table,so thus the null
hypothesis cannot be rejected at 5 percent significance level.
So, that means, both the samples are basically from the same population,so what the supplier
has claimed that the second sample issuperior than the first one is not validateat least
at this5 percent significance level.So, in this lecture, we havediscussed twostatistical
test to test whether now, what type of particular distribution if particular sample is following
throughtwo statistical test,one is the chi square test and second one is the K S test.
Generally, the chi square test can be applied for both that discrete and continuous random
variable and this K S test isforthe continuous random variable, butmost of this application,
whether the dataset is following the normal distribution or not that we test,and also
we test that whether the two samples that we are having whether, they are following
the same distribution or not,that is what we can test using this K S test.
So,we will take up someothertest in the next lecture; thank you.