Mod - 06 lec - 25 markov vector approach - 5, monte carlo simulation approach - 1

In the previous lecture, we studied the question on how to characterize first passage times in the non-linear oscillators, which are driven by white noise. And to summarize what we have done till now, the basic results of Markov process theory has applied to dynamical system is displayed here. So, this is the general stochastic differential equation with multi-dimensional; this is the drift, this is the diffusion terms, and associated with this, we get the so call forward Kolmogorov equation or the Fokker Planck Kolmogorov equation - f p k equation. As based on this, we can derive the moments, equation governing the moments, that is expected value of h of X of t comma t and this equation can be derived. This is equation for moments at a single time t, but you can generalize this and get equations for moments involving two times t and t naught, and that equation again is another partial differential equation. We introduce an intermediate function call nu, which is function of x 1 x 2 t comma t naught and that is satisfied by this equation; this again the backward Kolmogorov, this is forward Kolmogorov operator. So, along for each of these equations, for the Fokker Planck equations, moment equations and this we need to write the appropriate boundary condition. Now, by taking t naught as a independent variable, we can derive an equation known as the backward Kolmogorov equation, which is useful in the study of first passage times. And we if we define a survival reliability function R, which is actually probability distribution, probability of first passage time T greater than t minus t naught condition on X of t naught is x naught tilde; this function is again satisfied by this partial differential equation, based on which, we get a sequence of equations for the nth order moments of the first passage time. So, M n is T n, where T is the first passage time. So, the Markov process approach essentially provides us with shoot of tools, which can be used to characterize stochastic response of non-linear dynamical systems. In many situations or in at least in a few situation, you can get an exact solution, either for the governing Fokker Planck equation or for the moments. In when other situations, more general situations, this forms the basis on which we can develop approximate methods for solving the problems. One such approach we were discussing, namely, how to develop models for envelope and phase of the response processes, using what are known as stochastic averaging principles. So, I briefly outline the averaging concept, as applied to deterministic systems in the previous talk. And here, we have a non-linear free vibration, u double dot omega naught square u minus epsilon this, and we made a transformation a displayed here, where a and beta is the amplitude or the envelope; beta is the phase, there taken to be slowly varying functions of time. And based on that, this is the exact transformations. we derived the governing differential equations for evaluation of a and beta. And then, we replace the right hand side by their average over cycle, and while doing that averaging, we treated a and beta as constants; and that assumption was based on the fact, that when epsilon is small, the change in a and beta over a cycle will be quite small compare to change in trigonometric functions like, cos phi, sine phi. So, that helps us to simplify, eliminate this oscillatory terms deterministic oscillatory terms from our model for envelope and phase, and we may be able to get simpler models for system response. Now, we will consider the question of extending this argument to randomly driven systems. So, I consider now, x double dot epsilon square h x comma x dot plus omega naught square x as epsilon z of t; this z of t is not a white noise; it is however taken to be process is zero mean and auto covariance R z z tau, and it is taken to be broad banded. When do you say an excitation is broad banded? It can be specified either in terms of the band width of the power spectral density; we saw the band width of the system transfer function or in terms of the time constants associated with auto covariance functions and impulse response of the system. So, suppose if you say characteristic time constant of excitation is much greater than the characteristic time constant of the system, then we can say that it is a broad band process; I think it's much less. Now, time duration over which R z z of tau decays to 10 percent of its maximum value, can be taken as the characteristic time constant of excitation. For system, it can be taken as, time duration over the impulse response decays over by 90 percent's. Now, we are taking this parameter epsilon that appears here to be small; and if indeed it is 0, and if there is no excitation, we get the solution as a cos phi a omega naught sin phi. So, this is the case for no excitation and no damping also. But when there is light damping and weak excitations, we stick to the functional form of this solution namely, x of t a cos phi and x dot of t is a omega naught sin phi, where capital phi is t omega naught plus small phi. For epsilon naught equal to 0, when epsilon is naught equal to 0, two things happen; one is system become nonlinearly damped and other is it get driven by a broad band excitation. So, we still assume that the functional form of this equations remain the same, except that the quantities a and phi are now taken to be slowly varying functions of time. So, based on this, I will now there is no again... let me emphasis, there is no approximation as of now. You can implement this transformation, and get these equations for a dot and phi dot. So, z of t is now multiplied by sin phi and cos phi; these are actually quite intimidating set of non-linear differential equations. You can see there are trigonometric terms and there is 1 by a sitting here, and this a multiplies our random excitation, and so on and so for. So, by introducing this transformation, we seem to have made the problem more complicated. But if we now bring in the fact that epsilon is small and a and beta are slowly varying, then certain simplifications become possible. The averaging here is done in two steps: one is the deterministic averaging, where we replace regular oscillatory terms by their time averages, as we did in case of deterministic averaging; we also do another component of averaging known as stochastic averaging, where we replace randomly fluctuating oscillatory terms by delta correlated processes. That means, excitations which are not truly white noise excitations, where replacing them by equivalent white noise excitations; what is that equivalents, is what this method, the theory behind this method establishes. The first stage follows the procedure used in deterministic averaging. The second stage is based on the application of what is known as stratonovich khasminiski theorem. So, what this theorem says is, if you have a non-linear system, which is excited by a broad band excitation Y of t, and this parameter epsilon is small, this X of t is n cross of vector of response processes and Y of t m cross one vector of random excitations; and we assume that, mean of this excitation is 0 and Y of t is broad banded. According to the stratonovich khasminiski theorem, the above equation can be approximated by a stochastic differential equation, where we are now replacing the band limited random excitations by equivalents increments of Brownian process, and also, we are simplifying the functions f and g by doing certain time averaging. So, that time averaged and stochastically averaged equations are given here; this drift is given by this and diffusion co efficient is given in terms of this. So, while carrying out this averaging, the quantities a and this f, this quantities X here are treated as constants. So, there is this paper I reference in the previous lecture; so, this is a review paper, which explains the basics of this method. Now, first level of exercise is to make the transformation; these two equations are equations for amplitude and phase, in terms of the random excitations, and the non-linear dissipation and other terms; these are exact. Now, if you do the averaging, that is the two stage averaging, the resulting stochastic differential equation will have this form, da of t is some F of a and this is a constant. And now, I get a increment of Brownian motion process with an equivalent multiplier. Similarly, the phase equation is govern by this, s d this function F of a and G of are given by this equation, which are quite similar to what we did it in deterministic averaging. So, you now look at these two equations; you can recognize that, the vector a and phi have Markovian vector - Markovin character. We look carefully at equation for a of t, there is no phi of t in this; that means, a of t itself is a Markov process, but you look at phi of t, there is a here. So, phi of t needs on its own is not Markov, but you look now at the vector a and phi together, that is Markovian. So, this is the case in which the response vector has a components, which has Markovian property; another components which does not have Markovian property, but taken together, they have Markovian property. Now, once this simplified equations is set of equations are obtained, I can take two routes now. One is to look at only the envelope and write the forward equations, forward Kolmogorov equations, backward Kolmogorov equations, the equation for first passage times, the moments, first passage times, and so on and so forth. Alternatively, I can consider the joint density function between a and phi, and develop the forward equation, the backward equation, the moments equations, the generalized punter-ion-wit equation, etcetera, for those problems. So, that would mean this method of stochastic averaging has now enhanced; our ability to study non-linear oscillators, which are now driven not necessarily by white noise, but by broad band excitations. So, they method establishes a equivalent Markovian model for the system behavior, which is not in reality Markov. So, this is the FPK equation for the joint density of a and phi for this function transition PDF. And if you are interested in steady state, dou p by dou t vanishes; if it is admissible and I get the reduced equations properly. This you may be able to solve lot more easily than this equation. Or you can consider now a of t itself as a scalar Markov random process, and write the FPK equation governed by p of a comma t a naught t naught; and in steady state, this becomes a ordinary differential equations, because now there are only two independent variables t and a. So, in steady state, it becomes ordinary differential equation and I can solve for the envelope of the response process in steady state. So, this is one of the major achievements now; I have got now probability distribution function of the envelope process of a non-linear systems, which is driven by broad band excitation. So, our ability to model envelopes has now considerably enhanced; this an exact solutions within the framework of approximation, that the method of averaging introduce. Now, you look at the two equations in steady state; this is equation for p of a phi a naught phi naught, whereas this equation this p is a condition on a naught; this is steady state. Now, although I use same notations p, you must understand at this equation is for a of t and this is equation is for a and phi. Now, if you now look at a possible solution which is in variable separable form, you can verify that, the first of these equation is satisfied by p of a divided by 2 phi, where p of a solution of this equation; that would mean, the amplitude, the phase is almost characterless here; it is uniformly distributed 0 into 2 phi. So, one moment you solve this ordinary differential equation, actually this ordinary differential equation, because a is the only independent variable, other problem is completely solved, because if you substitute this and look for phase being uniformly distributed in 0 to 2 phi, you substitute this solution into the first of this equations; this equation is satisfied. So, therefore, you can also get an approximation to the joint density function between the process and its time derivatives plus when velocity, because I have now model for a and phi; I can use method of transformation of random variables, and 1 by a will be a Jacobean and I can get this. And this also enables us to get the model for the parents process and its derivatives. So, again, I have a handle on the response of the non-linear system to broad band excitations, I have solved envelope process - phase process; and from them, I am now getting the probability density of the parents process, although if I directly try to write the, I mean, originally the parent process and its time derivatives do not constitute a Markov process. Now, however, transient solution, the transient solution for p of a t conditions on a naught t naught can be developed based on a method of variable separation; we will get, we will have to solve on the way an Eigen value problem, which leads to Eigen functions in terms of hypergeometric functions, and we can like laguerre hermite and that type of functions, and we will be able to solve this problem. Now, in the illustration that I have done, I have consider only external excitation. But suppose there is a parametric excitation, that two random excitations, zeta of t and s i of t; both are broad banded, zero mean, may be dependent, could be independent; actually, the method of averaging can be used for this also. Now, in the special case, where system is linear; so, what I am taking about, u double dot plus 2 eta omega u dot plus omega square u is equal to z of t. Now, I know that by a independent study, that the envelope of this process is rarely distributed. If f of t is broad banded, excitation is broad banded, the response is narrow banded; and for narrow banded process, the envelopes are rayleigh distributor. Now, the same result I get through averaging; I indeed get a rayleigh density function for this system, and the phase is uniformly distributed between 0 to 2 phi, and this sigma square is equivalent strength of the white noise; there is no white noise in this problem, because z of t is the band limited excitation. But the method of averaging replaces the excitation - band limited excitation - by an equivalent you know white noise excitation. So, the strength of that white noise is provided in the calculations of this method. Now, you can compare this with the envelope and peak distribution obtained earlier, by using you know convolution integral approach, and so on and so forth. Now, what happens if stiffness is also non-linear? I have been talking about non-linear damping; stiffness is non-linear and of course, I also added for variety parameter excitation term also, the definition of envelope now will not be in terms of trigonometric function. You saw this, that the envelope V of t can be defined in terms of the energy, that is kinetic energy plus the potential energy in the system, and we can develop a Markov model for this envelope. I am not going into get into details, I already given a reference. I am trying to provide certain key ideas, which would help you to read the relevant literature. So, in summary, what we can say is, the method of stochastic averaging enables us to study envelope and phase processes associated with weakly non-linear system response to broad band excitations. The method also provides a framework to study first passage problems for the response envelope. The method is best suited to the study of single degree freedom systems. It leads to many interesting results, which provide very useful inside into non-linear system behavior. Although there applicable to small scale systems, the insides that we gain from this are quite valuable in understanding behavior of larger systems. with this, I conclude the discussion on the use of Markov process theory in the evaluation analysis. We now move on to a new topic, which is Monte Carlo simulation methods in stochastic structural dynamics. So, as the name suggests, it essentially a numerical approach to a random evaluation studies. So, what it does? So, imagine the equilibrium equation of a dynamical system of mx double dot plus cx dot plus g of x comma x dot is f of t; this f of t let it be a random excitation is a random process. Since excitation is the random process, the response process is also a random process. So, we have f of t going to the system and producing x of t; if f of t is a random process, x of t is also the random process. What we have being doing till now, is to characterize x of t in terms of the properties of f of t, in terms of for example, if the system is linear, if I know the power spectral density function of excitation, can I get the power spectral density function of the response in steady state? Or if I know the covariance function of f of t, auto covariance of f of t, can I get an equation which governs auto covariance of x of t? So, here, I will be basically using the governing equation of motion; the transmission of uncertainty in f of t to the response is essentially through the mechanics of the problem. Now, in Markov vector approach, what we did; we assume excitation to be white in nature. So, we were aiming to get the probability density functions, first passage times, and its probability density function, moments, etcetera, directly from the governing equation and find that they have Markov properties. Now, in Monte Carlo simulation approach, we take an alternative route. So, here, what we do is, we treat excitation is the random process, that would mean, this random process is actually ensemble of time histories. So, this continues; all though I shown if you here, this is an infinite ensemble. For any single realization of this ensemble, suppose if I take this realization, I can solve this equation numerically and get the sample of the response; that means, I am essentially carrying out a deterministic analysis, as for as the realization of excitation is concerned. I take a realization of excitation, solve this equation using standard methods of solving ordinary differential equation - numerical methods, get the time history of the response. I repeat that for every realization of f of t; for corresponding to these ensembles of inputs, I get an ensemble of response time histories. From this, what I do? So, one of the task that I should do is, I should be able to produce an ensemble of inputs, obeying prescribed models for f of t; see, f of t is a random process, means, it is completely specified by its n th order joint density function. When I say that I have an ensemble of inputs, which corresponds to realization f of t, means, that I am able to produce time histories of f of t, which obey the prescribed probabilistic model for f of t. So, if f of t is say a gaussian random process with a specified power spectral density function, zero mean stationary random process with the specified power spectral density function, the problem would be, how do you simulate samples of such a random process? That would mean, if I were to simulate these ensemble of time histories, and on these ensemble of time histories, if I were to do statistical processing of these time histories and estimate the power spectral density function. I should get back the target power spectral density function, which I had at the outside, fine; that mean that is one of the capabilities that we have to develop. How to generate samples of random quantities, which obey the prescribed probabilistic loss; that is one of the important components of Monte Carlo simulation study and this has to be done digitally on a computer. Then, we simulate the system dynamics through the solution of the system behavior, which could itself be a, for example, a finite element model. This could be a wind on a cooling tower, and I have a finite element model made up of beams plates shells and soil, and more media, etcetera. And I can do a sample calculation and get the response of the systematic given location. So, at the end of the this exercise, I get ensemble of outputs, but I am interested in statistical properties of x of t, I am interested in knowing what is power spectral density function of x of t, what is first passage time associated with x of t, what is the probability distribution of first passage time associated with x of t; that would mean, after getting this ensemble, I have to perform one more exercise, where I have to process the ensemble of outputs using statistical tools and arrive at probabilistic model for x of t. Based on these probabilistic models, I will able to take a decision on the system behavior, right. So, task one is to simulate samples of f of t obeying prescribed probabilistic loss. How do you test that you actually succeeded in doing? We have to use methods of statistics. You simulate samples of time histories and get ensemble of response time histories, and you have to get a probabilistic model for this. How do you do? You have to do statistical processing of this ensemble. So, the subject of statistics underlies forms one of the main components of Monte Carlo simulation studies. So, we will split this discussion Monte Carlo simulations into four aspects: one is how to simulate realizations of random quantities, which obey prescribed probabilistic loss; second, after having simulated those realizations, how do you test that you are succeeded in doing so; Third, corresponding to ensemble of excitations and system parameters, how do you solve the governing equation for the system behavior and get an ensemble of time histories of the response; if this system behavior is characterized in terms of say a set of stochastic differential equations, how do numerically simulate samples of response of a system which is governed by an SDE. So, that is third aspect that we need to investigate. The fourth aspect is of course, after getting the ensemble of response, how do we get the probabilistic models for the response process? So, I will begin this discussion by considering questions on statistics. Before that, we could also look at the problem of simulation from a slightly different perspective. Consider the problem of evaluation of a definite integral i a to b f of x d x; this I can rewrite as, I will multiply divided by b minus a and I will write it as b minus a a to b f of x 1 by b minus a dx. Now, if you interpret 1 by, what is this function? this is a, this is b and this height is 1 by b minus a. So, what is this area? This is 1; this function between a and b represents the probability density function of a random variable, which is uniformly distributed between a and b. So, if that probability density function I write it as p X of x dx. Then, this integral can be interpreted as expected value of f of x with respect to this density function; that is, I is b minus a into expected value of this and this, this expectation is with respect to p X of x, which is uniformly distributed in a to b. Now, this expectation is nothing but an average and that average can be return as, 1 by N i equal to 1 to N f of X I; that would mean an integral which is a deterministic quantity now is being evaluated as an expectation of a random variable which is f of X, where X is uniformly distributed in between a and Now, let me just to illustrate, suppose I am interested in evaluating 0 to 1 x square dx, I know the answer is 1 by 3. Now, for a moment, let us assume that, I have a method to simulate samples of x of t on a computer x, our computer such that, p X of x is uniformly distributed in a and b, that I have to still demonstrate, but let us assume that capability exist with us. So, what I was shown here? There red line is the exact solution; this value of the integral on y axis and these samples that we use in evaluating this expectation through simulation is shown on x axis. So, as you see the smaller samples, as a fluctuation and samples become large 1000 samples, we are hovering very closely around the exact value which is 1 by 3. Now, if I repeat this exercise with another set of a samples, I get a similar curve; not exactly the same curve, because I am using different random numbers. Thus, they also show a similar behavior. Now, if I take 500 samples and do 500 runs, in each run, I will take 500 different samples, and I perform the evaluation of I and plot it as a function of different runs; there are 500 runs, and each runs, there are 500 samples, I get this kind of trajectory; that would mean, integrals can be described in probabilistic terms and we may have a way of evaluating integrals through Monte Carlo simulations; indeed one of the major application of Monte Carlo simulations lies in evaluation of multi fold integrals. And since reliability is essentially in some sense evaluation of multi-dimensional integrals, you can easily see that Monte Carlo simulations have important role to play in these exercises. So, here, what I have shown is, I would treated this, I that have got here as a random variable and plotted its probability distribution function. It turns out that, the blue line is the normal probability distribution function and red line is the empirical probability distribution function of a realizations of I; and it turns out that the nearly Gaussian. These are empirical observations, I have to put all these matters on a proper footing. So, we will begin that discussion now. So, I will again re-capsulate what are the major ingredients of Monte Carlo simulation. First, methods for generating samples of excitations and system parameters compatible with the prescribed probabilistic models. Next, to test statistically if the generated samples indeed obey the prescribed probabilistic laws. How do you know what you are doing is right? So, that have to do. Then, third one is a computational model for the system dynamics, which accepts samples of inputs and system parameters produced in the previous steps and generates an ensemble of response quantities. This as its already set could be a finite element model or a numerical model for solution of a stochastic differential equation; it could be any model that is based on the physical laws, that govern the system behavior. The last step is a statistical processing of ensemble of response time histories and inferences on system behavior, where based on which we make decisions. So, there are two places, that is, namely this step and this step, where we need to understand how to process data. So, we will begin with a review of elements of statistical methods, just to place all the terms and techniques in a unified framework. So, the word statistics can be used in different senses; it can be used in plural, to mean data like data on birth, death and marriage so called vital statistics or it is a subject - science of statistics; it is used in singular, but what we do is, we use statistic in singular to denote a random variable statistics is a set of random variables. So, it is in this sense that we are going to use this word. Statistics is a subject is a singular, but this form a, it is a set of random variables; so, what it mean? What does that mean? We will have to build a set of terminologies, vocabulary to describe these issues. So, I will briefly introduce them; the word average, we consider average is single number that describes the data. So, the data is a collection of numbers, which describes some property of certain physical phenomena; for example, if we have a material, we will describe it by its density, viscosity, stiffness, strength, etcetera. So, if we have a data, what you do? You talk about its mean, geometric mean, arithmetic mode, median, percentile, minimum, maximum, variance, skewness, kurtosis, cumulative distribution, correlation, etcetera. So, these are descriptors of data, which conveys certain specific meaning; for example, if I say material as viscosity of certain units, certain number, it means something in the same sense, if I say skewness of this data is so much, it should convert some specific meaning. We use the term population; suppose, we consider a campus with 5000 persons, we can measure all their heights, and I get 5000 numbers: X 1, X 2, X 5000; I can measure the weight, I get Y 1, Y 2, Y 1000; I can look at the income, I get another set of Y 5000 numbers. I can look if they wear specs or not, I get a sequence of yes or no 5000; gender: male, female, 5000 number. So, in study of statistics, each of this is a population; this a population of heights; this a population of weights; this a population of gender, so on and so forth. So, population is synonym somewhat analogues to the sample space, that we used in probability and defining probability. So, it is a collection of all possible observations on a particular characteristic with respect to the problem on hand. This is starting point in statistics; it analogues to sample space in probability. Any collection of measurements capable of being described by a random variable constitutes a population. This can be taken as a working description of a population; that means, the outcomes that we talk about should be... the numbers we talk about, we should be able to constitutes them as outcome of a random experiment and hence a random variable; that should be born in mind. A sample in our studies, it is not practically to studied all members of the population; therefore, we talk about sample. Sample is a part of the population which we want to study and draw conclusions about property of population. We want to study average income of people in this country; one approaches to ask everyone get that number or sample take 10000 samples, a sample of 10000 persons and get that an estimate for the average, and see, how good you are in your estimate. It is not enough to say that sample is a subset of population; the subset needs to be representative. The word sampling denotes the procedure of drawing samples. And the word sampling design describes a development of a sampling procedures to meet a specific requirement. So, we use these terms sample sampling, sample design, etcetera. So, we need to have some working definitions or descriptions of these words. Now, the word random sample, we can give a fairly rigorous definition for that. So, let X be a random variable with probability density function p x of x, and we consider an iid sequence X 1, X 2, X 3, X n, with a common probability density function p x of x; this set of random variables X 1, X I, i running from 1 to n is called a random sample of size n of x. Now, if you now consider a transformation S, on these n random variables S of X 1, X 2, X n, that is the function of X 1, X 2, X n; this function is called statistic. This is a random variable, because it is a transformation on X 1, X 2, X 3, X n. We will write the probability density function of X in the form p x of x colon theta, where theta are parameters of the probability density function; for example, in a normal random variable, we have m and sigma; in poisson, we have lambda, so on and so forth. The joint probability density function of X I, i running from 1 to n is of the form, this is X i's are all independent; it is i running from 1 to n p x of x i theta. Now, this states X i can be interpreted as values of observed data, taken from the random sample; that means, it is a realization of these n random variables, is actually what can be interpreted as a measurement that you make. An estimator of theta is a statistics S, theta is a parameter; so, essentially, the estimator is a random variable, which is a function of your random sample, and that we denote as theta S of X 1 comma X 2 comma X n; so, I am describing the word estimator. For a particular set of observations, say, X 1 is small x 1, lower case x 2, lower, etcetera, the value of this estimator is called an estimate of theta and is denoted by theta cap. So, estimator therefore is a random variable, but estimate is realization of the estimator; that means, it is a realization of this random variable. So, we will be using these terms, so you need to have some clarity on what exactly they mean. Just to fix the idea, we will consider a hypothetical problem, where there is a population with only 4 numbers, say, 1, 2, 3, 4. Now, if you now take samples, suppose I draw only one sample, you are allowed to take only one sample and tell me what is average of this; what is average of this? This is 2.5; this is 1 plus 1 plus 3 plus 4 divided by 4 is 2.5. If you draw one sample, if you get one, you will report one as an answer; if you where to draw two report, two as an answer, so on and so forth. Now, if you are allowed to take two samples, how many such samples we can draw? 1 2, 1 3, 1 4, 2 3, 2 4, 3 4; depending on which of these samples are actually drawn, you get the estimate for mean as 1.5, 2, 2.5, 2.5, 3, 3.5, etcetera, right. So, the estimate that we are getting is therefore random variable and the probability distribution of that is known as sampling distribution; that is, first term is sampling distribution. Now, if N equal to 3, you can now the number of ways in which sample reduces, you can draw 1 2 3, 1 2 4, 2 3 4, 1 3 4, and we get these numbers as a answers; of course, if you take N equal to 4, there is only one sample that you can draw and you get an answer 2.5; and we will now look at the last two entries in these tables. Now, look at what happens if I take two samples, I can draw samples in 1, 2, 3, 4, 5, 6 number of x and I get these answers; these are all estimator answer to the question that we are asking; 3.5 as good as 1.5, which is as good as 2.5, because you are drawn two samples. Now, if you take average of this, I get 2.5. And you look at its standard deviation, if you are taking only one sample, the average is still 2.5, but the standard deviation will be 1.29. I will explain how this standard deviation is computed in due course, but you can assume that it is based on these numbers. Here, the sample size has increased and standard deviation has reduced; the samples are not become 3, the mean still remains to be 2.5, but the standard deviation has becomes 0.43. In the last case, the sample size reaches its possible maximum value, and mean is 2.5 and standard deviation is 0. So, what is happening here? We will summarize now; if you say capital T is a estimator 1 by n i equal to 1 to X I, the probability distribution function of T is known as sampling distribution of T. The fact that the T is random variable is very clear, because X 1, X 2, X 3, X n are an random variables. A realization of T is known as an estimator; this is an estimator; estimator is a random variable; estimate is a realization. The estimator is said to be unbiased, if the expected value of this random variable T is actually equal to the population mean. Is it true here? Yes. As we saw here, if you take the mean of this random variable, I am getting 2.5, which is indeed the population mean which is not known to me. We say that, such an estimator is unbiased; an estimator is unbiased, if its expected value is exactly equal to the population quantity that you are trying to estimate. Now, one more thing that we saw was, as n was increasing, a sample size was increasing, the standard deviation of the estimator was reducing. So, the estimator in that case is said to be consistent. So, the estimator is said to be consistent, if as limit n tending to infinity variance of T goes to 0. So, that means, you draw more samples, you are more sure of your answer; that dispersion in your answer goes on reducing. Estimation is a finding a realization of T as an approximation to a population parameter; that would mean, what it involves, this the population, suppose with N equal to 2, I take the number say 2 4 and compute this 2 plus 4 divided by 2 which is 3 as an as the answer, for this unknown 2.5 which is not known to me; I take 3 as an approximation to 2.5. This process of drawing a sample and computing, this is known as estimation. Now, let us consider the problem of estimation of mean. So, let X be a random variable with probability distribution function P X of x, probability density function lower case P X of x, a mean mu and standard deviation sigma. We will form a sequence of identical, independent and identically distributed random variables X I, i running from 1 to n with common PDF P X of x which is given here, in the specification of X. What this means is, X I is independent of X j, for all i not equal to j in 1 to n. Since X i's are all identical, expected value of X i is mu, variance of X i is sigma square and p X i of p X is p X of x, for all i 1 to n, because there are iid's. Now, let theta, i running from 1 to n, a i x i be an estimator of mu; mu is a property of the population which is not known; theta is an estimator. Now, estimator is said to be unbiased, if expected value of theta is equal to mu. When that will happen? You consider expected value of theta; this is expected value of i running from 1 to n a i X I, but expected value of X i is mu; therefore, I get this mu i equal to 1 to n. If this has to be equal to mu, it means the summation of this a i must be equal to 1; that means, if you select an estimator, where the weights a i added to 1, then the resulting estimator is unbiased. Unbiased is a desirable property, we can start with that notion, but this is not unique; there are so many ways in which I can select a I, so that i running from 1 to n, a i is equal to one. Now, what I will do now is, I will consider variance of theta; variance of theta is variance of this summation, and since X i is are all independent, variance is a i X i minus mu summed over i to 1 square and we can show that variance indeed this. Now, what we do is, to make the unbiased estimator unique, we will select a i show that it minimizes this variance. Since theta is being used as an approximation to the population description mu, say, mean, population mean is not random, it is a deterministic quantity. So, you are approximating a deterministic quantity via random variable. So, the best that you can do is to minimize the variance of random variable; that is the logic beyond minimizing the variance here. So, variance of theta is given this; I want to minimize variance of theta given by this; subject to the constraint that I want an unbiased estimator. So, what we do? We form the langrangian sigma square, i equal to 1 to n a i square plus lambda; this is the constraint equation, i equal to 1 to n a i minus 1. Now, for optimality, dau l by dau a i must be equal to 0 for i equal to 1 and dau l by dau lambda must be equal to 0, and that basically that condition optimality of L with respect to lambda essentially gives our constraint equation; that is the idea behind doing introducing langrangian multiplier. So, we go through these calculations, dou l by dou a k equal to 0, would mean 2 sigma square a k plus lambda equal to 0, for k running from 1 to n, and from this, we get a k to be minus lambda by 2 sigma square; dou lambda by dou lambda dou l by dou lambda equal to 0, I can recover the constraint equation. And in this if I put this a k as minus lambda by 2 sigma square, I can eliminate, that an I can get lambda in terms of sigma square and n as shown here, and I get a k to be 1 by n; that means, all a 1, a 2, a 3, a n, there are all 1 by n. If n is a sample size, all a k's are one by n; therefore, the optimal corresponding to this optimal values, the minimum variance that I get is sigma square by n. So, to summarize, theta is equal to 1 by n, a i 's are all 1 by n, they come out of the summation; i equal to 1 to X i is an unbiased estimator of mu with minimum variance, and the lowest variance is sigma square by n. So, whenever we are given a set of numbers and ask to find an average, this is what we intuitively do, and that actually has this desirable properties that that procedure leads to an estimate, which is estimator which is unbiased and has its smallest variance. Now, another approach to parameter estimation is what is known as maximum likelihood estimation. Let X be a random variable with pdf p X x colon theta, and theta is a vector of parameters of the distribution; for example, if X is normal, I have sigma and mu as parameters. For the moment, let us assume that theta is known, and let X i be an idd sequence of random variables with the common pdf given by p X x colon theta; this theta is given now. Now, let us consider the joint density function of x 1, x 2, x n and that is given by, since X i 's are all independent, it is p X 1 evaluated x 1, p X 2 evaluated x 2, so on, and so forth. And using the product notation, I write it as product i equal to 1 to n p X i x i colon theta. Now, just for a illustration, let x be exponentially distributed, so that p X of x colon lambda theta is only one. Now, this is lambda, I get lambda exponential minus lambda x, this is the probability - common probability density function - and this product will be lambda to the power of n exponential minus lambda, i running from 1 to x i, where x i's are all greater than 0, and i raise from 1 to n; here, I have already said that theta is taken to be known. What is this product? Product of density functions, suppose I multiply this by dx i, this product is nothing but the probability of X 1 lying between X 1 plus dx 1 intersection X 2 lying between X 2 and X 2 dx 2, so on and so forth, right. Now, let us now consider the case, where theta is unknown and let us assume that we have observed a sample denoted by lower case x 1, x 2, x 3, x n. Now, what I do? I interpret this product as the likelihood of making the observation x i, because of this property, that this product is indeed the probability. So, the word likelihood is being used in the sense of a probability. Now, theta is unknown; so, it is a function of observed samples and the unknown parameter theta. Now, for what value of theta, this likelihood is maximum is a way to find theta now. So, what we do is, we introduce the definition; the maximum likelihood estimator of theta is the value of theta, for which this L theta condition a x 1, x 2, x n is the maximum, and this L is known as a likelihood function, which is the product of the probability density function evaluated at the sample points - observed sample points. Just to give an example, let X be a exponential random variable and suppose I have made observations t 1, t 2, t n, so the L function will be the likelihood function will be lambda to the power of n exponential minus lambda, i running from 1 to n t i. We have to maximize this; so, we take logarithm; so, this is known as log likelihood function; so, this we get as this. Now, let lambda had maximize this function; so, that means, do by dou lambda L n of this is 0. And if I do now the calculation, first term gives n by lambda, the other one gives summation t I; and from this, I get 1 by lambda as 1 by n, i running from 1 to n t i. Now, quickly recall, what was an expected value of X? This is consistent with our definition of expected value of X; that means, if you observed t 1 t 2 through t n and find the unbiased estimator with minimum variance, you will get the same formula for the expected value of X. So, this is consistent with that. How about normal random variable? It has two parameters: mu and sigma. So, the likelihood function is product of these density functions evaluated at these observed points t 1, t 2, t 3, t n and this is the likelihood function. And if I take logarithm, it becomes, I mean, it simplifies the matter a bit and our parameters of interest are buried here, and here and here. So, I have to minimize this maximize this with respect to mu and sigma. So, if I do that, I use dou by dou mu equal to 0, dou by dou sigma of this L n function is 0; and I get this as the maximum likelihood estimator for the mean, maximum likelihood estimator for standard deviation. So, these are yet another approach for finding parameters of a probability density function, based on observations. Now, I have mentioned that estimator is the random variable; to be able to assess how good is our estimation, we should characterize the probability distribution of that estimator. So, just to give an introduction to that, if you consider the estimator theta as 1 by n i equal to 1 to n X I; theta as we know is an unbiased estimator of mu with variance as this. Now, if X is Gaussian, X i will also be Gaussian; and you are adding gaussian random numbers, so theta will be Gaussian. So, sampling distribution for theta as per this description is gaussian with mean as mu and variance as sigma square by n. If X is not gaussian by using central limit theorem for large samples n being large, we can still make a have postulation that, sampling distribution for theta is still Gaussian. Now, based on our knowledge of probability distribution of theta, what can we say about the accuracy of estimation? This is a question we will consider in the next lecture; that is, how to get the sampling distribution for different estimators, say, estimators for mean, estimators for variance, estimator for correlation function, estimator for power spectral density, so on and so forth? That is one exercise. After having got that, how do you use that information in assessing the accuracy of estimates? We will consider these questions in the next lecture; we will conclude this lecture at this stage.