Shannon 1948 part 3

density function for the sum is given by the convolution: Physically this corresponds to adding the noises or signals represented by the original ensembles of functions. The following result is derived in Appendix 6. Theorem 15: Let the average power of two ensembles be N1 and N2 and let their entropy powers be N1 and N2. Then the entropy power of the sum, N3, is bounded by White Gaussian noise has the peculiar property that it can absorb any other noise or signal ensemble which may be added to it with a resultant entropy power approximately equal to the sum of the white noise power and the signal power (measured from the average signal value, which is normally zero), provided the signal power is small, in a certain sense, compared to noise. Consider the function space associated with these ensembles having n dimensions. The white noise corresponds to the spherical Gaussian distribution in this space. The signal ensemble corresponds to another probability distribution, not necessarily Gaussian or spherical. Let the second moments of this distribution about its center of gravity be ate. That is, if p(xl,...,x„) is the density distribution function where the ai are the coordinates of the center of gravity. Now ate is a positive definite quadratic form, and we can rotate our coordinate system to align it with the principal directions of this form. ate is then reduced to diagonal form bit. We require that each btt be small compared to N, the squared radius of the spherical distribution. In this case the convolution of the noise and signal produce approximately a Gaussian distribution whose corresponding quadratic form is The last term is the signal power, while the first is the noise power. PART IV: THE CONTINUOUS CHANNEL 24. THE CAPACITY OF A CONTINUOUS CHANNEL In a continuous channel the input or transmitted signals will be continuous functions of time f (t) belonging to a certain set, and the output or received signals will be perturbed versions of these. We will consider only the case where both transmitted and received signals are limited to a certain band W. They can then be specified, for a time T, by 2TW numbers, and their statistical structure by finite dimensional distribution functions. Thus the statistics of the transmitted signal will be determined by

41 and those of the noise by the conditional probability distribution The rate of transmission of information for a continuous channel is defined in a way analogous to that for a discrete channel, namely where H(x) is the entropy of the input and H_y(x) the equivocation. The channel capacity C is defined as the maximum of R when we vary the input over all possible ensembles. This means that in a finite dimensional approximation we must vary P(x) = P(xl,. . . ,x„) and maximize It is obvious in this form that R and C are independent of the coordinate system since the numerator and denominator in log _P~⁽p₍y₎will be multiplied by the same factors when x and y are transformed in any one-to-one way. This integral expression for C is more general than H (x) -H_y(x). Properly interpreted (see Appendix 7) it will always exist while H(x) - H_y (x) may assume an indeterminate form - - - in some cases. This occurs, for example, if x is limited to a surface of fewer dimensions than n in its n dimensional approximation. If the logarithmic base used in computing H(x) and H_y(x) is two then C is the maximum number of binary digits that can be sent per second over the channel with arbitrarily small equivocation, just as in the discrete case. This can be seen physically by dividing the space of signals into a large number of small cells, sufficiently small so that the probability density P_x(y) of signal x being perturbed to pointy is substantially constant over a cell (either of x or y). If the cells are considered as distinct points the situation is essentially the same as a discrete channel and the proofs used there will apply. But it is clear physically that this quantizing of the volume into individual points cannot in any practical situation alter the final answer significantly, provided the regions are sufficiently small. Thus the capacity will be the limit of the capacities for the discrete subdivisions and this is just the continuous capacity defined above. On the mathematical side it can be shown first (see Appendix 7) that if u is the message, x is the signal, v is the received sianal (perturbed by noise) and v is the recovered message then regardless of what operations are performed on u to obtain x or on y to obtain v. Thus no matter how we encode the binary digits to obtain the signal, or how we decode the received signal to recover the message, the discrete rate for the binary digits does not exceed the channel capacity we have defined. On the other hand, it is possible under very general conditions to find a coding system for transmitting binary digits at the rate C with as small an equivocation or frequency of errors as desired. This is true, for example, if, when we take a finite dimensional approximating space for the signal functions, P(x,y) is continuous in both x and y except at a set of points of probability zero. An important special case occurs when the noise is added to the signal and is independent of it (in the probability sense). Then P_x (y) is a function only of the difference n = (y - x),

and we can assign a definite entropy to the noise (independent of the statistics of the signal), namely the entropy of the distribution Q(n). This entropy will be denoted by H(n). Theorem I6: If the signal and noise are independent and the received signal is the sum of the transmitted signal and the noise then the rate of transmission is R = H(y) - H(n), i.e., the entropy of the received signal less the entropy of the noise. The channel capacity is We have, since y = x + n: H(x,y) = H(x,n). Expanding the left side and using the fact that x and n are independent Hence Since H(n) is independent of P(x), maximizing R requires maximizing H(y), the entropy of the received signal. If there are certain constraints on the ensemble of transmitted signals, the entropy of the received signal must be maximized subject to these constraints. 25. CHANNEL CAPACITY WITH AN AVERAGE POWER LIMITATION A simple application of Theorem 16 is the case when the noise is a white thermal noise and the transmitted signals are limited to a certain average power P. Then the received signals have an average power P+N where N is the average noise power. The maximum entropy for the received signals occurs when they also form a white noise ensemble since this is the greatest possible entropy for a power P +N and can be obtained by a suitable choice of transmitted signals, namely if they form a white noise ensemble of power P. The entropy (per second) of the received ensemble is then and the noise entropy is The channel capacity is Summarizing we have the following: Theorem 17: The capacity of a channel of band W perturbed by white thermal noise power N when the average transmitter power is limited to P is given by This means that by sufficiently involved encoding systems we can transmit binary digits at the rate W 109₂ P NN bits per second, with arbitrarily small frequency of errors. It is not possible to transmit at a higher rate by any encoding system without a definite positive frequency of errors. To approximate this limiting rate of transmission the transmitted signals must approximate, in statistical properties, a white noise. ⁶A system which approaches the ideal rate may be described as follows: Let ^TThis and other properties of the white noise case are discussed from the geometrical point of view in "Communication in the Presence of Noise," loc. cit. 43 M = 2s samples of white noise be constructed each of duration T. These are assigned binary numbers from 0 to M - 1. At the transmitter the message sequences are broken up into groups of s and for each group the corresponding noise sample is transmitted as the signal. At the receiver the M samples are known and the actual received signal (perturbed by noise) is compared with each of them. The sample which has the least R.M.S. discrepancy from the received signal is chosen as the transmitted signal and the corresponding binary number reconstructed. This process amounts to choosing the most probable (a posteriori) signal. The number M of noise samples used will depend on the tolerable frequency E of errors, but for almost all selections of samples we have so that no matter how small E is chosen, we can, by taking T sufficiently large, transmit as near as we wish to TW log P NN binary digits in the time T. Formulas similar to C = W log P NN for the white noise case have been developed independently by several other writers, although with somewhat different interpretations. We may mention the work of N. Wiener, W. G. Tuller,^s and H. Sullivan in this connection. In the case of an arbitrary perturbing noise (not necessarily white thermal noise) it does not appear that the maximizing problem involved in determining the channel capacity C can be solved explicitly. However, upper and lower bounds can be set for C in terms of the average noise power N the noise entropy power NI. These bounds are sufficiently close together in most practical cases to furnish a satisfactory solution to the problem. Theorem 18: The capacity of a channel of band W perturbed by an arbitrary noise is bounded by the ineaualities where P = average transmitter power N = average noise power NI = entropy power of the noise. Here again the average power of the perturbed signals will be P + N. The maximum entropy for this power would occur if the received signal were white noise and would be Wlog27Fe(P+N). It may not be possible to achieve this; i.e., there may not be any ensemble of transmitted signals which, added to the perturbing noise, produce a white thermal noise at the receiver, but at least this sets an upper bound to H(Y). We have, therefore This is the upper limit given in the theorem. The lower limit can be obtained by considering the rate if we make the transmitted signal a white noise, of power P. In this case the entropy power of the received signal must be at least as great as that of a white noise of power P+NI since we have shown in in a previous theorem that the entropy power of the sum of two ensembles is greater than or equal to the sum of the individual entropy powers. Hence

Cybernetics, loc. cit. ⁸"Theoretical Limitations on the Rate of Transmission of Information," Proceedings of the Institute of Radio Engineers, v. 37, No. 5, May, 1949, pp. 468-78. and As P increases, the upper and lower bounds approach each other, so we have as an asymptotic rate If the noise is itself white, N = NI and the result reduces to the formula proved previously: If the noise is Gaussian but with a spectrum which is not necessarily flat, NI is the geometric mean of the noise power over the various frequencies in the band W. Thus where N(f) is the noise power at frequency f. Theorem 19: If we set the capacity for a given transmitter power P equal to then q is monotonic decreasing as P increases and approaches 0 as a limit. Suppose that for a given power P1 the channel capacity is This means that the best signal distribution, say p(x), when added to the noise distribution q(x), gives a received distribution r(y) whose entropy power is (PI +N-n1). Let us increase the power to PI + AP by adding a white noise of power AP to the signal. The entropy of the received signal is now at least by application of the theorem on the minimum entropy power of a sum. Hence, since we can attain the H indicated, the entropy of the maximizing distribution must be at least as great and n must be monotonic decreasing. To show that n -+ 0 as P -+ - consider a signal which is white noise with a large P. Whatever the perturbing noise, the received signal will be approximately a white noise, if P is sufficiently large, in the sense of having an entropy power approaching P+N. 2E). THE CHANNEL CAPACITY WITH A PEAK POWER LIMITATION [n some applications the transmitter is limited not by the average power output but by the peak instantaneous power. The problem of calculating the channel capacity is then that of maximizing (by variation of the -nsemble of transmitted symbols) H(y) - H(n) subject to the constraint that all the functions f (t) in the ensemble be less than or equal to v^fS^-, say, for all t. A constraint of this type does not work out as well mathematically as the average power limitation. The most we have obtained for this case is a lower bound valid for all N, an "asymptotic" upper bound (valid For large N) and an asymptotic value of C for N small.

Theorem 20: The channel capacity C for a band W perturbed by white thermal noise of power N is bounded by where S is the peak allowed transmitter power. For sufficiently large where e is arbitrarily small. As (and provided the band W starts at 0) We wish to maximize the entropy of the received signal. If is large this will occur very nearly when we maximize the entropy of the transmitted ensemble. The asymptotic upper bound is obtained by relaxing the conditions on the ensemble. Let us suppose that the power is limited to S not at every instant of time, but only at the sample points. The maximum entropy of the transmitted ensemble under these weakened conditions is certainly greater than or equal to that under the original conditions. This altered problem can be solved easily. The maximum entronv occurs if the different samples are independent and have a distribution function which is constant from can be calculated as W log4S. The received signal will then have an entropy less than This is the desired upper bound to the channel capacity. To obtain a lower bound consider the same ensemble of functions. Let these functions be passed through an ideal filter with a triangular transfer characteristic. The gain is to be unity at frequency 0 and decline linearly down to gain 0 at frequency W. We first show that the output functions of the filter have a peak power limitation S at all times (not just the sample points). First we note that a pulse the filter -'I" going into in the output. This function is never negative. The input function (in the general case) can be thought of as the sum of a series of shifted functions where a, the amplitude of the sample, is not greater than -_vfS^-. Hence the output is the sum of shifted functions of the non-negative form above with the same coefficients. These functions being non-negative, the greatest positive value for any t is obtained when all the coefficients a have their maximum positive values, i.e., -_VfS. In this case the input function was a constant of amplitude -_vfS^- and since the filter has unit gain for D.C., the output is the same. Hence the output ensemble has a peak power S. 46 The entropy of the output ensemble can be calculated from that of the input ensemble by using the theorem dealing with such a situation. The output entropy is equal to the input entropy plus the geometrical mean gain of the filter: Hence the output entropy is and the channel capacity is greater than We now wish to show that, for small N (peak signal power over average white noise power), the channel capacity is approximately Therefore, if we can find an ensemble of functions such that they correspond to a rate nearly W log (1 + N and are limited to band W and peak S the result will be proved. Consider the ensemble of functions of the following type. A series of t samples have the same value, either +-\,IS^- or --\,IS^-, then the next t samples have the same value, etc. The value for a series is chosen at random, probability z for +-,fS^- and z for --,fS^-.If this ensemble be passed through a filter with triangular gain characteristic (unit gain at D.C.), the output is peak limited to ±S. Furthermore the average power is nearly S and can be made to approach this by taking t sufficiently large. The entropy of the sum of this and the thermal noise can be found by applying the theorem on the sum of a noise and a small signal. This theorem will apply if is sufficiently small. This can be ensured by taking N small enough (after t is chosen). The entropy power will be S+N to as close an approximation as desired, and hence the rate of transmission as near as we wish to PART V: THE RATE FOR A CONTINUOUS SOURCE 27. FIDELITY EVALUATION FUNCTIONS In the case of a discrete source of information we were able to determine a definite rate of generating information, namely the entropy of the underlying stochastic process. With a continuous source the situation is considerably more involved. In the first place a continuously variable quantity can assume an infinite number of values and requires, therefore, an infinite number of binary digits for exact specification. This means that to transmit the output of a continuous source with exact recovery at the receiving point requires, 47 in general, a channel of infinite capacity (in bits per second). Since, ordinarily, channels have a certain amount of noise, and therefore a finite capacity, exact transmission is impossible. This, however, evades the real issue. Practically, we are not interested in exact transmission when we have a continuous source, but only in transmission to within a certain tolerance. The question is, can we assign a definite rate to a continuous source when we require only a certain fidelity of recovery, measured in a suitable way. Of course, as the fidelity requirements are increased the rate will increase. It will be shown that we can, in very general cases, define such a rate, having the property that it is possible, by properly encoding the information, to transmit it over a channel whose capacity is equal to the rate in question, and satisfy the fidelity requirements. A channel of smaller capacity is insufficient. It is first necessary to give a general mathematical formulation of the idea of fidelity of transmission. Consider the set of messages of a long duration, say T seconds. The source is described by giving the probability density, in the associated space, that the source will select the message in question P(x). A given communication system is described (from the external point of view) by giving the conditional probability P_x(y) that if message x is produced by the source the recovered message at the receiving point will be y. The system as a whole (including source and transmission system) is described by the probability function P(x, y) of having message x and final output y. If this function is known, the complete characteristics of the system from the point of view of fidelity are known. Any evaluation of fidelity must correspond mathematically to an operation applied to P(x,y). This operation must at least have the properties of a simple ordering of systems; i.e., it must be possible to say of two systems represented by Pt (x, y) and P2 (x, y) that, according to our fidelity criterion, either (1) the first has higher fidelity, (2) the second has higher fidelity, or (3) they have equal fidelity. This means that a criterion of fidelity can be represented by a numerically valued function: whose argument ranges over possible probability functions P(x,y). We will now show that under very general and reasonable assumptions the function v(P(x,y)) can be written in a seemingly much more specialized form, namely as an average of a function p(x,y) over the set of possible values of x and y: To obtain this we need only assume (1) that the source and system are ergodic so that a very long sample will be, with probability nearly 1, typical of the ensemble, and (2) that the evaluation is "reasonable" in the sense that it is possible, by observing a typical input and output xl and yl, to form a tentative evaluation on the basis of these samples; and if these samples are increased in duration the tentative evaluation will, with probability 1, approach the exact evaluation based on a full knowledge of P(x,y). Let the tentative evaluation be p(x,y). Then the function p(x,y) approaches (as T ~) a constant for almost all (x,y) which are in the high probability region corresponding to the system: and we may also write since this establishes the desired result. The function p(x, y) has the general nature of a "distance" between x and y.9 It measures how undesirable t is (according to our fidelity criterion) to receive y when x is transmitted. The general result given above ;an be restated as follows: Any reasonable evaluation can be represented as an average of a distance function )ver the set of messages and recovered messages x and y weighted according to the probability P(x,y) of vetting the pair in question, provided the duration T of the messages be taken sufficiently large. The following are simple examples of evaluation functions: ⁹1t is not a "metric" in the strict sense, however, since in general it does not satisfy either p(x,y) =p(y,x) or p(x,y)+p(y,z) > p(x, z).

1. R.M.S. criterion. In this very commonly used measure of fidelity the distance function p(x,y) is (apart from a constant factor) the square of the ordinary Euclidean distance between the points x and y in the associated function space. 2. Frequency weighted

R.M.S. criterion. More generally one can apply different weights to the different frequency components before using an R.M.S. measure of fidelity. This is equivalent to passing the difference x(t) -y(t) through a shaping filter and then determining the average power in the output. Thus let and then 3. Absolute error criterion. 4. The structure of the ear and brain determine implicitly an evaluation, or rather a number of evaluations, appropriate in the case of speech or music transmission. There is, for example, an "intelligibility" criterion in which p(x,y) is equal to the relative frequency of incorrectly interpreted words when message x(t) is received as y(t). Although we cannot give an explicit representation of p(x,y) in these cases it could, in principle, be determined by sufficient experimentation. Some of its properties follow from well-known experimental results in hearing, e.g., the ear is relatively insensitive to phase and the sensitivity to amplitude and frequency is roughly logarithmic. 5. The discrete case can be considered as a specialization in which we have tacitly assumed an evaluation based on the frequency of errors. The function p(x,y) is then defined as the number of symbols in the sequence y differing from the corresponding symbols in x divided by the total number of symbols in x. 28. THE RATE FOR A SOURCE RELATIVE TO A FIDELITY EVALUATION We are now in a position to define a rate of generating information for a continuous source. We are given P(x) for the source and an evaluation v determined by a distance function p(x,y) which will be assumed continuous in both x and y. With a particular system P(x, y) the quality is measured by Furthermore the rate of flow of binary digits corresponding to P(x,y) is We define the rate R1 of generating information for a given quality vl of reproduction to be the minimum of R when we keep v fixed at vi and vary P_x(y). That is: 49 subject to the constraint: This means that we consider, in effect, all the communication systems that might be used and that transmit with the required fidelity. The rate of transmission in bits per second is calculated for each one and we choose that having the least rate. This latter rate is the rate we assign the source for the fidelity in question. The justification of this definition lies in the following result: Theorem 21: If a source has a rate R1 for a valuation v1 it is possible to encode the output of the source and transmit it over a channel of capacity C with fidelity as near v1 as desired provided R1 < C. This is not possible if Rl > C. The last statement in the theorem follows immediately from the definition of R1 and previous results. If it were not true we could transmit more than C bits per second over a channel of capacity C. The first part of the theorem is proved by a method analogous to that used for Theorem 11. We may, in the first place, divide the (x,y) space into a large number of small cells and represent the situation as a discrete case. This will not change the evaluation function by more than an arbitrarily small amount (when the cells are very small) because of the continuity assumed for p(x,y). Suppose that PI(x,y) is the particular system which minimizes the rate and gives R1. We choose from the high probability y's a set at random containing 2(R1+E)T members where E -+ 0 as T -+ -. With large T each chosen point will be connected by a high probability line (as in Fig. 10) to a set of x's. A calculation similar to that used in proving Theorem 11 shows that with large T almost all x's are covered by the fans from the chosen y points for almost all choices of the y's. The communication system to be used operates as follows: The selected points are assigned binary numbers. When a message x is originated it will (with probability approaching 1 as T -+ -) lie within at least one of the fans. The corresponding binary number is transmitted (or one of them chosen arbitrarily if there are several) over the channel by suitable coding means to give a small probability of error. Since RI < C this is possible. At the receiving point the corresponding y is reconstructed and used as the recovered message. The evaluation vi for this system can be made arbitrarily close to v1 by taking T sufficiently large. This is due to the fact that for each long sample of message x(t) and recovered message y(t) the evaluation approaches v1 (with probability 1). It is interesting to note that, in this system, the noise in the recovered message is actually produced by a kind of general quantizing at the transmitter and not produced by the noise in the channel. It is more or less analogous to the quantizing noise in PCM. 29. THE CALCULATION OF RATES The definition of the rate is similar in many respects to the definition of channel capacity. In the former with P_x(y) fixed and possibly one or more other constraints (e.g., an average power limitation) of the form K= ffP(x,y)A(x,y)dxdy. A partial solution of the general maximizing problem for determining the rate of a source can be given. Using Lagrange's method we consider

50 The variational equation (when we take the first variation on P(x,y)) leads to where A is determined to give the required fidelity and B(x) is chosen to satisfy This shows that, with best encoding, the conditional probability of a certain cause for various receive( y, P_y(x) will decline exponentially with the distance function p(x,y) between the x and y in question. In the special case where the distance function p(x, y) depends only on the (vector) difference between a and v. we have Hence B(x) is constant, say a, and Unfortunately these formal solutions are difficult to evaluate in particular cases and seem to be of little value. In fact, the actual calculation of rates has been carried out in only a few very simple cases. If the distance function p(x, y) is the mean square discrepancy between x and y and the message ensemble is white noise, the rate can be determined. In that case we have with N = (x - y) z. But the Max H_y (x) occurs when y - x is a white noise, and is equal to Wl log 27FeN where W is the bandwidth of the message ensemble. Therefore where Q is the average message power. This proves the following: Theorem 22: The rate for a white noise source of power Q and band Wl relative to an R.M.S. measure of fidelity is where N is the allowed mean square error between original and recovered messages. More generally with any message source we can obtain inequalities bounding the rate relative to a mean square error criterion. Theorem 23: The rate for any source of band Wl is bounded by where Q is the average power of the source, Ql its entropy power and N the allowed mean square error. The lower bound follows from the fact that the MaxH_y(x) for a given (x-y)z =N occurs in the white noise case. The upper bound results if we place points (used in the proof of Theorem 21) not in the best way but at random in a sphere of radius -\I^-Q- N. ACKNOWLEDGMENTS The writer is indebted to his colleagues at the Laboratories, particularly to Dr. H. W. Bode, Dr. J. R. Pierce, Dr. B. McMillan, and Dr. B. M. Oliver for many helpful suggestions and criticisms during the course of this work. Credit should also be given to Professor N. Wiener, whose elegant solution of the problems of filtering and prediction of stationary ensembles has considerably influenced the writer's thinking in this field. APPENDIX 5 Let S1 be any measurable subset of the g ensemble, and S2 the subset of the f ensemble which gives S1 under the operation T. Then Sl = TS2. Let W be the operator which shifts all functions in a set by the time A. Then since T is invariant and therefore commutes with W. Hence if m[S] is the probability measure of the set S where the second equality is by definition of measure in the g space, the third since the f ensemble is stationary, and the last by definition of g measure again. To prove that the ergodic property is preserved under invariant operations, let St be a subset of the g ensemble which is invariant under W, and let S2 be the set of all functions f which transform into S1. Then this implies APPENDIX 6 The upper bound, N3 < Nl +N2, is due to the fact that the maximum possible entropy for a power Ni +N2 occurs when we have a white noise of this power. In this case the entropy power is Ni +N2. To obtain the lower bound, suppose we have two distributions in n dimensions P(xi) and q(xi) with entropy powers N1 and N2. What form should P and q have to minimize the entropy power N3 of their convolution r(xi): 52 and H3 = -PH2 Then r(xi) will also be normal with quadratic form Ci_j. If the inverses of these forms are ate, bi_j, ci_j then We wish to show that these functions satisfy the minimizing conditions if and only if ate = Kbi_j and thus give the minimum H3 under the constraints. First we have This should equal 53 APPENDIX 7 The following will indicate a more general and more rigorous approach to the central definitions of communication theory. Consider a probability measure space whose elements are ordered pairs (x,y). The variables x, y are to be identified as the possible transmitted and received signals of some long duration T. Let us call the set of all points whose x belongs to a subset Sl of x points the strip over S1, and similarly the set whose y belong to S2 the strip over S2. We divide x and y into a collection of non-overlapping measurable subsets Xi and Yt approximate to the rate of transmission R by and consequently the sum is increased. Thus the various possible subdivisions form a directed set, with R monotonic increasing with refinement of the subdivision. We may define R unambiguously as the least upper bound for R1 and write it This integral, understood in the above sense, includes both the continuous and discrete cases and of course many others which cannot be represented in either form. It is trivial in this formulation that if x and u are in one-to-one correspondence, the rate from u to y is equal to that from x to y. If v is any function of y (not necessarily with an inverse) then the rate from x to y is greater than or equal to that from x to v since, in the calculation of the approximations, the subdivisions of y are essentially a finer subdivision of those for v. More generally if y and v are related not functionally but statistically, i.e., we have a probability measure space (y, v), then R(x, v) < R(x,y). This means that any operation applied to the received signal, even though it involves statistical elements, does not increase R. Another notion which should be defined precisely in an abstract formulation of the theory is that of "dimension rate," that is the average number of dimensions required per second to specify a member of an ensemble. In the band limited case 2W numbers per second are sufficient. A general definition can be framed as follows. Let f_a (t) be an ensemble of functions and let PT [f, (t), f,Q (t)] be a metric measuring This is a generalization of the measure type definitions of dimension in topology, and agrees with the intu itive dimension rate for simple ensembles where the desired result is obvious.