Shannon 1948 part 2

Roughly then, Hy(x) is the amount of additional information that must be supplied per second at the receiving point to correct the received message. To prove the first part, consider long sequences of received message M' and corresponding original message M. There will be logarithmically THy(x) of the M's which could reasonably have produced each M'. Thus we have THy (x) binary digits to send each T seconds. This can be done with E frequency of errors on a channel of capacity Hy(x). The second part can be proved by noting, first, that for any discrete chance variables x, y, z The left-hand side can be expanded to give If we identify x as the output of the source, y as the received signal and z as the signal sent over the correction channel, then the right-hand side is the equivocation less the rate of transmission over the correction channel. If the capacity of this channel is less than the equivocation the right-hand side will be greater than zero and Hyz (x) > 0. But this is the uncertainty of what was sent, knowing both the received signal and the correction signal. If this is greater than zero the frequency of errors cannot be arbitrarily small. Example: Suppose the errors occur at random in a sequence of binary digits: probability p that a digit is wrong and q = 1 - p that it is right. These errors can be corrected if their position is known. Thus the correction channel need only send information as to these positions. This amounts to transmitting from a source which produces binary digits with probability p for 1 (incorrect) and q for 0 (correct). This requires a channel of capacity which is the equivocation of the original system. The rate of transmission R can be written in two other forms due to the identities noted above. We have 21 The first defining expression has already been interpreted as the amount of information sent less the uncertainty of what was sent. The second measures the amount received less the part of this which is due to noise. The third is the sum of the two amounts less the joint entropy and therefore in a sense is the number of bits per second common to the two. Thus all three expressions have a certain intuitive significance. The capacity C of a noisy channel should be the maximum possible rate of transmission, i.e., the rate when the source is properly matched to the channel. We therefore define the channel capacity by where the maximum is with respect to all possible information sources used as input to the channel. If the channel is noiseless, Hy (x) = 0. The definition is then equivalent to that already given for a noiseless channel since the maximum entropy for the channel is its capacity.

R1. Nature takes payment by requiring just that much uncertainty, so that we are not actually getting any more than C through correctly. The situation is indicated in Fig. 9. The rate of information into the channel is plotted horizontally and the equivocation vertically. Any point above the heavy line in the shaded region can be attained and those below cannot. The points on the line cannot in general be attained, but there will usually be two points on the line that can. These results are the main justification for the definition of C and will now be proved. Theorem]]: Let a discrete channel have the capacity C and a discrete source the entropy per second H. If H <_ C there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrarily small frequency of errors (or an arbitrarily small equivocation). If H > C it is possible to encode the source so that the equivocation is less than H - C+ E where E is arbitrarily small. There is no method of encoding which gives an equivocation less than H - C. The method of proving the first part of this theorem is not by exhibiting a coding method having the desired properties, but by showing that such a code must exist in a certain group of codes. In fact we will

22 average the frequency of errors over this group and show that this average can be made less than E. If the average of a set of numbers is less than E there must exist at least one in the set which is less than E. This will establish the desired result. The capacity C of a noisy channel has been defined as

where x is the input and y the output. The maximization is over all sources which might be used as input to the channel. Let So be a source which achieves the maximum capacity C. If this maximum is not actually achieved by any source let So be a source which approximates to giving the maximum rate. Suppose So is used as input to the channel. We consider the possible transmitted and received sequences of a long duration T. The following will be true: 1. The transmitted sequences fall into two classes, a high probability group with about 2TH(x) members and the remaining sequences of small total probability. 2. Similarly the received sequences have a high probability set of about 2TH(y) members and a low probability set of remaining sequences. 3. Each high probability output could be produced by about 2THy (x) inputs. The probability of all other cases has a small total probability. All the E's and S's implied by the words "small" and "about" in these statements approach zero as we allow T to increase and So to approach the maximizing source. The situation is summarized in Fig. 10 where the input sequences are points on the left and output sequences points on the right. The fan of cross lines represents the range of possible causes for a typical output.

Fig. 10-Schematic representation of the relations between inputs and outputs in a channel. Now suppose we have another source producing information at rate R with R < C. In the period T this source will have 2TR high probability messages. We wish to associate these with a selection of the possible channel inputs in such a way as to get a small frequency of errors. We will set up this association in all 23 possible ways (using, however, only the high probability group of inputs as determined by the source So) and average the frequency of errors for this large class of possible coding systems. This is the same as calculating the frequency of errors for a random association of the messages and channel inputs of duration T. Suppose a particular output yI is observed. What is the probability of more than one message in the set of possible causes of yI ? There are 2TR messages distributed at random in 2TH(x) points. The probability of 8a particular point being a message is thus 8

8 Hence the probability of an error approaches zero and the first part of the theorem is proved. The second part of the theorem is easily shown by noting that we could merely send C bits per second from the source, completely neglecting the remainder of the information generated. At the receiver the neglected part gives an equivocation H(x) - C and the part transmitted need only add E. This limit can also be attained in many other ways, as will be shown when we consider the continuous case. The last statement of the theorem is a simple consequence of our definition of C. Suppose we can encode a source with H,3' (x) = C -> a in such a way as to obtain an equivocation Hy (x) = a - E with E positive. Then R=H(x) =C+ a and 8

8 with E positive. This contradicts the definition of C as the maximum of H(x) -Hy(x). Actually more has been proved than was stated in the theorem. If the average of a set of numbers is within E of of their maximum, a fraction of at most sqrt(E) can be more than sqrt(E) below the maximum. Since E is arbitrarily small we can say that almost all the systems are arbitrarily close to the ideal. 8 8 14. DISCUSSION 8 The demonstration of Theorem 11, while not a pure existence proof, has some of the deficiencies of such proofs. An attempt to obtain a good approximation to ideal coding by following the method of the proof is generally impractical. In fact, apart from some rather trivial cases and certain limiting situations, no explicit description of a series of approximation to the ideal has been found. Probably this is no accident but is related to the difficulty of giving an explicit construction for a good approximation to a random sequence. An approximation to the ideal would have the property that if the signal is altered in a reasonable way by the noise, the original can still be recovered. In other words the alteration will not in general bring it closer to another reasonable signal than the original. This is accomplished at the cost of a certain amount of redundancy in the coding. The redundancy must be introduced in the proper way to combat the particular noise structure involved. However, any redundancy in the source will usually help if it is utilized at the receiving point. In particular, if the source already has a certain redundancy and no attempt is made to eliminate it in matching to the channel, this redundancy will help combat noise. For example, in a noiseless telegraph channel one could save about 50% in time by proper encoding of the messages. This is not done and most of the redundancy of English remains in the channel symbols. This has the advantage, however, of allowing considerable noise in the channel. lambda) sizable fraction of the letters can be received incorrectly and still reconstructed by the context. In fact this is probably not a bad approximation to the ideal in many cases, since the statistical structure of English is rather involved and the reasonable English sequences are not too far (in the sense required for the theorem) from a random selection. 8 8 24 As in the noiseless case a delay is generally required to approach the ideal encoding. It now has the additional function of allowing a large sample of noise to affect the signal before any judgment is made at the receiving point as to the original message. Increasing the sample size always sharpens the possible statistical assertions. The content of Theorem 11 and its proof can be formulated in a somewhat different way which exhibits the connection with the noiseless case more clearly. Consider the possible signals of duration T and suppose a subset of them is selected to be used. Let those in the subset all be used with equal probability, and suppose the receiver is constructed to select, as the original signal, the most probable cause from the subset, when a perturbed signal is received. We define N(T, q) to be the maximum number of signals we can choose for the subset such that the probability of an incorrect interpretation is less than or equal to q.

Theorem 12: where C is the channel capacity, provided that q does not equal 0 or In other words, no matter how we set out limits of reliability, we can distinguish reliably in time T enough messages to correspond to about CT bits, when T is sufficiently large. Theorem 12 can be compared with the definition of the capacity of a noiseless channel given in Section 1. 15. EXAMPLE OF A DISCRETE CHANNEL AND ITS CAPACITY A simple example of a discrete channel is indicated in Fig. 11. There are three possible symbols. The first is never affected by noise. The second and third each have probability p of coming through undisturbed, and q of being changed into the other of the pair. We have (letting a = - [p log p + q log q] and P and Q be the

probabilities of using the first and second symbols)

We wish to choose P and Q in such away as to maximize H(x) - Hy (x), subject to the constraint P+ 2Q = 1. Hence we consider

Eliminating

Note how this checks the obvious values in the cases pn. = 1 and pn. = z . In the first, beta. = 1 and C = log 3, which is correct since the channel is then noiseless with three possible symbols. If pn. = z, beta. = 2 and C = log2. Here the second and third symbols cannot be distinguished at all and act together like one symbol. The first symbol is used with probability P = z and the second and third together with probability z . This may be distributed between them in any desired way and still achieve the maximum capacity. For intermediate values of pn. the channel capacity will lie between log2 and log3. The distinction between the second and third symbols conveys some information but not as much as in the noiseless case. The first symbol is used somewhat more frequently than the other two because of its freedom from noise. 16. THE CHANNEL CAPACITY IN CERTAIN SPECIAL CASES If the noise affects successive channel symbols independently it can be described by a set of transition probabilities Pij This is the probability, if symbol i is sent, that j will be received. The maximum channel rate is then given by the maximum of

where we vary the Pi subject to EPi = 1. This leads by the method of Lagrange to the equations,

Hence:

or,

This is the system of equations for determining the maximizing values of Pi, with C to be determined so that Y Pi = 1. When this is done C will be the channel capacity, and the Pi the proper probabilities for the channel symbols to achieve this capacity. If each input symbol has the same set of probabilities on the lines emerging from it, and the same is true of each output symbol, the capacity can be easily calculated. Examples are shown in Fig. 12. In such a case H,(y) is independent of the distribution of probabilities on the input symbols, and is given by -EPilogPi where the pi are the values of the transition probabilities from any input symbol. The channel capacity is

The maximum of H(y) is clearly logm where m is the number of output symbols, since it is possible to make them all equally probable by making the input symbols equally probable. The channel capacity is therefore

Fig. 12-Examples of discrete channels with the same transition probabilities for each input and for each output. In Fig. 12a it would be

In Fig. 12c we have

Suppose the symbols fall into several groups such that the noise never causes a symbol in one group to be mistaken for a symbol in another group. Let the capacity for the nth group be C„ (in bits per second) when we use only the symbols in this group. Then it is easily shown that, for best use of the entire set, the total probability P„ of all symbols in the nth group should be

Within a group the probability is distributed just as it would be if these were the only symbols being used. The channel capacity is

17. AN EXAMPLE OF EFFICIENT CODING The following example, although somewhat unrealistic, is a case in which exact matching to a noisy channel is possible. There are two channel symbols, 0 and 1, and the noise affects them in blocks of seven symbols. A block of seven is either transmitted without error, or exactly one symbol of the seven is incorrect. These eight possibilities are equally likely. We have

An efficient code, allowing complete correction of errors and transmitting at the rate C, is the following (found by a method due to R. Hamming): 27 Let a block of seven symbols be XI,X2,...,X7. Of these X3, X5, X6 and X7 are message symbols and chosen arbitrarily by the source. The other three are redundant and calculated as follows:

When a block of seven is received a, 0 and rY are calculated and if even called zero, if odd called one. The binary number a 0 rY then gives the subscript of the Xi that is incorrect (if 0 there was no error). APPENDIX 1 THE GROWTH OF THE NUMBER OF BLOCKS OF SYMBOLS WITH A FINITE STATE CONDITION Let Ni (L) be the number of blocks of symbols of length L ending in state i. Then we have

ij i. . . , b ~ are the length of the symbols which maybe chosen instate i and lead to state j. These are linear difference equations and the behavior as L must be of the type

Substituting in the difference equation

For this to be possible the determinant

must vanish and this determines W, which is, of course, the largest real root of D = 0. The quantity C is then given by

and we also note that the same growth properties result if we require that all blocks start in the same (arbitrarily chosen) state. APPENDIX 2

Let H (1, , . . . , 1) = A (n). From condition (3) we can decompose a choice from s' equally likely possibilities into a series of m choices from s equally likely possibilities and obtain

28 Similarly

Thus, taking logarithms and dividing by n logs,

where E is arbitrarily small. Now from the monotonic property of A (n),

Hence, dividing by nA (s),

the ni are integers. We can break down a choice from Ent possibilities into a choice from n possibilities with probabilitie">PI, . . . , Pn and then, if the ith was chosen, a choice from ni with equal probabilities. Using condition (3) again, we equate the total choice from Ent as computed by two methods

Hence

If the Pi are incommeasurable, they may be approximated by rationals and the same expression must hold by our continuity assumption. Thus the expression holds in general. The choice of coefficient K is a matter of convenience and amounts to the choice of a unit of measure. APPENDIX 3 THEOREMS ON ERGODIC SOURCES If it is possible to go from any state with P > 0 to any other along a path of probability P > 0, the system is ergodic and the strong law of large numbers can be applied. Thus the number of times a given path Pti in the network is traversed in a long sequence of length N is about proportional to the probability of being at i, say Pi, and then choosing this path, PiPijN. If N is large enough the probability of percentage error fS in this is less than E so that for all but a set of small probability the actual numbers lie within the limits

or This proves Theorem 3. Theorem 4 follows immediately from this on calculating upper and lower bounds fo">n(q) based on the possible range of values of P in Theorem 3. In the mixed (not eraodic) case if

To prove Theorems 5 and 6 first note that FN is monotonic decreasing because increasing N adds a subscript to a conditional entropy. A simple substitution fo">PB; (Sj) in the definition of FN shows that

and summing this for all N gives Hence GN > FN and GN monotonic decreasing. Also they

must approach the same limit. By using Theorem 3 we see that APPENDIX 4 MAXIMIZING THE RATE FOR A SYSTEM OF CONSTRAINTS Suppose we have a set of constraints on sequences of symbols that is of the finite state type and can be represented therefore by a linear graph. Let fI~s) be the lengths of the various symbols that can occur in passing from state i to state j. What distribution of probabilities Pi for the different states and p(i) for choosing symbol s in state i and going to state j maximizes the rate of generating information under these constraints? The constraints define a discrete channel and the maximum rate must be less than or equal to the capacity C of this channel, since if all blocks of large length were equally likely, this rate would result, and if possible this would be best. We will show that this rate can be achieved by proper choice of the Pi and s pij The rate in question is

30 Solving fo">Pij

Since

The correct value of D is the capacity C and the Bj are solutions of

for then

So that if Ai satisfy

Both the sets of equations fo">Bi and -Yt can be satisfied since C is such that

In this case the rate is

but

Hence the rate is C and as this could never be exceeded this is the maximum, justifying the assumed solution. PART III: MATHEMATICAL PRELIMINARIES In this final installment of the paper we consider the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed heretofore. To a considerable extent the continuous case can be obtained through a limiting process from the discrete case by dividing the continuum of messages and signals into a large but finite number of small regions and calculating the various parameters involved on a discrete basis. As the size of the regions is decreased these parameters in general approach as limits the proper values for the continuous case. There are, however, a few new effects that appear and also a general change of emphasis in the direction of specialization of the general results to particular cases. We will not attempt, in the continuous case, to obtain our results with the greatest generality, or with the extreme rigor of pure mathematics, since this would involve a great deal of abstract measure theory and would obscure the main thread of the analysis. A preliminary study, however, indicates that the theory can be formulated in a completely axiomatic and rigorous manner which includes both the continuous and discrete cases and many others. The occasional liberties taken with limiting processes in the present analysis can be justified in all cases of practical interest. 18. SETS AND ENSEMBLES OF FUNCTIONS We shall have to deal in the continuous case with sets of functions and ensembles of functions. A set of functions, as the name implies, is merely a class or collection of functions, generally of one variable, time. It can be specified by giving an explicit representation of the various functions in the set, or implicitly by giving a property which functions in the set possess and others do not. Some examples are: 1. The set of functions: fo (t) = sin (t + 0). Each particular value of 0 determines a particular function in the set. 12. The set of all functions of time containing no frequencies over W cycles per second. 3. The set of all functions limited in band to W and in amplitude to A. 4. The set of all English speech signals as functions of time. An ensemble of functions is a set of functions together with a probability measure whereby we may determine the probability of a function in the set having certain properties.' For example with the set,

we may give a probability distribution for 0, P(9). The set then becomes an ensemble. Some further examples of ensembles of functions are: 1. A finite set of functio1">fk(t) (k = 1, 2,..., n) with the probability of fk being Pk. 2. A finite dimensional family of functions f(a1,a2,-, an; t) with a probability distribution on the parameter">ai: P(al,...,an). For example we could consider the ensemble defined by

with the amplitudes ai distributed normally and independently, and the phases Oi distributed uniformly (from 0 to 27r) and independently. 'In mathematical terminology the functions belong to a measure space whose total measure is unity. 32 3. The ensemble

with the ai normal and independent all with the same standard deviation VIN. This is a representation of "white" noise, band limited to the band from 0 to W cycles per second and with average power N.2 4. Let points be distributed on the t axis according to a Poisson distribution. At each selected point the function f (t) is placed and the different functions added, giving the ensemble

where the tk are the points of the Poisson distribution. This ensemble can be considered as a type of impulse or shot noise where all the impulses are identical. 5. The set of English speech functions with the probability measure given by the frequency of occurrence in ordinary use. An ensemble of functio4">fa (t) is stationary if the same ensemble results when all functions are shifted any fixed amount in time. The ensemble

is stationary if 0 is distributed uniformly from 0 to 27r. If we shift each function by tI we obtain

with chi distributed uniformly from 0 to 27r. Each function has changed but the ensemble as a whole is invariant under the translation. The other examples given above are also stationary. An ensemble i">ergodic if it is stationary, and there is no subset of the functions in the set with a probability different from 0 and 1 which is stationary. The ensemble

is ergodic. No subset of these functions of probability 0 0,1 is transformed into itself under all time translations. On the other hand the ensemble

with a distributed normally and 0 uniform is stationary but not ergodic. The subset of these functions with a between 0 and 1 for example is stationary. Of the examples given, 3 and 4 are ergodic, and 5 may perhaps be considered so. If an ensemble is ergodic we may say roughly that each function in the set is typical of the ensemble. More precisely it is known that with an ergodic ensemble an average of any statistic over the ensemble is equal (with probability 1) to an average over the time translations of a particular function of the set 3 Roughly speaking, each function can be expected, as time progresses, to go through, with the proper frequency, all the convolutions of any of the functions in the set. TThis representation can be used as a definition of band limited white noise. It has certain advantages in that it involves fewer limiting operations than do definitions that have been used in the past. The name "white noise," already firmly entrenched in the literature, is perhaps somewhat unfortunate. In optics white light means either any continuous spectrum as contrasted with a point spectrum, or a spectrum which is flat with wavelength (which is not the same as a spectrum flat with frequency). TThis is the famous ergodic theorem or rather one aspect of this theorem which was proved in somewhat different formulations by Birkoff, von Neumann, and Koopman, and subsequently generalized by Wiener, Hopf, Hurewicz and others. The literature on ergodic theory is quite extensive and the reader is referred to the papers of these writers for precise and general formulations; e.g., E. Hopf, "Ergodentheorie," Ergebnisse der Mathematik and ihrer Grenzgebiete, v. 5; "On Causality Statistics and Probability," Journal of Mathematics and Physics, v. XIII, No. 1, 1934; N. Wiener, "The Ergodic Theorem," Duke Mathematical Journal, v. 5, 1939. 33

Probability measure is defined for the set ga (t) by means of that for the set fa (t). The probability of a certain subset of the ga (t) functions is equal to that of the subset of the fa (t) functions which produce members of the given subset of g functions under the operation T. Physically this corresponds to passing the ensemble through some device, for example, a filter, a rectifier or a modulator. The output functions of the device form the ensemble ga (t). A device or operator T will be called invariant if shifting the input merely shifts the output, i.e., if

implies

for all fa (t) and all It. It is easily shown (see Appendix 5 that if T is invariant and the input ensemble is stationary then the output ensemble is stationary. Likewise if the input is ergodic the output will also be ergodic. A filter or a rectifier is invariant under all time translations. The operation of modulation is not since the carrier phase gives a certain time structure. However, modulation is invariant under all translations which are multiples of the period of the carrier. Wiener has pointed out the intimate relation between the invariance of physical devices under time translations and Fourier theory. 4 He has shown, in fact, that if a device is linear as well as invariant Fourier analysis is then the appropriate mathematical tool for dealing with the problem. An ensemble of functions is the appropriate mathematical representation of the messages produced by a continuous source (for example, speech), of the signals produced by a transmitter, and of the perturbing noise. Communication theory is properly concerned, as has been emphasized by Wiener, not with operations on particular functions, but with operations on ensembles of functions. A communication system is designed not for a particular speech function and still less for a sine wave, but for the ensemble of speech functions. 19. BAND LIMITED ENSEMBLES OF FUNCTIONS If a function of time f (t) is limited to the band from 0 to W cycles per second it is completely determined by giving its ordinates at a series of discrete points spaced zw seconds apart in the manner indicated by the following results Theorem 13: Let f (t) contain no frequencies over W. Then

where

4Communication theory is heavily indebted to Wiener for much of its basic philosophy and theory. His classic NDRC report, The Interpolation, Extrapolation and Smoothing of Stationary Time Series (Wiley, 1949), contains the first clear-cut formulation of communication theory as a statistical problem, the study of operations on time series. This work, although chiefly concerned with the linear prediction and filtering problem, is an important collateral reference in connection with the present paper. We may also refer here to Wiener'">Cybernetics (Wiley, 1948), dealing with the general problems of communication and control. SFor a proof of this theorem and further discussion see the author's paper "Communication in the Presence of Noise" published in the Proceedings of the Institute of Radio Engineers, v. 37, No. 1, Jan., 1949, pp. 10-21. A function can be considered to be substantially limited to a time T if all the ordinates X„ outside this interval of time are zero. In this case all but 2TW of the coordinates will be zero. Thus functions limited to a band W and duration T correspond to points in a space of 2TW dimensions. A subset of the functions of band W and duration T corresponds to a region in this space. For example, the functions whose total energy is less than or equal to E correspond to points in a 2TW dimensional sphere with radius r = V2-WE. An ensemble of functions of limited duration and band will be represented by a probability distribution p(x1, . . . , x„) in the corresponding n dimensional space. If the ensemble is not limited in time we can consider the 2TW coordinates in a given interval T to represent substantially the part of the function in the interval T and the probability distribution p(xl,...,x„) to give the statistical structure of the ensemble for intervals of that duration. 20. ENTROPY OF A CONTINUOUS DISTRIBUTION The entropy of a discrete set of probabilities Pt,... ,p, has been defined as: In an analogous manner we define the entropy of a continuous distribution with the density distribution function p(x) by:

With an n dimensional distribution p(xl,... ,x„) we have

If we have two arguments x and y (which may themselves be multidimensional) the joint and conditional entropies of p(x,y) are given by

and

where

The entropies of continuous distributions have most (but not all) of the properties of the discrete case. In particular we have the following: 1. If x is limited to a certain volume v in its space, then H(x) is a maximum and equal to logv when p(x) is constant (1/v) in the volume. 35 2. With any two variables x, y we have

with

Then the entropy of the averaged distribution p(y) is equal to or greater than that of the original distribution p(x). 4. We have

and HX(Y) < H(Y) 5. Let p(x) be a one-dimensional distribution. The form of p(x) giving a maximum entropy subject to the condition that the standard deviation of x be fixed at Q is Gaussian. To show this we must maximize

with

The condition for this is

and consequently (adjusting the constants to satisfy the constraints)

Similarly in n dimensions, suppose the second order moments of p(xl,. . . ,xn) are fixed at Ate:

Then the maximum entropy occurs (by a similar calculation) when p(xl,. . . ,xn) is the n dimensional Gaussian distribution with the second order moments Ate. 6. The entropy of a one-dimensional Gaussian distribution whose standard deviation i">Q is given by

and is equal to log ea. 8. There is one important difference between the continuous and discrete entropies. In the discrete case the entropy measures in an absolute way the randomness of the chance variable. In the continuous case the measurement is relative to the coordinate system. If we change coordinates the entropy will in general change. In fact if we change to coordinates yl ... yn the new entropy is given by

where J(y) is the Jacobian of the coordinate transformation. On expanding the logarithm and changing the variables to xl ... x„, we obtain:

37 Thus the new entropy is the old entropy less the expected logarithm of the Jacobian. In the continuous case the entropy can be considered a measure of randomnes">relative to an assumed standard, namely the coordinate system chosen with each small volume element dxl ... dx„ given equal weight. When we change the coordinate system the entropy in the new system measures the randomness when equal volume element">dyl ... dy„ in the new system are given equal weight. In spite of this dependence on the coordinate system the entropy concept is as important in the continuous case as the discrete case. This is due to the fact that the derived concepts of information rate and channel capacity depend on the difference of two entropies and this difference does not depend on the coordinate frame, each of the two terms being changed by the same amount. The entropy of a continuous distribution can be negative. The scale of measurements sets an arbitrary zero corresponding to a uniform distribution over a unit volume. A distribution which is more confined than this has less entropy and will be negative. The rates and capacities will, however, always be nonnegative. 9. A particular case of changing coordinates is the linear transformation

In this case the Jacobian is simply the determinant laid-1 and

In the case of a rotation of coordinates (or any measure preserving transformation) J = 1 and H(y) _ H (x) . 21. ENTROPY OF AN ENSEMBLE OF FUNCTIONS Consider an ergodic ensemble of functions limited to a certain band of width W cycles per second. Let

be the density distribution function for amplitudes xl,...,x„ at n successive sample points. entropy of the ensemble per degree of freedom by We define the

We may also define an entropy H per second by dividing, not by n, but by the time T in seconds fo">n samples. Since n = 2TW, H = 2WH'. With white thermal noise P is Gaussian and we have

For a given average power N, white noise has the maximum possible entropy. This follows from the maximizing properties of the Gaussian distribution noted above. The entropy for a continuous stochastic process has many properties analogous to that for discrete processes. In the discrete case the entropy was related to the logarithm of the Probability of long sequences, and to the number of reasonably probable sequences of long length. In the continuous case it is related in a similar fashion to the logarithm of the Probability density for a long series of samples, and the volume of reasonably high probability in the function space. More precisely, if we assume p(xl,...,x„) continuous in all the xi for all n, then for sufficiently large n

38 for all choices of (xl,...,x„) apart from a set whose total probability is less than S, with S and E arbitrarily small. This follows form the ergodic property if we divide the space into a large number of small cells. The relation of H to volume can be stated as follows: Under the same assumptions consider the n dimensional space corresponding to P(xl,...,x„). Let V,(q) be the smallest volume in this space which includes in its interior a total probability q. Then

provided q does not equal 0 or 1. These results show that for large n there is a rather well-defined volume (at least in the logarithmic sense) of high probability, and that within this volume the probability density is relatively uniform (again in the logarithmic sense). In the white noise case the distribution function is given by

Since this depends only on Exi the surfaces of equal probability density are spheres and the entire distribution has spherical symmetry. The region of high probability is a sphere of radiu">V"n-N. As n -+ - the probability of being outside a sphere of radiu">n(N+ E) approaches zero and n times the logarithm of the volume of the sphere approaches log 27reN. In the continuous case it is convenient to work not with the entropy H of an ensemble but with a derived quantity which we will call the entropy power. This is defined as the power in a white noise limited to the same band as the original ensemble and having the same entropy. In other words if H' is the entropy of an ensemble its entropy power is

22. ENTROPY LOSS IN LINEAR FILTERS Theorem 14: If an ensemble having an entropy Hl per degree of freedom in band W is passed through a filter with characteristic Y(f) the output ensemble has an entropy

The operation of the filter is essentially a linear transformation of coordinates. If we think of the different frequency components as the original coordinate system, the new frequency components are merely the old ones multiplied by factors. The coordinate transformation matrix is thus essentially diagonalized in terms of these coordinates. The Jacobian of the transformation is (fo">n sine and n cosine components)

where the f are equally spaced through the band W. This becomes in the limit

Since J is constant its average value is the same quantity and applying the theorem on the change of entropy with a change of coordinates, the result follows. We may also phrase it in terms of the entropy power. Thus if the entropy power of the first ensemble i">NI that of the second is

39 TABLE I

The final entropy power is the initial entropy power multiplied by the geometric mean gain of the filter. If the gain is measured in db, then the output entropy power will be increased by the arithmetic mean db gain over W. In Table I the entropy power loss has been calculated (and also expressed in db) for a number of ideal gain characteristics. The impulsive responses of these filters are also given for W = 27r, with phase assumed to be0. The entropy loss for many other cases can be obtained from these results. For example the entropy power facto">1/e2 for the first case also applies to any gain characteristic obtain from 1- w by a measure preserving transformation of the w axis. In particular a linearly increasing gain G(w) = w, or a "saw tooth" characteristic between 0 and 1 have the same entropy loss. The reciprocal gain has the reciprocal factor. Thus 1/w has the facto">e2. Raising the gain to any power raises the factor to this power. ENTROPY OF A SUM OF TWO ENSEMBLES If we have two ensembles of functio3">fa (t) and g,Q (t) we can form a new ensemble by "addition." Suppose the first ensemble has the probability density function P(xl,...,x,) and the second q(xl,...,x,). Then the 40