Roughly then, Hy(x) is the amount of additional information that
must be supplied per second at the receiving point to correct the received message.
To prove the first part, consider long sequences of received message M' and
corresponding original message M. There will be logarithmically THy(x)
of the M's which could reasonably have produced each M'. Thus we have
THy (x) binary digits to send each T seconds. This can
be done with E frequency of errors on a channel of capacity Hy(x).
The second part can be proved by noting, first, that for any discrete chance
variables x, y, z
The left-hand side can be expanded to give
If we identify x as the output of the source, y as the received signal and z
as the signal sent over the correction channel, then the right-hand side is
the equivocation less the rate of transmission over the correction channel.
If the capacity of this channel is less than the equivocation the right-hand
side will be greater than zero and Hyz (x) >
0. But this is the uncertainty of what was sent, knowing both the received signal
and the correction signal. If this is greater than zero the frequency of errors
cannot be arbitrarily small. Example: Suppose the errors occur at random
in a sequence of binary digits: probability p that a digit is wrong and q
= 1 - p that it is right. These errors can be corrected if their position
is known. Thus the correction channel need only send information as to these
positions. This amounts to transmitting from a source which produces binary
digits with probability p for 1 (incorrect) and q for 0 (correct). This
requires a channel of capacity
which is the equivocation of the original system. The rate of transmission R
can be written in two other forms due to the identities noted above. We have
21 The first defining expression has already been interpreted as the amount
of information sent less the uncertainty of what was sent. The second measures
the amount received less the part of this which is due to noise. The third is
the sum of the two amounts less the joint entropy and therefore in a sense is
the number of bits per second common to the two. Thus all three expressions
have a certain intuitive significance. The capacity C of a noisy channel should
be the maximum possible rate of transmission, i.e., the rate when the source
is properly matched to the channel. We therefore define the channel capacity
by
where the maximum is with respect to all possible information sources used as
input to the channel. If the channel is noiseless, Hy (x) = 0. The definition
is then equivalent to that already given for a noiseless channel since the maximum
entropy for the channel is its capacity.
R1.
Nature takes payment by requiring just that much uncertainty, so that we
are not actually getting any more than C through correctly. The situation is
indicated in Fig. 9. The rate of information into the channel is plotted horizontally
and the equivocation vertically. Any point above the heavy line in the shaded
region can be attained and those below cannot. The points on the line cannot
in general be attained, but there will usually be two points on the line that
can. These results are the main justification for the definition of C and will
now be proved. Theorem]]: Let a discrete channel have the capacity C and
a discrete source the entropy per second H. If H <_ C there exists a coding
system such that the output of the source can be transmitted over
the channel with an arbitrarily small frequency of errors (or an arbitrarily
small equivocation). If H > C it is possible to encode the source
so that the equivocation is less than H - C+ E where E is arbitrarily
small. There is no method of encoding which gives an equivocation less than
H - C. The method of proving the first part of this theorem is not by exhibiting
a coding method having the desired properties, but by showing that such a code
must exist in a certain group of codes. In fact we will
22 average the frequency of errors over this group and show that this average
can be made less than E. If the average of a set of numbers is less than
E there must exist at least one in the set which is less than E. This
will establish the desired result. The capacity C of a noisy channel has been
defined as
where x is the input and y the output. The maximization is over all sources
which might be used as input to the channel. Let So be a source which achieves
the maximum capacity C. If this maximum is not actually achieved by any source
let So be a source which approximates to giving the maximum rate. Suppose So
is used as input to the channel. We consider the possible transmitted and received
sequences of a long duration T. The following will be true: 1. The transmitted
sequences fall into two classes, a high probability group with about 2TH(x)
members and the remaining sequences of small total probability. 2. Similarly
the received sequences have a high probability set of about 2TH(y) members
and a low probability set of remaining sequences. 3. Each high probability
output could be produced by about 2THy (x) inputs. The probability of
all other cases has a small total probability. All the E's and S's implied
by the words "small" and "about" in these statements approach zero as we allow
T to increase and So to approach the maximizing source. The situation is summarized
in Fig. 10 where the input sequences are points on the left and output sequences
points on the right. The fan of cross lines represents the range of possible
causes for a typical output.
Fig. 10-Schematic representation of the relations between inputs and outputs
in a channel. Now suppose we have another source producing information at rate
R with R < C. In the period T this source will have 2TR
high probability messages. We wish to associate these with a selection of
the possible channel inputs in such a way as to get a small frequency of errors.
We will set up this association in all 23 possible ways (using, however,
only the high probability group of inputs as determined by the source So) and
average the frequency of errors for this large class of possible coding systems.
This is the same as calculating the frequency of errors for a random association
of the messages and channel inputs of duration T. Suppose a particular output
yI is observed. What is the probability of more than one message in the
set of possible causes of yI ? There are 2TR messages distributed
at random in 2TH(x) points. The probability of 8a particular point being a message
is thus 8
8
8 Hence the probability of an error approaches zero and the first part of the
theorem is proved. The second part of the theorem is easily shown by noting
that we could merely send C bits per second from the source, completely neglecting
the remainder of the information generated. At the receiver the neglected part
gives an equivocation H(x) - C and the part transmitted need only add
E. This limit can also be attained in many other ways, as will be shown when
we consider the continuous case. The last statement of the theorem is a simple
consequence of our definition of C. Suppose we can encode a source with H,3'
(x) = C -> a in such a way as to obtain an equivocation Hy
(x) = a - E with E positive. Then R=H(x) =C+ a and 8
8 with E positive. This contradicts the definition of C as the maximum of H(x)
-Hy(x). Actually more has been proved than was stated in
the theorem. If the average of a set of numbers is within E of of their maximum,
a fraction of at most sqrt(E) can be more than sqrt(E) below the maximum. Since
E is arbitrarily small we can say that almost all the systems are arbitrarily
close to the ideal. 8 8 14. DISCUSSION 8 The demonstration of Theorem 11, while
not a pure existence proof, has some of the deficiencies of such proofs. An
attempt to obtain a good approximation to ideal coding by following the method
of the proof is generally impractical. In fact, apart from some rather trivial
cases and certain limiting situations, no explicit description of a series of
approximation to the ideal has been found. Probably this is no accident but
is related to the difficulty of giving an explicit construction for a good approximation
to a random sequence. An approximation to the ideal would have the property
that if the signal is altered in a reasonable way by the noise, the original
can still be recovered. In other words the alteration will not in general bring
it closer to another reasonable signal than the original. This is accomplished
at the cost of a certain amount of redundancy in the coding. The redundancy
must be introduced in the proper way to combat the particular noise structure
involved. However, any redundancy in the source will usually help if it is utilized
at the receiving point. In particular, if the source already has a certain redundancy
and no attempt is made to eliminate it in matching to the channel, this redundancy
will help combat noise. For example, in a noiseless telegraph channel one could
save about 50% in time by proper encoding of the messages. This is not done
and most of the redundancy of English remains in the channel symbols. This has
the advantage, however, of allowing considerable noise in the channel. lambda)
sizable fraction of the letters can be received incorrectly and still reconstructed
by the context. In fact this is probably not a bad approximation to the ideal
in many cases, since the statistical structure of English is rather involved
and the reasonable English sequences are not too far (in the sense required
for the theorem) from a random selection. 8 8 24 As in the noiseless case a
delay is generally required to approach the ideal encoding. It now has the additional
function of allowing a large sample of noise to affect the signal before any
judgment is made at the receiving point as to the original message. Increasing
the sample size always sharpens the possible statistical assertions. The content
of Theorem 11 and its proof can be formulated in a somewhat different way which
exhibits the connection with the noiseless case more clearly. Consider the possible
signals of duration T and suppose a subset of them is selected to be used. Let
those in the subset all be used with equal probability, and suppose the receiver
is constructed to select, as the original signal, the most probable cause from
the subset, when a perturbed signal is received. We define N(T, q) to
be the maximum number of signals we can choose for the subset such that the
probability of an incorrect interpretation is less than or equal to q.
Theorem 12: where C is the channel capacity, provided that q does
not equal 0 or In other words, no matter how we set out limits of reliability,
we can distinguish reliably in time T enough messages to correspond to about
CT bits, when T is sufficiently large. Theorem 12 can be compared with the definition
of the capacity of a noiseless channel given in Section 1. 15. EXAMPLE OF A
DISCRETE CHANNEL AND ITS CAPACITY A simple example of a discrete channel is
indicated in Fig. 11. There are three possible symbols. The first is never affected
by noise. The second and third each have probability p of coming through undisturbed,
and q of being changed into the other of the pair. We have (letting a
= - [p log p + q log q] and P and Q be the
probabilities of using the first and second symbols)
We wish to choose P and Q in such away as to maximize H(x) - Hy
(x), subject to the constraint P+ 2Q = 1. Hence we consider
Eliminating
25
Note how this checks the obvious values in the cases pn. = 1 and pn. = z . In
the first, beta. = 1 and C = log 3, which is correct since the channel is then
noiseless with three possible symbols. If pn. = z, beta. = 2 and C = log2. Here
the second and third symbols cannot be distinguished at all and act together
like one symbol. The first symbol is used with probability P = z and the second
and third together with probability z . This may be distributed between them
in any desired way and still achieve the maximum capacity. For intermediate
values of pn. the channel capacity will lie between log2 and log3. The distinction
between the second and third symbols conveys some information but not as much
as in the noiseless case. The first symbol is used somewhat more frequently
than the other two because of its freedom from noise. 16. THE CHANNEL CAPACITY
IN CERTAIN SPECIAL CASES If the noise affects successive channel symbols independently
it can be described by a set of transition probabilities Pij This is the probability,
if symbol i is sent, that j will be received. The maximum channel rate is then
given by the maximum of
where we vary the Pi subject to EPi = 1. This leads by the method of Lagrange
to the equations,
Hence:
or,
This is the system of equations for determining the maximizing values of Pi,
with C to be determined so that Y Pi = 1. When this is
done C will be the channel capacity, and the Pi the proper probabilities for
the channel symbols to achieve this capacity. If each input symbol has the same
set of probabilities on the lines emerging from it, and the same is true of
each output symbol, the capacity can be easily calculated. Examples are shown
in Fig. 12. In such a case H,(y) is independent of the distribution of probabilities
on the input symbols, and is given by -EPilogPi where the pi are the values
of the transition probabilities from any input symbol. The channel capacity
is
The maximum of H(y) is clearly logm where m is the number of output symbols,
since it is possible to make them all equally probable by making the input symbols
equally probable. The channel capacity is therefore
26
Fig. 12-Examples of discrete channels with the same transition probabilities
for each input and for each output. In Fig. 12a it would be
In Fig. 12c we have
3
Suppose the symbols fall into several groups such that the noise never causes
a symbol in one group to be mistaken for a symbol in another group. Let the
capacity for the nth group be C„ (in bits per second) when we use only
the symbols in this group. Then it is easily shown that, for best use of the
entire set, the total probability P„ of all symbols in the nth group should
be
Within a group the probability is distributed just as it would be if these were
the only symbols being used. The channel capacity is
17. AN EXAMPLE OF EFFICIENT CODING The following example, although somewhat
unrealistic, is a case in which exact matching to a noisy channel is possible.
There are two channel symbols, 0 and 1, and the noise affects them in blocks
of seven symbols. A block of seven is either transmitted without error, or exactly
one symbol of the seven is incorrect. These eight possibilities are equally
likely. We have
An efficient code, allowing complete correction of errors and transmitting at
the rate C, is the following (found by a method due to R. Hamming): 27 Let a
block of seven symbols be XI,X2,...,X7. Of these X3, X5, X6 and X7 are message
symbols and chosen arbitrarily by the source. The other three are redundant
and calculated as follows:
When a block of seven is received a, 0 and rY are calculated and if even called
zero, if odd called one. The binary number a 0 rY then gives the subscript of
the Xi that is incorrect (if 0 there was no error). APPENDIX 1 THE GROWTH OF
THE NUMBER OF BLOCKS OF SYMBOLS WITH A FINITE STATE CONDITION Let Ni (L) be
the number of blocks of symbols of length L ending in state i. Then we
have
ij i. . .
, b ~ are the length of the symbols which maybe chosen
instate i and lead to state j. These are linear difference equations and the
behavior as L must be of the type
Substituting in the difference equation
or
For this to be possible the determinant
must vanish and this determines W, which is, of course, the largest real root
of D = 0. The quantity C is then given by
and we also note that the same growth properties result if we require that all
blocks start in the same (arbitrarily chosen) state. APPENDIX 2
Let H (1, , . . . , 1) = A (n). From condition (3) we can decompose
a choice from s' equally likely possibilities into a series of m choices from
s equally likely possibilities and obtain
28 Similarly
Thus, taking logarithms and dividing by n logs,
where E is arbitrarily small. Now from the monotonic property of A (n),
Hence, dividing by nA (s),
the ni are integers. We can break down a choice from Ent possibilities
into a choice from n possibilities with probabilitie">PI, . . . , Pn
and then, if the ith was chosen, a choice from ni with equal probabilities.
Using condition (3) again, we equate the total choice from Ent as computed
by two methods
Hence
If the Pi are incommeasurable, they may be approximated by rationals and the
same expression must hold by our continuity assumption. Thus the expression
holds in general. The choice of coefficient K is a matter of convenience and
amounts to the choice of a unit of measure. APPENDIX 3 THEOREMS ON ERGODIC SOURCES
If it is possible to go from any state with P > 0 to any other along a path
of probability P > 0, the system is ergodic and the strong law of large
numbers can be applied. Thus the number of times a given path Pti in
the network is traversed in a long sequence of length N is about proportional
to the probability of being at i, say Pi, and then choosing this path, PiPijN.
If N is large enough the probability of percentage error fS in this
is less than E so that for all but a set of small probability the actual numbers
lie within the limits
29
or This proves Theorem 3. Theorem 4 follows immediately from this on calculating
upper and lower bounds fo">n(q) based on the possible range of values
of P in Theorem 3. In the mixed (not eraodic) case if
To prove Theorems 5 and 6 first note that FN is monotonic decreasing
because increasing N adds a subscript to a conditional entropy. A simple
substitution fo">PB; (Sj) in the definition of FN shows that
and summing this for all N gives Hence GN > FN and GN monotonic
decreasing. Also they
must approach the same limit. By using Theorem 3 we see that APPENDIX 4 MAXIMIZING
THE RATE FOR A SYSTEM OF CONSTRAINTS Suppose we have a set of constraints on
sequences of symbols that is of the finite state type and can be represented
therefore by a linear graph. Let fI~s) be the lengths of the various symbols
that can occur in passing from state i to state j. What distribution
of probabilities Pi for the different states and p(i) for choosing symbol s
in state i and going to state j maximizes the rate of generating information
under these constraints? The constraints define a discrete channel and the maximum
rate must be less than or equal to the capacity C of this channel, since if
all blocks of large length were equally likely, this rate would result, and
if possible this would be best. We will show that this rate can be achieved
by proper choice of the Pi and s pij The rate in question is
30 Solving fo">Pij
Since The
correct value of D is the capacity C and the Bj are solutions of
for then
or
So that if Ai satisfy
Both the sets of equations fo">Bi and -Yt can be satisfied since C is
such that
In this case the rate is
but
Hence the rate is C and as this could never be exceeded this is the maximum,
justifying the assumed solution. PART III: MATHEMATICAL PRELIMINARIES In this
final installment of the paper we consider the case where the signals or the
messages or both are continuously variable, in contrast with the discrete nature
assumed heretofore. To a considerable extent the continuous case can be obtained
through a limiting process from the discrete case by dividing the continuum
of messages and signals into a large but finite number of small regions and
calculating the various parameters involved on a discrete basis. As the size
of the regions is decreased these parameters in general approach as limits the
proper values for the continuous case. There are, however, a few new effects
that appear and also a general change of emphasis in the direction of specialization
of the general results to particular cases. We will not attempt, in the continuous
case, to obtain our results with the greatest generality, or with the extreme
rigor of pure mathematics, since this would involve a great deal of abstract
measure theory and would obscure the main thread of the analysis. A preliminary
study, however, indicates that the theory can be formulated in a completely
axiomatic and rigorous manner which includes both the continuous and discrete
cases and many others. The occasional liberties taken with limiting processes
in the present analysis can be justified in all cases of practical interest.
18. SETS AND ENSEMBLES OF FUNCTIONS We shall have to deal in the continuous
case with sets of functions and ensembles of functions. A set of functions,
as the name implies, is merely a class or collection of functions, generally
of one variable, time. It can be specified by giving an explicit representation
of the various functions in the set, or implicitly by giving a property which
functions in the set possess and others do not. Some examples are: 1. The set
of functions: fo (t) = sin (t + 0). Each particular value of 0 determines
a particular function in the set. 12. The set of all functions of time containing
no frequencies over W cycles per second. 3. The set of all functions limited
in band to W and in amplitude to A. 4. The set of all English speech signals
as functions of time. An ensemble of functions is a set of functions
together with a probability measure whereby we may determine the probability
of a function in the set having certain properties.' For example with the set,
we may give a probability distribution for 0, P(9). The set then becomes an
ensemble. Some further examples of ensembles of functions are: 1. A finite set
of functio1">fk(t) (k = 1, 2,..., n) with the probability of fk
being Pk. 2. A finite dimensional family of functions f(a1,a2,-,
an; t) with a probability distribution on the parameter">ai:
P(al,...,an). For example we could consider the ensemble defined by
with the amplitudes ai distributed normally and independently, and the phases
Oi distributed uniformly (from 0 to 27r) and independently. 'In mathematical
terminology the functions belong to a measure space whose total measure is unity.
32 3. The ensemble
with the ai normal and independent all with the same standard deviation
VIN. This is a representation of "white" noise, band limited to the band
from 0 to W cycles per second and with average power N.2 4. Let points
be distributed on the t axis according to a Poisson distribution. At
each selected point the function f (t) is placed and the different functions
added, giving the ensemble
where the tk are the points of the Poisson distribution. This ensemble
can be considered as a type of impulse or shot noise where all the impulses
are identical. 5. The set of English speech functions with the probability measure
given by the frequency of occurrence in ordinary use. An ensemble of functio4">fa
(t) is stationary if the same ensemble results when all functions are
shifted any fixed amount in time. The ensemble
is stationary if 0 is distributed uniformly from 0 to 27r. If we shift each
function by tI we obtain
with chi distributed uniformly from 0 to 27r. Each function has changed
but the ensemble as a whole is invariant under the translation. The other examples
given above are also stationary. An ensemble i">ergodic if it is stationary,
and there is no subset of the functions in the set with a probability different
from 0 and 1 which is stationary. The ensemble
is ergodic. No subset of these functions of probability 0 0,1 is transformed
into itself under all time translations. On the other hand the ensemble
with a distributed normally and 0 uniform is stationary but not ergodic.
The subset of these functions with a between 0 and 1 for example is stationary.
Of the examples given, 3 and 4 are ergodic, and 5 may perhaps be considered
so. If an ensemble is ergodic we may say roughly that each function in the set
is typical of the ensemble. More precisely it is known that with an ergodic
ensemble an average of any statistic over the ensemble is equal (with probability
1) to an average over the time translations of a particular function of the
set 3 Roughly speaking, each function can be expected, as time progresses, to
go through, with the proper frequency, all the convolutions of any of the functions
in the set. TThis representation can be used as a definition of band limited
white noise. It has certain advantages in that it involves fewer limiting operations
than do definitions that have been used in the past. The name "white noise,"
already firmly entrenched in the literature, is perhaps somewhat unfortunate.
In optics white light means either any continuous spectrum as contrasted with
a point spectrum, or a spectrum which is flat with wavelength (which
is not the same as a spectrum flat with frequency). TThis is the famous ergodic
theorem or rather one aspect of this theorem which was proved in somewhat different
formulations by Birkoff, von Neumann, and Koopman, and subsequently generalized
by Wiener, Hopf, Hurewicz and others. The literature on ergodic theory is quite
extensive and the reader is referred to the papers of these writers for precise
and general formulations; e.g., E. Hopf, "Ergodentheorie," Ergebnisse der
Mathematik and ihrer Grenzgebiete, v. 5; "On Causality Statistics and Probability,"
Journal of Mathematics and Physics, v. XIII, No. 1, 1934; N. Wiener,
"The Ergodic Theorem," Duke Mathematical Journal, v. 5, 1939. 33
Probability measure is defined for the set ga (t) by means of that for the set
fa (t). The probability of a certain subset of the ga (t) functions is equal
to that of the subset of the fa (t) functions which produce members of the given
subset of g functions under the operation T. Physically this corresponds to
passing the ensemble through some device, for example, a filter, a rectifier
or a modulator. The output functions of the device form the ensemble ga (t).
A device or operator T will be called invariant if shifting the input merely
shifts the output, i.e., if
implies
for all fa (t) and all It. It is easily shown (see Appendix 5 that if T is invariant
and the input ensemble is stationary then the output ensemble is stationary.
Likewise if the input is ergodic the output will also be ergodic. A filter or
a rectifier is invariant under all time translations. The operation of modulation
is not since the carrier phase gives a certain time structure. However, modulation
is invariant under all translations which are multiples of the period of the
carrier. Wiener has pointed out the intimate relation between the invariance
of physical devices under time translations and Fourier theory. 4 He has shown,
in fact, that if a device is linear as well as invariant Fourier analysis is
then the appropriate mathematical tool for dealing with the problem. An ensemble
of functions is the appropriate mathematical representation of the messages
produced by a continuous source (for example, speech), of the signals produced
by a transmitter, and of the perturbing noise. Communication theory is properly
concerned, as has been emphasized by Wiener, not with operations on particular
functions, but with operations on ensembles of functions. A communication system
is designed not for a particular speech function and still less for a sine wave,
but for the ensemble of speech functions. 19. BAND LIMITED ENSEMBLES OF FUNCTIONS
If a function of time f (t) is limited to the band from 0 to W cycles
per second it is completely determined by giving its ordinates at a series of
discrete points spaced zw seconds apart in the manner indicated by the following
results Theorem 13: Let f (t) contain no frequencies over W. Then
where
4Communication theory is heavily indebted to Wiener for much of its basic philosophy
and theory. His classic NDRC report, The Interpolation, Extrapolation and
Smoothing of Stationary Time Series (Wiley, 1949), contains the first clear-cut
formulation of communication theory as a statistical problem, the study of operations
on time series. This work, although chiefly concerned with the linear prediction
and filtering problem, is an important collateral reference in connection with
the present paper. We may also refer here to Wiener'">Cybernetics (Wiley,
1948), dealing with the general problems of communication and control. SFor
a proof of this theorem and further discussion see the author's paper "Communication
in the Presence of Noise" published in the Proceedings of the Institute of
Radio Engineers, v. 37, No. 1, Jan., 1949, pp. 10-21. A function can be
considered to be substantially limited to a time T if all the ordinates X„
outside this interval of time are zero. In this case all but 2TW of the coordinates
will be zero. Thus functions limited to a band W and duration T correspond to
points in a space of 2TW dimensions. A subset of the functions of band W and
duration T corresponds to a region in this space. For example, the functions
whose total energy is less than or equal to E correspond to points in a 2TW
dimensional sphere with radius r = V2-WE. An ensemble of functions of
limited duration and band will be represented by a probability distribution
p(x1, . . . , x„) in the corresponding n dimensional space. If
the ensemble is not limited in time we can consider the 2TW coordinates in a
given interval T to represent substantially the part of the function in the
interval T and the probability distribution p(xl,...,x„) to give the statistical
structure of the ensemble for intervals of that duration. 20. ENTROPY OF A CONTINUOUS
DISTRIBUTION The entropy of a discrete set of probabilities Pt,... ,p, has been
defined as: In an analogous manner we define the entropy of a continuous distribution
with the density distribution function p(x) by:
With an n dimensional distribution p(xl,... ,x„) we have
If we have two arguments x and y (which may themselves be multidimensional)
the joint and conditional entropies of p(x,y) are given by
and
where
The entropies of continuous distributions have most (but not all) of the properties
of the discrete case. In particular we have the following: 1. If x is limited
to a certain volume v in its space, then H(x) is a maximum and equal to logv
when p(x) is constant (1/v) in the volume. 35
| 2. With any two variables x, y we have
with
Then the entropy of the averaged distribution p(y) is equal to or greater than
that of the original distribution p(x). 4. We have
and HX(Y) < H(Y) 5. Let p(x) be a one-dimensional distribution. The
form of p(x) giving a maximum entropy subject to the condition that the standard
deviation of x be fixed at Q is Gaussian. To show this we must maximize
with
The condition for this is
and consequently (adjusting the constants to satisfy the constraints)
Similarly in n dimensions, suppose the second order moments of p(xl,.
. . ,xn) are fixed at Ate:
Then the maximum entropy occurs (by a similar calculation) when p(xl,. . . ,xn)
is the n dimensional Gaussian distribution with the second order moments
Ate. 6. The entropy of a one-dimensional Gaussian distribution whose standard
deviation i">Q is given by
and is equal to log ea. 8. There is one important difference between
the continuous and discrete entropies. In the discrete case the entropy measures
in an absolute way the randomness of the chance variable. In the continuous
case the measurement is relative to the coordinate system. If we change
coordinates the entropy will in general change. In fact if we change to coordinates
yl ... yn the new entropy is given by where
J(y) is the Jacobian of the coordinate transformation. On expanding the logarithm
and changing the variables to xl ... x„, we obtain:
37 Thus the new entropy is the old entropy less the expected logarithm of the
Jacobian. In the continuous case the entropy can be considered a measure of
randomnes">relative to an assumed standard, namely the coordinate system
chosen with each small volume element dxl ... dx„ given equal weight.
When we change the coordinate system the entropy in the new system measures
the randomness when equal volume element">dyl ... dy„ in the new
system are given equal weight. In spite of this dependence on the coordinate
system the entropy concept is as important in the continuous case as the discrete
case. This is due to the fact that the derived concepts of information rate
and channel capacity depend on the difference of two entropies and this
difference does not depend on the coordinate frame, each of the two terms
being changed by the same amount. The entropy of a continuous distribution can
be negative. The scale of measurements sets an arbitrary zero corresponding
to a uniform distribution over a unit volume. A distribution which is more confined
than this has less entropy and will be negative. The rates and capacities will,
however, always be nonnegative. 9. A particular case of changing coordinates
is the linear transformation
In this case the Jacobian is simply the determinant laid-1 and
In the case of a rotation of coordinates (or any measure preserving transformation)
J = 1 and H(y) _ H (x) . 21. ENTROPY OF AN ENSEMBLE OF FUNCTIONS
Consider an ergodic ensemble of functions limited to a certain band of width
W cycles per second. Let
be the density distribution function for amplitudes xl,...,x„ at n
successive sample points. entropy of the ensemble per degree of freedom
by We define the
We may also define an entropy H per second by dividing, not by n,
but by the time T in seconds fo">n samples. Since n = 2TW, H =
2WH'. With white thermal noise P is Gaussian and we have
For a given average power N, white noise has the maximum possible entropy. This
follows from the maximizing properties of the Gaussian distribution noted above.
The entropy for a continuous stochastic process has many properties analogous
to that for discrete processes. In the discrete case the entropy was related
to the logarithm of the Probability of long sequences, and to the number
of reasonably probable sequences of long length. In the continuous case
it is related in a similar fashion to the logarithm of the Probability density
for a long series of samples, and the volume of reasonably high probability
in the function space. More precisely, if we assume p(xl,...,x„) continuous
in all the xi for all n, then for sufficiently large n
38 for all choices of (xl,...,x„) apart from a set whose total probability
is less than S, with S and E arbitrarily small. This follows form the ergodic
property if we divide the space into a large number of small cells. The relation
of H to volume can be stated as follows: Under the same assumptions consider
the n dimensional space corresponding to P(xl,...,x„). Let V,(q)
be the smallest volume in this space which includes in its interior a total
probability q. Then
provided q does not equal 0 or 1. These results show that for large n
there is a rather well-defined volume (at least in the logarithmic sense)
of high probability, and that within this volume the probability density is
relatively uniform (again in the logarithmic sense). In the white noise case
the distribution function is given by
Since this depends only on Exi the surfaces of equal probability density are
spheres and the entire distribution has spherical symmetry. The region of high
probability is a sphere of radiu">V"n-N. As n -+ - the
probability of being outside a sphere of radiu">n(N+ E) approaches zero
and n times the logarithm of the volume of the sphere approaches log 27reN.
In the continuous case it is convenient to work not with the entropy H
of an ensemble but with a derived quantity which we will call the entropy
power. This is defined as the power in a white noise limited to the same band
as the original ensemble and having the same entropy. In other words if H'
is the entropy of an ensemble its entropy power is
22. ENTROPY LOSS IN LINEAR FILTERS Theorem 14: If an ensemble
having an entropy Hl per degree of freedom in band W is passed
through a filter with characteristic Y(f) the output ensemble has an entropy
The operation of the filter is essentially a linear transformation of coordinates.
If we think of the different frequency components as the original coordinate
system, the new frequency components are merely the old ones multiplied by factors.
The coordinate transformation matrix is thus essentially diagonalized in terms
of these coordinates. The Jacobian of the transformation is (fo">n sine
and n cosine components)
where the f are equally spaced through the band W. This becomes in the limit
Since J is constant its average value is the same quantity and applying the
theorem on the change of entropy with a change of coordinates, the result follows.
We may also phrase it in terms of the entropy power. Thus if the entropy power
of the first ensemble i">NI that of the second is
39 TABLE I
The final entropy power is the initial entropy power multiplied by the geometric
mean gain of the filter. If the gain is measured in db, then the output
entropy power will be increased by the arithmetic mean db gain over W.
In Table I the entropy power loss has been calculated (and also expressed in
db) for a number of ideal gain characteristics. The impulsive responses
of these filters are also given for W = 27r, with phase assumed to be0. The
entropy loss for many other cases can be obtained from these results. For example
the entropy power facto">1/e2 for the first case also applies to any
gain characteristic obtain from 1- w by a measure preserving transformation
of the w axis. In particular a linearly increasing gain G(w) = w, or a "saw
tooth" characteristic between 0 and 1 have the same entropy loss. The reciprocal
gain has the reciprocal factor. Thus 1/w has the facto">e2.
Raising the gain to any power raises the factor to this power.
| ENTROPY OF A SUM OF TWO ENSEMBLES
If we have two ensembles of functio3">fa (t) and g,Q (t) we can form a new ensemble by "addition." Suppose the first ensemble has the probability density function P(xl,...,x,) and the second q(xl,...,x,). Then the
40
|