Demographics are destiny. As Earth reaches the 7 billion human
population milestone it is more important than ever that we understand
how to model the future: how fast is our population growing? What, if
any, resource constraints will we face in the future as world
populations continue to grow? How will the shifting balance between
young and old people play out? What will be the economic, social,
political and military consequences of shifting demographic patterns?
Predicting the future is hard. Nevertheless, in studying demographic
trends, we have one big advantage over most social sciences: what
physicists would call a
conservation law. Put simply: if you
are age 40 today, then in ten years, either you will be age 50, or you
will be dead.
We are going to combine this fact with real data about the US
from the Centers for Disease Control and Prevention (CDC) to make some
graphs showing life expectancy as a function of age. Since some people
will live more or fewer years than expected, we will also look at
the
distribution of mortality outcomes. Finally, we will also
look at how life expectancy could change if medical technology
improves mortality rates by 1% each year.
If we try to model the impact of advertising on shoppers, or the
future direction of the stock market, or who will win the next
election, we face the difficulty of predicting human behavior in the
face of all our irrationality and eccentricity. But when it comes to
demographics, we have an inescapable mathematical law:
population in future equals
population now plus births minus deaths
At least, that works for the total population of the Earth. If we want
to track population by country, we have to add a net migration term as
well, since not everyone stays permanently in the country of their
birth. But we won't bother about migration today.
This fundamental law means that we know the correct mathematical
equations for modeling demographics; all that is needed is to measure
the parameters (rates of fertility and mortality). Of course, we still
don't know the future: will people choose to have more children, or
fewer, as time goes on? Will a cure for cancer result in people living
longer, or will an epidemic kill a lot of people? But at least we can
have an informed discussion around the probability and magnitude of
these kinds of scenarios, since the underlying math gives us the
framework within which to work.
To do the calculations, we have to be a little careful about
definitions. If you want, you can skip forward to the graphs, but they
will be easier to interpret if you read through
the next few paragraphs.
We are interested in predicting the future, so we need to
define metrics that are likely to be relatively constant over
time. Suppose we track the total number of deaths in the US each
year. This number will probably grow from year to year, even though
people are living longer. Why? Simple: the US population is also
growing. Suppose a constant fraction of the population dies each year
(e.g. 1%). As the total population P(t) grows with time, so does the
total number of deaths D(t) = 0.01*P(t). So the ratio D(t)/P(t) is a
better metric than plain D(t), since it controls for the size of the
population.
However, even D/P is not that useful for forecasting,
because it is
not going to stay constant over time. Humans
typically live for many decades (how many varies by country), because
death rates are low for young people (except for infant mortality) and
high for old people. So, if the mix of old and young changes,
D/P will also change.
To illustrate, suppose the death rate is zero for people under
80, and is 10% per year for people over 80. Suppose 10% of our
population is over 80. Then D/P will be 0.9*0+0.1*0.1 = 0.01, or one
percent of the total population per year. Now suppose after a few
decades, the percent of people in the population who are over 80
changes to 20%. Then D/P will be 0.8*0+0.2*0.1 = 0.02, or two percent
of the population per year.
Since D/P depends on the age distribution, a better model will
track the D/P ratio separately for each age group; that way we can
also control for shifts in mix. For instance, we will keep track of D,
P, and the D/P ratio for 30-year old persons, and separately also for
31-year old persons, and so forth.
There is a further complication. The death rate for a given age
actually changes over time, as medical technology improves. A person
born in 1900 was 30 years old in 1930, before antibiotics were common,
so their chance of dying during their 30th year was higher than that
of a person born in 1950, whose 30th year was 1980.
Demographers report death rates by age in what are
called
mortality tables. Since death rates vary as technology
improves, we have to make clear whether the mortality rates we collect
in the table refer to one
cohort (e.g. the people born in 1900,
followed over their lifetimes) or to one
period (e.g. the
deaths that occurred during the year 2007, which impacted people from
a whole spectrum of birth years).
Constructing a cohort table requires taking data over a
hundred year period, and you have to wait until all of the members are
dead before you are finished. That makes period tables much more
practical.
Notice, though, that period tables come with the implicit
assumption that their rates stay constant from here on out. This
assumption can be unrealistic. Imagine a baby born in 2007. They
experience the infant mortality conditions of the table accurately
enough, but as they age, they ought to experience less mortality than
the table suggests. That's because when they are 30, for example, the
year will be 2037, and (hopefully) improved medical care will reduce
the death rates for 30 year old people compared with the value in the
table, which reflects the experience of people who were 30 in
2007. Similarly, when this 2007 baby is 80, the year will be 2087, not
2007, and so the mortality rate ought to reflect another 80 years of
progress. Thus, the period table shows the life expectancy that the
baby of 2007 would experience
if medical care never improves
beyond the 2007 level.
The bottom line is that the period life table gives us a
practical way to compare death rates by age at one moment in time,
even though it cannot accurately predict what will happen to your
cohort (let alone you personally), in some distant future
year. Comparing period tables constructed in different calendar years
(or for different countries or races or genders) also gives us a
standardized way to compare and track progress.
Demographers measure mortality rates (and also fertility
rates) subdivided by country, age, gender, and sometimes additional
factors like race. They also have to track immigration, because people
can change countries over time. For today though, we are going to
look just at mortality, just in the United States, and without
subdividing by race or gender. We will leave fertility and other
elements of a complete model to a future article.
The Centers for
Disease Control and Prevention (CDC) publishes "Vital Statistics" each
year, including a
so-called
LEWK3
table; we will work with the most recent one, from
2007.
This is a
period table. From it, we can calculate all sorts of
interesting metrics, such as life expectancy.
Here are the first few rows in the table - the spreadsheet, which you
can download from the CDC, actually goes all the way to age 100.
Let's be sure we are clear about what each column means. From the
CDC
documentation file nvsr58_10.pdf page 2:
- The row labeled Age 4-5 refers to what happens to people during
the year between their 4th and 5th birthdays, i.e. while they are
4 years old.
- The probability of dying column (q) is the probability of the
person dying during that year of age. Observe that it is much higher
(more than 10 times) for age 0-1 than for subsequent years: this
is why infant mortality is such a big concern.
The documentation is a bit ambiguous: is q[A] the probability of death
at age A for all people in the original cohort, or just for people who
have attained age A? It must be the latter, since column (q) does not
sum to one, as it would need to if these probabilities applied to the
whole cohort. But since all the other columns in the spreadsheet are
calculated from column (q), we can verify this interpretation by
replicating the calculations ourselves.
The number surviving column (l) starts at 100,000 people and goes down
over time as the people age. Here is where we blur the distinction
between period and cohort. The probabilities (q) are period death
rates, e.g. in 2007 in the US, 0.0460% of children age 1-2 died, and
0.0176% of children age 4-5 died. To calculate column (l),
we
pretend that these rates will not change in the
future. Hence, our
hypothetical cohort of 100,000 people will
experience the infant mortality of the 2007 cohort, the age 1-2 mortality
of the 2006 cohort, the age 2-3 mortality of the 2005 cohort, and so
on, since all of those rates were observed in 2007 and hence recorded
in our table. As noted above, if medical technology continues to
improve, then for the
real 2007 cohort, by the time they reach
any given age, their actual mortality should be lower than given in
the table, so more will survive longer than predicted in the table.
But assuming medical care no longer improves, we can follow
our hypothetical cohort forward and observe what happens. Multiplying
100,000 by 0.006761 gives 676 infant deaths, leaving
99,324. Multiplying 99,324 by 0.000460 gives 46 deaths during age 1-2,
leaving 99,278 survivors. And so forth. If you download the
spreadsheet, you can work with exact numbers and avoid round-off
error. Following this pattern you can verify the (l) and
(d) numbers from (q).
Since there are one hundred independent numbers in the table
(the contents of the (q) column), the best way to understand the table
is via a graph. However, people often prefer having a single number
that summarizes a lot of the information in the table. There are
various numbers you could compute for this purpose, such as the
average death rate, but the one most commonly reported is
called
life expectancy at birth. This is the number 77.9 found
in the age 0-1 row of column (e).
Life expectancy means the expected, or average, number of
additional years a person in our hypothetical cohort will live.
How shall we calculate it? Among people who reached age x, some die in
the next year, during their x-to-(x+1) year. Some die the following
year. Some keep living a long time further. We have to weight each of
these outcomes by the corresponding probability, and then add them up
to find the expected value.
The two columns labeled (L) and (T) help with this calculation.
Except for the first and last rows in the table, the (L) column shows
the average population during that year, assuming people who die do so
at a uniform rate throughout the year. For example, the (L) value for
age 3-4 is 99,239, which is half way between the surviving population
(l) at the start of the year (99,250) and at the end of the year
(99,228). The first and last values have been modified based on
information not shown to us, namely the age in months at which infants
died, or the age beyond 100 that people survived to. Apparently they
assumed people live exactly two years past 100.
At any rate, (L) represents the total number of person-years
lived during this age range: one year for everyone who survives the
year, plus on average half a year for those who die during the year.
The (T) column is calculated from the bottom end of the table
backward to the top, by summing the (L) column. For instance, the
value of (T) in the 0-1 age row is the sum of the entire (L) column.
Since (L) represents people-years lived during one age row, (T)
represents the total people-years lived during and after one age row.
Since those total people-years belong to the survivors listed in
column (l), if we divide (T) by (l) we get the life expectancy beyond
the current age.
Personally, I find it easier to think about this definition
the other way around: take the probability-weighted sum of the extra
number of years people live. The tricky part is that the probabilities
are not given by the (q) column, but rather, by the (d) column from
age x onward, divided by the (l) column at age x. To see this, note
that the people in cell l[x] die according to the values in d[x],
d[x+1], and so forth, which therefore sum to l[x]. So, for example,
starting at age x=10 (meaning row 10-11), if we multiply column (d) by
(age-9.5), sum, and divide by l[x], we get 68.6, as expected from the
spreadsheet. The spreadsheet formula
is
=SUMPRODUCT(H15:H105,D15:D105)/C15 after first
putting
=0.5+ROW()-15 in column H.
A little algebra should convince you the two approaches give the same
answer.
So far, all we have done is verify the definition of life
expectancy. Now for the much more interesting part. Let's draw some
graphs, and then let's ask how these concepts may change as time goes
forward. In other words, if medical technology does continue to improve,
how much longer might we expect to live? And, rather than just look at
averages, let's look at the
distribution of outcomes. There is
so much more information here than can be summarized in a single
number!
If you like, you can do the calculations and make the graphs yourself
using a spreadsheet. However, I will show how to do it using the free
high-quality open-source
statistical programming language
R. You can follow along by
downloading your own completely free copy of
R
from
The Comprehensive R Archive
Network.
Look in my earlier post on
Koch
Snowflakes for some background on why R is a good choice, or
look
in
System
Dynamics: Feedback Models for more on population modeling using
R, or look
in
How
Do We Know? for an even simpler introduction to R.
The first thing we need to do is use our spreadsheet program to
convert the CDC table, which was designed for human readers, into a
plain tab-delimited text file, with numbers in the age column, like this:
Save this in a
file called "us-period2007-life-table.tab". Now we can close off the
spreadsheet and do the rest with R.
Here's the R code we will use. The first line reads the table
and turns it into a "data
frame", which allows us to refer to the columns by name. The second
line prints a summary of each column so we can check that it read the
file correctly. Next, we define a function for calculating the (l),
(d) and (e) columns from the (q) column; this will let us experiment
with modifying the mortality rates. The rest of the code makes the
graphs that we will discuss below.
mort <- read.table('us-period2007-life-table.tab',
header=TRUE, sep='\t')
print(summary(mort))
N <- length(mort$q)
lifeExp <- function(d) {
y <- 1:N
p <- 0
np <- d[N]
for(i in N:1) {
p <- p + d[i]
np <- np + (i-0.5)*d[i]
y[i] <- np/p-i+1
}
y
}
calc <- function(q) {
l <- 0*q + 100000 # initial cohort size
d <- 0*q
e <- 0*q
for(i in 1:N) {
d[i] <- q[i]*l[i]
l[i+1] <- l[i] - d[i]
}
list(l=l, d=d, e=lifeExp(d))
}
m <- calc(mort$q)
## check we can reproduce 'l', 'd', and 'e' from 'q'
if(m$l[N+1] != 0 ||
max(abs(m$l[1:N]-mort$l))>0.1 ||
max(abs(m$d-mort$d))>0.01 ||
max(abs(m$e-mort$e))>0.8)
stop('does not match data')
## now make some graphics helper functions:
graph <- function(name, ylabel, y, col='black', extra=0) {
png(paste(name,'.png',sep=''), 800, 500)
par(mar=c(5, 5, 1, 1), cex=1.5, lwd=2)
n <- length(y)
yy <- y[1:(n-1)] # drop last point
plot(c(0,n-2), c(0,max(yy)+extra),
type='n', xlab='Age', ylab=ylabel)
add(y, col)
}
add <- function(y, col='black') {
n <- length(y)
lines(0:(n-2), y[1:(n-1)], col=col)
}
## now make the plots:
graph('le', 'Life Expectancy', lifeExp(mort$d))
dev.off()
graph('mr', 'Mortality Rate (%/year)',
100*mort$q[1:(N-1)])
dev.off()
graph('sv', 'Surviors (% of cohort)', mort$l/1e3)
dev.off()
## modify mortality rates up or down 50%:
m <- calc(mort$q * 0.5)
m2 <- calc(mort$q * 1.5)
graph('le1', 'Life Expectancy', lifeExp(m$d), 'blue')
add(lifeExp(mort$d))
add(lifeExp(m2$d), 'red')
dev.off()
graph('sv1', 'Surviors (% of cohort)', m$l/1e3, 'blue')
add(mort$l/1e3)
add(m2$l/1e3, 'red')
dev.off()
## make a cohort graph following 2007 or 1957 people,
## assuming a 1%/year improvement after 2007
m <- calc(mort$q * (0.99^(1:N)))
m2 <- calc(mort$q * (0.99^((1:N)-50)))
graph('le2', 'Life Expectancy', lifeExp(m$d), 'green')
add(lifeExp(mort$d))
L <- lifeExp(m2$d)
lines(50:99, L[50:99], col='blue')
dev.off()
## age of death probabilities for the 1957 cohort,
## again assuming a 1%/year improvement after 2007
graph('death', 'Probability of Death by Age',
m$d/m$l[1], col="black", 0.003)
lines(50:99, m2$d[50:99]/m2$l[50], col="blue")
dev.off()
First, the life expectancy graph, column (e) in the dataset.
This shows that at birth, life expectancy is about 78, falling
roughly linearly with age until around 60, after which it starts to
tail off. Since not everyone dies at age 78, the graph has to tail
off: it cannot go zero, let alone be negative. For example, at 80,
life expectancy is about 9 more years. This means that having
survived all the way to 80, you will, on average, live 9 more years -
even though 80 is already past the original life expectancy at birth
of 78. Of course, "you" means a member of the hypothetical cohort
born in 2007 experiencing no further improvement in medical care
beyond 2007 levels, not the real "you". If you are already 80, today
when you read this, then these numbers are representative for your
generation, but if you are 20, you can hope that 60 years of progress
will make the numbers for your generation better, assuming you actually
make it to age 80 yourself.

Next, we plot the individual mortality rates for each age, column (q)
in the dataset. The rate for age 100 is 100%, but that is an artifact
of ending the table there, so I have suppressed it to enlarge the
vertical scale. You can see the blip at zero for "infant mortality",
followed by almost zero death rates until people reach age 60 or so.
Finally, we look at the percent of the cohort surviving to a given
age, column (l)
in the dataset. Aside from the infant mortality blip at the start,
this decays very slowly until around age 60, after which it
accelerates for a while, and then tails off, since like life
expectancy, it can never go negative.
Visually, it looks like the median age of death (the age where the survival
curve is at 50% of the population) is around 80, which makes sense:
the median will be fairly close to the mean (life expectancy), though
not identical.
Why did we bother figuring out how to calculate these columns? Why not
just graph them directly in our spreadsheet? Well, now that we know
how to do it, we can change things. In particular, we can experiment
with reducing the death rates to see how much of a difference medical
improvements might have on life expectancy.
Since we do not know the future, we will have to present
scenarios. The next two plots show life expectancy and survival curves
under the following scenarios:
- black: the base case shown above
- blue: mortality rates are cut in half across the board,
meaning we multiply column (q) by 0.5
- red: mortality rates rise by 50% across the board,
meaning we multiply column (q) by 1.5
Can you guess what will happen?
A fifty-percent change in mortality rates seems like it should have
pronounced impact. Intuitively, we expect something like a fifty-percent change in
life expectancy! In fact, life expectancy at birth changes
by only 5 years up or down, with progressively less impact for older
people. What is going on?
Think of it this way: until age 60, the mortality rate is
almost zero, so whether we multiply it by 0.5 or by 1.5, it is still
almost zero. Only when people are 70 or 80 or 90 are the rates high
enough that 50% up or down is a big change. As a result, life
expectancy at birth still reaches into the 70's, even for the red
line, and does not get far into the 80's, even for the blue line. That
means that changing life expectancy by even one year is hard. That
suggests that the differences in life expectancy between
industrialized countries and developing countries will be difficult to
reduce without massive improvements in health care;
see
Gapminder
World Map (2010) for a well-drawn diagram showing the correlation
between life expectancy and economic progress.
Similarly, if we look at the survival graph, most people live to their
70's or 80's under all three scenarios. For ages below 60, the
differences are small. "Almost zero" mortality rates do add up over
time, but it takes half a century or so to see it.
So far, these graphs reflect the
period life table - they show
what would happen to the cohort of babies born in 2007 if medical
progress freezes at 2007 levels. Let's try to make some
cohort
graphs, predicting what will actually happen for those babies, as well
as for the cohort born in 1957, which is always 50 years older than
the 2007 cohort. To draw these graphs, we have to make an assumption
about how fast mortality rates will improve in the future. For simplicity
- since I have no real data on this - let us assume that the (q)
values will fall by one percent (multiply by 0.99) with each passing
year. In the next graph:
- The black line is the usual 2007 cohort life expectancy as in the
previous graphs,
- The green line is the "actual" average number of remaining years
of life for people from the 2007 cohort who have attained a given
age, based on our 1%/year improvement assumption, and
- The blue line is the "actual" for people from the 1957 cohort.
Can you guess what it will look like?
To make the green line, we multiply the mortality rate for age A by
0.99^A, reflecting A years of progress since 2007.
However, to make the blue line, we multiply the mortality rate for age A by
0.99^(A-50), since the blue cohort was already 50 years old in 2007.
The blue line begins at age 50, since the 1957 cohort has already
reached 50 as of 2007 - all these graphs are drawn from the
perspective of 2007, not 2011, since there is apparently a 3 year lag
in publishing the table.
We see that life expectancy at birth for the 2007 babies is really 83,
up 5 from 78 years, provided medical care improves as we have
assumed. For those who survive to age 50, their expected additional
years of life after 50 also rises by 5 years, from 32 to 37 (i.e. they
can expect to live to 87, rather than 82).
For people born in 1957, though, their expected additional
years of life as of age 50 only increases by 2 years, from 32 to 34,
i.e. they can expect to live to 84, rather than 82. That's because
their old age will happen relatively soon - in just 30 years - so
there will not have been time to improve medical care as much as for
the 2007 cohort, for whom old age (i.e. years with large mortality
rates) is still 70 years away.
All the graphs so far looked at averages. For any individual person, though,
it is also interesting to know the
distribution of additional
years of life. After all, some people die young, while others live
past 100. Not everyone lives exactly to the average.
What does the "bell curve" look like?
Turns out there is an easy answer. We just need to look at column
(d), which holds the number of people from the original cohort that
died at each year. Starting at a particular age A, we divide by the
total number of people that reached A, namely l[A], to get the
probability distribution.
In the final chart, the black line shows the probability of death at a
given age for the 2007 cohort, and the blue line for the 1957
cohort. In both cases, the calculations are as of 2007, and assume a
1% improvement in mortality rates per year throughout the 21st century.
This picture is quite interesting. First of all, on the left side, the
infant mortality piece looks much more noticeable than it did back
when we plotted the mortality
rates. That's because those rates
leave out the size of the population they apply to. The population is
largest at birth, and declines over time, so the high rates at ages
near 100 do not actually translate into very many people, since so
many died along the way. This graph makes it more clear that infant
mortality is still a very big problem even in the US.
Looking at the right-hand side, we see that the peak for the 2007
cohort comes at a later age than for the 1957 cohort - again because
the 2007 cohort has an extra 50 years of medical improvements to help
them over the 1957 cohort. We
calculated earlier that the 1957 cohort could expect to live to 84 on
average; now we see that there is considerable spread around that
number, with significant numbers of people dying at every age between
50 and 100.
In fact, we see that cutting off the life table at age 100 is a bit
premature: since it is all computerized nowadays, there is no reason
not to extend it out to 110 or even 120 in order to provide more
insight into how medical improvements affect the oldest people.
I hope this example encourages you to experiment: in whatever country
you live, download the latest life tables and see what the forecasts
are for you. You could also download several life tables from
different years and compare them in order to estimate just how much
progress improved medical care is making each year. Of course, past
trends need not continue in the future, but they are at least a
starting point for discussion.
I hope you found this interesting. You can click the "M"
button below to email this post to a friend, or the "t" button to
Tweet it, or the "f" button to share it on Facebook, and so on.
As usual, please post questions, comments and other suggestions using
the box below, or G-mail me directly at the address mentioned in the
Welcome
post. Remember that you can sign up for email alerts about
new posts by entering your address in the widget on the sidebar. If
you prefer, you can follow
@ingThruMath on
Twitter to
get a 'tweet' for each new
post. The
Contents page has a complete list of previous articles in
historical order. You may also want to use the 'Topic' and 'Search'
widgets in the side-bar to find other articles of related interest.
See you next time!
0 comments:
Post a Comment