Saturday, November 19, 2011

Understanding Life Expectancy

Demographics are destiny. As Earth reaches the 7 billion human population milestone it is more important than ever that we understand how to model the future: how fast is our population growing? What, if any, resource constraints will we face in the future as world populations continue to grow? How will the shifting balance between young and old people play out? What will be the economic, social, political and military consequences of shifting demographic patterns?

Predicting the future is hard. Nevertheless, in studying demographic trends, we have one big advantage over most social sciences: what physicists would call a conservation law. Put simply: if you are age 40 today, then in ten years, either you will be age 50, or you will be dead.

We are going to combine this fact with real data about the US from the Centers for Disease Control and Prevention (CDC) to make some graphs showing life expectancy as a function of age. Since some people will live more or fewer years than expected, we will also look at the distribution of mortality outcomes. Finally, we will also look at how life expectancy could change if medical technology improves mortality rates by 1% each year.

If we try to model the impact of advertising on shoppers, or the future direction of the stock market, or who will win the next election, we face the difficulty of predicting human behavior in the face of all our irrationality and eccentricity. But when it comes to demographics, we have an inescapable mathematical law:
population in future equals 
  population now plus births minus deaths

At least, that works for the total population of the Earth. If we want to track population by country, we have to add a net migration term as well, since not everyone stays permanently in the country of their birth. But we won't bother about migration today.

This fundamental law means that we know the correct mathematical equations for modeling demographics; all that is needed is to measure the parameters (rates of fertility and mortality). Of course, we still don't know the future: will people choose to have more children, or fewer, as time goes on? Will a cure for cancer result in people living longer, or will an epidemic kill a lot of people? But at least we can have an informed discussion around the probability and magnitude of these kinds of scenarios, since the underlying math gives us the framework within which to work.

To do the calculations, we have to be a little careful about definitions. If you want, you can skip forward to the graphs, but they will be easier to interpret if you read through the next few paragraphs.

We are interested in predicting the future, so we need to define metrics that are likely to be relatively constant over time. Suppose we track the total number of deaths in the US each year. This number will probably grow from year to year, even though people are living longer. Why? Simple: the US population is also growing. Suppose a constant fraction of the population dies each year (e.g. 1%). As the total population P(t) grows with time, so does the total number of deaths D(t) = 0.01*P(t). So the ratio D(t)/P(t) is a better metric than plain D(t), since it controls for the size of the population.

However, even D/P is not that useful for forecasting, because it is not going to stay constant over time. Humans typically live for many decades (how many varies by country), because death rates are low for young people (except for infant mortality) and high for old people. So, if the mix of old and young changes, D/P will also change.

To illustrate, suppose the death rate is zero for people under 80, and is 10% per year for people over 80. Suppose 10% of our population is over 80. Then D/P will be 0.9*0+0.1*0.1 = 0.01, or one percent of the total population per year. Now suppose after a few decades, the percent of people in the population who are over 80 changes to 20%. Then D/P will be 0.8*0+0.2*0.1 = 0.02, or two percent of the population per year.

Since D/P depends on the age distribution, a better model will track the D/P ratio separately for each age group; that way we can also control for shifts in mix. For instance, we will keep track of D, P, and the D/P ratio for 30-year old persons, and separately also for 31-year old persons, and so forth.

There is a further complication. The death rate for a given age actually changes over time, as medical technology improves. A person born in 1900 was 30 years old in 1930, before antibiotics were common, so their chance of dying during their 30th year was higher than that of a person born in 1950, whose 30th year was 1980.

Demographers report death rates by age in what are called mortality tables. Since death rates vary as technology improves, we have to make clear whether the mortality rates we collect in the table refer to one cohort (e.g. the people born in 1900, followed over their lifetimes) or to one period (e.g. the deaths that occurred during the year 2007, which impacted people from a whole spectrum of birth years).

Constructing a cohort table requires taking data over a hundred year period, and you have to wait until all of the members are dead before you are finished. That makes period tables much more practical.

Notice, though, that period tables come with the implicit assumption that their rates stay constant from here on out. This assumption can be unrealistic. Imagine a baby born in 2007. They experience the infant mortality conditions of the table accurately enough, but as they age, they ought to experience less mortality than the table suggests. That's because when they are 30, for example, the year will be 2037, and (hopefully) improved medical care will reduce the death rates for 30 year old people compared with the value in the table, which reflects the experience of people who were 30 in 2007. Similarly, when this 2007 baby is 80, the year will be 2087, not 2007, and so the mortality rate ought to reflect another 80 years of progress. Thus, the period table shows the life expectancy that the baby of 2007 would experience if medical care never improves beyond the 2007 level.

The bottom line is that the period life table gives us a practical way to compare death rates by age at one moment in time, even though it cannot accurately predict what will happen to your cohort (let alone you personally), in some distant future year. Comparing period tables constructed in different calendar years (or for different countries or races or genders) also gives us a standardized way to compare and track progress.

Demographers measure mortality rates (and also fertility rates) subdivided by country, age, gender, and sometimes additional factors like race. They also have to track immigration, because people can change countries over time. For today though, we are going to look just at mortality, just in the United States, and without subdividing by race or gender. We will leave fertility and other elements of a complete model to a future article.

The Centers for Disease Control and Prevention (CDC) publishes "Vital Statistics" each year, including a so-called LEWK3 table; we will work with the most recent one, from 2007. This is a period table. From it, we can calculate all sorts of interesting metrics, such as life expectancy.

Here are the first few rows in the table - the spreadsheet, which you can download from the CDC, actually goes all the way to age 100.

Let's be sure we are clear about what each column means. From the CDC documentation file nvsr58_10.pdf page 2:
  • The row labeled Age 4-5 refers to what happens to people during the year between their 4th and 5th birthdays, i.e. while they are 4 years old.
  • The probability of dying column (q) is the probability of the person dying during that year of age. Observe that it is much higher (more than 10 times) for age 0-1 than for subsequent years: this is why infant mortality is such a big concern.
The documentation is a bit ambiguous: is q[A] the probability of death at age A for all people in the original cohort, or just for people who have attained age A? It must be the latter, since column (q) does not sum to one, as it would need to if these probabilities applied to the whole cohort. But since all the other columns in the spreadsheet are calculated from column (q), we can verify this interpretation by replicating the calculations ourselves.

The number surviving column (l) starts at 100,000 people and goes down over time as the people age. Here is where we blur the distinction between period and cohort. The probabilities (q) are period death rates, e.g. in 2007 in the US, 0.0460% of children age 1-2 died, and 0.0176% of children age 4-5 died. To calculate column (l), we pretend that these rates will not change in the future. Hence, our hypothetical cohort of 100,000 people will experience the infant mortality of the 2007 cohort, the age 1-2 mortality of the 2006 cohort, the age 2-3 mortality of the 2005 cohort, and so on, since all of those rates were observed in 2007 and hence recorded in our table. As noted above, if medical technology continues to improve, then for the real 2007 cohort, by the time they reach any given age, their actual mortality should be lower than given in the table, so more will survive longer than predicted in the table.

But assuming medical care no longer improves, we can follow our hypothetical cohort forward and observe what happens. Multiplying 100,000 by 0.006761 gives 676 infant deaths, leaving 99,324. Multiplying 99,324 by 0.000460 gives 46 deaths during age 1-2, leaving 99,278 survivors. And so forth. If you download the spreadsheet, you can work with exact numbers and avoid round-off error. Following this pattern you can verify the (l) and (d) numbers from (q).

Since there are one hundred independent numbers in the table (the contents of the (q) column), the best way to understand the table is via a graph. However, people often prefer having a single number that summarizes a lot of the information in the table. There are various numbers you could compute for this purpose, such as the average death rate, but the one most commonly reported is called life expectancy at birth. This is the number 77.9 found in the age 0-1 row of column (e).

Life expectancy means the expected, or average, number of additional years a person in our hypothetical cohort will live. How shall we calculate it? Among people who reached age x, some die in the next year, during their x-to-(x+1) year. Some die the following year. Some keep living a long time further. We have to weight each of these outcomes by the corresponding probability, and then add them up to find the expected value.

The two columns labeled (L) and (T) help with this calculation. Except for the first and last rows in the table, the (L) column shows the average population during that year, assuming people who die do so at a uniform rate throughout the year. For example, the (L) value for age 3-4 is 99,239, which is half way between the surviving population (l) at the start of the year (99,250) and at the end of the year (99,228). The first and last values have been modified based on information not shown to us, namely the age in months at which infants died, or the age beyond 100 that people survived to. Apparently they assumed people live exactly two years past 100. At any rate, (L) represents the total number of person-years lived during this age range: one year for everyone who survives the year, plus on average half a year for those who die during the year.

The (T) column is calculated from the bottom end of the table backward to the top, by summing the (L) column. For instance, the value of (T) in the 0-1 age row is the sum of the entire (L) column. Since (L) represents people-years lived during one age row, (T) represents the total people-years lived during and after one age row. Since those total people-years belong to the survivors listed in column (l), if we divide (T) by (l) we get the life expectancy beyond the current age.

Personally, I find it easier to think about this definition the other way around: take the probability-weighted sum of the extra number of years people live. The tricky part is that the probabilities are not given by the (q) column, but rather, by the (d) column from age x onward, divided by the (l) column at age x. To see this, note that the people in cell l[x] die according to the values in d[x], d[x+1], and so forth, which therefore sum to l[x]. So, for example, starting at age x=10 (meaning row 10-11), if we multiply column (d) by (age-9.5), sum, and divide by l[x], we get 68.6, as expected from the spreadsheet. The spreadsheet formula is =SUMPRODUCT(H15:H105,D15:D105)/C15 after first putting =0.5+ROW()-15 in column H. A little algebra should convince you the two approaches give the same answer.

So far, all we have done is verify the definition of life expectancy. Now for the much more interesting part. Let's draw some graphs, and then let's ask how these concepts may change as time goes forward. In other words, if medical technology does continue to improve, how much longer might we expect to live? And, rather than just look at averages, let's look at the distribution of outcomes. There is so much more information here than can be summarized in a single number!

If you like, you can do the calculations and make the graphs yourself using a spreadsheet. However, I will show how to do it using the free high-quality open-source statistical programming language R. You can follow along by downloading your own completely free copy of R from The Comprehensive R Archive Network. Look in my earlier post on Koch Snowflakes for some background on why R is a good choice, or look in System Dynamics: Feedback Models for more on population modeling using R, or look in How Do We Know? for an even simpler introduction to R.

The first thing we need to do is use our spreadsheet program to convert the CDC table, which was designed for human readers, into a plain tab-delimited text file, with numbers in the age column, like this:

Save this in a file called "us-period2007-life-table.tab". Now we can close off the spreadsheet and do the rest with R.

Here's the R code we will use. The first line reads the table and turns it into a "data frame", which allows us to refer to the columns by name. The second line prints a summary of each column so we can check that it read the file correctly. Next, we define a function for calculating the (l), (d) and (e) columns from the (q) column; this will let us experiment with modifying the mortality rates. The rest of the code makes the graphs that we will discuss below.

mort <- read.table('us-period2007-life-table.tab', 
 header=TRUE, sep='\t')
print(summary(mort))
N <- length(mort$q)

lifeExp <- function(d) {
  y <- 1:N
  p <- 0
  np <- d[N]
  for(i in N:1) {
    p <- p + d[i]
    np <- np + (i-0.5)*d[i]
    y[i] <- np/p-i+1
  }
  y
}

calc <- function(q) {
  l <- 0*q + 100000 # initial cohort size
  d <- 0*q
  e <- 0*q
  for(i in 1:N) {
    d[i] <- q[i]*l[i]
    l[i+1] <- l[i] - d[i]
  }
  list(l=l, d=d, e=lifeExp(d))
}

m <- calc(mort$q)
## check we can reproduce 'l', 'd', and 'e' from 'q'
if(m$l[N+1] != 0 ||
   max(abs(m$l[1:N]-mort$l))>0.1 ||
   max(abs(m$d-mort$d))>0.01 ||
   max(abs(m$e-mort$e))>0.8)
  stop('does not match data')

## now make some graphics helper functions:

graph <- function(name, ylabel, y, col='black', extra=0) {
  png(paste(name,'.png',sep=''), 800, 500)
  par(mar=c(5, 5, 1, 1), cex=1.5, lwd=2)
  n <- length(y)
  yy <- y[1:(n-1)] # drop last point
  plot(c(0,n-2), c(0,max(yy)+extra), 
 type='n', xlab='Age', ylab=ylabel)
  add(y, col)
}

add <- function(y, col='black') {
  n <- length(y)
  lines(0:(n-2), y[1:(n-1)], col=col)
}

## now make the plots:

graph('le', 'Life Expectancy', lifeExp(mort$d))
dev.off()

graph('mr', 'Mortality Rate (%/year)', 
       100*mort$q[1:(N-1)])
dev.off()

graph('sv', 'Surviors (% of cohort)', mort$l/1e3)
dev.off()

## modify mortality rates up or down 50%:
m <- calc(mort$q * 0.5)
m2 <- calc(mort$q * 1.5)
graph('le1', 'Life Expectancy', lifeExp(m$d), 'blue')
add(lifeExp(mort$d))
add(lifeExp(m2$d), 'red')
dev.off()

graph('sv1', 'Surviors (% of cohort)', m$l/1e3, 'blue')
add(mort$l/1e3)
add(m2$l/1e3, 'red')
dev.off()

## make a cohort graph following 2007 or 1957 people,
## assuming a 1%/year improvement after 2007
m <- calc(mort$q * (0.99^(1:N)))
m2 <- calc(mort$q * (0.99^((1:N)-50)))
graph('le2', 'Life Expectancy', lifeExp(m$d), 'green')
add(lifeExp(mort$d))
L <- lifeExp(m2$d)
lines(50:99, L[50:99], col='blue')
dev.off()

## age of death probabilities for the 1957 cohort, 
## again assuming a 1%/year improvement after 2007
graph('death', 'Probability of Death by Age', 
     m$d/m$l[1], col="black", 0.003)
lines(50:99, m2$d[50:99]/m2$l[50], col="blue")
dev.off()

First, the life expectancy graph, column (e) in the dataset. This shows that at birth, life expectancy is about 78, falling roughly linearly with age until around 60, after which it starts to tail off. Since not everyone dies at age 78, the graph has to tail off: it cannot go zero, let alone be negative. For example, at 80, life expectancy is about 9 more years. This means that having survived all the way to 80, you will, on average, live 9 more years - even though 80 is already past the original life expectancy at birth of 78. Of course, "you" means a member of the hypothetical cohort born in 2007 experiencing no further improvement in medical care beyond 2007 levels, not the real "you". If you are already 80, today when you read this, then these numbers are representative for your generation, but if you are 20, you can hope that 60 years of progress will make the numbers for your generation better, assuming you actually make it to age 80 yourself.

Next, we plot the individual mortality rates for each age, column (q) in the dataset. The rate for age 100 is 100%, but that is an artifact of ending the table there, so I have suppressed it to enlarge the vertical scale. You can see the blip at zero for "infant mortality", followed by almost zero death rates until people reach age 60 or so.

Finally, we look at the percent of the cohort surviving to a given age, column (l) in the dataset. Aside from the infant mortality blip at the start, this decays very slowly until around age 60, after which it accelerates for a while, and then tails off, since like life expectancy, it can never go negative.

Visually, it looks like the median age of death (the age where the survival curve is at 50% of the population) is around 80, which makes sense: the median will be fairly close to the mean (life expectancy), though not identical.

Why did we bother figuring out how to calculate these columns? Why not just graph them directly in our spreadsheet? Well, now that we know how to do it, we can change things. In particular, we can experiment with reducing the death rates to see how much of a difference medical improvements might have on life expectancy.

Since we do not know the future, we will have to present scenarios. The next two plots show life expectancy and survival curves under the following scenarios:
  • black: the base case shown above
  • blue: mortality rates are cut in half across the board, meaning we multiply column (q) by 0.5
  • red: mortality rates rise by 50% across the board, meaning we multiply column (q) by 1.5
Can you guess what will happen?

A fifty-percent change in mortality rates seems like it should have pronounced impact. Intuitively, we expect something like a fifty-percent change in life expectancy! In fact, life expectancy at birth changes by only 5 years up or down, with progressively less impact for older people. What is going on?

Think of it this way: until age 60, the mortality rate is almost zero, so whether we multiply it by 0.5 or by 1.5, it is still almost zero. Only when people are 70 or 80 or 90 are the rates high enough that 50% up or down is a big change. As a result, life expectancy at birth still reaches into the 70's, even for the red line, and does not get far into the 80's, even for the blue line. That means that changing life expectancy by even one year is hard. That suggests that the differences in life expectancy between industrialized countries and developing countries will be difficult to reduce without massive improvements in health care; see Gapminder World Map (2010) for a well-drawn diagram showing the correlation between life expectancy and economic progress.

Similarly, if we look at the survival graph, most people live to their 70's or 80's under all three scenarios. For ages below 60, the differences are small. "Almost zero" mortality rates do add up over time, but it takes half a century or so to see it.

So far, these graphs reflect the period life table - they show what would happen to the cohort of babies born in 2007 if medical progress freezes at 2007 levels. Let's try to make some cohort graphs, predicting what will actually happen for those babies, as well as for the cohort born in 1957, which is always 50 years older than the 2007 cohort. To draw these graphs, we have to make an assumption about how fast mortality rates will improve in the future. For simplicity - since I have no real data on this - let us assume that the (q) values will fall by one percent (multiply by 0.99) with each passing year. In the next graph:
  • The black line is the usual 2007 cohort life expectancy as in the previous graphs,
  • The green line is the "actual" average number of remaining years of life for people from the 2007 cohort who have attained a given age, based on our 1%/year improvement assumption, and
  • The blue line is the "actual" for people from the 1957 cohort.
Can you guess what it will look like?

To make the green line, we multiply the mortality rate for age A by 0.99^A, reflecting A years of progress since 2007.

However, to make the blue line, we multiply the mortality rate for age A by 0.99^(A-50), since the blue cohort was already 50 years old in 2007.

The blue line begins at age 50, since the 1957 cohort has already reached 50 as of 2007 - all these graphs are drawn from the perspective of 2007, not 2011, since there is apparently a 3 year lag in publishing the table.

We see that life expectancy at birth for the 2007 babies is really 83, up 5 from 78 years, provided medical care improves as we have assumed. For those who survive to age 50, their expected additional years of life after 50 also rises by 5 years, from 32 to 37 (i.e. they can expect to live to 87, rather than 82).

For people born in 1957, though, their expected additional years of life as of age 50 only increases by 2 years, from 32 to 34, i.e. they can expect to live to 84, rather than 82. That's because their old age will happen relatively soon - in just 30 years - so there will not have been time to improve medical care as much as for the 2007 cohort, for whom old age (i.e. years with large mortality rates) is still 70 years away.

All the graphs so far looked at averages. For any individual person, though, it is also interesting to know the distribution of additional years of life. After all, some people die young, while others live past 100. Not everyone lives exactly to the average. What does the "bell curve" look like?

Turns out there is an easy answer. We just need to look at column (d), which holds the number of people from the original cohort that died at each year. Starting at a particular age A, we divide by the total number of people that reached A, namely l[A], to get the probability distribution.

In the final chart, the black line shows the probability of death at a given age for the 2007 cohort, and the blue line for the 1957 cohort. In both cases, the calculations are as of 2007, and assume a 1% improvement in mortality rates per year throughout the 21st century.

This picture is quite interesting. First of all, on the left side, the infant mortality piece looks much more noticeable than it did back when we plotted the mortality rates. That's because those rates leave out the size of the population they apply to. The population is largest at birth, and declines over time, so the high rates at ages near 100 do not actually translate into very many people, since so many died along the way. This graph makes it more clear that infant mortality is still a very big problem even in the US.

Looking at the right-hand side, we see that the peak for the 2007 cohort comes at a later age than for the 1957 cohort - again because the 2007 cohort has an extra 50 years of medical improvements to help them over the 1957 cohort. We calculated earlier that the 1957 cohort could expect to live to 84 on average; now we see that there is considerable spread around that number, with significant numbers of people dying at every age between 50 and 100.

In fact, we see that cutting off the life table at age 100 is a bit premature: since it is all computerized nowadays, there is no reason not to extend it out to 110 or even 120 in order to provide more insight into how medical improvements affect the oldest people.

I hope this example encourages you to experiment: in whatever country you live, download the latest life tables and see what the forecasts are for you. You could also download several life tables from different years and compare them in order to estimate just how much progress improved medical care is making each year. Of course, past trends need not continue in the future, but they are at least a starting point for discussion.

I hope you found this interesting. You can click the "M" button below to email this post to a friend, or the "t" button to Tweet it, or the "f" button to share it on Facebook, and so on. As usual, please post questions, comments and other suggestions using the box below, or G-mail me directly at the address mentioned in the Welcome post. Remember that you can sign up for email alerts about new posts by entering your address in the widget on the sidebar. If you prefer, you can follow @ingThruMath on Twitter to get a 'tweet' for each new post. The Contents page has a complete list of previous articles in historical order. You may also want to use the 'Topic' and 'Search' widgets in the side-bar to find other articles of related interest. See you next time!

0 comments:

Post a Comment