Of Missing Temperatures and Filled-in Data (Part 1)

One of the most shocking things about examining the GHCN data that goes into global climate models has been the inconsistency of the data.  Not only is there loss of stations, but within each set of station data, there may be considerable loss of monthly data. This post asks – how bad is this? (answer – much worse than I thought – see the last graph).

Figure 1. Station data for Mactan, Philippines 2000-2009 (GISS unadjusted/combined data).  Missing months are highlighted in yellow; seasonal averages and annual means derived (by GISS) with the inclusion of in-filled data due to missing months are coloured red. Only the Annual Mean from 2006 (27.87 degC) is derived from a full 12 monthly observations (Dec-Nov). 


GISS stated methods and QC
The methods used by NASA GISS for the calculation of the global average temperature using the GIStemp programme can be found here. Basically, deriving station annual mean temperatures relies on first calculation of the long term monthly averages of the data.  These are then used to derive the monthly, seasonal and annual anomaly values.  For Mactan for example, the long term monthly average for January is 26.92 degC, which means the anomaly value for January 2009 was (27.2 – 26.92 = 0.27). NASA says:

“The trick was to find the anomalies first and then compute the absolute values from the anomalies: Whereas the absolute monthly and seasonal temperatures may have a definite seasonal cycle, the monthly and seasonal anomalies do not; hence whereas a seasonal mean may be totally distorted if we leave out the warmest or coldest month, seasonal anomalies are less impacted by dropping any monthly anomaly.”

Really? (Hmm, there’s that word ‘trick’ again). Well, I worked through the calculations for Mactan and I have to say I was convinced – the anomaly calculation actually does a good job of filling in for any missing data. And it makes some sense to do this – to maximise the data that is there and avoid large gaps. But then I thought – the temperature variations in Mactan are small. The annual average temperature for the station is 28.01 degC and the aeasonal averages vary from 27.03 to 28.38. The temperature plot for the reporting period (1974-2009) (Figure 2) also shows a relatively flat trend.

Figure 2. Temperatures for Mactan, Philippines (GHCN/GISS, unadjusted data)


A quick eye cast over the values of the temperature anomalies for Mactan showed that most of the monthly variation is small, less than +/- 0.5 degC off the monthly mean, with exceptional months exceeding +/- 1.0 degC.  The highest monthly anomaly was +1.87 degC (March, 1985). But what would happen in a station with large variations?

 
 Figure 3. Temperatures in Jiuquan, China (GHCN/GISS unadjusted data)

The annual temperature record for Jiuquan, China (Figure 3) fails to show that there is an annual temperature variation in this region of more than 25 degC across the average year.  The overall annual average is 7.79 degC, but the season averages vary from -7.06 (DJF) to 21.19 (JJA) degC.  The plot does show clearly that there has been a strong cooling trend at Jiuquan from 1941 to 1968 and then warming from 1970 to present.

Figure 4. Monthly (left) and Seasonal (right) Anomalies for Temperatures at Jiuquan.

A quick look at examples of monthly and seasonal anomaly values for Jiuquan (Fig, 4) shows that the cooling/warming cycle is also visible at monthly and seasonal level. Note that there is a huge variation in Feburary anomaly values from -7 to +5 degC, but even in July (the warmest month) there is still a variation between +/- 2 degC off the monthly mean.

Now I should say at this point that there is very little missing data in the Jiuquan record; six individual months over the record, with no more than one month missing in any one of the six affected years. But what if there were? Does the greater variation in temperatures make a difference?

Well, it has been quite instructive playing with the data.  Taking out any one month of data in the Jiuquan record can affect the annual anomaly quite significantly. I was surprised. Removing any one month in Summer (June/July/Aug) can affect the anomaly value for that year +/- 0.03-0.08 degC on average, and up to to +/- 0.18 degC maximum, but removing any Winter (Dec/Jan/Feb) month can result in a change in annual anomaly of +/- 0.2-0.3 (up to a max. of 0.6) degC for that year.  Repeating this for Mactan the maximum differences I observed were +/- 0.10 and 0.15 degC respectively for Summer and Winter. So Jiuquan (and by implication cooler stations like it with large temperature variations) can be very sensitive to missing values, even when calculating anomaly values not absolute temperatures.

Why do I think this is important? Simply this: when you have a missing month, the ‘filling in’ by using average anomaly values is just WRONG. Look at the Jiuquan Feburary record. The temperatures are all over the place.  Typically if January is warm, there is no guarantee that February will be warm also. So with all those missing values we are creating even more uncertainty in the data record by spreading the existing data to cover those months – averaging the data. And the main point is this – we know that Winter warming has played a major part in warming the global average temperature, and part of that has been fewer extreme lows, but the cooling/warming cycle apparent in Jiuquan’s record is far from unique (see Mapping Global Warming for examples of maps of worldwide warming, cooling and warming cycles). So, if we are now entering a cooling cycle (negative PDO./AMO etc.) with more extreme lows, and we miss them though missing months, the record will be warmer than it should be (conversely extreme warm months may be missed and the record will be cooler than actual).

So what else to do when there are missing months?  Well, I believe BOM (Australian Bureau of Meterorology) does not compute an Annual Mean Temperature for years with even one missing month of data, however I have been unable to find a specific reference to this on the BOM site [if some kind soul can point me to it in comments I’ll update with a link]. This is the QC also applied by my collaborator Kevin, who is responsible for the wonderful maps I linked to above.  Kevin has quantified the missing data (Fig. 5) and it is quite shocking:

Figure 5. Plot showing the percentage of stations by WMO region which have at least one month of missing temperature data in any year.

Now note that the data in Fig. 5 is for active stations.  We know there is a Station Dropout Problem in 1990, but the record since that is bad too: in Africa and S. America since the mid 1980s, 60-90% of the stations are missing at least one month of data each year. In Antarctica (one of those lovely cold places, with temperature variations much greater than Jiuquan) more than 50% of years miss at least one month (and often more). Even in N. America and Europe the rate is currently hovering around 20%.  

Of course, just dropping a year due to a missing month does not help the calculation of global average temperature, and if you are going to ‘fill in’ perhaps better to fill in using ‘self’ data than that from a nearby station as happens at grid level anyway.  But you know if I WANTED to manipulate the temperature record (and I make no accusations), this is how I would try to do it – by selective reporting of data and a method (anomaly calculation) that seems perfect in covering such inadequacies.  It would be so easy to do, but yet so difficult to get right 😉

[Update 8th March 2010. Just realised I’d missed something E.M. Smith had picked up in the NASA FOIA emails release – he quotes a couple of emails and it seems NASA is concerned about infill after all!] 

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

10 Responses to Of Missing Temperatures and Filled-in Data (Part 1)

  1. The Blob says:

    GHCN data doesn’t go into global climate models…

  2. KevinS says:

    Blob,

    “GHCN data doesn’t go into global climate models…”

    Whats’s that got to do with the price of fish?

    No one here is stating that GHCN data goes into climate models. On the other hand teh GCM’s ar ehindcasted against ‘homogenised’ GHCN data and the whole case for AGW within the GCMs rest on the premis ethat the post 1960 warming trend cannot be explained through natural climatic variability alone but can be if AGW is invoked.

    I and many other disagree with this assumption on the part of the climate modellers.

  3. VJones says:

    Blob,
    actually GHCN data does go into climate models – feeds into GISS and CRU, although a lot happens to it before it comes out the other end. It is not the only data to be fed in but it is a major part. The GHCN data is fed into GIStemp in the form of the GHCN v2.mean file, which has been used as a source of data for other analyses on this blog.

  4. The Blob says:

    People will misunderstand you and end up talking at cross purposes if you describe global temperature record analyses like GISTEMP as climate models. Really climate models are something else. I was just letting you know – aside from that it’s a minor point that doesn’t bear on your post – the graphs are very useful.

  5. VJones says:

    Blob,
    yes, you are right – I am a bit careless with my wording at times – tending to focus on the data and analysis. Thanks for pointing it out.

  6. cappyquinlan@bigpond.com says:

    If you wish someone in BOM or one that has left that Australian Met Bureau ,then I suggest Bill Kinninmonth whose website is available on google .Bill is well known for his scepticism and would make himself available or point you in the right direction .

    Frank Quinlan
    Australia

  7. drj11 says:

    Your proposal for missing months is that one should avoid calculating an annual average.

    If we were trying to compute a trend based on annual average (for a station, a location, a hemisphere, a planet, or whatever) then this would leave to fewer annual averages to use to compute the trend.

    Which would lead to _greater_ uncertainty. The computation of annual mean via seasonal means is surely designed to use as much of the source data as is reasonable, and thereby _minimise_ uncertainty.

  8. VJones says:

    drj11,
    I worked through this calculation to understand it. I was asking the question – could missing months have an effect? and if so – what? how? how big? I artificially deleted months to see what effect this could have and was satisfied that it could have a big effect (although a small effect is more likely), however in real life we can’t say that because the data is…..missing, so we can’t know what effect the real data would have (obviously).

    Having worked through the data I actually think this method IS a good way to calculate seasonal and annual means, and until RomanM’s recent posting of code, I had no notion of an alternative that would be any better.

    The point, which I still would assert, is that missing months do lower accuracy, that can’t be helped even with this method of anomaly calculation. Since the level of missing data is so high in recent years this ‘uncertainty’ needs to be quantified.

    I think my comment about “if I WANTED to manipulate the temperature record….” has been taken a bit too seriously by some people (unfortunately).

  9. drj11 says:

    Okay, but you also say “when you have a missing month, the ‘filling in’ by using average anomaly values is just WRONG”. So if that is wrong, what is right?

  10. VJones says:

    Well it is wrong. But then is it “black and white, right or wrong” or is it shades of gray?

    Also, sometimes you have no choice but to go with ‘wrong’.

    On the other hand it is usual to exclude months with missing days from climate data (US Regional Climate Data Centers):
    “MAXIMUM ALLOWABLE NUMBER OF MISSING DAYS : 5
    Individual Months not used for annual or monthly statistics if more than 5 days are missing.
    Individual Years not used for annual statistics if any month in that year has more than 5 days missing.”

    If you desire to maximise use of the available data, but the choice is – exclude good data because there are gaps, or fill in with average data, both methods will result in a loss of accuracy of ‘actual’ data.

Comments are closed.