A Tale of Two Datasets: Part 1

One of the Hadley leaked files (idl_cruts3_2005_vs_2008b.pdf) contains an intriguing array of temperature graphs – seasonal temperatures by world region or country – 154 sets. Each graph of the four per region has both raw and smoothed data from two data sets: 2005 (black) and 2008b (purple).



I think the two data sets are the CRU data, before and after the updating work described in the Harry_Read_Me file where I was alarmed to find frequent references to ‘synthetic data’. We know the timeline of the file begins in 2006 and runs to near present. The file describes updating more than just code reading temperature records, so the question is  – how much does the data plotted show – is it just land temperature or does it include sea surface temperature?

AJ Strata has done a couple of excellent posts about the graphs in this file (here, here and here), focusing on the lack of warming apparent in the graphs. My only issue with his observations is minor:  he attibutes the horizontal dashed lines to median values for each data set; I don’t think this is correct. Look at the two graphs below, from the Cape Verde Islands:

The median value should split the data in two, with half of it above the median line and half below. Other graphs show this too, but in the Cape Verde set it is very clear that the black and purple dashed lines do  NOT represent the median values for the 2005 and 2008b datasets (respectively).  I think the dashed lines are the mean of the 1961-1990 reference period that is used to calculate anomaly values for the HadCruT output. However, having made that observation, there is not a lot I can do with it a present, but it may become useful as analysis of the data moves on.

Another released file is a report to the funding body (Defra), dated March 2005, which is a contract interim report.  Ian ‘Harry’ Harris is one of the co-authors.

Revised optimally averaged global and hemispheric land and ocean surface temperature series including HadCRUT3 data set

Aims of the work (five main areas):

  • Improved land data: additional data, extra quality control.
  • Comprehensive land error model: Add estimates of observation errors, extend existing sampling and bias uncertainty estimates to arbitrary grid resolutions.
  • Flexible gridder: make gridded fields on any spatial resolution.
  • Better land-sea blending: combine land and sea data in coastal grid-boxes in a way which uses our knowledge of the uncertainties for each data source.
  • Better statistical processing: the gridded fields are refined using two important statistical processes:
    – Variance correction: removes the effect of the changing number of observing stations,
    – Optimum averaging: make global and hemispheric time-series from the gridded fields.

These processes will be checked for correctness, and simplified and refined where necessary.
The resulting dataset will be made available on the web, and the work will be published in time for the results to be available to the IPCC 4th assessment report.

We’ve had the 4th AR  so I had a look for the data. It might be here or it might not be fully transferred yet; either way, registration is required for access and there are rules and restrictions limiting use.  I am assuming, rightly or wrongly that the PDF of graphs is part of the output of this contract, so we’re looking at before/after data improvement.  The figure (above) from the Defra report is entitled: Improvements to the station data.(Plus signs are stations added, filled symbols stations deleted and hollow symbols stations edited).

There is a further piece of the jigsaw and of course I can’t be certain that the piece is the right piece, but I have to assume it is: one of the text files from the CRU code “station-list-ncep”. This contains nearly 22,000 names, IDs and data for individual land and sea-based stations.

Since I am familiar with the GISS data set, my starting point with the graphs has been to compare the GHCN data available with the CRU stations and regional graphs. This is still only the beginning; I have done some comparisons, which I’ll start to put up here.

This entry was posted in Climategate, GIStemp, Station Data and tagged , . Bookmark the permalink.

3 Responses to A Tale of Two Datasets: Part 1

  1. chiefio says:

    Are you saying that station-list-ncep has only station data, or that it has some temperature data in it as well?

    This is intriguing. Is “NCEP” some agency from whom a data set can be aquired?… hmmm…

    BTW, the “Hadley Sample” that was published by the Met Office that was supposed to be “the data” are in fact the homogenized product. Haven’t noticed anyone pointing that out yet…



    HadCRUT3: Global surface temperatures

    HadCRUT3 is a globally gridded product of near-surface temperatures, consisting of annual differences from 1961-90 normals. It covers the period 1850 to present and is updated monthly.

    So we can see it is a “gridded product” and has the 1850 cutoff date.

    This is offered as “proof” that AGW still exists and “the data” are clean. Someone needs to call them on this “polite deception”…

  2. chiefio says:

    Oh, and forgot this link, this is the actual subset that points to the above description to explain what it is:


  3. VJones says:

    The station-list-ncep has only station data in it – no temperature data (that would have been too good to be true).

    Thanks for the Hadley links. I haven’t had time to look yet.

Comments are closed.