Prediction and modeling of rainfall is an important problem in
atmospheric sciences and agriculture. It is often addressed using
statistical learning methods since global circulation and climate
change models are too coarse and inaccurate to capture properties of
precipitation for a specific location. We consider a problem of
modeling precipitation occurrence for a network of rain stations.
Ideally, the model should capture a number of data properties, e.g.
spatial dependencies between pairs of rain stations, the temporal (e.g.
run-length) distribution of the wet
and dry spell lengths, interannual variability in the
number of rainy days per season. What makes the problem difficult
is the variety of aspects of data to be modeled.

Predicting seasonal rainfall in Northeast region of Brazil is of
great interest to the atmospheric scientists, in particular at
IRI. As one of the goals, they are interested in modeling
rainfall occurrences for February-March-April (FMA) season for the
state of Ceará (Figure 1). The data for the region
consists of
rainfall records for 10 rain-gauge stations for the period beginning at
1975. Once the years with significant number of missing
observations are discarded, we end up with data for 10 rain stations
over 24 years with 90 binary (rain/no rain) observations each.

Figure
1: Rainfall station locations with topographic contours
(meters). Circle size denotes the February-April climatological
daily rainfall probability (%) 1975-2002. The stations are: (1)
Acopiara (317 m), (2) Aracoiaba (107 m), (3) Barbalha (405 m),
(4) Boa Viagem (276 m), (5) Camocim (5 m), (6) Campos Sales (551 m),
(7) Caninde (15 m), (8) Crateus (275 m), (9) Guaraciaba Do Norte (902
m), and (10) Ibiapina (878 m). One degree of longitude/latitude
corresponds to about 110 km at the equator.

Our approach is to model daily precipitation for the network
conditioned on a small number of "weather" states. The states are
not explicitely known and treated as a random variable. A
sequence of precipitation occurrences is modeled as a hidden Markov
model (HMM) with weather states hidden and having first-order Markov
dependence, and observations for different days independent given the
values of corresponding weather states (Figure 2). Precipitation
occurrences for each station on a given day are further assumed to be
independent conditioned on the value of the weather state.

Figure 2: Graphical model of a hidden Markov model. States S

While this model can capture some global properties of the data, it cannot capture interannual variability due to outside atmospheric factors. For example, using HMMs we cannot predict whether a season from a test data is going be rainier than average or not since there is no mechanism in the model to distinuish unseen sequences. Without a mechanism to use information other than historical precipitation, the model cannot be used for prediction.

Atmoshperic scientists often use general circulation models (GCM) to extrapolate the future physical state of the atmosphere. GCMs can produce with reasonable accuracy values for sea-surface temperatures, sea-surface pressure, wind vectors, precipitation, and other atmospheric variables on a grid of typically 2.5º×2.5º on the daily (or sometimes even finer) time intervals. While these predictions are not accurate enough to predict precipitation for a particular location directly, they can be used as additional input vectors to improve the descriptive power of HMMs as well as to distinguish unseen data. To incorporate atmospheric variables into HMM, we make the transition matrix representing the probability distribution P(Figure 3: Graphical model of a non-homogeneous hidden Markov model. States S

We have used this framework to train models and analyze their
predictive power on the hold-out set for the Northeast Brazil
region. The results are described in detail in the related paper.

- S. Kirshner.
**Modeling of multivariate time series using hidden Markov models**, Ph.D. thesis. [PDF]

- S. Kirshner, P. Smyth, A.W. Robertson.
**Conditional Chow-Liu tree structures for modeling discrete-valued vector time series**, Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI-2004), July 2004. [PDF]*New method of modeling discrete-valued multivariate time series with application to modeling of multi-site precipitation occurrences.* - A.W. Robertson, S. Kirshner, P. Smyth.
**Downscaling of daily rainfall occurrence over Northeast Brazil using a hidden Markov model**, Journal of Climate, 17(22):4407-4424, November 2004. [PDF (from Allen Press)] (or a TR version)*Results of modeling precipitation occurrences of Northeast region of Brazil.*

- J. P. Hughes, P. Guttorp, S. P.Charles.
**A non-homogeneous hidden Markov model for precipitation occurrence**, Journal of the Royal Statistical Society Series C Applied Statistics, 48(1):15-30, 1999. [PDF (from Blackwell Publishing)]*Earlier paper describing NHMMs and how they can be applied to modeling precipitation occurrence for a network of stations.* - J.P. Hughes, P. Guttorp
**. Incorporating spatial dependence and atmospheric data in a model of precipitation**, Journal of Applied Meteorology, 33(12):1503-1515, December 1994. [PDF (from Allen Press)]*First paper describing NHMMs and their application to precipitation occurrence.*

This is joint work with Andrew Robertson at the International Research Institute (IRI) for Climate Prediction at Columbia University, and it is supported by the Department of Energy.

Information and Computer Science

University of California, Irvine CA 92697-3425

Last modified: December 21, 2003