GEOS 585A, Applied Time Series Analysis

Instructor
Course Mode and University Policies
Course Description
Overview
Goals
Prerequisites
Other Requirements
Availability
Syllabus
Download zipped files
Tailoring the Matlab Scripts
Accessing files of notes and assignments

Home

Vita

Course

Toolbox

Contact

Instructor

David M. Meko

Laboratory of Tree-Ring Research, , Room 417, Bryant Bannister Tree-Ring Building (Bldg #45B)

Email: dmeko@arizona.edu

Office hours Wednesday, 1:00-3:00 PM (please email to schedule zoom meeting)

Course Mode and University Policies

This course was last taught in Spring Semester 2021, and is now being revamped by revision of the notes and moving from Matlab to R as the supporting software. I hope to have this transition completed by fall semester, 2025. The University of Arizona has policies generally applicable to all graduate courses. The policies cover such topics as 1) absence and class participation, 2) threatening behavior, 3) accessibility and accommodations, 4) academic integrity, and 5) nondiscrimination and anti-harassment. More information can be found at University Policies.

Course Description

Analysis tools in the time and frequency domains are introduced in the context of sample time series. I use a dataset of sample time series to illustrate methods. This year the sample dataset comes from an NSF project on hydrologic variability in the Truckee-Carson Basin of California/Nevada. This dataset includes tree-ring chronologies, precipitation records, streamflow records, and time series of snow-water equivalent measured at snow-course stations. You will assemble your own time series for use in the course. Ideally these are from your own research, but in any case should be time series you would like to understand better. Back to Top of Page

Overview

This is an introductory course, with emphasis on practical aspects of time series analysis. Methods are hierarchically introduced -- starting with terminology and exploratory graphics, moving to descriptive statistics, and ending with basic modeling procedures. Topics include detrending, filtering, autoregressive modeling, spectral analysis and regression. You spend the first two weeks installing Matlab on your laptop, getting a basic introduction to Matlab, and assembling your dataset of time series for the course. Twelve topics, or "lessons" are then covered, each allotted a week, or two class periods. Twelve class assignments go along with the topics. Assignments consist of applying methods by running pre-written Matlab scripts (programs) on your time series and interpreting the results.

The course is 3 credits for University of Arizona students, and 1-3 credits for others.

Any time series with a constant time increment (e.g., year, month, day) is a candidate for use in the course. Examples are annual precipitation, monthly mean temperature, and daily cases of COVID-19.

Goals

As a result of taking the course, you should:

understand basic time series concepts and terminology
be able to select time series methods appropriate to goals
be able to critically evaluate scientific literature applying the time series methods covered
have improved understanding of time series properties of your own dataset
be able to concisely summarize results of time series analysis in writing

Prerequisites

An introductory statistics course
Access to a laptop computer capable of having Matlab installed on it
Permission of the instructor (undergraduates or non-University-of-Arizona students)

Other Requirements

If you are on a University of Arizona (UA) student, you have access to Matlab and required toolboxes through a UA site license as no cost software. No previous experience with Matlab is required, and computer programming is not part of the course.
If you are not a University of Arizona student, you may, with the instructor's permission, be able to take the course in Spring 2021 semester as an "iCourse". You must make sure that you have access to Matlab and the required toolboxes (see below) at your location.
Access to the internet. There is no paper exchange in the course. Notes and assignments are exchanged electronically and completed assignments are submitted electronically through the University of Arizona Desire2Learn (D2L) system.

Matlab version. I update scripts and functions now and then using the current site-license release of Matlab. For 2021, I am still using MATLAB Version 9.5.0.944444 (R2018b). Beware that cripts and functions used in the course may not run on earlier versions of Matlab.

Install the whole Matlab package (includes all toolboxes) when installing from the U of A site license. Not all of the toolboxes are needed, but this is the easiest installation. If you are not using the site license, keep in mind that my scripts and functions make extensive use of four toolboxes: Statistics, Signal Processing, System Identification, and Curve Fitting.

Availability

The course is offered in Spring Semester every other year (2019, 2021, etc.). It is open to graduate students and may also be taken by undergraduate seniors with permission of the instructor. Enrollment is capped at 20 for Spring Semester 2021.

A small number of students not at the University of Arizona can also be accommodated through the "iCourse" path descrbed above. Back to Top of Page

Course Outline (Lessons)

Introduction to time series; organizing data for analysis
Probability distribution
Autocorrelation
Spectrum
Autoregressive-Moving Average (ARMA) modeling
Trend
Detrending
Filtering
Correlation
Lagged Correlation
Multiple linear regression
Validating the regression model

Syllabus

Calendar

Spring 2021 semester. Class meets twice a week for 75 minute sessions, 9:00-10:15 AM T/Th, over Zoom. The first day of our class is Jan 14 (Thurs). The last day of class is May 4 (Tues). There is no spring break this year, but instead there are "reading days" with no classes. Class will be cancelled for four days this semester: Feb 25 and Mar 9, which are COVID-19 "reading days"; and whichever two days (Tuesday & Thursday) that happen to fall in the the U-of-A Earth Week. I do not yet know the Earth Week schedule for this year. Earth Week is usually around the last week of March.

The schedule typically allows about two weeks for gathering data and becoming familiar with Matlab. After that, one week (two class periods) is devoted to each of the 12 lessons or topics. Class meets on Tuesday and Thursday. A new topic is introduced on Tuesday, and is continued on the following Thursday. Thursday's class ends with an assignment and a demonstration of running the assignment Matlab script on my sample data. The assignment is due (must be uploaded by you to D2L) before class the following Tuesday.

Any online students not at the University of Arizona are expected to follow the same schedule of submitting assignments as regular students. In the "live online" mode, students have access to recorded zoom lectures. All students have access to D2L for submitting assignments.

Once we are into into the assignments on data analysis (after first couple of weeks), the class routine is as follows:

Tuesday

3-minute lightning talk by a student (chosen randomly the previous Thursday).

1/2 hour for guided self-assessment, grading, and uploading of graded assignment to D2L

Remainder of class time to introduce the next topic

In the lightning talk, the student puts one or more figures from the submitted assignment up on the screen, and describes the time series analyzed and at least one finding from the analysis. Goals of this activity, new in spring 2019, are to 1) expose students to a variety of time series, 2) provide experience in communication of time series methods, and 3) give practical experience in the lightning-talk of briefly and concisely describing research.

Thursday

Second part of lecture on this week's topic

Breakout-room discussion of some point raised by the instructor

Description of the next assignment and a trial run on my sample data

Random picking (sampling without replacement) of next Tuesday's lightning-talk presenter

The breakout-room discussion is new for Spring Semester 2021. Students are assigned to breakout rooms and to discuss a specific time series question related to the current topic. After 10 minutes, students return, and one student representing a breakout group reports on their discussion.

Data

You analyze data of your own choosing in the class assignments. As stated in the course overview, there is much flexibility in the choice of time series. I will make suitable time series available, but it is best to focus on your own research data. The first assignment involves running a script that stores the data and metadata you have gathered in Matlab structure variables in a "mat" file, a specific type of storage file readable only by Matlab. Subsequent assignments draw data from the mat file for time series analysis.

Assignments

The 12 topics are addressed sequentially over the semester, which covers approximately 15 weeks. About the first two weeks (4-5 class meetings) are used for some introductory material, deciding on and gathering your time series, and readying Matlab on your laptop. Each week after that is devoted to one of the 12 course topics. Each assignment consists of reading a chapter of notes, running an associated Matlab script that applies selected methods of time series analysis to your data, and writing up your interpretation of the results. Assignments require understanding of the lecture topics as well as ability to use the computer and software.

You submit assignments by uploading them to D2L before the Tuesday class when the next topic is introduced. Students self-grade their assignments at the beginning of class on Tuesday. I browse the self-graded assignments the next day, assess the writing in the assignment, and may or may not change the student's self-assessed grade. To find out how to access assignments, click assignment files.

Readings

Readings consist of class notes. There are twelve sets of .pdf notes files , one for each of the course topics. You can access these pdf files through D2L. There are 12 chapters of notes. I put the pdfs up on D2L a couple of weeks before covering the topic in class. More information on the various topics covered in the course can be found through references listed at the end of each chapter of class notes. People from outside the course can also access the full set of notes from last year through the web site.

Grades

Grades are based entirely on performance on the assignments, each of which is worth 10 points. There are no exams. The total number of possible points for the 12 topics is 12 x 10 = 120. A grade of "A" required 90-100 percent of the possible points. A grade of "B" requires 80-90 percent. A grade of "C" requires 70-80 percent, and so forth. In each assignment points are subtracted from the maximum of 10 by self-assessment guided by a rubric presented in class. You should mark the number of points earned at the top of each graded assignment, and annotate with reference to the rubric any subtraction of points.

The instructor looks over the self-graded assignments the next day, and may subtract up to an additional point for shortcomings in the writing quality (e.g., too long, incomprehensible, many spelling or grammatical errors). Assignments, given in class on Thursday, are due (must be uploaded to D2L by you) before the start of class the following Tuesday. The first half hour of Tuesday's meeting period will be dedicated to presentation of a grading rubric, self-assessment of completed assignments, and uploading of self-graded assignments to D2L. This schedule gives you 4 days to complete and upload the assignment to D2L before 9:00 am Tuesday. D2L keeps track of the time the assignment was uploaded, and no penalty is assessed as long as it is uploaded before 9:00 AM on Tuesday of the due date.

A late penalty of 3 points is assessed if the assignment is not submitted to D2L by 9 AM Tuesday. A late penalty of 1 point is assessed if the graded assignment is not uploaded to D2L by 5 AM Wednesday, which is when I begin looking over your self-graded assignments. If you have some scheduled need to be away from class (e.g., attendance at a conference), you are responsible for uploading your assignment before 9:00 AM the Tuesday it is due, and for uploading the self-graded version by 10:15 AM the same day. In other words, the schedule is the same as for the students who are in class. If an emergency comes up (e.g., you catch COVID) and cannot do the assignment or assessment on schedule, please send me an email and we will reach some accommodation. Otherwise, the late penalties described above will apply.

Lessons

Introduction to time series; organizing data for analysis
A time series is broadly defined as any series of measurements taken at different times. Some basic descriptive categories of time series are 1) long vs short, 2) even time-step vs uneven time-step, 3) discrete vs continuous, 4) periodic vs aperiodic, 5) stationary vs nonstationary, and 6) univariate vs multivariate. These properties as well as the temporal overlap of multiple series, must be considered in selecting a dataset for analysis in this course. You will analyze your own time series in the course. The first steps are to select those series and to store them in structures in a mat file. Uniformity in storage at the outset is convenient for this class so that attention can then be focused on understanding time series methods rather debugging computer code to ready the data for analysis. A structure is a Matlab variable similar to a database in that the contents are accessed by textual field designators. A structure can store data of different forms. For example, one field might be a numeric time series matrix, another might be text describing the source of data, etc. In the first assignment you will run a Matlab script that reads your time series and metadata from ascii text files you prepare beforehand and stores the data in Matlab structures in a single mat file. In subsequent assignments you will apply time series methods to the data by running Matlab scripts and functions that load the mat file and operate on those structures.

Assignments
Select sample data to be use for assignments during the course
Read: (1) Notes_1.pdf, (2) "Getting Started", accessible from the MATLAB help menu
Answer: Run script geosa1.m and answer questions listed in the file in a1.pdf
What to Know
- How to distinguish the categories of time series
- How to start and quit MATLAB
- How to enter MATLAB commands at command prompt
- How to create figures in figure window
- How to export figures to your word processor
- Difference between MATLAB scripts and functions
- How to run scripts and functions
- The form of a MATLAB "structure" variable
- How to apply the script geosa1.m to get a set of time series and metadata into MATLAB structures
Back to Top of Page
Probability distribution
The probability distribution of a time series describes the probability that an observation falls into a specified range of values. An empirical probability distribution for a time series can be arrived at by sorting and ranking the values of the series. Quantiles and percentiles are useful statistics that can be taken directly from the empirical probability distribution. Many parametric statistical tests assume the time series is a sample from a population with a particular population probability distribution. Often the population is assumed to be normal. This chapter presents some basic definitions, statistics and plots related to the probability distribution. In addition, a test (Lilliefors test) is introduced for testing whether a sample comes from a normal distribution with unspecified mean and variance.
Assignments
Read: Notes_2.pdf
Answer: Run script geosa2.m and answer questions listed in the file in a2.pdf
What to Know
- Definitions of terms: time series, stationarity, probability density, distribition function, quantile, spread, location, mean, standard deviation, and skew
- How to interpret the most valuable graphic in time series analysis -- the time series plot
- How to interpret the box plot, histogram and normal probability plot
- Parameters and shape of the normal distribution
- Lilliefors test for normality: graphical description, assumptions, null and alternative hypotheses
- Caveat on interpretation of significance levels of statistical tests when time series not "random" in time
- How to apply geosa2.m to check the distribution properties of a time series and test the series for normality
Back to Top of Page
Autocorrelation
Autocorrelation refers to the correlation of a time series with its own past and future values. Autocorrelation is also sometimes called lagged correlation or serial correlation, which refers to the correlation between members of a series of numbers arranged in time. Positive autocorrelation might be considered a specific form of persistence, a tendency for a system to remain in the same state from one observation to the next. The likelihood of tomorrow being rainy is greater if today is rainy than if today is dry. Geophysical time series are frequently autocorrelated because of inertia or carryover in the physical system. The slowly evolving low pressure systems in the atmosphere might impart persistence to daily rainfall. The slow drainage of groundwater reserves might impart correlation to successive annual flows of a river. Stored photosynthates might impart correlation to successive annual values of tree-ring indices. Autocorrelation complicates the application of statistical tests by reducing the number of independent observations. Autocorrelation can also complicate the identification of significant covariance or correlation between time series (e.g., precipitation with a tree-ring series). Autocorrelation can be exploited for predictions: an autocorrelated time series is predictable, probabilistically, because future values depend on current and past values. Three tools for assessing the autocorrelation of a time series are (1) the time series plot, (2) the lagged scatterplot, and (3) the autocorrelation function.
Assignments
Read: Notes_3.pdf
Answer: Run script geosa3.m and answer questions listed in the file in a3.pdf
What to Know
- Definitions: autocorrelation, persistence, serial correlation, autocorrelation function (acf), autocovariance function (acvf), effective sample size
- How to recognize autocorrelation in the time series plot
- How to use lagged scatterplots to assess autocorrelation
- How to interpret the plotted acf
- How to adjust the sample size for autocorrelation
- Mathematical definition of the autocorrelation function
- Terms affecting the width of the computed confidence band of the acf
- The difference between a one-sided and two-sided test of significant lag-1 autocorrelation
- How to apply geos3.m to study the autocorrelation of a time series
Back to Top of Page
Spectrum
The spectrum of a time series summarizes the partitioning of variance of the series to rapid and gradual fluctuations. Rapid fluctuations are those with short wavelength, or high frequency. Gradual fluctuations are those with long-wavelength, or low-frequency The spectrum by definition describes the variance of the series as a function of frequency or wavelength. The object of spectral analysis is to estimate and study the spectrum. The spectrum contains no new information beyond that in the autocovariance function (acvf), and in fact the spectrum can be computed mathematically by transformation of the acvf. But the spectrum and acvf present the information on the variance of the time series from complementary viewpoints. The acf summarizes information in the time domain and the spectrum in the frequency domain.
Assignments
Read: Notes_4.pdf
Answer: Run script geosa4.m and answer questions listed in the file in a4.pdf
What to Know
- Definitions: frequency, period, wavelength, spectrum, Nyquist frequency, Fourier frequencies, bandwidth
- Reasons for analyzing a spectrum
- How to interpret a plotted spectrum in terms of distribution of variance
- The difference between a spectrum and a normalized spectrum
- Definition of the "lag window" as used in estimating the spectrum by the Blackman-Tukey method
- How the choice of lag window affects the bandwidth and variance of the estimated spectrum
- How to define a "white noise" spectrum and "autoregressive" spectrum
- How to sketch some typical spectral shapes: white noise, autoregressive, quasi-periodic, low-frequency, high-frequency
- How to apply geosa4.m to analyze the spectrum of a time series by the Blackman-Tukey method
Back to Top of Page
Autoregressive-Moving Average (ARMA)modeling
Autoregressive-moving-average (ARMA) models are mathematical models of the persistence, or autocorrelation, in a time series. ARMA models are widely used in hydrology, dendrochronology, econometrics, and other fields. There are several possible reasons for fitting ARMA models to data. Modeling can contribute to understanding the physical system by revealing something about the physical process that builds persistence into the series. For example, a simple physical water-balance model consisting of terms for precipitation input, evaporation, infiltration, and groundwater storage can be shown to yield a streamflow series that follows a particular form of ARMA model. ARMA models can also be used to predict behavior of a time series from past values alone. Such a prediction can be used as a baseline to evaluate possible importance of other variables to the system. ARMA models are widely used for prediction of economic and industrial time series. ARMA models can also be used to remove persistence. In dendrochronology, for example, ARMA modeling is applied routinely to generate residual chronologies – time series of ring-width index with no dependence on past values. This operation, called prewhitening, is meant to remove biologically-related persistence from the series so that the residual may be more suitable for studying the influence of climate and other outside environmental factors on tree growth.
Assignments
Read: Notes_5.pdf
Answer: Run script geosa5.m and answer questions listed in the file in a5.pdf
What to Know
- The functional form of the simplest AR and ARMA models
- Why such models are referred to as autoregressive or moving average
- The three steps in ARMA modeling
- The diagnostic patterns of the autocorrelation and partial autocorrelation functions for an AR(1) time series
- Definition of the final prediction error (FPE) and how the FPE is used to select a "best" ARMA model
- Definition of the Portmanteau statistic, and how it and the acf of residuals can be used to assess whether an ARMA model effectively models the persistence in a series
- How the principle of parsimony is applied in ARMA modeling
- Definition of prewhitening
- How prewhitening affects (1) the appearance of a time series, and (2) the spectrum of a time series
- How to apply geosa5.m to ARMA-model a time series
Back to Top of Page
Trend

Trend in a time series is a slow, gradual change in some property of the series over the whole interval under investigation. Trend is sometimes loosely defined as a long term change in the mean, but can also refer to change in other statistical properties. For example, tree-ring series of measured ring width frequently have a trend in variance as well as mean. Years ago a time series was typically decomposed into trend, seasonal or periodic components, and irregular fluctuations, and the various parts were studied separately. Modern analysis techniques frequently treat the series without such routine decomposition, but separate consideration of trend is still often required. One of the most frequent question asked about a time series is whether there is significant trend in mean.
Assignments
Read: Notes_6.pdf
Answer: Run script geosa6.m and answer questions listed in the file in a6.pdf
What to Know
- Definitions: trend, nonstationarity, hypothesis test
- How to test for a monotonic trend in mean of a time series
- How to test for a difference of means or variances in two segments of an autocorrelated time series
Back to Top of Page
Detrending
Detrending is the statistical or mathematical operation of removing trend from the series. Detrending is often applied to remove a feature thought to distort or obscure the relationships of interest. In climatology, for example, a temperature trend due to urban warming might obscure a relationship between cloudiness and air temperature. Detrending is also sometimes used as a preprocessing step to prepare time series for analysis by methods that assume stationarity. Many alternative methods are available for detrending. Simple linear trend in mean can be removed by subtracting a least-squares-fit straight line. More complicated trends might require different procedures. For example, the cubic smoothing spline is commonly used in dendrochronology to fit and remove ring-width trend that might not be linear, or not even monotonically increasing or decreasing over time. In studying and removing trend, it is important to understand the effect of detrending on the spectral properties of the time series. This effect can be summarized by the frequency response of the detrending function.
Assignments
Read: Notes_7.pdf
Answer: Run script geosa7.m and answer questions listed in the file in a7.pdf
What to Know
- Definitions: frequency response, spline, cubic smoothing spline
- Pros and cons of ratio vs difference detrending
- Interpretation of terms in the equation for the "spline parameter"
- How to choose a spline interactively from desired frequency response
- How the spectrum is affected by detrending
- How to measure the importance of the trend component in a time series
- How to apply geosa7.m to interactively choose a spline detrending function and detrend a time series
Back to Top of Page
Filtering
The estimated spectrum of a time series gives the distribution of variance as a function of frequency. Depending on the purpose of analysis, some frequencies may be of greater interest than others, and it may be helpful to reduce the amplitude of variations at other frequencies by statistically filtering them out before viewing and analyzing the series. For example, the high-frequency (year-to-year) variations in a gauged discharge record of a watershed may be relatively unimportant to water supply in a basin with large reservoirs that can store several years of mean annual runoff. Where low-frequency variations are of main interest, it is desirable to smooth the discharge record to eliminate or reduce the short-period fluctuations before using the discharge record to study the importance of climatic variations to water supply. Smoothing is a form of filtering which produces a time series in which the importance of the spectral components at high frequencies is reduced. Electrical engineers call this type of filter a low-pass filter, because the low-frequency variations are allowed to pass through the filter. In a low-pass filter, the low frequency (long-period) waves are barely affected by the smoothing. It is also possible to filter a series such that the low-frequency variations are reduced and the high-frequency variations unaffected. This type of filter is called a high-pass filter. Detrending is a form of high-pass filtering: the fitted trend line tracks the lowest frequencies, and the residuals from the trend line have had those low frequencies removed. A third type of filtering, called band-pass filtering, reduces or filters out both high and low frequencies, and leaves some intermediate frequency band relatively unaffected. In this lesson, we cover several methods of smoothing, or low-pass filtering. We have already discussed how the cubic smoothing spline might be useful for this purpose. Four other types of filters are discussed here: 1) simple moving average, 2) binomial, 3) Gaussian, and 4) windowing (Hamming method). Considerations in choosing a type of low-pass filter are the desired frequency response and the span, or width, of the filter.
Assignments
Read: Notes_8.pdf
Answer: Run script geosa8.m and answer questions listed in the file in a8.pdf
What to Know
- Definitions: filter, filter weights, filter span, low-pass filter, high-pass filter, band-pass filter; frequency response of a filter
- How the Gaussian filter is related to the Gaussian distribution
- How to build a simple binomial filter manually (without the computer)
- How to describe the frequency response function in terms of a system with sinusoidal input and output
- How to apply geosa8.m to interactively design a Gaussian, binomial or Hamming-window lowpass filter for a time series
Back to Top of Page
Correlation
The Pearson product-moment correlation coefficient is probably the single most widely used statistic for summarizing the relationship between two variables. Statistical significance and caveats of interpretation of the correlation coefficient as applied to time series are topics of this lesson. Under certain assumptions, the statistical significance of a correlation coefficient depends on just the sample size, defined as the number of independent observations. If time series are autocorrelated, an effective sample size, lower than the actual sample size, should be used when evaluating significance. Transient or spurious relationships can yield significant correlation for some periods and not for others. The time variation of strength of linear correlation can be examined with plots of correlation computed for a sliding window. But if many correlation coefficients are evaluated simultaneously, confidence intervals should be adjusted (Bonferroni adjustment) to compensate for the increased likelihood of observing some high correlations where no relationship exists. Interpretation of sliding correlations can be also be complicated by time variations of mean and variance of the series, as the sliding correlation reflects covariation in terms of standardized departures from means in the time window of interest, which may differ from the long-term means. Finally, it should be emphasized that the Pearson correlation coefficient measures strength of linear relationship. Scatterplots are useful for checking whether the relationship is linear.
Assignments
Read: Notes_9.pdf
Answer: Run script geosa9.m and answer questions listed in the file in a9.pdf
What to Know
- Mathematical definition of the correlation coefficient
- Assumptions and hypothesis for significance testing of correlation coefficient
- How to compute significance level of correlation coefficient and to adjust the significance level for autocorrelation in the individual time series
- Caveats to interpretation of correlation coefficient
- Bonferroni adjustment to signficance level of correlation under multiple comparisons
- Inflation of variance of estimated correlation coefficient when time series autocorrelated
- Possible effects of data transformation on correlation
- How to interpret plots of sliding correlations
- How to apply geosa9.m to analyze correlations and sliding correlations between pairs of time series
Back to Top of Page
Lagged correlation
Lagged relationships are characteristic of many natural physical systems. Lagged correlation refers to the correlation between two time series shifted in time relative to one another. Lagged correlation is important in studying the relationship between time series for two reasons. First, one series may have a delayed response to the other series, or perhaps a delayed response to a common stimulus that affects both series. Second, the response of one series to the other series or an outside stimulus may be smeared in time, such that a stimulus restricted to one observation elicits a response at multiple observations. For example, because of storage in reservoirs, glaciers, etc., the volume discharge of a river in one year may depend on precipitation in the several preceding years. Or because of changes in crown density and photosynthate storage, the width of a tree-ring in one year may depend on climate of several preceding years. The simple correlation coefficient between the two series properly aligned in time is inadequate to characterize the relationship in such situations. Useful functions we will examine as alternative to the simple correlation coefficient are the cross-correlation function and the impulse response function. The cross-correlation function is the correlation between the series shifted against one another as a function of number of observations of the offset. If the individual series are autocorrelated, the estimated cross-correlation function may be distorted and misleading as a measure of the lagged relationship. We will look at two approaches to clarifying the pattern of cross-correlations. One is to individually remove the persistence from, or prewhiten, the series before cross-correlation estimation. In this approach, the two series are essentially regarded on equal footing. An alternative is the systems approach: view the series as a dynamic linear system -- one series the input and the other the output -- and estimate the impulse response function. The impulse response function is the response of the output at current and future times to a hypothetical pulse of input restricted to the current time.
Assignments
Read: Notes_10.pdf
Answer: Run script geosa10.m and answer questions listed in the file in a10.pdf
What to Know
- Definitions: cross-covariance function, cross-correlation function, impulse response function, lagged correlation, causal, linear
- How autocorrelation can distort the pattern of cross-correlations and how prewhitening is used to clarify the pattern
- The distinction between the 'equal footing' and 'systems' approaches to lagged bivariate relationships
- Which types of situations the impulse response function (irf) is an appropriate tool
- How to represent the causal system treated by the irf in a flow diagram
- How to apply geos10.m to analyze the lagged cross-correlation structure of a a pair of time series
Back to Top of Page
Multiple linear regression
Multiple linear regression (MLR) is a method used to model the linear relationship between a dependent variable and one or more independent variables. The dependent variable is sometimes also called the predictand, and the independent variables the predictors. MLR is based on least squares: the model is fit such that the sum-of-squares of differences of observed and predicted values is minimized. MLR is probably the most widely used method in dendroclimatology for developing models to reconstruct climate variables from tree-ring series. Typically, a climatic variable is defined as the predictand and tree-ring variables from one or more sites are defined as predictors. The model is fit to a period -- the calibration period -- for which climatic and tree-ring data overlap. In the process of fitting, or estimating, the model, statistics are computed that summarize the accuracy of the regression model for the calibration period. The performance of the model on data not used to fit the model is usually checked in some way by a process called validation. Finally, tree-ring data from before the calibration period are substituted into the prediction equation to get a reconstruction of the predictand. The reconstruction is a "prediction" in the sense that the regression model is applied to generate estimates of the predictand variable outside the period used to fit the data. The uncertainty in the reconstruction is summarized by confidence intervals, which can be computed by various alternative ways.
Assignments
Read: Notes_11.pdf
Answer: Run script geosa11.m (Part 1) and answer questions listed in the file in a11.pdf
What to Know
- The equation for the MLR model
- Assumptions for the MLR model
- Definitions of MLR statistics: coefficient of determination, sums-of-squares terms, overall-F for the regression equation, standard error of the estimate, adjusted R-squared, pool of potential predictors
- The steps in an analysis of residuals
- How to apply geosa11.m (part 1) to fit a MLR regression model to predict one variable from a set of several predictor variables
Back to Top of Page
Validating the regression model
Regression R-squared, even if adjusted for loss of degrees of freedom due to the number of predictors in the model, can give a misleading, overly optimistic view of accuracy of prediction when the model is applied outside the calibration period. Application outside the calibration period is the rule rather than the exception in dendroclimatology. The calibration-period statistics are typically biased because the model is "tuned" for maximum agreement in the calibration period. Sometimes too large a pool of potential predictors is used in automated procedures to select final predictors. Another possible problem is that the calibration period itself may be anomalous in terms of the relationships between the variables: modeled relationships may hold up for some periods of time but not for others. It is advisable therefore to "validate" the regression model by testing the model on data not used to fit the model. Several approaches to validation are available. Among these are cross-validation and split-sample validation. In cross-validation, a series of regression models is fit, each time deleting a different observation from the calibration set and using the model to predict the predictand for the deleted observation. The merged series of predictions for deleted observations is then checked for accuracy against the observed data. In split-sample calibration, the model is fit to some portion of the data (say, the second half), and accuracy is measured on the predictions for the other half of the data. The calibration and validation periods are then exchanged and the process repeated. In any regression problem it is also important to keep in mind that modeled relationships may not be valid for periods when the predictors are outside their ranges for the calibration period: the multivariate distribution of the predictors for some observations outside the calibration period may have no analog in the calibration period. The distinction of predictions as extrapolations versus interpolations is useful in flagging such occurrences.
Assignments
Read: Notes_12.pdf
Answer: Run script geosa11.m (Part 2) and answer questions listed in the file in a12.pdf
What to Know
- Definitions: validation, cross-validation, split-sample validation, mean square error (MSE), root-mean-square error (RMSE); standard error of prediction, PRESS statistic, "hat" matrix, extrapolation vs interpolation
- Advantages of cross-validation over alternative validation methods
- How to apply geosa11.m (part 2) for cross-validated MLR modeling of the relationship between a predictand and predictors, including generation of a reconstruction and confidence bands
Back to Top of Page

Downloading Files -- tsfiles.zip

The Matlab class scripts and user-written functions are zipped in a file called "tsfiles.zip", which students should download from D2L. To get the files, first create an empty folder on your computer. This is where you will store all functions, scripts and data used in the course. Unzip the file there. When you run matlab, be sure that directory is your current matlab working directory. I no longer put the current year's tsfiles.zip on the course web site, but have a version called tsfiles_Stale.zip that people can go to to get the functions, etc., from the previous offering of the course.

Powerpoint lecture outlines & miscellaneous files. Downloadable file other_Stale.zip has miscellaneous files used in lectures from the previous offering of the course. Included are Matlab demo scripts, sample data files, user-written functions used by demo scripts, and powerpoint presentations, as pdfs (lect1a.pdf, lect1b.pdf, etc.) used in on-campus lectures. Students taking the course this semester should not use other_Stale.zip, but instead get the file "other.zip" from D2L contents. I update other.zip over the semester, and add the presentation for the current lecture within a couple of days after that lecture is given. File other.zip for this semester does not exist till after the first lecture, and then is augmented after each lecture. At the end of the semester I revise the online-available other_Stale.zip.

Tailoring the Matlab Scripts

To run the Matlab scripts for the assignments, you must have your data, the class scripts, and the user-written Matlab functions called by the scripts in a single directory on your computer. The name of this directory is unimportant. Under Windows, it might be something like "C:\geos585a\". The functions and scripts provided for the course should not require any tailoring, but some changes can be made for convenience. For example, scripts and functions will typically prompt you for the name of your input data file and present "Spring21" as the default. That is because I've stored the sample data in Spring21.mat. If you want to avoid having to type over "Spring21" with the name of your own data file each time you run the script, edit the matlab script with the Matlab editor/debugger to change one line. In the editor, search for the string "Spring21" and replace it with the name of your .mat storage file (e.g., "Smith2021"), then be sure to re-save the edited script.

Notes and Assignments

Notes and assignments are available to those taking the course through D2L. I revise the notes and assignments during the semester, and upload files to D2L at least two weeks before the topic is covered in class. The zipped file "other.zip" contains powerpoints(converted to pdf) and miscellaneous demo files used in class lectures. "other.zip" is built up cumulatively through the semester, and is updated after each class. The powerpoints have feedback from students' submitted assignments, and may be helpful to the correspondence students, who do not have access to the twice-a-week lectures.

I am happy to share my notes, and anyone not taking the course is welcome to download them and modify them for their own purposes. No attribution or acknowledgement of source is requested. Enjoy! Click on the zip file that contains the pdf's of notes for each lecture for the previous semester I taught the course. These are NOT the notes for the current semester. Notes for the current semester are available to registered students only, must be accessed through D2L, and are not finalized till the end of the semester.

Back to Top of Page