Transcript
GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat
GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat
GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat
F O R W I N D OW S
7 t h
GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat
GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat GenStat
Introduction
®
E D I T I O N
GenStat® for Windows TM (7th Edition) Introduction by Roger Payne, Darren Murray, Simon Harding, David Baird, Duncan Soutar & Peter Lane. GenStat Release 7.1 was developed by VSN International Ltd, in collaboration with practising statisticians at Rothamsted and other organisations in Britain, Australia and New Zealand. Main authors of Release 7.1: R.W. Payne (Chief Science and Technology Officer, VSN), D.B. Baird, M. Cherry, A.R. Gilmour, S.A. Harding, A.F. Kane, P.W. Lane, D.A. Murray, D.M. Soutar, R. Thompson, A.D. Todd, G. Tunnicliffe Wilson, R. Webster, S.J. Welham. Other contributors: A.E. Ainsley, N.G. Alvey, C.F. Banfield, R.I. Baxter, K.E. Bicknell, B.R. Cullis, P.G.N. Digby, A.N. Donev, M.F. Franklin, J.C. Gower, T.J. Hastie, S.K. Haywood, A. Kobilinsky, W.J. Krzanowski, P.K. Leech, J.H. Maindonald, G.W. Morgan, J.A. Nelder, H.D. Patterson, D.L. Robinson, G.J.S. Ross, H.R. Simpson, R.J. Tibshirani, L.G. Underhill, P.J. Verrier, R.W.M. Wedderburn, R.P. White and G.N. Wilkinson. Published by VSN International, Wilkinson House, Jordan Hill Road, Oxford, UK. E-mail:
[email protected] Website: http://www.vsn-intl.com/ GenStat is a registered trade mark in certain jurisdictions of the Lawes Agricultural Trust, Rothamsted. All rights reserved. ISBN-1-904375-08-1 © 2003 Lawes Agricultural Trust (Rothamsted Experimental Station)
Contents 1 Getting started 1 1.1 What is GenStat? 1 1.2 Using menus 3 1.3 Giving commands 11 1.4 Working with programs 18 1.5 The Windows interface 20 1.6 Exercises 23
109 Sorting data 114 Data manipulation 117 Bookmarking and comments 124 Dynamic Data Exchange (DDE) 126 4.9 Reading and writing data to databases 130 4.10 Other facilities 138 4.11 Commands 139 4.12 Exercises 143 4.5 4.6 4.7 4.8
2 Data input, calculations and display 25 2.1 GenStat data structures 25 2.2 Data input 26 2.3 Reading data from ASCII files 33 5 Linear regression 147 2.4 Displaying data 36 5.1 Fitting a straight line 147 2.5 Converting data structures 40 5.2 Multiple linear regression 155 2.6 Saving data to files 42 5.3 Stepwise and all subsets 2.7 Calculations 45 regression 163 2.8 The GenStat graphics wizard 54 5.4 Regression with grouped data 2.9 Commands for data input, 167 calculations and display 60 5.5 Fitting curves and polynomials 2.10 Exercises 71 175 5.6 Generalized linear models 183 5.7 Regression commands 186 3 Basic statistics 73 5.8 Other facilities 188 3.1 Comparing two samples 73 5.9 Exercises 188 3.2 Summarizing categorical data 81 3.3 Summarizing data by groups 83 3.4 Association between categorical 6 Analysis of variance 193 variables 84 6.1 One-way analysis of variance 3.5 Transferring output to other 193 applications 88 6.2 Two-way analysis of variance 3.6 Commands for basic statistics 92 195 3.7 Exercises 94 6.3 Randomized-block designs 198 6.4 Fitting contrasts 203 6.5 Designing an experiment 208 4 GenStat spreadsheet 97 6.6 Syntax of model formulae 214 4.1 Entering data into a spreadsheet 6.7 Unbalanced designs 217 97 6.8 Split-plot designs 223 4.2 Data verification 103 6.9 Commands for analysis of 4.3 Inserting and deleting rows or variance 227 columns 107 6.10 Other facilities 232 4.4 Defining subsets of data values
iv 6.11 Exercises 232 7 REML analysis of mixed models 237 7.1 Linear mixed models: split-plot design 237 7.2 Linear mixed models: a nonorthogonal design 242 7.3 Spatial analysis 250 7.4 Repeated measurements 254 7.5 Commands for REML analysis 260 7.6 Other facilities 262 7.7 Exercises 262
10.5 Repeating a statement 319 10.6 Making lists more compact 320 10.7 Suffixed identifiers and pointers 321 10.8 Unnamed data structures 323 10.9 Exercises 324 11 Other statistical methods 327 11.1 Six sigma 327 11.2 Survey data 327 11.3 Geostatistics 328 11.4 Survival analysis 328 11.5 Repeated measurements 329 11.6 Multiple experiments 330
8 Multivariate analysis 265 Index 331 8.1 Principal components analysis 266 8.2 Canonical variates analysis 270 8.3 Multidimensional scaling 273 8.4 Hierarchical cluster analysis 278 8.5 Non-hierarchical cluster analysis 281 8.6 Multivariate analysis of variance 287 8.7 Classification trees 290 8.8 Other facilities 297 8.9 Exercises 297 9 Time series 299 9.1 Exploration of time series 299 9.2 ARIMA model fitting 301 9.3 Time series commands 303 9.4 Exercises 305 10 More about commands and syntax 307 10.1 Working with commands 307 10.2 Repeating a sequence of commands 313 10.3 Syntax of GenStat commands 314 10.4 Abbreviation rules 318
1
Getting started
1.1
What is GenStat?
GenStat is a comprehensive statistical system that allows you to summarize, display and analyse data. The use of the computer for data analysis can save a great deal of time and trouble, but telling a computer what to do can be a troublesome business in itself. General-purpose computing languages, such as Fortran or C++, are designed to deal with the details of arithmetic and communication between a person and a computer; but quite ordinary methods of analysis need long programs. Specialist statistical packages are designed to provide an easy-to-use environment, where only a few instructions or selections from a menu are needed to do a standard analysis; but for something different from the standard, packages are difficult or even impossible to change. The WindowsTM implementation of GenStat gives you the best of both worlds: the flexibility of a programming language with the simplicity of operation of a menu-driven package. It provides this through a standard WindowsTM interface, with multiple windows and menus for standard analyses. The menus generate commands automatically to carry out the actions you choose, using GenStat's highlevel statistical programming language. The command language is also available for you to construct your own analyses simply and concisely, when you want something new or non-standard. Here are some of the things that GenStat can do: ! Manage data, entered by GenStat’s own spreadsheet or imported from existing computer files; ! Illustrate data with graphics such as histograms, boxplots, scatter plots, line graphs, trellis plots, contour and 3-dimensional surface plots; ! Summarize and compare data with tabular reports, fitted distributions, and standard tests, such as t-tests, P2-tests and various non parametric tests; ! Transform data using a general calculation facility with a wide range of mathematical and statistical functions; ! Model relationships between variables by linear or nonlinear regression, generalized linear models, generalized additive models, generalized linear mixed models or hierarchical generalized linear models; ! Design and analyse experiments, from one-way analysis of variance to complex multi-stratum designs, using a balanced-ANOVA or a REML approach (including the modelling of correlation structures); ! Identify patterns in data by means of multivariate techniques such as
1 Getting started
2
! ! ! ! !
canonical variates analysis, principal components analysis, principal coordinates analysis, correspondence analysis, partial least squares, classification trees and cluster analysis; Analyse results from stratified or from unstructured surveys; Plot control charts, print Pareto tables and calculate capability statistics; Analyse time series, using Box-Jenkins models or spectral analysis; Analyse repeated measurements, by analysis of variance, or using ante dependence structure, or by modelling the correlation over time; Analyse spatial patterns, using Kriging or spatial point processes.
These techniques are useful in agriculture, ecology, genetics, medical research, and other areas of biology, as well as in industrial research and quality control, and economic and social surveys; in fact in any field of research, business, government or education where statistics are used. The version of GenStat described here is the Seventh Edition for PCs under Windows 95, 98, NT, 2000 and XP. The minimum recommended configuration under Windows XP is a Pentium PC with 32Mb RAM. The menus are based on an underlying command language, GenStat Release 7.1, which is available for you to use for non-standard analyses. In this book, we introduce the command language only briefly, at the end of each chapter and in the final chapter. However, it is described in full in a separate book, The Guide to GenStat: Part 1 Syntax and Data Management. This language is common to all implementations of GenStat, including those on workstations and mainframes. The second part of the Guide (Part 2 Statistics) gives a comprehensive account of the statistical content of GenStat, reviewing the underlying methodology, explaining the output, and describing the relevant GenStat commands. Alternatively, the GenStat Reference Manual gives an individual description of each command. These books are all accessible, in PDF format, from the Help menu (Figure 1.4). This Introduction explains how to use GenStat for some of the commoner statistical tasks. There are exercises at the end of each chapter to encourage you to practise using GenStat. The data-files referred to in the exercises are supplied with GenStat, in a directory with a name like Gen7ed\Introduction, together with the data-files used in the illustrative examples. Note (in this section and later in this book): Microsoft, Windows, Windows NT, Windows 2000, Windows XP, Excel, Explorer, Word and Access are trademarks or registered trademarks of Microsoft Corporation.
1.2 Using menus
1.2
3
Using menus
You start GenStat within Windows on a PC by selecting the GenStat icon, shown in Figure 1.1, from the Programs Menu or double-clicking on the icon if it appears on the desktop. This runs two processes on the PC. The first, known as the GenStat Client, controls the Windows interface for GenStat. It collects information from you, and sends it to the GenStat Server, which runs in the Figure 1.1 background and performs the calculations. Figure 1.2 shows the screen that appears under Windows XP once GenStat has started (Windows 95, 98, NT 4.0 and 2000 are similar). You can see the icon of the Client in the main part of the task bar, and the icon of the Server in the tray on the right-hand side.
Figure 1.2
Figure 1.2 shows that the Client provides a standard Windows interface, with a menu bar and tool bars at the top, and a status bar at the bottom. There are two tool bars. The lower tool bar is for the GenStat spreadsheet, which is described in
4
1 Getting started
more detail in Chapter 4. There may be several sub-windows. Initially GenStat displays a start-session menu, which allows you to open a new spreadsheet or a new text window, to access the GenStat Help system, or to open a file used in an earlier session. If you do not want to see this menu in future runs of GenStat, you can uncheck the box Show this men u on Startup. Otherwise, we are displaying a single window, called Output. This initially contains information about the version of GenStat. Later it will contain output from the operations that we perform. The title bar of the window is highlighted to show that it is the current window, and the cursor can be seen blinking inside it. The left-hand section of the status bar at the bottom also shows which window is current; the other sections show the status of the GenStat Server, the position of the cursor in the current window, the working directory, and whether the current window is in insert or overwrite mode. We can click W indow on the menu bar to see a list of the currently available Windows (Figure 1.3). For example, the Input Log records the GenStat commands that the Client sends to the Server to carry out your requests. You can change the appearance of the GenStat window by the usual Windows methods. For example, you can expand it to fill the whole screen by clicking on the box-shaped icon in the top right-hand corner. Each sub-window can then be re-sized to take advantage of the extra space. For example, you could click on the Input Log, and then resize the Output Figure 1.3 window so that they share the available space. Many of the menus provided from the menu bar are standard for Windows applications. Section 1.5 describes these briefly, and the rest of this book introduces many of the other, GenStatspecific menus for carrying out statistical analysis and presentation. To start with, though, it would be useful to become familiar with the on-line help provided by the Help menu – the right-most pull- Figure 1.4 down menu on the menu bar, shown in Figure 1.4. This provides access to several sources of information (including this book, in pdf format, by clicking on Introduction). If you click on Contents and Index, you will enter GenStat's main help system. Figure 1.5 gives the initial display,
1.2 Using menus
5
with the Co nten ts tab selected showing that the system contains three "books": GenSta t W indow s H elp information about GenStat for Windows and its menus; GenStat Language Reference details of the GenStat command language, and the methods that it provides; and GenStat Graphics information about the sub-system that plots graphs for GenStat. You can browse the information using the standard Windows conventions, selecting a book and clicking the Open button to display the sub-sections of information, continuing until you reach an individual page; the Display button then appears, allowing you to view the selected piece of information.
Figure 1.5
Figure 1.6
Alternatively, you can select the Index tab, and browse through an index of the available topics. In Figure 1.6, we have located the item for t-test. Clicking Display then produces a menu for you to choose which of the relevant topics to display. The individual help sections allow you to keep bookmarks, follow hot-links to other sections etc, in the usual way. You can close the Help system by clicking the button at the top right of the help screen, or by clicking an Ex it or Cancel button.
6
1 Getting started
There are also interactive tutorials obtained by clicking the seventh line (Tutorials ) of the Help menu, which enable you to learn about GenStat by viewing videos on the PC and running interactive exercises. The first task when you start to use GenStat is usually to access your data. In later chapters we show several ways of entering data directly on the screen or importing it from various types of file; but here we shall use the easiest way, importing a set of data that was stored during a previous GenStat Figure 1.7 session. Clicking Da ta on the menu bar pulls down a menu with several options (Figure 1.7), selecting Load and then clicking on Data File brings up the Select Input File menu, which allows you to select a file containing some previously stored data. The menu starts in your working directory. Initially, when you first run GenStat, this will be the directory (or folder) that contains the GenStat executable program. You can move to the directory that contains the data for t h e ex amples and exercises in this course by clicking on the updirectory icon to the right of the Look in box, then clicking on the folder Figure 1.8 Introduction. You can make this your working directory by clicking on the Set as W orking Directory button at the bottom left of the menu. In Figure 1.8, we have selected Genstat Spreadsheet File (*.gsh) from the drop-down list in the File name menu, and then clicked on the file Iron.gsh. You can then load the data by clicking on Open .
1.2 Using menus
7
GenStat loads the data from the file and displays a report in the Output window, shown below. Identifier sample
Values 136
Missing 0
Levels 12
Identifier site
Values 136
Missing 0
Levels 6
Identifier FE
Minimum 200.6
Mean 246.6
Maximum 308.2
Values 136
Missing 0
Identifier weight
Minimum 11.36
Mean 12.42
Maximum 13.00
Values 136
Missing 0
There are four sets of information, or data structures, in this file. Two of these simply contain a series of numbers: the structure called FE stores the measured parts per million (ppm) of iron, and weight stores the weights of soil that were analysed. The values of the structures correspond to the 136 soil samples that were analysed in the whole study. The other two structures categorize these samples: each value of site records the code number of the laboratory that carried out the analysis, and sample contains the number (from 1 to 12) of the originating soil sample that was given to the laboratory to analyse. Notice that the summaries of the categorical structures are different to those of the purely numerical columns. Depending on how your copy of GenStat has been set up, the Output window may also contain a listing of the commands that GenStat has carried out in order to load the data. If this occurs, you may want to use the Options menu, shown below in Figure 1.30, to stop the commands being echoed. The Da ta menu also allows you to obtain a menu that lists the data currently stored inside GenStat; clicking on Display brings up the Da ta Display menu (Figure 1.9). The left-hand side of the menu allows you to select the types of data structure that you want to list on the right-hand side: in Figure 1.9, we have highlighted All Da ta . You can see that the Figure 1.9 structures that store lists of
8
1 Getting started
numbers in GenStat are known as variates whereas the categorical structures are called factors. GenStat provides a range of data structures that are convenient for different types of data, but these two are the most common. Clicking the Options button reveals two further rows of buttons that allow you, for example, to delete, rename or display (i.e. print in the Output window) structures selected from the right-hand list. You can also drag and drop structures from the list onto most of GenStat's other menus. Clicking on Close removes the menu. Clicking Stats on the menu bar allows you select any of GenStat's statistical menus. In Figure 1.10, we have selected Sum m arize C o n t e n t s o f V a r ia t e s , producing the Sum mary of Variates menu shown in Figure 1.11. This allows you to display summary statistics describing the contents of a variate, and also to produce some useful graphs. First we Figure 1.10 have used the !> button to put the name of the variate Fe into the Variates box. (Alternatively, you can double-click on the required variate or variates, but the button allows you to transfer several variates selected using the standard Windows techniques: clicking the mouse with the Ctrl key depressed to add or remove a variate from the selection, or clicking the mouse with the shift key depress to add a contiguous list.). We have then moved the cursor to the By Groups box by pressing the Tab key or clicking on that box, and selected the factor site from the new entries shown in Ava ilable Data . We have also checked the Boxplot box.
1.2 Using menus
9
When you click on OK , GenStat prints summary statistics for each site, in turn, as shown below. It also opens a Graphics window and draws six boxplots in it, one for each of the laboratories (or sites). This process may take a little more time than some of the operations done earlier, because GenStat has to start its Graphics server running Figure 1.11 first. While this is going on, the Status bar will display the message Processing Summ ary of Variates in place of Server Ready, to let you know that the computations are taking place. Summary statistics for FE: site 1 Number of observations Number of missing values Mean Median Minimum Maximum Lower quartile Upper quartile
= = = = = = = =
24 0 289.6 289.1 269.5 308.2 282.1 295.6
Summary statistics for FE: site 2 Number of observations Number of missing values Mean Median Minimum Maximum Lower quartile Upper quartile
= = = = = = = =
22 0 274.2 273.2 262.6 283.1 270.0 280.1
Summary statistics for FE: site 3 Number of observations Number of missing values Mean Median Minimum Maximum Lower quartile
= = = = = = =
24 0 216.3 212.8 200.6 252.7 208.7
1 Getting started
10
Upper quartile = 218.4
Summary statistics for FE: site 4 Number of observations Number of missing values Mean Median Minimum Maximum Lower quartile Upper quartile
= = = = = = = =
18 0 239.5 238.1 232.5 255.7 236.5 239.4
Summary statistics for FE: site 5 Number of observations Number of missing values Mean Median Minimum Maximum Lower quartile Upper quartile
= = = = = = = =
24 0 234.9 234.6 222.7 251.6 230.2 237.1
Summary statistics for FE: site 6 Number of observations Number of missing values Mean Median Minimum Maximum Lower quartile Upper quartile
= = = = = = = =
24 0 225.5 224.2 215.3 238.6 221.8 229.1
The output is labelled using standard statistical terminology but, if you are unsure about any of the words or phrases, you can use GenStat's context-sensitive help. Put the cursor into the word of interest, or into the first word of the phrase of Figure 1.12 interest, and press the F1 key. For example, if you put the cursor into the word "quartile" and press F1 , GenStat produces a help screen containing the information: "The lower quantile is the value l such that 25% of a sample are less than l. Similarly, the upper quantile is the value u such that 25%
1.3 Giving commands
11
of a sample are greater than u." Sometimes there is more than one potentially relevant topic. GenStat then provides a menu so that you can select the one that seems most appropriate. The menu for the word "median" is shown in Figure 1.12. Selecting Median (explanation fro m glo ss ary) produces the definition: "Median is the value that divides a sample into two equally sized groups." Figure 1.13 shows the graph, displayed in a separate window by GenStat’s Graphics Viewer. (If you look on the task bar you will see that this has its own icon there.) You can zoom the display by using the slider on the toolbar, or by holding the left mouse button and moving the mouse up and down. If your mouse has a centre button, you can move the display within the window by moving the mouse with that button Figure 1.13 held down. Alternatively, you can use the scroll bars at the bottom and on the right-hand side of the screen. Each boxplot consists of a central box spanning the inter-quartile range of the data from a laboratory (so that 50% of the observations lie inside the box). The horizontal bar marks the median, and the whiskers extend out to the largest and smallest observation made at that laboratory. It is clear from the display that Laboratories 1 and 2 are producing consistently higher results than the rest, and Laboratory 3’s results are generally lower. To return to the the main GenStat screen, you can click the GenStat icon on the task bar. Then click Cancel in the Sum mary of Variates menu to remove this menu, if you no longer need it.
1.3
Giving commands
As well as using menus, or instead if you prefer, you can tell GenStat what to do by giving it commands. In fact, the menus themselves work by constructing commands automatically and sending them to the GenStat Server. You can see
12
1 Getting started
these commands in the Input Log ; by default you cannot edit within this window (it is “read only”), but this can be changed using the Options menu (see Figure 1.32). The contents of this window after drawing the boxplot are in Figure 1.14.
Figure 1.14 There are three commands here: to print the summary statistics (DESCRIBE), to produce a title for the boxplot (PRINT), and to draw the boxplot (BOXPLOT). Notice that GenStat has used a private, temporary structure _tmptext to contain the title. Structures like this always have names that begin with the underscore character _, to distinguish them from your own data structures. Previous commands to set the working directory and to load data into GenStat from the file have scrolled up above this very narrow window. All GenStat commands have a common form of syntax: in other words, there are some general rules that apply to all the commands that you give. We introduce the basic syntax here, to help you to start to using commands if you want. Further details are given in Chapter 10. There are two types of commands: directives are the basic commands of the GenStat language while procedures are extensions of the language, using programs written in the GenStat language itself. However, both obey identical rules, so you do not need to know which you are using. GenStat can use different colours to identify the various components of the command. Here, for example, the names of the commands are in blue. This syntax highlighting of the current window can be controlled by checking or unchecking the Syntax Highlighting line of the Options menu (see Figure 1.31). We introduce the rules in the context of the directive called PRINT. This displays data, allowing you to inspect individual values. Note that this command does not send data or results to a printer: to do that you can select the Print option from the File menu in the menu bar. There is also a menu option to display data values: just select the structures in the Data Display window (as in Figure 1.9) and click on Display.
1.3 Giving commands
Figure 1.15
13
Figure 1.16
To give commands directly, it is best to open a new window in which to construct them. This is done by clicking on File in the menu bar and selecting New , as shown in Figure 1.15. This generates the menu shown in Figure 1.16, allowing you to choose what type of new window you want. Selecting Te xt W indow and clicking on OK gives you a new, empty, window which will become the current window. You can type, for example, the simple command PRINT 1
to display a single set of data: the number 1. When constructing commands in a window, you can use the usual keys for typing and deleting characters, and moving about the window. You can also switch between Insert and Overwrite mode by pressing the Insert key, and the Status bar will display, with Ins or Ovr , which mode you are in at any time. This is a trivial exercise, of course, but it serves to show how commands work. To get GenStat to execute this command, leave the cursor at the end of the line (that is, just after the 1) and select the Run menu from the menu bar. Select Submit Line , as shown in Figure 1.17, and the command will be executed. The resulting display is put in the Output window, as shown in Figure 1.18. Figure 1.17
1 Getting started
14
Figure 1.18 This is clearly not a very useful operation, because you already know what the set of data is, and because it consists only of a single number; however, this will quickly be generalized. In the meantime, you can see that the directive name, PRINT, is like a command verb which instructs GenStat to do something, and the number 1 is like the object of the command. All directives, and procedures, work like this, though not all directive names are actually verbs in the English language. The object is called the primary parameter of the command. The PRINT directive, like all others, works with sets of data. You can make it work with several sets of data at once by giving a list; for example, the command PRINT 1,2
has two sets, each containing one number, as shown in Figure 1.19.
Figure 1.19 In GenStat, lists are always constructed using commas. You must not use just spaces; for example, the command PRINT 1 2
would be faulted, because the space may be an accident, and you may have meant
1.3 Giving commands
15
PRINT 12
Figure 1.20 If you make a mistake like this, GenStat puts a brief explanatory message in a separate window, called the Fault Log , and copies it to the Output window. The message is shown in Figure 1.20. You can click on the Output button to go to the fault in the Output window, or on the Fault Log button to open the Fault Log . You can, though, use spaces as well as commas if you want, so the following command is acceptable: PRINT 1 ,
2
You will have noticed that PRINT commands lay out the data in a tabular form, choosing an appropriate number of decimal places for numbers. By default, a single number is displayed with four significant digits. Also, sets of data with compatible shape are laid out in parallel: that is, side-by-side. If you don't want this default display, there are a range of options to modify it. For example, the command PRINT [SERIAL=yes] 1,2
displays the two numbers in serial rather than in parallel: that is, one number by itself, and then the other, as shown in Figure 1.21. Most GenStat directives and procedures have options like this Figure 1.21 to control the way in which the operations are done. They must always be given in square brackets following the directive or procedure name and preceding the parameters, if any. Options have the form name=setting, where here
1 Getting started
16
the name is SERIAL and the setting is yes. Settings can be words, as here, or numbers. If you set several options, you must separate them with a semi-colon, as in PRINT [SERIAL=yes; INDENTATION=10] 1,2
This command would indent the output by 10 characters, so that if you arrange to send the display to a printer, you could rely on having a clear margin on the paper, perhaps for binding. The CHANNEL option of PRINT was used in the Input Log , to put the output into the GenStat text structure _tmptext. Most GenStat directives and procedures also have auxiliary parameters which control the way the command works. For example, the command PRINT 1,2; DECIMALS=0,1
gives the output shown in Figure 1.22. Figure 1.22 The effect of the DECIMALS parameter is to specify how many decimal places to display for each set of data. The essential difference between an option and an auxiliary parameter is that an option specifies a modification once and for all for the command: an auxiliary parameter specifies a modification that may be different for each of the sets of data in turn. The setting of the DECIMALS parameter above, 0,1, is matched item by item with the setting of the primary parameter, 1,2. This distinction applies to all GenStat commands. The setting of an auxiliary parameter is otherwise like that of an option, with the form name=setting, and the semi-colon separator is needed between successive parameters. The primary parameter itself has a name, except when there are no auxiliary parameters. So you could actually give the command: PRINT STRUCTURE=1,2; DECIMALS=0,1
However, if you specify the primary parameter first in a command, its name can always be omitted. You can abbreviate directive and procedure names to the first four characters; names of options and parameters can also be abbreviated to four characters, and sometimes further. The full abbreviation rules are described in Section 10.4. So far, we have used the very simplest sets of data, consisting of a single number each. Most practical work is done with series of numbers, like those in Section 1.2. For example, we can display the values in the variate called FE by simply giving the command: PRINT FE
We gave the name FE to this set of data before saving it in the file Iron.gsh: such a name is referred to as an identifier in GenStat. You need to give identifiers
1.3 Giving commands
17
to data even when using menus, so you should be aware of what GenStat allows. You can use names consisting of up to 32 letters or digits, but they must start with a letter. Case is significant, so the identifier FE is different to fe. We have used capital letters for this identifier but lower case for the others, like sample; however, you may find it easier to stick to all lower-case or all upper-case for your identifiers, at least while you get started with the system. The PRINT command works on all types of GenStat data structures, so you can probably guess that the following command would display all the data that was loaded in Section 1.2. PRINT sample,site,FE,weight
Part of the display is shown in Figure 1.23.
Figure 1.23 Values can be assigned to data structures in many ways; for example, by loading data from a file, by saving the results of an analysis, or by doing calculations (Chapter 2). In GenStat, calculations are specified in expressions, and are most often carried out with the CALCULATE directive. As a simple example, we can calculate the amount of iron in each tested quantity of soil by multiplying the weight by the concentration of iron: CALCULATE iron = FE * weight
The weights of soil were in grams, so the resulting weights of iron will be in micrograms. This CALCULATE command is more powerful than it looks. In fact, 136 separate calculations are done here: GenStat knows to do this because FE and weight have been defined to be variates with 136 values. So each of the 136 values stored in FE is multiplied by the corresponding value in weight and the results stored successively in a new variate called iron.
18
1.4
1 Getting started
Working with programs
Instead of making GenStat execute one command at a time, and interacting with the results, you can arrange to run several GenStat commands at once. Open an edit window, as before, but construct in it the whole series of commands that you want to execute. If you like, you can store the commands for future reference in a file, by clicking on File in the menu bar, and selecting Save As. You can tell GenStat to execute a selected set of commands by highlighting them (by clicking at one corner of a block of commands and dragging with the mouse to the opposite corner). If you then choose Su bm it Se lection from the Run menu, the commands are sent to the server, and the results are displayed in the Output W indow. Alternatively, you can run all the commands in the window by choosing Submit W indow from the Run menu. There are many example files of GenStat commands which are stored in a folder called Examples alongside the Intro folder. These can be loaded automatically into an input window, and then (if you want) executed, by clicking Help on the menu bar and selecting Ex am ple Programs (see Figure Figure 1.24 1.4). This generates the menu shown in Figure 1.24, which enables you to select the type of example that you want. Here we have selected Manipulation of data , but you can select m ore to see further example types, in particular for procedures in the GenStat Procedure Library. Clicking on O K produces the menu in Figure 1.25 (which varies according to the example Figure 1.25
1.4 Working with programs
19
type), allowing you to select the example required. If we select SORT directive and click on OK , GenStat displays a final menu (Figure 1.26) for you to choose whether or not to run the example. Whatever you decide, the example program will be loaded into a new Figure 1.26 input window (Figure 1.27) and, if you have decided to run it, the resulting output will be in the Output window.
Figure 1.27
20
1.5
1 Getting started
The Windows interface
The previous sections have introduced many of the features of GenStat's interface, and others will be introduced in later chapters. But we introduce here some general features that are common to many Windows-based programs, which can be very useful when using GenStat in practice, and briefly list the range of statistical techniques that GenStat provides by menus. The Ed it menu, shown in Figure 1.28, allows you to exchange information with other Windows-based programs. For example, the Copy option allows you to copy results from the Output window into a word-processing package. You can select the results to be copied by highlighting them using the mouse, or the Shift key together with the arrow keys, in the Output window. As usual with programs within Windows, the copy is sent to the Clipboard, which is a utility provided by the Windows system to help communication between applications. GenStat and many other applications provide Figure 1.28 a Paste option to insert the contents of the clipboard into the current window, at the position of the cursor. In this way, information can be moved within one GenStat window, sent from one GenStat window to another, exported from GenStat to another program, or imported from another program into GenStat. The Cut option works like Copy, except that a selection must be made, and the information is removed from the current window as well as being copied to the Clipboard. If no selection has been made, the Cut option is displayed in grey rather than black, indicating that it is not yet available; similarly, Paste is not available if there is no material in the Clipboard. The De lete option removes information without putting it in the Clipboard. Any of these operations (except Copy ) can be undone by choosing the Undo option. The Search menu, shown in Figure 1.29, allows you to search the current window for a given string, with options to specify case matching and occurrence as a word rather than Figure 1.29
1.5 The Windows Interface
21
part of a word. Repeated searches can be done with the Find Next option, and strings can be replaced with the Replace option. The Go To option allows you to move to a specific line in the current window, if you know its line number (as displayed on the Status bar). The Go Back option is relevant to the GenStat spreadsheet (Chapter 4), enabling you to undo a Go To operation. The Book m ark option allows you to maintain markers in a window, which can be useful if you are generating a long report in the Output window, for example. The W indow menu shown in Figure 1.30 allows you to rearrange the sub-windows in the GenStat window, either cascading them so that they overlap except for the top left corners, or tiling them so that you can see the whole of each window. You can also bring any of the named windows to the top of the display, which is useful if any window has become buried by other windows so that you cannot click the mouse on an exposed part. The Next and Previous options allow you to cycle through all the windows (except graphics), including input windows and dialogue boxes, and the W indows option brings up a list of all windows to choose Figure 1.30 from, including any open dialogue boxes. Many of these standard Windows options are provided also by buttons on the tool bar, to make it easier to execute them. GenStat provides "tool tips". So, if you leave the mouse pointer on one of the buttons, a small window will appear describing the purpose of the button concerned. The Options menu, shown in Figure 1.31, controls various aspects of GenStat, and the environment in which it runs. For example, clicking on the Customize Toolbar line brings up a menu that enables you to choose which buttons appear on the toolbar, and how they are arranged. The W orking Directory line provides another method of specifying the working directory, which is where all browse menus first start. The Spreadsheet Options line allows you to control the way in which the spreadsheet operates, and the Proced ure Libraries line allows you to connect your own procedure libraries. Figure 1.31
22
1 Getting started
Clicking on the Interactive line allows you to treat the current window as an interactive input device, sending command lines direct to GenStat whenever you type enter. Conversely, clicking on the Syntax Only line requests the client to copy the commands generated from the menus to the Input Log , but not to send them to the server. The Syntax Highlighting line controls the use of different colours to highlight the various elements of the command in any text window (other than the Output W indow). The Save Options Now line allows you to carry all the option settings that you have changed using these menus through to your next session with GenStat, while the Save Layout line simply saves the way in which the windows are arranged. The O ptions option brings up a tabbed menu with more detailed settings, as shown in Figure 1.32. For example, the Au dit trail tab lets you specify what aspects of your work will be recorded in the Input Log. Here all the commands are being recorded in the Input Log , but they are not being echoed in the Output window, nor is GenStat sending out the special code to start different sections of output on new pages when they are printed. Figure 1.32
1.6 Exercises
1.6
23
Exercises
1(1) Start GenStat running. Experiment with re-sizing the windows; for example, make the GenStat window fill the whole screen and the Output window take up all but a narrow strip at the bottom of the GenStat window, displaying one or two lines of the Input log in that strip. 1(2) Click on the Tuto rials line in the He lp menu on the menu bar, and explore some of the possibilities. 1(3) Select Load from the Da ta menu. Move from the directory (i.e. folder) containing the GenStat executable program to the Introduction directory and set this as the working directory. Load the file Sales.gsh and open the Data Display menu to see what data structures it contains. Use the Display button of the menu to print the contents of the structures to the Output window. Use the Sum mary of Variates menu to display the minimum, maximum and mean sales, and to plot a boxplot of the sales in each town. Print the skewness and kurtosis of the sales figures, and use the contextsensitive help to obtain some information about what these represent. 1(4) Open a new window and construct some commands to do a simple calculation. First set up a variate structure to store the radii of a series of circles, with the command VARIATE [VALUES=1,2,3,5,7,11,13,17,19] radius
Then use the CALCULATE directive to evaluate the area of circles (Br2) with these radii. You can represent B by the GenStat expression C('pi')
Use the PRINT directive to display the radii and areas side by side. Finally, modify your program to do the same operation for all integer radii from 1 up to 100. To avoid typing all 100 numbers, look up progression in the Help index. If your program looks good but still gives a fault, you may find that you have tried to re-use the structure which is storing the areas of only nine circles. Use the Display option in the Da ta menu to delete this structure if necessary.
24
2
Data input, calculations and display
The first step in any data analysis with GenStat (as with any computer software) is to import some data. You can access data from various types of data files on the computer: spreadsheet files (e.g. Microsoft Excel), database formats, simple text files (ASCII files), or files previously prepared by GenStat. You can also use the standard WindowsTM cut-and-paste facilities to transfer data directly from other WindowsTM applications via the clipboard. We will show how you can import data from Excel by reading from files and by using the clipboard. Before performing any statistical analysis it is good practice to explore your data graphically to gain a greater understanding of the data. We will show how you can visually display your data using histograms and scatter plots. Finally, we will introduce the calculations menu, which provides a very powerful calculator that operates with any type of numerical data structure.
2.1
GenStat data structures
In Chapter 1 we introduced two different types of data structures and how these could be read into GenStat using an existing GenStat spreadsheet. When data is read into GenStat it is stored within a central data pool and information on current data structures can be viewed using the Data Display menu. The first data structure we introduced was a variate, which stores a column of numerical values. The length (or number of values) of a variate is fixed, and two variates of different lengths cannot be used in a common calculation (unless you are calculating summary statistics from them). The second data structure was called a factor. A factor is a special data structure within GenStat for specifying an allocation of units into discrete groups. Each group can be represented with a label and/or numerical value (level). The groups are also assigned ordinal values that are numbered from 1 upwards and these indicate the order the levels or labels of the factor will be displayed. For example, the table below shows the 3 attributes of a factor that has 4 groups: Level Label Ordinal 1 2 3 4
0.0 0.5 1.0 2.0
Control Half Rate Standard Double Rate
The levels and labels of a factor can be reordered, but the ordinal values are always numbered 1,2,3… and the order of these cannot be changed. For example,
2 Data input, calculations and display
26
in the table below, the labels and levels have been reordered by sorting the labels alphabetically. Ordinal 1 2 3 4
Level 0.0 2.0 0.5 1.0
Label Control Double Rate Half Rate Standard
When data are imported from other file formats, GenStat will default the order of the levels of any factor either numerically for levels, or alphabetically for labels. You can control the way factor labels and levels are imported using the spreadsheet options and options on dialogs when loading data. The spreadsheet facilities provide a number of menus to allow factor levels and labels to be manipulated (see Chapter 4). There are many other data structures available within GenStat, each with appropriate attributes. A single numerical value is stored within a scalar. A column of textual information is contained in a text. A two dimensional array of data is contained in a matrix, and the two specialized forms of matrices (symmetric or diagonal) can also be used. Numerical results of cross tabulations or analyses are stored in tables that are indexed by a number of classifying factors.
2.2
Data input
One important feature of GenStat is that it provides flexibility to make tasks easier. A good example of this is data input, where there are three different ways of importing data. The first, and most common way, is to use the File menu (this is a standard WindowsTM approach). Opening a spreadsheet or database file through the File menu will load the data into a GenStat spreadsheet for viewing and/or editing before any updating of the data in GenStat's central data pool. The second approach is to open spreadsheet or database files using the Load option of the Data menu . Opening data this way will read the data direct into GenStat's central data pool without displaying the data into a spreadsheet. The final method is to use the Spread menu , which is similar to the File menu where data is opened within a spreadsheet. The Spread menu is useful for creating blank spreadsheets (see Chapter 4), copying data from the clipboard and viewing data already contained within the central data pool. To illustrate the facilities for data input within GenStat we will demonstrate how to open data from Microsoft Excel. When reading data from a foreign file, GenStat expects the data to be in a rectangular column format. In a spreadsheet, such as Microsoft Excel, the data need to be arranged in a group of columns forming a
2.2 Data input
27
rectangle where the columns are of the same length. If the rectangular area contains empty rows or columns then, by default, these will be removed when the data is opened in GenStat. You can specify column names for your data by entering a label for the name in the first row of the column within the rectangular block. A spreadsheet column name must start with a letter (A-Z, a-z or %) and can only contain letters, numbers or the symbols % and _. When data are read into GenStat, a check is made to see if a column name meets these conditions and modifies any names that include invalid characters. For example, if the first character of the column name is a number, then GenStat will create a new name by prefixing the label for the column with a %. When no column names are provided, GenStat will generate default column names using the notation C1 , C2 etc... You can specify missing data values by either leaving the cells blank or by entering an asterisk (*). When the data columns are read into GenStat, any numerical columns will be imported as variates and any column containing labels (excluding the column name) will be imported as a text data structure. Within a GenStat spreadsheet a text column is marked by a green ‘T’ next to its column name and the contents are right justified by default. A column of numbers or text can also be read into GenStat as a factor. You can specify a column to be a factor by appending an exclamation mark (!) onto the column name (e.g. crop!). Figure 2.1 Figure 2.1 shows an example of a block of data within the GenStat Data worksheet of the Excel file Bacteria.xls, which has been arranged for input into GenStat. The data values are a set of counts from an experiment: the numbers of one particular type of bacteria found in small samples of soil growing two different types of crop. The second column contains categorical data and has had the symbol ‘!’ appended to the column to specify the column is to be a factor.
28
2 Data input, calculations and display
We now look at the first method of importing data into GenStat; using the File menu . In this example we want to open the Excel file containing the data shown in Figure 2.1. To open the file we select the Open line in the File menu on the menu bar. This opens the Select Input file menu (Figure 2.2), in which we have selected Other Spreadsheet Files from the drop-down Figure 2.2 list entitled Files of type.
Figure 2.3
Figure 2.4
Selecting the file Bacteria.xls and clicking on Open , or double-clicking on the filename, gives the menu shown in Figure 2.3. This is the initial menu of a wizard for the input of data from an Excel file. It lists all the available worksheets and named ranges within the Excel file, with worksheet names prefixed by ‘S:’ and named ranges by ‘R:’. In this example, we have selected the worksheet GenStat Da ta . Subsequent menus allow you to select ranges and columns, and set various other options controlling how the data are transferred to GenStat. In this case we
2.2 Data input
29
want to take all the data on the page, and will leave the other options with their default settings. (The subsequent menus will be shown later though; see Figures 2.7, 2.8 and 2.9.) So we click on OK to open the two columns of data into a GenStat spreadsheet, as shown in Figure 2.4. You can close the spreadsheet and transfer the values to GenStat's central data pool by selecting the Ch ang e da ta to GenStat and close sheet
item from the Update option on the Spread menu (Figure 2.5). Alternatively, you can just click on the O u tp ut window. Any data in the spreadsheet that have changed are automatically Figure 2.5 transferred to the central data pool when you switch to another GenStat window. After the central data pool in the GenStat server has been updated, the Output window displays a brief summary of the data that have been transferred as shown below. Data imported from Excel file: C:\Program Files\Gen7ed\Introduction\Bacteria.xls on: 25-Apr-2003 16:17:39 taken from sheet ""Genstat Data"", cells A1:B11
Identifier counts
Minimum 4.000
Mean 73.50
Maximum 244.0
Identifier crop
Values 10
Missing 0
Levels 2
Values 10
Missing 0
We shall now close the GenStat spreadsheet, and input some data from the other Excel worksheet.
30
2 Data input, calculations and display
Data are not always stored in a singular rectangular format within a spreadsheet, but may have multiple blocks of data entered on a single worksheet. Figure 2.6 shows an example of this in the worksheet Bacteria C oun ts from the file Bacteria.xls. In this worksheet there is a title in row 1 of column A, and two rectangular sets of data records. In this example we just want to open the Figure 2.6 second rectangle of data (counts2 and crop2) within a spreadsheet. This time we shall use the second method of importing data, with the Data menu . This uses the same menus as those with the File menu . However, the data are loaded directly into GenStat’s data pool, and no GenStat spreadsheet is formed. To load the data from the file Bacteria.xls, we select the Data File option from the Load item on the Da ta menu. This produces the Se lect Inp ut file menu (Figure 2.2) again. So, as before, we select Other Spreadsheet Files from the drop-down list entitled Files of type, select the file Bacteria.xls, and click OK . This again leads to the initial menu of the Excel wizard, as shown earlier in Figure 2.3. There are two ways of reading a rectangular range of data from Excel into GenStat. If we select the worksheet Ba cte ria Counts in Figure 2.3 and click on the Next button (instead of Finish ), the second menu in the wizard, allows the range to be specified explicitly. You check the Specified Range radio button (instead of the default All cells ), and enter the range Figure 2.7 D3:E13 into the adjacent field as shown in Figure 2.7.
2.2 Data input
31
Alternatively, you can create a named range for the rectangular block of data within Excel and select this from the worksheet list in Figure 2.3. To create a named range in Excel you first select the desired rectangle either with the mouse or by using the shift and cursor keys. Once the rectangle has been selected, you can name the range by clicking in the Name Box and typing its name. In Figure 2.6 we have selected the range D3 to E13 and entered its name as Nam ed_Range in the Name Box. If you select Nam ed_Range as the worksheet or range in Figure 2.3 and again click Next , you will see that the range D3 to E13 is set up automatically in the second menu of the wizard, just as in Figure 2.7. The third menu in the wizard (Figure 2.8) allows you to choose which of the columns in the worksheet or range to read. By default they are all read. The final part of the wizard, shown in Figure 2.9, is a menu with tabs controlling more advanced aspects. This time we have not put an exclamation mark at the end of the column name to specify Figure 2.8 that the column crop2 is to be a factor. So, instead, we select the Factors tab, and check the Suggest columns with only a few
box. If this option is set, GenStat will check all the columns for repeated values or labels and, if any are detected, you will be prompted with a menu offering you the choice to convert them. unique values to be factors
Figure 2.9
2 Data input, calculations and display
32
On clicking Finish GenStat detects that the column crop2 has repeated labels and displays the menu shown in Figure 2.10. This menu displays all the columns that have repeating values and the current data type for each column is indicated by a prefix to the name (T specifies a text, F a factor and V a variate). To change the type of crop2 from a text to a factor we double-click on the name crop2 in the list (alternatively you can click on the button labelled Factor). This changes the prefix from Figure 2.10 T to F specifying the column will be a factor. Clicking on OK loads the data range direct into the data pool and produce the summary shown below. Data imported from Excel file: C:\Program Files\Gen7ed\Introduction\Bacteria.xls on: 25-Apr-2003 21:21:52 taken from sheet ""Bacteria Counts"", cells D3:E13 Identifier counts2
Minimum 4.000
Mean 98.00
Maximum 311.0
Values 10
Missing 0
Identifier crop2
Minimum
Mean
Maximum
Values 10
Missing 0
The final way to input data is to use the facilities within the Spread menu. In this example we will copy the columns count1 and crop1 from the file Bacteria.xls (see Figure 2.6) via the clipboard Figure 2.11 into a GenStat spreadsheet. As with the layout within a spreadsheet, GenStat expects the data on the clipboard
2.3 Reading data from ASCII files
33
to be in a rectangular format with columns of equal lengths. In Excel we select the data, including the column names (data range A3:B13), and then select Copy from the Ed it menu. Note that when you are using Excel, if you do any other operation on the spreadsheet before going to GenStat, Excel clears the data from the clipboard. The data is available to GenStat only while the dotted lines are moving around the selected cells in Excel. Now, in GenStat, we create a spreadsheet of the data, by selecting the From Clipboard item from the New option on the Spread menu as shown in Figure 2.11. The New Spreads heet from Clipboard menu (Figure 2.12) is then produced to control the process. If we leave the Suggest columns to be factors box checked, GenStat will display the factor conversion menu again. This time it will show crop1 as the column with repeated values rather than crop2, as in Figure 2.10. Leaving crop1 as a text and clicking OK produces the spreadsheet shown in Figure 2.13.
Figure 2.12
2.3
Figure 2.13
Reading data from ASCII files
In Section 2.2 we described how data can be imported into GenStat in file formats from other applications such as Excel. However, there is another common way to record data in files in a form that can be understood by any application. This is to store the data in a flat (or ASCII) file. This type of file can easily be read and written by GenStat. Data in ASCII format are imported using the Data menu. The data that we have examined so far in this chapter are just a part of a larger experiment. The computer file Bacteria.dat has 40 records, starting with the ten counts we have already looked at, and continuing with a further 30. So, the file looks like this:
34
2 Data input, calculations and display "Counts of bacteria and crop type, Jan-Feb 1988" 18 pea 117 pea 21 cereal 7 pea ... 1 cereal 12 cereal 4 pea * cereal
Select Load and then AS CII file from the Da ta menu (as shown in Figure 2.14). This option can cope with any single set of data laid out in a rectangu lar array, with columns for measurements Figure 2.14 and rows or records for observations. The values must be separated from each other by spaces or some other consistently used character such as a comma. Comments can be included in the file by placing a double-quote (") at the beginning of a line. Textual values must be delimited by single quotes (‘ ’) if they contain a space or a comma, or a backslash character (\) which is the continuation character in the GenStat command language. The AS CII file option displays the Read Data From ASCII File menu in Figure 2.15. You need to enter the name of the file storing the data, into the ASCII D ata Filename box. If the file is in the working directory (as shown on the Status Bar), the filename is sufficient. Otherwise, you can change the current directory, or give the full path of the file. It is Figure 2.15 probably easiest, however, to click on Browse to find the directory and file that you want. Figure 2.16 shows the Se lect AS CII da ta file menu, with the required file highlighted.
2.3 Reading data from ASCII files
35
Double-clicking on the file Bacteria.dat will enter the
full filename into the box in the Re ad D ata menu, and display the first five lines of the file in the box at the top right of this menu. The names for the data columns are typed in the Names for Data Columns box, separated either by spaces or commas; there must be one name for each column of data in the file. Figure 2.16 You can choose to group the data automatically to form a factor, by checking the box Automatically Group Da ta , as in Figure 2.15. The grouping option works by checking the number of distinct values in each column of data, and querying whether to turn into a factor any column that has 10 or less. You can change the value for the grouping criterion by entering a value in the Maximum Num ber of Categories box. With this set of data, the counts will not be grouped because there are many more than 10 different counts. However, there are only two different crop names, pea and cereal, so the menu shown in Figure 2.17 will appear Figure 2.17 asking whether to turn crop into a factor or keep it as a text. GenStat reads the data file and sets up data structures to store the values that it finds. A report is sent to the Output window, controlled by the check-boxes at the bottom of the Re ad D ata menu, and this is shown below. "Counts of bacteria and crop type, Jan-Feb 1988" *** The first line of data with no missing values is: 18
pea
*** The file C:/Program Files/Gen7ed/Introduction/Bacteria.dat is assumed to contain 2 structure(s), with one value for each structure on each record
2 Data input, calculations and display
36
*** Occurrence of distinct values of crop Count category cereal pea
20 20
*** The file contains 40 values for each of the following structures: Identifier counts crop
Type Variate Factor
Length 40 40
Values Missing Present 3 Present 0
There are several other options in the Read Data menu, that make it easier to deal with data that is not in GenStat's standard form. To illustrate these, we have formed another file called B a c t e r i 2 . da t which contains the same data, but with comma separators and the minus sign (-) rather than the asterisk (*) as the missing-value indicator. The Figure 2.18 identifiers for the two data structures are now included in the first line of the file. Figure 2.18 shows the Read Da ta menu, completed, ready to read the data from this file instead. Notice that it can save time to store identifiers for data structures in the file with the data; whenever you access the file from GenStat, you will not have to remember what order the columns are in, nor type the names that you wish to use.
2.4
Displaying data
There are two ways that you can view data structures from the GenStat central data pool. You can display the contents of data structure either within a spreadsheet or within the Output W indow. To display data in a spreadsheet you can use the Load Spreadsheet menu. The advantage of displaying the data within a spreadsheet is that you are then able to edit the values and send these changes back to GenStat. To open the Load Spreadsheet menu we select the Data in GenStat item from the New option on the Spread menu, which opens the menu as shown in Figure 2.19. In this menu you are
2.4 Displaying data
37
offered the choice of several different types of spreadsheet, allowing you to display different kinds of data structures such a vector of values, matrix or table. In this example we have selected the default Vector Spreadsheet, which allows a spreadsheet to be formed using one or more columns (or vectors) of information. A Vector Spreadsheet can contain variates, texts and factors simultaneously if they are all the same length. To select the columns to display in the spreadsheet we have double-clicked the names counts2 and crop2 in the Ava ilable Data list to copy them into the Data to Load list. Clicking OK produces the spreadsheet shown in Figure 2.20.
Figure 2.20
Figure 2.19
38
2 Data input, calculations and display
To display the contents of data structures within the Output W indow you can use the Da ta Dis play menu. For this example we shall illustrate how to display the data structures counts and crop from Section 2.3. Figure 2.21 shows the Data Display menu which is opened by selecting the Display option on the Da ta menu. We first need to select Figure 2.21 the data structures in the Da ta Display menu, which you can do by selecting the All Vectors data folder in the left hand tree view. This data folder lists all the data structures of vector form (i.e. variates, factors and texts) that are currently stored within GenStat. We then select the Options button so that the additional option buttons at the bottom of the menu are visible. Now, we make a multiple selection of the two identifier names, counts and crop, by holding the Control key down and clicking on each name using the left mouse button. Clicking on the Display button opens the Display Data in Output Window menu shown in Figure 2.22. This menu allows you to control the layout and style of how the data is to appear in the Output W indow, setting different attributes appropriate to the data type. In this menu we select counts in the Da ta to Display list, which changes the menu to display the appropriate boxes and buttons Figure 2.22 that can be used for displaying the contents of a variate. Here we have changed the field width to 8.
2.4 Displaying data Next we select the name crop from the Da ta to Display list. The identifier crop is a
factor so the menu changes to show the appropriate buttons and boxes for displaying a factor. We then set the field width to 8 and select the justification to left as shown in Figure 2.23. Clicking OK displays the data in the Output W indow as shown below (Note: this example only Figure 2.23 shows the first 10 rows).
counts 18.00 117.00 21.00 7.00 176.00 85.00 244.00 4.00 55.00 8.00 73.00 ..... ..... .....
crop pea pea cereal pea cereal cereal cereal pea cereal pea cereal
39
2 Data input, calculations and display
40
Figure 2.24
Figure 2.25
You can also use the Data Display menu to display data structures within a spreadsheet. For example, to open a spreadsheet containing the data structures counts and crop, select the two names in the data list as in the previous example using the Control key and left mouse clicks. Clicking on either of the selected names using the right mouse button will produce the menu shown in Figure 2.24. Selecting the Cre ate S prea dsh eet menu option will produce a spreadsheet (Figure 2.25) containing the two selected data structures. As well as the Create Spreadsheet menu option, you can use the right mouse pop-up menu to display data in the Output W indow by selecting the menu option Display.
2.5
Converting data structures
After reading data into GenStat you may want to change the data type. For example, you may want to convert a column to a factor as it contains grouped data or a factor to a text for use in a particular menu. To illustrate how to convert structures we will change the column crop from Figure 2.25 into a text using the menus and change it back to a factor using the right-mouse pop-up menu. So, to convert the column crop to a text structure, we select Conve rt from the Column item on the Spread menu.
2.5 Converting data structures
Figure 2.26
Figure 2.27
This opens the menu shown in Figure 2.26. The different types of data structure to which you can convert are listed in the Colu mn T ype box. We have selected crop from the list of columns and we have selected Text from the Column Type box. Clicking OK on this menu changes the spreadsheet as shown in Figure 2.27. You can now see that the column heading has a green ‘T’ indicating that the column is a text structure. To change the column back to a factor, move the mouse over the column crop and click the rightmouse button. This should pop up the menu shown in Figure 2.28. On this menu we have selected the item Conve rt to factor. This returns the spreadsheet to its original state as shown in Figure 2.25.
Figure 2.28
41
42
2.6
2 Data input, calculations and display
Saving data to files
The easiest way to save data in GenStat to an external file is to display the data within a GenStat spreadsheet, and save this as a GenStat Spreadsheet file (*.gsh). To save such a file, select Save As from the File menu. This prompts you with a menu, as shown in Figure 2.29. In the menu select the GenStat Spreadsheet (*.gsh) option from the Save as type list. You can specify the name of the file in which to save the spreadsheet, and Figure 2.29 select the directory in which to store the file. GenStat has the ability to save data to files in several other common file formats. For example, you can save a spreadsheet in an Excel file, Lotus workfile or a database file. When GenStat saves the data to these file formats, the data are stored in exactly the same format as they are displayed in the spreadsheet. Storing the data in this way has the advantage that it is easily read back into GenStat. To illustrate this, we look at Figure 2.30 how the Bacteria counts spreadsheet is saved to an Excel file. To save the spreadsheet, select Save As from the File menu. In the Save As menu select the Excel File (*.xls) option from the Save as type drop-down list and the Excel version whose file format you want to use, from the File version drop-down list (see Figure 2.30). Type in the file name (here the menu shows the default
2.6 Saving data to files
43
SHEET2.xls: this is the second unnamed spreadsheet that we have formed in this
session), and click OK . When GenStat saves a new spreadsheet (i.e. one that has not been saved previously) to an Excel file, the data are written to the first worksheet, which is named Genstat Data. Figure 2.31 shows the layout used to represent the bacteria data in the Genstat Data worksheet in the Excel File SHEET2.xls. The data are stored in exactly the same column format as within the GenStat Spreadsheet (see Figure 2.25), but with the identifier names in row 1, and the data values in rows 2 onwards. The identifier name for the column crop has an exclamation mark, `!', appended to the end of the name. This has been added to the name to indicate that the column of data is of type factor. Another way of saving data is to save all the contents from the central data pool to a GenStat Save File (*.gsv). To do this, select the Save option from the Da ta menu on the menu bar. This brings up the Save As menu, shown in Figure 2.32, allowing you to select the file in which to store all the current data. A GenStat Save File can be opened at a later date using either the File Open option or the Resum e item from the Figure 2.32 Load option on the Da ta menu.
Figure 2.31
44
2 Data input, calculations and display
There is a special file format in GenStat called a Session file (*.gsn), which allows you to save the entire contents of your current GenStat session including the data, spreadsheets, text windows and any open menus. Opening a session file will automatically re-open all of these items. To save a session, select the Save Session option from the File menu. This opens the save dialog as shown in Figure Figure 2.33 2.33. You can control what details to save by clicking on the Options button. This opens the menu shown in Figure 2.34 where you can select from a range of different features.
Figure 2.34
2.7 Calculations
2.7
45
Calculations
The Ca lculate menu (Figure 2.35) is obtained by selecting Calculations from the Da t a menu. The Calculate menu can be used for calculations such as transforming data, and data summaries. It can also work as an ordinary calculator. For example, you can Figure 2.35 simply multiply two numbers together: type 4, click the button for the operator *, and then type 6.25. As you do this, you can see that the expression defining the calculation is recorded at the top of the window. You can display the result in the output window, by selecting the Print in Output option and clicking OK to produce the following output. 4* 6.25 25.00
Instead of (or as well as) displaying the result, you can ask to save the result in a structure by giving an identifier name, say s25, in the Save R esu lt In box; s25 will then be defined as a scalar data structure (since the calculation has generated a single number as its result), storing the value 25. As you can see, all the usual arithmetic operators are available: + addition; - subtraction; * multiplication; / division; ** exponentiation (for example, X**2 stands for X2 ) There is also the operator *+ for matrix multiplication, left and right brackets, and a full set of logical and relational tests, these are described later. Most practical calculations are done on whole series of numbers stored in variates. To show what can be done, we shall work with some administrative data from a small company, recording rates of pay and hours of work over a four-week period. These are available in a GenStat spreadsheet file, Pay.gsh, which can be opened by selecting Open from the File menu from the menu bar as described in
46
2 Data input, calculations and display
Section 2.2. If you look at the Ca lculate menu again (Figure 2.36), the Available Da ta box will now list the variates hours1, hours2, hours3, hours4 and rate. Our first calculation works out the wages for the first week. To specify this calculation, enter hours1*rate by double-clicking on the identifiers in the Av ailable Da ta box then single-clicking on the operator button. Store the results in a new Figure 2.36 structure called pay, removing the check from the Print in Output box (see Figure 2.36). The calculation takes place for every unit of the variates and pay is defined as a variate. The value for Foster is the appropriate value for hours1 (41) multiplied by the corresponding value of rate (10.00) and so on. This can be verified by returning to the spreadsheet and adding the calculated column (pay) back into the spreadsheet. Selecting the Da ta in GenStat option from the Add menu on Spread on the menu bar opens the menu shown in Figure 2.37. This menu operates in a similar way to the Load Spreadsheet menu described in Section 2.4 where, instead of Figure 2.37 creating a new spreadsheet, the data is added as new columns within the current spreadsheet. We have doubleclicked the name pay in the Availab le Data list to copy it across to the Data to Load box.
2.7 Calculations
47
Clicking OK produces the spreadsheet shown in Figure 2.38 with the new column pay added to the spreadsheet. You can include a scalar in a calculation with a variate. This will apply the calculation with the scalar value to every unit of the variate. To illustrate this we will add the scalar s25 to every unit of the variate pay. First we click the Clear button to re-initialize the Figure 2.38 menu, and check the Scalars box in the Ava ilable Data box so that the scalar s25 will be displayed in addition to all the available variates. We can then enter the calculation pay + s25, and specify pay1 as the name of the structure (again a variate) to store the results. This time we place the results directly into the spreadsheet. To do this we select the Display in Spreadsheet checkbox, which enables a list of open spreadsheets from which we select pay.gsh. Figure 2.39 shows the Calculate menu, and Figure 2.40 illustrates the effects of adding pay1 to the spreadsheet where the calculated column is indicated by a yellow block in the column title.
Figure 2.39
Figure 2.40
48
2 Data input, calculations and display
GenStat provides a wide range of functions that can be included in the expression. Click on Clear in the Calculate menu to clear the expression and save box. Now, clicking on the Functions button opens the C a l c u la t e Fu nctio n s menu from which you can choose the Figure 2.41 function and set its arguments, as shown in Figure 2.41. The menu contains a drop-down list box of different classes of functions. For each function class there is a range of functions available in the Fu nction drop-down list. In Chapter 3 we will describe how to transform data using the logarithmic function. This type of calculation will produce a result that is the same type of structure as the argument of the function. Classes which produce a result in this form include transformations, inverse transformations, strings, log-likelihood, and the various types of probability. Some function classes however do produce a result that is a different type of structure than that of the argument. For example, the functions in the Sum m ary class produce a scalar summary of all the values in a structure.
Figure 2.43 To illustrate this we will calculate a new scalar, totalpay, as the sum of the pay for all employees Figure 2.42 in week 1. We select Sum m ary as the class of function, and then Sum values as the function. The Sum values function has a single argument (the numbers to be summed), so we enter this into the X box using the Ava ilable Data (Figure 2.41).Clicking OK , transfers the
2.7 Calculations
49
function and its argument to the expression box in the main Ca lculate menu at the current cursor position (Figure 2.42). Clicking OK in the Calculate menu creates a new scalar spreadsheet with one column for the value for totalpay (Figure 2.43).
Figure 2.44
Figure 2.45
Most of the summary functions operate on a single data structure. One exception is CORRELATION, which calculates the correlation between two structures. In Figures 2.44 and 2.45, we illustrate how to calculate the correlation between the hours worked in week 1 and the pay rates placing the results in the scalar paycorr. The resulting scalar spreadsheet containing the value for Figure 2.46 paycorr is shown in Figure 2.46. Notice that, when a function has more than one argument, each is separated from the next by a semicolon. (Correlations can also be calculated from the Sum m ary Statistics option of the Stats menu.)
50
2 Data input, calculations and display
Sometimes data will include missing values, for example, the column hours2 has a missing value in row 7 where there was no record of the hours worked for Edwards in week 2. GenStat has a general rule for calculations involving missing values in that, if any of the structures involved in a calculation has a Figure 2.47 missing value in a particular unit, the result of the calculation will be missing for that unit. To illustrate this we will calculate the variate monthpay as total pay for each employee over the four weeks (Figure 2.47). The resulting spreadsheet is shown in Figure 2.48 where the column m o n t h p a y contains a missing result when the hours are unknown in any of the four weeks. If you wish to replace a missing value, you can use the Replace m i s s in g v a lu e s function ( M V R E P L A C E) i n t h e Transformations menu. This has two arguments: the first specifies the identifier of the data structure with the Figure 2.48 missing values, and the second supplies the values that are to replace them. In our example we might assume that a value would be missing if an employee had not been present during the week concerned, so we should replace it by zero. To do this we use the calculation shown in Figures 2.49 and 2.50 for each of the hours columns which contain a missing value.
2.7 Calculations
Figure 2.49
51
Figure 2.50
When values are changed in data structures that have been used in a calculation, if the calculated structures are present in a spreadsheet (e.g. pay1 and monthpay), they need to be manually recalculated. We can now recalculate monthpay by selecting Figure 2.51 Re calculate from the Calculate option of the Spread menu, see Figure 2.51. In this menu we select the column monthpay and click OK to recalculate that column in the spreadsheet.
Figure 2.52
Figure 2.53
2 Data input, calculations and display
52
If you examine the spreadsheet in Figure 2.48 more closely, you will see that there are other problems in the data: the value in row 9 for hours4 is 400. Calculations can also involve relational and logical tests. These produce the value 1 if the result is true, and 0 if it is false. So, for example, we can use the greater-than operator (>) to set up a variate called odd4 containing 0s and 1s according to whether staff are recorded as working less than or greater than 100 hours in the fourth week (Figure 2.52). We can then use odd4 in the Insert missing values function (MVINSERT) to place a missing value into unit 9 of the monthpay variate, since we believe this record must be wrong. This function also has two arguments: the first is the identifier of the structure with values that need changing, and the second is a variate of 0s and 1s indicating which values are to become missing values. So Figures 2.53 and 2.54 take the values of monthpay, insert a missing value whenever the corresponding value of odd4 is non-zero, and store the results back in monthpay. The resulting spreadsheet is shown in Figure 2.55.
Figure 2.54
Figure 2.55
The calculation does not have to be done in two stages: the second argument of the Insert missing values function could be replaced by the expression hours4 > 100. However, the intermediate variate odd4 makes it easier to see what is going on. The available operators for relational tests are as follows: < <= > >= == /=
less than less than or equal to greater than greater than or equal to equality not equal to
2.7 Calculations .IN.
.NI.
53
inclusion: X.IN.Vals gives result true for each value of X that is equal to any one of the values of Vals non-inclusion: the opposite of .IN.
There are also logical operators that can be useful to combine the results of expressions involving relational operators. .AND. .OR. .NOT. .EOR.
and: a.AND.b true if both a and b are true or: a.OR.b is true if either a or b is true not: .NOT.a is true for a untrue either or: a.EOR.b is true if either a or b, but not both, is true
The precedence of the operators (that is, the order in which they are evaluated if there are several different ones contained in an expression) is much the same as you would expect from ordinary arithmetic: (1) .NOT. Monadic ! (that is, minus as for example in !2) (2) .IS. .ISNT. .IN. .NI. *+ (3) ** (4) * / (5) + Dyadic ! (that is, minus as for example in x!y) (6) < > == <= >= /= <> .LT. .GT. .EQ. .LE. .GE. .NE. .NES.
(7) .AND. .OR. .EOR. (8) = (see the examples of CALCULATE below) Within each class, operations are done from left to right within an expression. So, for example 5 ! 1 ! 2
gives the value 2. In case of any doubt, it is safest to use brackets ! the expression inside a pair of brackets is always evaluated first. So, for example 5 * 2 + 3
gives the value 13, but A = 5 * (2 + 3)
gives the value 25.
54
2 Data input, calculations and display
GenStat text structures can be used in expressions, but only with the set inclusion operators .IN. and .NI. (see above), or the string operators .EQS. (equality) and .NES. (inequality). For example, the expression text1 .EQS. text2
compares the string in each unit (or line) of text1 with that in the corresponding unit of text2, giving the result true if they are identical, while text1 .NES. text2
gives the result true if they differ. When a factor occurs in a calculation, GenStat usually works with its levels. The exception is when the factor occurs as the first operand of the operators .IN. or .NI. and the second operand is a text; the factor labels are then used instead. A factor can also receive the results of a calculation; an error is reported if any of the resulting values is not one of the levels of the factor. Two functions are provided especially for factors: the Num be r of levels function (NLEVELS) in the Other class provides the number of levels of a factor, and the Convert factor function (NEWLEVELS) forms a variate from the factor supplied by its first argument using the variate supplied by its second argument to define (new) values for the levels.
2.8
The GenStat graphics wizard
It is often useful to explore the data using graphical summaries. These can be used to examine the structure of the data or to display its distribution. GenStat provides a convenient way of generating graphical displays using the graphics wizard. We shall investigate some data collected in 1990 to investigate changing levels of air pollution. The principal measurement is the amount of sulphur in the air each day, but associated measurements are included: the strength and direction of the wind, and whether or not it rained. The data are available in the file Figure 2.56 Sulphur.gsh and can be read using the Data file option on the Load menu from the Da ta menu bar. The Da ta Display menu, shown in Figure 2.56, lists the structures loaded from the file (You
2.8 The GenStat graphics wizard
55
can view the Data Display menu by selecting the Display option on the Da ta menu). The data consist of two variates and two factors, each containing 114 values. We can investigate the shape of the distribution of the 114 sulphur measurements usin g a histogram. To draw a histogram we select the Cre ate G r ap h option from the Graphics menu. This opens the first menu of the graphics wizard as shown in Figure 2.57. On this menu we select the option 2-D Histogram from the Graph Type list. Clicking the Next button opens the Figure 2.57 second menu, which is for entering the data. The data menu for a histogram is shown in Figure 2.58, where we have selected Sulphur in the Select D ata list and clicked on the arrow to place the name Sulphur into the Data currently selected for plotting list at the bottom of the menu. Now, clicking the Next button opens another menu for entering the options (see Figure 2.59).
Figure 2.58
56
2 Data input, calculations and display
On this menu we entered a title, “Sulphur pollution”, for the graph. We have selected the Use data values option for specifying the boundaries. This option lets GenStat select the number of groupings and their locations automatically. However, you can use the Num ber of Groups option to specify a particular number of groups. The groups are then defined by intervals of equal width, spanning the range of values of the variate. Figure 2.59 Alternatively, if you want to have other intervals, you can define the boundaries explicitly by selecting the Lim its option and entering either a variate containing the boundary values, or the list of values themselves. Clicking the Finish button produces the histogram shown in Figure 2.60. This shows the numbers of observations in successive equal-width categories of the sulphur scale. Clearly sulphur has a skew distribution: there are many days with little or no sulphur in the air, and then decreasing numbers in successive categories with more and more sulphur. Other useful graphical ways of displaying the data in a variate include Stem and Leaf plots, and Boxplots (see Figure 2.60 Figure 1.13).
2.8 The GenStat graphics wizard
57
Many statistical studies are concerned not with single variables, but with the relationships between several variables. Clearly, it is important to also study the individual variables, as we have done already with some of the pollution data in the previous section, but with such data it is natural to ask questions like "Is there any effect of wind speed on the sulphur level?" The most effective way to begin to answer a question like this is generally to draw a scatter plot or point plot. In GenStat, this can be done using the graphics wizard by selecting the 2-D Scatter Plot option in the Graph T ype list (see Figure 2.57). If we click on the Next button, the resulting menu is shown in Figure 2.61. We wish to plot the sulphur levels against the wind speeds, so we select the Figure 2.61 Sin gle XY option from the Type of plot list. We then select the name of the variate Sulphur from the Select Y list, and the variate Windsp from the Select X list. The two variates that have been selected for plotting are then displayed at the bottom of the menu in the Da ta currently selected for plotting lists. Clicking Next opens the Graphics
W izard
Attribu tes
menu shown in Figure 2.62 where we have entered the title into the Graph Title box and unselected the option Display key so that no key is displayed on the graph. Clicking on Finish produces the graph shown in Figure 2.63. GenStat also displays a message in the Output window to warn that there are some missing values that have not Figure 2.62 been plotted.
58
2 Data input, calculations and display
Many of the graphics menus contain axes menus to allow you to modify various aspects of the x- and y-axes. For each axis there is a separate menu which allows you to set various attributes such as a title, a lower limit, an upper limit, position of tick marks and the origin position through which to draw the other axis i.e., where to draw the x-axis across the y-axis, or vice-versa. To illustrate the axis menu, we select the Create Graph option from the Graphics Figure 2.63 menu bar as before, and then select the 2-D Scatter Plot option from the Graph Type list, and click Next . On the data menu we select sulphur and windsp for the data for Y and X again, and clicked Next . On the Options menu we now select the X Ax is tab from the top of the menu, and this produces the menu shown in Figure 2.64. To set the x-axis attributes in this menu we select the Display T itle option and enter the title as Wind speed m/s. We then enter the lower and upper bound values as 0 and 25. Figure 2.64
2.8 The GenStat graphics wizard We now click on the Y Axis tab which produces an identical menu to the X Ax is tab. The new menu is shown in Figure 2.65 where we have selected the Display Title option and entered the title Sulphur microg/m**3 into the space provided. You can set the style and colour of the symbols for the scatter plot using the Line and Sym bols tab.
Figure 2.65
Clicking on this tab produces the menu shown in Figure 2.66. We have selected Plot 1 from the Graph list, and chosen Circle from the symbols. You can chose colours for the circle (in the Colour list box), and for its interior (in the Fill Colour list box). Here we have chosen dark blue circles filled with light blue. Figure 2.66
59
60
2 Data input, calculations and display
Clicking on Finish produces the graph shown in Figure 2.67, where the points are now represented by filled circles.
Figure 2.67
2.9
Commands for data input, calculations and display
The menu options illustrated above perform actions that can also be carried out using commands in an Ed it window, as outlined in Section 1.3. You may prefer not to use this level of control of the GenStat system, and restrict yourself to what can be done by the menus, but the commands are in fact much more powerful. We outline them here so that you can find out what is available in case you need it, and we shall continue to do this in an optional final section of each of the remaining chapters in this book, giving more details of commands in general in Chapter 10. The READ directive, described at the end of this section, provides very general facilities for the input of data. However, when the file is in a simple layout (like those handled by the Da ta menu), it is simpler to use the FILEREAD procedure. The FILEREAD procedure can automatically determine the type of data being read as either numbers or text, and then set up factors in the same way as the Da ta menu. For example, the data in Bacteria.dat can be input by the following single command: FILEREAD [NAME='Bacteria.dat'] counts,crop; FGROUPS=no,yes
2.9 Commands for data input, calculations and display
61
Similarly, the alternative data-recording style used in Bacteri2.dat can be handled with the following command (using the continuation character (\) to continue the command onto a second line): FILEREAD [NAME='Bacteri2.dat'; IMETHOD=read; \ MISSING='-'; SEPARATOR=','] FGROUPS=no,yes
You need to be careful with using backslashes when specifying names of directories on a PC. If you need to specify the directory name, you should either duplicate each backslash character, or use the forward slash (/) instead. For example, if the data file was in directory C:\GEN6ED\INTRO\EXAMPLE, you could put either NAME='C:\\GEN6ED\\INTRO\\EXAMPLE\\Bacteria.DAT'
or NAME='C:/GEN6ED/INTRO/EXAMPLE/Bacteria.dat'
The SPLOAD directive can be used to access data within a GenStat spreadsheet. For example to read the data from the file sulphur.gsh you would use the following command: SPLOAD 'C:/GEN6ED/INTRO/EXAMPLE/Sulpur.gsh'
By default, GenStat will display a list of the data structures within the spreadsheet file. However, you can suppress this summary using the PRINT option as follows: SPLOAD [PRINT=*] 'C:/GEN6ED/INTRO/EXAMPLE/Sulpur.gsh'
The IMPORT procedure provides a way of importing data from foreign file formats such as Excel, Quattro or dBase etc... The basic use of the IMPORT procedure is the same for most file formats with the exception of Excel and Quattro, which use two additional parameters. The following example shows how to read data from the Ge nStat Da ta worksheet from within the Excel file Bacteria.xls. IMPORT 'C:/GEN6ED/INTRO/EXAMPLE/Bacteria.xls'; \ SHEET='GenStat Data'
There are two ways that a range of data can be read from an Excel. The first way is to specify the worksheet and the range. For example, the following command will read the range A3:B13 from the Bacteria C oun ts worksheet. IMPORT 'C:/GEN6ED/INTRO/EXAMPLE/Bacteria.xls'; \ SHEET='Bacteria Counts'; CELL='A3:B13'
62
2 Data input, calculations and display
The second way to read in a range, is to specify a named range in the SHEET option. For example to read in a named range called ‘Named_Range’ you would use the following: IMPORT 'C:/GEN6ED/INTRO/EXAMPLE/Bacteria.xls'; \ SHEET='Named_Range'
You can use the PRINT directive introduced in Chapter 1 to store data in an ASCII file. This directive has a CHANNEL option, so you can open a file for output and instruct PRINT to write data into it with a prescribed format. There are also many options available in PRINT to control details of layout; full details are in Section 10.3. For example, the following commands would write a new file containing the same data as within Bacteria.dat, but without the initial comment, and then close the file in order that it could be used by another application: OPEN 'Bacteria.new'; CHANNEL=2; FILETYPE=output PRINT [CHANNEL=2; IPRINT=*] counts,crop; \ FIELDWIDTH=3,7; DECIMALS=0 CLOSE 2; FILETYPE=output
You can also use the PRINT directive to display the contents of data structures to the Output Window. The following command shows how to display the count and crop structures in the output window. PRINT counts,crop; FIELDWIDTH=3,7; DECIMALS=0
You can display data structures in a spreadsheet or add structures to an existing spreadsheet using the FSPREADSHEET procedure. The following command will display the bacteria data in a spreadsheet: FSPREADSHEET count,crop
To add data structures you should use the SHEET option. The number provided by SHEET is the position of the spreadsheet in the list of currently open spreadsheets. Thus SHEET=1 will add or update data in the first spreadsheet in the window list, SHEET=2 the second etc. Setting SHEET=0 will cause GenStat to update the first sheet with matching stuctures (i.e. for a variate this will be a Vector Spreadsheet with the same number of rows). The GenStat interface uses internal pointers to the spreadsheet structures which appear as large integers, and these should not be re-used in your code. For example, the following commands will load count into a spreadsheet and then add crop into this spreadsheet: FSPREADSHEET counts FSPREADSHEET [SHEET=0] crop
2.9 Commands for data input, calculations and display
63
The procedure FSPREADSHEET can also be used to save data structures to a GenStat spreadsheet using its OUTFILE option. For example, the following shows how to save the bacteria data to a GenStat spreadsheet: FSPREADSHEET [OUTFILE=’MyData.gsh’] counts,crop
To save GenStat data structures to a foreign file you can use the EXPORT procedure. The file that the data is saved into depends on the file extension used. For example, if you use the file extension .xls the data will be saved into an Excel file. The file types that are currently supported by GenStat include Excel (.xls), Quattro (.wq1), dBase (.dbf), Splus (.sdd), Gauss (.fmt), MatLab (.mat), Instat (.wor) and comma delimited text files (.csv). To export data to an Excel file you would use the following command: EXPORT ['MyData.xls'] counts,crop
If you are using EXPORT in interactive mode then a prompt will appear if you try to overwrite an existing file. You can avoid the prompt for overwriting an existing file with the following command: EXPORT ['MyData.xls'; METHOD=overwrite] count,crop
The form:
Calculations
menu uses the GenStat CALCULATE directive. This has the
CALCULATE expression
where the expression specifies the calculation to be performed, and where the results are to be stored. The expression first indicates the structure (or structures) to store the results (as provided by the Save R esu lt In box of the Calculations menu). There is then an assignment operator =, and then details of the calculation to be done (as listed in the box at the top of the Calculations menu). The CALCULATE statements for the examples discussed in Section 2.7 are reproduced below. CALCULATE CALCULATE CALCULATE CALCULATE CALCULATE CALCULATE CALCULATE CALCULATE CALCULATE CALCULATE CALCULATE
s25=6*4.25 pay=hours1*rate pay1=pay+s25 totalpay=SUM(pay1) paycorr=CORRELATION(hours1;rate) monthpay=(hours1+hours2+hours3+hours4)*rate hours2=mvreplace(hours2; 0) hours3=mvreplace(hours3; 0) hours4=mvreplace(hours4; 0) odd4=hours4>100 monthpay=MVINSERT(monthpay;odd4)
The histogram in Figure 2.60 can be produced by the DHISTOGRAM directive as follows
2 Data input, calculations and display
64
DHISTOGRAM sulphur
The DHISTOGRAM directive arranges the layout and labelling of the picture itself, but there are many options available if you want to customize the display. For example, you can define your own ranges for the bars by setting the LIMITS option to a suitable variate; this corresponds to the Lim its box in the menu in Figure 2.59. Likewise, the NGROUPS option corresponds to the Num ber of Groups box. An extension to the edit capabilities available in command mode is that you can change the colour in which the bars are drawn. This is controlled by the GenStat pen that is used to draw the bars. For example DHISTOGRAM sulphur; PEN=2
would draw the histogram bars in red rather than white, as this is the colour associated by default with pen 2 in GenStat. (We describe in later how to change the default attributes of the pens.) Scatter plots are provided by the DGRAPH directive for high-resolution plots, and by GRAPH for line-printer plots. For example, the plot in Figure 2.63 can be produced by the statement DGRAPH sulphur; windsp
However, GRAPH sulphur; windsp
would give the line-printer alternative. The digits in the picture, shown below, represent coincident points in the coarse grid used by GRAPH (the digit 9 would indicate nine or more coincident points). -+---------+---------+---------+---------+---------+--------I I I * I I I I * I 40.0 I I I * * * I I * I I * * * I I * * * * * I I * * * * I 20.0 I * 2 * I I * * I I * ** 2 * * 2* * * I I ** * * 2* I* I * 22 * ** 2* * * * * * I I * **4 2**6 ***3*23* 2***4 4*** ** * * I 0.0 I * * ** I -+---------+---------+---------+---------+---------+--------0.0 4.0 8.0 12.0 16.0 20.0 24.0
2.9 Commands for data input, calculations and display Sulphur
65
v. Windsp using symbol *
As with the DHISTOGRAM directive, there are many details that can be modified if you are not satisfied with the default settings. For example, DGRAPH also has a PEN parameter that allows you to change attributes such as the colour used to plot the points, so we could set PEN=2 to draw the crosses in red. It can also be used to change the plotting symbol in conjunction with the PEN directive, the PEN parameters of interest here being SYMBOL and CSYMBOL. For example, the command PEN 2; SYMBOL=7; CSYMBOL=3
sets up pen 2 to draw triangles rather than crosses, and display them in green in the PC implementation. Figure 2.68 shows the 22 standard markers, that are the same in all implementations of GenStat. DGRAPH and GRAPH also have a TITLE option to supply a general title for the plot. Axes titles for high-resolution histograms and scatter plots are specified using the XAXIS and YAXIS directives. These provide detailed control of the axes in the picture, for example, the following commands specify titles for both the xFigure 2.68 axis (horizontal) and the y-axis (vertical): XAXIS 1; TITLE='Wind speed m/s' YAXIS 1; TITLE='Sulphur microg/m**3'
The digit 1 here refers to the number of the graphical window whose axes are to be specified. By default, DGRAPH and DHISTOGRAM draw their pictures in Window 1, which takes up most of the screen, and their keys in Window 2, which is a narrow rectangle along the bottom of the screen. You can use up to 32 windows, so you can position several pictures on the same screen. High-resolution graphics can be produced on media other than the screen. This is done by selecting a different graphical device. The possibilities in GenStat for WindowsTM can be found by the following command: DHELP
Many word-processing systems allow you to import figures in Encapsulated Postscript format. To do this in GenStat for WindowsTM you must select device 3, by giving the command
2 Data input, calculations and display
66 DEVICE 3
You also need to open a file to store the graphical codes; this is done by the OPEN directive, in the same way as opening files for input and output described earlier in this section. However, the graphical channel number must be the same as the device number: OPEN 'Figure.eps'; CHANNEL=3; FILETYPE=graphics
Once stored the file Figure.eps can be incorporated into documents (if the word-processing software allows this). You can also save graphs in various formats, including the Enhanced Windows Metafile (EMF) format and a specific GenStat (gmf) format, using the Save menu in the GenStat graphics viewer. This is obtained by clicking on the Save or Save As lines in the File menu on the viewer’s toolbar. All aspects of the graphical “environment” in GenStat are given initial settings relevant to each device, but they are re-definable if necessary (Guide to GenStat, Part 1, Section 6.8). Finally, we mention briefly some useful housekeeping commands. To list data structures in the session, you can use the LIST directive. If you type LIST
you will be given a list of all the current identified structures. The DELETE directive can be used to throw away data structures once you have finished with them. By default, it throws away only their values, so you cannot reuse the identifier for another type of structure. However, if you set the REDEFINE option, you can then re-use the identifier in any way you wish. For example, if we no longer wanted the variate counts, but wanted to set up a scalar called count instead, we could use the commands DELETE [REDEFINE=yes] counts SCALAR counts
If you wish to copy a structure, you can use the DUPLICATE directive. For example, the command DUPLICATE OLD=crop; NEW=newcrop
will set up a new factor called newcrop, containing the same values as in crop. If you then delete the original structure: DELETE [REDEFINE=yes] crop
this is equivalent to renaming the structure. Finally we describe the READ directive, which allows you to input data values into any GenStat data structure. In the simplest use of READ, you need specify only
2.9 Commands for data input, calculations and display
67
the identifiers of the structures to be read. GenStat will then expect you to provide the data values in free format on the next input lines, and to type a colon (:) at the end of the data. READ has a PRINT option with settings: to print a summary of the data to print a copy of the input lines to print a detailed report on any errors in the data
summary data errors
By default PRINT=summary,errors but we include the setting data in all the examples below, so that you can see what is being read. We have also requested GenStat to echo the lines containing the commands. With GenStat for Windows this is requested from the Options menu; with other implementations it happens by default (and can be controlled by the INPRINT option of the SET directive). All the examples in this section show READ being used in an example program (called Read.gen), which has been executed in batch (by opening it in an input window and then selecting Submit W indow from the Run menu on the menu bar, as explained in Section 1.4). READ can also be used in a window that has been set to be interactive (by clicking on the Interactive line in the Options menu on the menu bar; see Section 1.5). GenStat then expects you to type the data onto the screen. It uses a special prompt string to indicate the unit of the structure whose value is to be read next, and it terminates automatically when enough data values have been supplied (see Guide to GenStat, Part 1, Section 3.1.2). We have also checked the Echo Comm ands box on the Au dit T rail tab of the Options menu (Figure 1.32), so that the commands are echoed in the Output window. 2 3 4 5 6
VARIATE [NVALUES=8] rain TEXT [NVALUES=8] day TEXT [VALUES=no,yes] sunlabel FACTOR [NVALUES=8; LABELS=sunlabel] sunshine READ [PRINT=data,summary,errors] rain
7 8
0
0
5
14
2
Identifier rain 9 10 11
2.3E1 8 :
3
Minimum 0.0000
Mean 6.875
Maximum 23.00
Values 8
Missing 0
READ [PRINT=data,summary,errors] day 'Last Sunday' Monday Saturday Sunday : Identifier day
Minimum
Tuesday
Mean
Wednesday
Maximum
Thursday Friday
Values 8
Missing 0
2 Data input, calculations and display
68 12 13
READ [PRINT=data,summary,errors] sunshine; FREPRESENTATION=labels yes
yes
Identifier sunshine
no Values 8
no
no Missing 0
yes
yes
no :
Levels 2
The first section of the output reads data into a variate rain, a text day, and a factor sunshine. As you can see, in free format, numbers can be given in either ordinary or scientific format (line 7), and they can be arranged in any way you like provided there is at least one space, newline, or tab character between each one. Similar freedom is available for the strings for texts like day, but notice in line 10 that the string Last Sunday needs to be placed between quotes ('). By default, READ expects the values of a factor to be represented by its levels, but in line 12 we have set the parameter FREPRESENTATION=labels to indicate that we shall use the labels no and yes for sunshine, instead of the levels 1 and 2. Next, we print the values of the structure so that you can see what has been read. 14
PRINT day,rain,sunshine
day Last Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sunday
rain 0.000 5.000 14.000 23.000 3.000 0.000 2.000 8.000
sunshine yes yes no no no yes yes no
Several structures can be read using a single READ statement. GenStat assumes that the values will be read in parallel (or unit by unit), and therefore that the structures will all have the same dimensions. This is illustrated in line 15, where we read rain, day and sunshine again, in parallel. 15
READ [PRINT=data,summary,errors] day,rain
16 17
'Last Sunday' 3 Friday Identifier day rain
0 0
Monday 5 Saturday
Tuesday 14 2 Sunday
Minimum
Mean
Maximum
0.0000
6.875
23.00
Wednesday 8 :
Values 8 8
23
Missing 0 0
Thursday
2.9 Commands for data input, calculations and display
69
Structures with different dimensions can be read in series, by setting option SERIAL=yes. Now the structures are read in turn, and each set of data values has its own terminating colon (:), as shown in line 18 of the next section of output. 18
READ [PRINT=data,summary,errors; SERIAL=yes] day,rain
19 20 21 22
'Last Sunday' Monday Saturday Sunday : 0 5 14 2.3E1 3 0 2 8 : Identifier day rain
Tuesday
Wednesday
Minimum
Mean
Maximum
0.0000
6.875
23.00
Thursday Friday
Values 8 8
Missing 0 0
If a structure whose values are to be read has not already been declared, GenStat will define it automatically as a variate. Likewise, if the length of a vector (a variate, text or factor) is undefined, this too will be set automatically. READ first checks whether the vector is being read in parallel with other vectors whose lengths have been defined, then it looks to see if a default length has been defined for vectors using the UNITS directive. If neither of these is available to define the length, it is set to the number of data values that are provided in the input. For example, in line 23 below, temp is defined to be a variate of length five. If you have declared a structure to be a factor (see Section 2.2), but have not yet defined its levels or labels, READ can define these for you too: levels only if FREPRESENTATION=levels, or labels (and levels as integers 1, 2...) if FREPRESENTATION=labels. Lengths of vectors can also be redefined according to the number of data values that are read, by setting option SETNVALUES=yes. This is used in line 25 to redefine the lengths of day and rain also to be five. 23
READ [PRINT=data,summary,errors] temp
24
15.5
12
10.5
Identifier temp
18
21 :
Minimum 10.50
Mean 15.40
Maximum 21.00
Values 5
Missing 0
25
READ [PRINT=data,summary,errors; SETNVALUES=yes] day,rain
26
Monday Identifier day rain
27
5
Tuesday
14
Wednesday
Minimum
Mean
Maximum
0.0000
9.000
23.00
PRINT day,rain,temp; DECIMALS=0,0,1 day
23
rain
temp
Thursday Values 5 5
3
Friday
Missing 0 0
0
:
2 Data input, calculations and display
70 Monday Tuesday Wednesday Thursday Friday
5 14 23 3 0
15.5 12.0 10.5 18.0 21.0
For factors, you can set option SETLEVELS=yes, to get READ to set up the levels or labels according to the values that it finds when reading the data. By default it distinguishes between capital and small letters when forming factor labels, but you can set option CASE=ignored to ignore the case of letters. Also, by default the levels or labels are sorted into ascending order, but you can set option LDIRECTION=given to leave them in the order in which they are found in the data file. It is often convenient to have the data in a separate file on the computer, particularly when running GenStat interactively. First you need to open the file on a suitable input channel. This can be done using the OPEN directive, as explained earlier in this section. In line 28, we open file Weather.dat on input channel 2. We can then read data from this file by setting option CHANNEL to 2 in the READ statement. Notice that the printed input lines have their own line numbering and a two-space indentation. A file can contain more than one set of data, and GenStat will remember how far it has read through the file so that you read them in turn. Alternatively, you can rewind the file to start again from the beginning by setting READ option REWIND=yes. At line 30, we use the CLOSE directive (Section 2.6) to close the file. Channel 2 could then be reused, if necessary, for some other file. 28 29
OPEN 'Weather.dat'; CHANNEL=2 READ [PRINT=data,errors; CHANNEL=2] day,rain,temp 1 2
30
Monday Thursday CLOSE 2
5 3
15.5 18.0
Tuesday Friday
14 0
12.0 21.0
Wednesday :
23
10.5
Other facilities not described here include the ability to read data in fixed format, to skip data values in free format, and to omit or change the use of colon to mark the end of each set of data (see Section 3.1 of the Guide to GenStat). Data structures can also be stored and retrieved from special backing-store files (Section 3.5 of the Guide to GenStat).
2.10 Exercises
71
2.10 Exercises 2(1) The file Rivers.txt is an ASCII file containing information on the 15 longest rivers. River Nile ... Amur Lena Mackenzie Niger Mekong Yenisey Murray Volga
Length 6695
Continent Africa
4415 4269 4240 4183 4180 4090 3717 3688
Asia Asia N.America Africa Asia Asia Oceania Europe
Use the Da ta menu to read all the data, storing the third column as a factor. The name Huang He contains a space, so it is enclosed in single quotes in the data file to ensure that it is treated as a single data value rather than two values. Display the three columns in the Output Window using the Data Display menu. Load the data into a spreadsheet using the Spread menu, and then save the data in the spreadsheet as an Excel file with extension .XLS . 2(2) The file Traffic.xls is an Excel data file with one worksheet called cou nts storing one set of data in the area B3:D43. Using the Open option of the File menu, load the data into a GenStat spreadsheet, converting day and month to factors. Examine the distribution of the counts using a histogram. Redraw the histogram, this time specifying 10 groups for the bars. 2(3) Details are given below of numbers of personal computers sold by a shop in the months of 2001 and the prices charged; the data are available in the spreadsheet file Computer.gsh. Open the spreadsheet using the File menu. Calculate the amount received from PC sales in each month and display them within the current spreadsheet. The price for January has been entered incorrectly and should have been 1099. Change this value and recalulate the column for the amount received. Calculate the total received over the whole year. "month number January 12 February 8 March 21 April 18 May 7 June 5 July 6 August 18
price" 999 1150 1150 1250 1250 1250 1250 1099
2 Data input, calculations and display
72 September October November December
5 17 13 31
1250 1250 1250 1150
2(4) The number of male deaths from lung cancer per million population, and the average number of cigarettes smoked by men in 1930 in 11 countries is given below and is available in Smoking.gsh. Country Deaths Smoking rate Australia 172 452 Canada 151 508 Denmark 168 379 Finland 353 1113 'Great Britain' 468 1145 Holland 244 468 Iceland 60 226 Norway 95 258 Sweden 116 315 Switzerland 252 540 USA 194 1290
(Data from Tufte, 1983, The visual display of quantitative information. Graphics Press: Cheshire, Connecticut, USA.) Draw a graph of cancer incidence against smoking rate, providing informative titles. Set axis limits to make sure that both the y-axis and x-axis start at 0. Also, draw the symbols for the points as red circles using the Line and Sym bols menu. 2(5) Write a program to read and then display the month, number and price of the personal computers sold by a shop into the Output Window and within a spreadsheet. The data can be read using the SPLOAD directive. To display the data in the Output Window you will need to use PRINT and to display in a spreadsheet you should use FSPREADSHEET. Write the data into a secondary output file using the OPEN and PRINT directives.
3
Basic statistics
In this chapter we describe how you can produce summary statistics and introduce some analytical statistics. We remind you how to use GenStat to calculate descriptive statistics, such as the mean and median, from a set of observations, and show how to produce summary tables from categorical data. Many statistics are best viewed graphically. In Chapter 2 we described how to draw histograms and scatterplots; in this chapter we show how to draw barcharts and boxplots. We also include here some standard methods for comparing groups of observations: the ttest, corresponding non-parametric tests and P2 tests.
3.1
Comparing two samples
Much of analytical statistics is concerned with comparisons, whether of categories of people in medicine or sociology, of animals or plants in biology or agriculture, or of machinery or processes in industry. We consider here only the simplest type of comparison; when there are just two samples of a single measurement. More complicated situations are covered in Chapter 5 onwards. Figure 3.1 We shall look at two samples of measurements of sulphur in the air: those taken on rainy days are compared with those taken on dry days. (The data are in Sulphur.gsh). The intention is to explore whether there is a difference between the amount of sulphur present in the air on wet and dry days !this is quite likely from a scientific basis, because it is known that rain tends to wash sulphur out of the air. In Chapter 2 we described the importance of first exploring the structure and distribution of your data using graphical methods before performing any statistical tests. It can also be useful to look at numerical summaries. The summary statistics and graphical displays can obtained from the Sum m ary of Variates menu as described in Chapter 1. To produce the menu you select Summ arize Contents of Variates from the Sum m ary Statistics option of the Statistics menu. Figure 3.1 shows the summary
3 Basic statistics
74
options that we have chosen for this example, and we have again clicked the Boxplot option to display boxplots of the samples. If we enter Sulphur into the Variate box, and Rain into the Groups box GenStat produces the output shown below and the boxplots shown in Figure 3.2. Summary statistics for Sulphur: Rain no Number of observations Number of missing values Mean Median Minimum Maximum Lower quartile Upper quartile
= = = = = = = =
64 0 12.09 7.00 0.00 49.00 4.00 16.50
Summary statistics for Sulphur: Rain yes Number of observations Number of missing values Mean Median Minimum Maximum Lower quartile Upper quartile
= = = = = = = =
50 0 8.38 5.00 1.00 38.00 3.00 10.00
The numerical summary indicates that there is indeed a higher average sulphur content in the air on dry days. The boxplot shows that the distribution of sulphur values is more squashed towards zero on wet days, although there seem to be no wet days with precisely zero sulphur. We shall now move from descriptive to analytical methods to describe ways of carrying out formal statistical tests on the samples. The first test we shall look at is the most well-known two-sample test, the t-test. This type of Figure 3.2 test should only be used when
3.1 Comparing two samples
75
the distributions of the groups of observations are reasonably Normal with similar variances. However, the boxplot (Figure 3.2) shows that the distributions are very skew, so the assumption of Normality is not justified. We could, however, try changing the scale of the data by transforming the values to logarithms to satisfy the assumption of normality. As described in Section 2.7, GenStat has a general Ca lculate menu that can be used to transform data. To produce the menu, select the Ca lculate option from the Da ta menu. The Ca lculate menu, as shown in Figure 3.3, contains a box at the top into which you enter the expression i.e., what you want to evaluate in Figure 3.3 your calculation. To calculate the logarithmic values we need to build an expression using the standard logarithmic function provided by GenStat. This function is available within the Ca lculate Functions menu which can be found by clicking on the Functions button. In the Calculate Functions menu (see Figure 3.4) the Figure 3.4 logarithmic function is stored within the Transform atio ns class which contains all the standard loga r i t h m i c, trigonometric and statistical transformations as well as absolute value, integer rounding and truncation, differences, shifts and square r o o t s . S o, w e select T r ansfo rma tio ns from the Figure 3.5
76
3 Basic statistics
drop-down list, and then Logarithm to base 10 from the Function list. This function has a single argument (the numbers to be logged), and you enter this into the X box from the Ava ilable Data box. When you click OK , the function and its argument are transferred to the main Calculate menu, and inserted into the calculation at the current cursor position. Figure 3.5 shows the expression formed for the calculation, where the name of the function within GenStat is LOG10. We shall store the transformed values in LogSulphur by entering the name into the Sa ve Re sult in box and clicking OK . Since some of the observations are zero, GenStat produces a warning dialog as shown in Figure 3.6. This dialog offers you the opportunity to examine the warning in either the Output Window or the Fault Log . Clicking on the Fa ult Log button will opens the Fault Log window displaying the message shown Figure 3.6 below. Function Class
**** G5W0001 **** Warning (Code CA 7). Statement 1 on Line 34 Command: CALCULATE LogSulphur=log10(Sulphur) Invalid value for argument of function The first argument of the LOG10 function in unit 1 has the value 0.0000
As you can see, the warning tells us that we cannot take logs of zero values. In this situation GenStat will insert a missing value in the variate LogSulphur for any of zero values in the variate Sulphur. This should also warn us to be careful in interpreting the analysis, since the logarithmic transformation has to be used with caution when applied to values that may range right down to zero. When you are examining faults or warnings within the Fault Log a double-click on the line containing the fault code (e.g. G5W0001) will take you to the same fault within the Output Window. You can suppress the prompt by selecting the Au dit T rail tab of the Options menu.
3.1 Comparing two samples
77
To evaluate the transformation, we shall examine a boxplot of the transformed values. This time we use the Boxplot menu from the graphics wizard, which offers some additional options not available from the Sum m ary of Variates menu. The menu (Figure 3.7) is obtained by first selecting the Cre ate Graph option from the Graphics menu and then selecting the Boxplot option from the Graph Type list from the first menu of the graphics Figure 3.7 wizard. There are two ways of specifying the data for a boxplot, using the radio buttons in the Ho w are the data organized? box: either in a list of variates or by specifying a single variate together with a factor indicating the groupings. In our example we have the variate LogSulphur and the factor rain containing the groupings (yes and no). So, in this menu we have selected the single variate with groups option and selected LogSulphur from the Se lect Da ta list and selected Rain for the groups. You can specify more options, such as titles, by clicking on Next . Clicking the Fin ish button produces the graph, in Figure 3.8. This graph shows that the d i stributions o f t he transformed data are much more symmetrical and the variances are more equal. Therefore, we can proceed to carry out a t-test. Figure 3.8
78
3 Basic statistics
To produce a t-test, you first click Stats on the menu bar, select Statistical Tests and click on On e an d two -sam ple t-tes ts . The type of t-test is selected using the Test list box. The possibilities include one-sample, unpaired two-sample and paired two-sample ! each with an appropriate selection of boxes and buttons. As the sulphur data consists of two unpaired samples of data, we select the two-sample (unpaired) test which produces the menu in the form shown in Figure 3.9. The two-sample (unpaired) test provides two ways of specifying the samples, using the radio buttons in the Da ta Arrangement box: either in two separate variates or by specifying a single variate together with a two-level factor. In our example we have the variate LogSulphur and two samples indicated by the two levels (yes and no) of the factor Rain. You can control the form of the test by selecting one of the Type of Figure 3.9 test options. The possibilities include One-sided (y1 < y2) , One-sided (y1 > y2) and Two-sided tests. In this example we select a two-sided test. Figure 3.10 shows the T-Test Options menu (selected by clicking the O ptio ns button), that allows you to control the output produced initially from the analysis and to choose which method to use to estimate the variance. As we described earlier, the t-test assumes that the variances are similar. The reasonableness of this assumption should always be considered in terms of the Figure 3.10 type of data. A statistical
3.1 Comparing two samples
79
judgement can also be made by comparing the ratio of the two sample variances against 1 using an F-test. If the variances are similar (i.e. the F-test is close to 1), we can use the pooled estimate of variance. Otherwise it may be more sensible to use a separate estimate for each group. Here we have selected the estimation of the variance to be Au tom atic , so that if there is no evidence that the sample variances are unequal, the pooled estimate will be used in the test; otherwise, separate estimates will be used. The default output is as follows: ***** Two-sample T-test ***** Variate: LogSulphur Group factor: Rain *** Test for equality of sample variances *** Test statistic F = 1.08 on 62 and 49 d.f. Probability (under null hypothesis of equal variances) = 0.79 *** Summary *** Sample
Size
Mean
Variance
63 50
0.9216 0.7494
0.1544 0.1431
no yes
Standard error for difference of means
Standard deviation 0.3929 0.3783
Standard error of mean 0.04951 0.05350
0.07321
95% confidence interval for difference in means: (0.02712, 0.3173) *** Test of null hypothesis that mean of LogSulphur with Rain = no is equal to mean with Rain = yes *** Test statistic t = 2.35 on 111 d.f. Probability = 0.020
The output contains a summary (Mean, variance standard deviation and standard error of mean), test statistic, confidence interval and F-test for checking the assumption of equal variances. We can see that the probability of having equal sample variances is 0.79, so GenStat will use the pooled estimate of variance for the t-test. We can conclude that the logarithms of the sulphur values are significantly smaller on rainy days than dry days, and so infer that the sulphur values are themselves smaller. The difference between the two means is about 0.2, so we can also say that on average the sulphur level is 50% higher on dry days (the antilog of 0.2 is about 1.5, or 150%).
3 Basic statistics
80
Instead of transforming the sulphur data, we might consider an alternative nonparametric test that does not make any strong assumptions about the actual form of the distributions. One possibility is a non-parametric test called the Mann-Whitney U test. This involves calculating a test statistic from the relative Figure 3.11 orders of the observations in the two categories. The statistic should be small if there is little difference between the samples; the test is made by calculating the probability of getting a value at least as extreme as that observed if there were indeed no difference. To produce a Mann-Whitney test you select Statistical Tests from the Stats menu and click on Two -sam ple non-param etric tests . In this menu select the Mann-W hitney U test from the drop-down list box of tests and select the type of data arrangement to be One Set with Groups, see Figure 3.11. We select the Da ta set to be Sulphur and the Groups to be Rain and click OK . ***** Mann-Whitney U (Wilcoxon Rank-Sum) Test ***** Variate: Sulphur Group factor: Rain Value of U: 1226.5 (first sample has highest rank score). Exact probability: 0.033 (under null hypothesis that group no is equal to group yes). Sample sizes:
64
50
The output here shows that the first level of the factor (group no) has higher values than the second level (group yes), and has the exact probability of 0.033 if there were no difference between the distributions in the two categories. With a probability as small as this we conclude that there is evidence of more sulphur in the air on dry days, as we found before.
3.2 Summarizing categorical data
3.2
81
Summarizing categorical data
When data are categorical rather than continuous, different kinds of summary are needed. For example, the information about wind d i r e ction c a n n o t be summarized easily with means or quartiles. However, the idea of the histogram is still useful here to display the numbers of observations in each category. Usually, such a picture is called a barchart rather than a histogram, the difference being that the Figure 3.12 categories are not necessarily related to a continuous scale, or even in a particular order. To draw a barchart, select the Barcha rt option from the Graph Type list on the graphics wizard and click Next . Figure 3.12 shows the Data menu for a barchart. Now select the factor Winddir from the Select Data list, click on the downwards arrow to include it in the Da ta window, and then click on the Finish button. The r e s ul t i n g b a r c h a rt i n illustrated in Figure 3.13. This differs from the previous form of histogram in the type of labeling, and the fact that the bars are slightly separated. This picture shows clearly how the wind direction varies, Figure 3.13 e m p h a s izing t h a t t he prevailing wind is from the South-west. The numbers shown graphically in this
3 Basic statistics
82
picture can also be printed in tabular form by selecting Sum mary Statistics menu. The Frequency Tables menu (in the Sum mary Statistics section of the Stats menu on the menu bar), provides a way of tabulating frequency counts for grouped data; see Figure 3.14. Here we want to print a table of frequency counts of occurrence of each level of the factor Winddir, so we need to enter Winddir into the Groups box.
Frequency Tables
the
Figure 3.14
Count winddir E N NE NW S SE SW W Unknown Count
8 8 11 11 15 11 28 21 1
The display here is called a table. This is only a one-way table – that is, it is classified by only one factor (W inddir ) – but GenStat allows two-way tables, as you will see in Section 3.4. In fact tables involving up to 9 factors can be produced. GenStat has a special table structure to store such information because it is central to much statistical work. Tables of frequency counts can be formed by specifying a name to save the frequencies in the same menu. You can display a table within a spreadsheet using the Display in Sp read she et option.
3.3 Summarizing data by groups
3.3
83
Summarizing data by groups
You can form tables of summary statistics for grouped data using the Sum m ary by Groups menu. To produce the menu select Sum m ary by Groups (T abu lation) from the Sum m ary Statistics
option of the Stats menu. The Sum m ary by Groups menu provides many types of summary, which are all selected from the Type of Sum m ary box. In Figure 3.15 we have selected the Means and Variances boxes, selected Sulphur as the Variate and Winddir for the Groups . Figure 3.15 Clicking OK produces the following table. Winddir E N NE NW S SE SW W Unknown
Mean
Mean
Variance
9.88 18.88 14.27 17.27 8.13 11.36 4.71 10.81
34.4 232.1 138.0 145.4 92.0 155.3 9.3 90.9
10.00
This shows two one-way tables for the means and variances printed in parallel. As with the Frequency Tables menu you can form up to nine-way summary tables. Tables of summaries can be formed automatically using the Sum mary by Groups Save Options menu, obtained by clicking the Save button.
3 Basic statistics
84
3.4
Association between categorical variables
R e l a tions h i p s betw een categorical variables cannot be displayed in the same way as between continuous variables. For example, it would not be possible to draw a useful scatter plot of wind direction against rain status. However, it is still reasonable to look for relationships between such variables. One way to display the relationships is to tabulate the numbers of observations in the categories. In Figure 3.16 we show how to tabulate the Figure 3.16 number of rainy and dry days for each wind direction using the Frequency Tables menu. winddir rain no yes
winddir rain no yes Unknown Count
Count E
N
NE
NW
S
7 1
5 3
5 6
8 3
5 10
SE
SW
W
7 4
14 14
13 8
1
This indicates that there appears to be a higher proportion of rainy days when the wind is from a southerly direction. Notice that GenStat automatically divides the display into sections if it is too wide to be shown in a single array. The "Unknown" entry reminds us of the missing value in the Winddir factor. The layout of the display is also affected by the order of the factors listed in the Groups box. In this example, a clearer display is achieved by reversing the order (Winddir and then Rain): if there is more than one factor, GenStat displays categories of the last factor across the page, and all the others down the page.
3.4 Association between categorical variables
rain winddir E N NE NW S SE SW W Unknown Count
Count no
yes
7 5 5 8 5 7 14 13
1 3 6 3 10 4 14 8
85
1
If you wished to test the apparent association between these two factors, you could carry out a chi-square test. This attempts to evaluate whether there are any significant differences in the proportion of rainy days for each wind direction; or, equivalently, whether there is a significantly different distribution of directions on wet and dry days. To produce a chi-square test you first need to click Stats on the menu bar, then click on Statistical Tests and select Contingency Tables. The type of test is selected using the drop-down list. When you select Chi-square test, the menu takes the form shown in Figure 3.17. The Da ta Arrangement option allows you to specify the form of the Figure 3.17 data. You can provide the data in a two-way table; this can be saved using the Frequency Table menu or you can create a table by clicking on the Create Table button. Alternatively, you can supply the row and column factors or a single variate containing the counts with two grouping factors to identify the row and columns. The Method option allows you to select between two-ways of calculating the test: Pearson uses the familiar method; the alternative, which may be more accurate, uses maximum likelihood.
86
3 Basic statistics
In Figure 3.18 we have selected the Pearson method and the R ow and C olum n Factors data arrangement. We have entered Winddir as the row factor, Rain as the column factor and the name nraindir to save the resulting two-way table. Clicking the O K button produces the output shown below. Figure 3.18
******** Warning from CHISQUARE: some cells in the table of expected values have less than 5. Pearson chi-square value is
9.21
with
7 df.
Probability level (under null hypothesis) p = 0.238
The output contains a warning that at least one or more of the factor combinations (or cells) have an expected value less than 5. This means that the chi-square test may not be valid, since it is based on an approximation that is valid only for large numbers. If it were valid, the probability is too large to conclude a significant association even at the 20% significance level. A general rule of thumb with a chi-square test is that if less than 20% of the values had expected values less than 5, and all expected values greater than 1, then you can cautiously accept the results of the chi-square test. You can view the expected values by selecting the Expected Values display option from the ChiSquare Options menu or by Figure 3.19 displaying them in a spreadsheet using the Save menu. To view the expected values in a spreadsheet
3.4 Association between categorical variables
87
select the Save button, which opens the menu shown in Figure 3.19. Selecting the box for Expected Values will enable a window (entitled In:) into which you enter the identifier of the structure in which the information is to be saved. In Figure 3.19 we have selected this box and entered the name texpected. To display this structure in a spreadsheet we have selected the Display in Spreadsheet box. Clicking OK returns us to the main Conting ency T able menu. Clicking OK on this menu will rerun the analysis and display the expected values into a spreadsheet as shown in Figure 3.20. You can perform a chi-square test on a one-way table of counts using the Ch isquare Goodness -of-fit menu (accessible through the Statistical Tests menu). In this case the test assesses whether the values in the various cells of the table are different; for example, we could test to see whether all the wind directions are roughly equally represented. There is a more general way to investigate relationships between categorical variables using log-linear models. These are not covered in this book, but the analyses can be obtained Figure 3.20 using the Generalize d Linear Models menu (accessible through the Regress ion Analysis section of the Stats menu). The Contingency Table menu also offers Fisher's exact test of association between the factors classifying a 2×2 table of counts. The output from a Fisher’s exact test includes the one-tailed significance level from the exact test, together with the mid-p value (containing only half the probability of the observed data table). In addition to this, two methods are used to calculate a two-tailed significance level. The first simply doubles the one-tailed significance level, whereas the second method calculates the cumulative probability of all outcomes that are no more probable than the observed table. Mid-p values are produced for each of these methods.
88
3.5
3 Basic statistics
Transferring output to other applications
When writing documentation you may want to display a table of means or counts, or a section of output within the text. One way in which you transfer this information to another application is to use the standard Cut, Copy and Paste options available in the Edit menu. In addition to this, GenStat has facilities which allow you to copy data from a spreadsheet or text window into a table using either RTF or HTML formats. These tables can then easily be inserted into other applications such as MS Word or MS Explorer. To illustrate this we will copy the table of counts for Figure 3.21 wind direction and rain from the sulphur data, and insert it into a table within MS Word. We first need to load the table into a spreadsheet, which you can do by selecting the Data in GenStat option from the New section of the Spread menu. This produces the menu shown in Figure 3.21, where we have selected the Table option from the Type of Spreadsheet radio buttons, and double-clicked on the nraindir data structure to enter it into the Data to Load field. Clicking on OK will produce the spreadsheet shown in Figure 3.22. So, the table that we wish to copy consists of 3 columns and 9 rows, including the row and column titles.
Figure 3.22
3.5 Transferring output to other applications
89
To produce a table in MS Word, you need to select the RTF Table option from the Copy Special option in the Ed it menu on the menu bar. This will produce the menu shown in Figure 3.23, that allows you to specify how the table is to appear in your Word document. In this example, we use the default settings, so clicking OK will copy the table to the clipboard. In the Word document, put the cursor in the position where you want the table to appear, and select Paste from the Ed it menu on the menu bar. The table appears in Word in the Figure 3.23 format shown in Figure 3.24, and you can change the table attributes within Word to your requirements.
Figure 3.24 In this first example we copied from a spreadsheet of type table, however, you can also apply this method to other types of spreadsheet, such as vector or scalar.
90
3 Basic statistics
In our second example we copy a table from the Output window into Word. To display the table in the Output window we first open the Data Display menu (Figure 3.25) by selecting the Display option in the Data menu on the menu bar. Then click on the Tables folder, select nraindir in the right-hand window, and click on the Display button to obtain the Display Data in Output W indow
menu as shown in Figure 3.26.
Figure 3.25
Clicking on OK in Figure 3.26 produces the output shown in Figure 3.27. We now select the output to appear in the table by using the mouse. Alternatively, you can make a selection by holding the shift key down and pressing the arrow keys. Figure 3.27 shows the selected rows of the output that are to be transferred into the table. Now selecting RTF T able from the Copy Special option on the Ed it menu Figure 3.26 produces the menu shown in Figure 3.28.
3.5 Transferring output to other applications
Figure 3.27
GenStat output is displayed by using spaces to separate different cells in a table. To produce RTF tables, the spaces between the columns need to be converted to tab characters. In Figure 3.28 we have selected that columns with 2 or more spaces will be converted Figure 3.28 to tabs. Clicking on OK then opens the menu for controlling the appearance of the table as shown in Figure 3.29. Leaving the default settings and clicking on OK copies the table to the clipboard. In Word, you put the cursor in the position where the table is to appear, and select Paste from the Ed it menu. Figure 3.30 shows the resulting table in Word.
Figure 3.29
91
3 Basic statistics
92
Figure 3.30
3.6
Commands for basic statistics
You can also generate the analyses and graphs described above directly using GenStat commands. The barchart in Figure 3.13 can be produced by the DHISTOGRAM directive as follows DHISTOGRAM winddir
The boxplots can be obtained using the BOXPLOT procedure. For example, BOXPLOT [TITLE='sulphur pollution';\ AXISTITLE='logarithm of sulphur']\ LogSulphur; GROUPS=Rain
produces the picture in Figure 3.8. The METHOD option controls the type of boxplot, the TITLE option supplies a general title for the picture and the AXISTITLE option gives the title for the axis. The GROUPS parameter indicates that separate boxplots are to be produced for each level of the factor Rain. Boxplots can be also drawn in line-printer style by setting option GRAPHICS=lineprinter. Descriptive statistics are produced by the DESCRIBE procedure. The default output is the same as that generated by the Sum mary of Variates in Figure 3.1. The SELECTION option controls what statistics are given; for example, you could specify DESCRIBE [SELECTION=mean,var,skew,kurtosis] Sulphur
to get the variance, skewness and kurtosis of the sample as well as the mean, and exclude the other information shown by default. To produce a summary using groups you need to specify a factor in the GROUPS option; for example DESCRIBE [SELECTION=mean,var,skew,kurtosis; \
3.6 Commands for basic statistics
93
GROUPS=Ra in] Sulphur
The frequency tables are generated by the TABULATE directive. The menu in Figure 3.14 generates the command TABULATE [PRINT=count; CLASSIFICATION=Winddir]
and that in Figure 3.16 generates TABULATE [PRINT=count; CLASSIFICATION=Rain,Winddir]
The CLASSIFICATION option specifies the factors that are to be used to classify the table, and the PRINT option indicates what output is to be printed. You can also produce tabular summaries of the values within a variate as shown in the Sum mary by Groups menu (Figure 3.15). For example, TABULATE [PRINT=mean; CLASSIFICATION=Rain] Sulphur
TABULATE also has options and parameters that allow the information to be saved
in GenStat table structures. The command TABULATE [CLASSIFICATION=Rain,Winddir] Sulphur; \ NOBSERVATIONS=nraindir
saves the numbers of observations in each of the categories of rain and wind direction in a table called nraindir. The CHISQUARE procedure can then be used to perform a chi-square test of association between the factors (Figure 3.18): CHISQUARE nraindir
The Mann-Whitney test is performed by the MANNWHITNEY procedure. For example, MANNWHITNEY [GROUPS=Rain] Sulphur
compares the values of sulphur in the two groups defined by the factor rain (see Figure 3.11). Similarly, t-tests are performed by the TTEST procedure: TTEST [METHOD=twosided; CIPROB=0.95; VMETHOD=automatic; \ GROUPS=Ra in] LogSul phur
compares the values in the variate LogSulphur (Figure 3.9). With both of these procedures, as in the menus, you can also specify the data values for the groups in two separate variates, for example, TTEST sample1; sample2
to compare the values in the variates sample1 and sample2. The values of sulphur are transformed using the CALCULATE directive and the standard logarithmic function (LOG10): CALCULATE LogSulphur = LOG10(Sulphur)
3 Basic statistics
94
3.7
Exercises
3(1) The file Music.xls contains data taken from a survey of the musical tastes and other related data of 227 college students. Investigate who buys the most CDs per year, those primarily interested in intellectual growth or those primarily interested in social growth? Display the distributions for the two groups in both a histogram and a boxplot. What observations can you make about these distributions? Who is prepared to pay the most for a good quality stereo, a male or a female student? How much is he/she prepared to pay? Who owns the least number of CDs/cassettes in total, students who play an instrument, or students who do not play an instrument? Display the distributions of the two groups in a boxplot. What observations can you make? 3(2) Twelve people were tested to investigate the relationship between reflex blink rate and the difficulty of performing visual motor tasks. Here are their recorded blink rates during a simple and a difficult task: Subject 1 2 3 4 5 6 7 8 9 10 11 12
Simple 14.0 19.5 8.2 8.5 12.1 8.0 8.2 10.1 5.5 10.1 7.2 5.6
Difficult 5.0 6.6 1.9 1.5 1.1 2.5 0.6 0.5 0.5 3.1 2.1 1.6
These results are stored in two variates, one for each task, in the spreadsheet file Blink.gsh. Display the two distributions side-by-side using boxplots. Form and display the means and standard deviations for each task. Carry out a test of the hypothesis that subjects have the same blink rate in the two tests. You need to use a paired test because the same subjects were used for both tasks: if you decide to avoid strong assumptions, use the Wilcoxon test procedure rather than the Mann-Whitney test. 3(3) A farmer wants to test the difference in yield between a new wheat variety and a standard wheat variety. The yields (tonnes per hectare) are: New: 2.5, 2.1, 2.4, 2.0, 2.6, 2.2
3.7 Exercises
95
Standard: 2.2, 1.9, 1.8, 2.1, 2.1, 1.7, 2.3, 2.0, 1.7, 2.2 Using the data from Vartest.gsh, form and display the means and standard deviations for each variety, and display the distributions side-by-side using boxplots. Carry out a test of the hypothesis that the new variety is producing higher yields than the standard variety. You need to use a one-sided (two-sample) unpaired ttest. 3(4) File Learn.gsh contains data from a study which was examining whether rats can generalize learned imitation. Five rats were trained to imitate leader rats in a T-maze in order to attain a food incentive. These were then compared against four control rats in a new training situation. The comparison is in terms of how many trials each rat took to reach a criterion of 10 correct responses in 10 trials. Carry out a Mann-Whitney test of the hypothesis that the number of trials needed to reach the criterion is the same for rats previously trained to follow a leader to a food incentive as for rats not previously trained. 3(5) File Pet.gsh contains a small school survey on children with pets. The three columns indicate the sex of the child (boy or girl), whether or not they have a pet, and their ages. Produce a two-way table showing the mean ages of the children of each sex with and without pets. Carry out a chi-square test of the hypothesis that equal proportions of boys and girls have no pets. Use the Create Table button to form a new two-way table from the chi-square menu to save the numbers of girls and boys with and without pets. Load the two-way table of counts of girls and boys with and without pets into a spreadsheet. Copy the table to the clipboard using the Copy Special options and paste the table into Word.
96
4
GenStat spreadsheet
GenStat has extensive spreadsheet facilities for data entry, import, export and manipulation. Initially we show how you can use the spreadsheet for data entry and verification. In an analysis you may sometimes want to work with subsets of your data, and we describe how these can be created using the spreadsheet. The data may not always be stored in a convenient form or may require rearranging before analysis. We demonstrate some of the relevant spreadsheet data manipulation techniques, such as appending, merging, stacking and unstacking data. In the current WindowsTM environment, data exchange between different applications is becoming much easier, and we show how GenStat can be used with other programs using Dynamic Data Exchange (DDE). Finally, we describe how you can use GenStat’s ODBC facilities to read and write data to databases.
4.1
Entering data into a spreadsheet
The data shown below are taken from an experiment in New Zealand. Twelve sheep were divided into 4 “flocks” to follow 3 different drench programs. The initial weights of the sheep were recorded, and, after they were run for 3 months on their respective programs, their final weights were recorded. Treatment
Rep
Weight in Kilograms Initial
Final
Control
1
38
48
Control
2
31
42
Control
3
37
48
Control
4
34
41
Drenched once
1
36
52
Drenched once
2
35
50
Drenched once
3
38
52
Drenched once
4
32
49
Drenched twice
1
33
53
98
4 GenStat spreadsheet
Drenched twice
2
34
49
Drenched twice
3
39
66
Drenched twice
4
36
57
To enter the data into a new spreadsheet, click on Spread in the menu bar and then click on New and Create , as shown in Figure 4.1. All the other options of the main Spread menu will be grey rather than black at this point, to show that they are not yet available (since the other menu options can only be selected for existing, active spreadsheets). This opens a menu containing Figure 4.1 a list of icons defining several types of spreadsheet that can be created. The last 6 icons in this list allow you to create blank spreadsheets for different types of data. The default spreadsheet type is for a Vector spreadsheet, that allows columns of variates (numerics), texts (labels) and factors (grouped data) of equal length to be displayed simultaneously within a spreadsheet. The data in our example will be in columns (or vectors) of variates and factors, so we have selected the Vector Spreadsheet icon, as shown in Figure 4.2. For a Vector spreadsheet you need to specify the number of rows and columns in the boxes provided. For this example we have entered 4 columns and 12 rows. It does not matter if you do not know the Figure 4.2 number of rows and columns
4.1 Entering data into a spreadsheet
99
needed initially for entering your data, you can easily insert or delete rows or columns at a later date. Clicking the OK button produces a blank spreadsheet in a new window, as shown in Figure 4.3. By default, the 4 columns are initially created as variates and all the values are set as missing values represented by asterisks. The columns are labelled by default as C1, C2, C3 and C4. If you enter data under these column names and transfer it to GenStat, four data structures will be created with the identifiers C1, C2, C3 and C4. It is good practice to assign your own descriptive names to the columns. A column name must start with a letter or %, and the remaining characters can only be alphanumeric (A-Z, a-z, 0-9), or ‘%’ or ‘_’. If you do use an illegal character in a column name, GenStat will convert Figure 4.3 these characters to valid ones.
100
4 GenStat spreadsheet
Figure 4.5 To rename the columns select the option from the Spread menu and then select R enam e . Alternatively, press the Alt, Control and F9 functions keys simultaneously. This opens the menu shown in Figure 4.4. To rename the columns simply select the column name to change in the list and enter the new name in the box at the top of the menu. Pressing the Enter key applies the changes and alters the name in the list of Column Nam es . In Figure 4.4 we have Figure 4.4 renamed C1 as Drench. An alternative way to rename the columns is to move the cursor over the beginning of column heading until it changes to a pencil, as shown in Figure 4.5. Clicking the left mouse button then opens a dialogue where you can enter a new name. Figure 4.6 shows the column C2 being changed to Rep. In the same way we can change C3 to Lwt1 and C4 to Lwt2. Figure 4.6 The column Drench contains grouped data, so we need to specify that the column is to be a factor before entering the labels. To convert the column to a factor, click anywhere on the column using the right-button on the mouse. Column
4.1 Entering data into a spreadsheet
101
Figure 4.8
Figure 4.7
This pops up the menu shown in Figure 4.7. Selecting the option Conve rt to Factor opens the menu shown in Figure 4.8. GenStat has recognized that this is a new column that is being converted to a factor and provides a menu to specify the levels and labels. The column Drench has three groups: Control, Drenched once and Drenched twice, so we have entered 3 in the Num ber of Levels box. We now want to change the labels to represent the three groups.
Figure 4.10
Figure 4.9
Clicking on the Labels button opens the menu in Figure 4.9. In this menu we enter Control for group 1 and press return to apply this label. Now, either using the mouse or the down arrow, we select group 2 and label this Once. We then select group 3 and label this group as Twice. Clicking OK on this menu returns us to the factor conversion menu. Clicking OK here returns us to the spreadsheet where the column name now appears in italics and has a red ! at the start of the name (see Figure 4.10).
102
Figure 4.11
4 GenStat spreadsheet The labels can now be entered either by typing in the label name, or by double clicking on a cell and selecting the appropriate label from the list, as shown in Figure 4.11. Alternatively, if you type the first character of the label and move to another cell, GenStat will fill in the rest of the name. For example, the letters C, O and T would be enough to specify the factor labels of Control, Once and Twice respectively. Note that if the labels begin with the same character then you will need to type as many characters as is required to distinguish between the labels. For example, if the factor contained 2 labels called Farm and Field, then you would need to enter the first 2 characters (Fa or Fi) before moving onto a new cell. Figure 4.12 shows the column complete with the new factor labels.
Figure 4.13
Figure 4.12
We now enter the data into the columns Lwt1 and Lwt2. We click on the cell for the first row of Lwt1, enter the value 38 and move to the next cell using the Enter key (alternatively you can use the down arrow key). We then type the value 31 in the second cell and so on. If a mistake is made and we want to edit individual characters within a cell, we can double-click on the cell. For example, double-clicking on the first cell in the column Rep opens the dialogue box, shown in Figure 4.13, where we can change the value. The column Rep contains patterned data with the values 1,2,3 and 4 repeated 3 times. There is a menu available to fill a column automatically with patterned data like this. Selecting Fill from the Ca lculate option of
4.2 Data verification
103
the Spread menu opens the menu shown in Figure 4.14. We have selected the column Rep and have entered the Start Value as 1, and the Ending Value as 4. Using the default option Fill to Bottom, the values 1,2,3,4 will be recycled until the bottom of the column. Clicking OK produces the spreadsheet shown in Figure 4.15.
Figure 4.14
4.2
Figure 4.15
Data verification
When data are entered into a spreadsheet it is easy to mistype or enter an incorrect value. GenStat provides a facility for data verification within the spreadsheet through the double entry of the data. In this example we demonstrate how to check that the data within the columns Lwt1 and Lwt2 have been entered correctly. First we select Verify from the Sheet option of the Spread menu, which opens the menu Figure 4.16 shown in Figure 4.16. The columns that are to be verified are chosen by double-clicking on the name of the column in the list (or by selecting the column name and clicking the Verify button). When a column has been selected for verification its name will be prefixed by the characters ‘V:’ in the list. In Figure 4.16 we have double-clicked on the names Lwt1 and Lwt2 to specify that they are to be verified.
104
4 GenStat spreadsheet
Figure 4.18 Clicking on OK changes the columns Lwt1 and Lwt2 in the spreadsheet to display three minus (!) characters in place of the values (see Figure 4.17). To verify the data, we now re-enter the Figure 4.17 values in these cells. First, we enter the value 38 in the first row of the column Lwt1 and press enter to move the cursor to the next cell. As the value is correct in this cell, the value is redisplayed. Now in the second row we enter the value 33 and move to the next cell. On moving to the next cell GenStat has recognized that the value is different from the original value entered and displays the menu shown in Figure 4.18. Here you can specify the correct value and can add a comment to the cell if it is required. In our example we decide that the correct value should be 33, so we click on the Typed button to register this value. On clicking the Typed button, a new spreadsheet is created containing a record of the mismatch in the data entry (see Figure 4.19). Each row within the new spreadsheet Figure 4.19 contains details of the column name, row, original value, new value typed and the new value. Any further mismatches in the data entry will be appended onto this spreadsheet. We then complete the data verification by entering the remaining values for the columns Lwt1 and Lwt2.
4.2 Data verification
105
On entering the last value of the verification in row 12 of the column Lwt2 we are prompted with the menu shown in Figure 4.20. This menu allows you to set the verified columns as read-only Figure 4.20 to protect them from any further changes. Clicking on the Yes button changes the columns Lwt1 and Lwt2 to read-only, and provides a visual indication of this by changing the background on the column title to blue. You can set or remove the protection for a column at any time using the Column/Sheet Protection menu. To remove the column protection on the columns Lwt1 and Lwt2 select Protection from the Column option of the Spread menu. This opens the menu shown in Figure 4.21. The columns within the spreadsheet are listed on the left of the menu and a protected column is identified in the list by having the prefix ‘P:’. So, to remove the Figure 4.21 protection on the columns Lwt1 and Lwt2 double-click the names in the list to remove the prefix. Alternatively, selecting Lwt1 and Lwt2 within the list and clicking on the Unprotect button will remove the protection. You can protect a column in a similar way by double clicking the name in the list or clicking the Protect button to include the prefix to indicate the column is to be protected. Clicking OK returns you to the spreadsheet and removes the blue background from the column titles.
106
4 GenStat spreadsheet
Comparing spreadsheets is another form of data verification. You can compare two open spreadsheets within GenStat or you can compare a currently open GenStat spreadsheet with data from a foreign data source. For example, you could compare an open spreadsheet with another spreadsheet saved in GSH (GenStat Spreadsheet) format, or with data in an Figure 4.22 Excel file. The data set shown earlier in this section can also be found in the GenStat spreadsheet file called Drench.gsh. To illustrate the spreadsheet comparison facilities, we will now compare the data we have entered, with the data in the file. Selecting Com pare from the Sheet option of the Spread menu opens the menu shown in Figure 4.22.The Data Source option identifies where the data that you wish to compare are located. The data we are comparing against are in a file so we select File . You need to enter the name of the file into the space provided or alternatively, you can select the file by clicking on Browse . The remaining options on the menu control how the comparison is to be done. Leaving the default settings and clicking on OK pops up a menu (Figure 4.23) to warn that the sheets are different, and prints a report in the Output W indow, as shown below. There are two differences between the spreadsheet and the file Drench.gsh. The first difference reported is the record where the data Figure 4.23 value was changed during the data verification. The second indicates that in the current spreadsheet the column Rep is a variate, but in the file Drench.gsh this column has been saved as a factor. "Comparing Spreadsheets: New Data and Drench.GSH Column Types don't match: Rep = Variate vs Rep = Factor Mismatch on Lwt1 at row 2: 33 <> 31 Spreadsheets are different! "
4.3 Inserting and deleting rows or columns
4.3
107
Inserting and deleting rows or columns
Columns and rows can be deleted using the De lete options on the Spread menu. To delete the column Rep click anywhere on the column and select Current Column from the De lete option on the Sp re a d menu. Alternatively, you can use the mouse to delete the column Rep, we click on Figure 4.24 a the column title using the left button and hold the mouse down. The cursor will now appear as a hand containing a dashed box. Drag the cursor to the edge of the spreadsheet and release the mouse (see Figure 4.24). Clicking Yes on the confirmation dialogue will delete the column. Rows can be deleted in a similar way. First click on the row number, and then drag it outside the spreadsheet. Figure 4.25 illustrates this being done with Row 2. You can select and drag multiple rows (or columns) for deletion in a similar fashion. Figure 4.25 New columns and rows can be inserted using the facilities available within the Insert options on the Spread menu. To insert a new row at the bottom of the spreadsheet, click on the last row of the spreadsheet and select Row after current row from the Figure 4.26 Insert option on the Spread menu. This will add a new row as shown in Figure 4.26. New values are defaulted to missing values (represented by asterisks). To insert a new column click in the Drench column and select Column after Current Column from the Insert option Figure 4.27
108
4 GenStat spreadsheet
on the Spread menu. This opens the menu shown in Figure 4.27, where you can choose what type of data structure the new column will be, and set an initial value for each cell. Selecting Va riate from the C olum n Type , entering the name ID and clicking OK produces a new column, initialized with missing values, as shown in Figure 4.28. Another way of inserting a new column is to create a duplicate column. So, for example, if we want to duplicate the column Lwt1, we can select Duplicate from the Column options on the Spread menu. This opens the menu shown in Figure 4.29 where we have selected the column Lwt1 and entered a new name for the duplicate column in the N ew C olum n N am e field. You can create the duplicate column as a different type using the New Type options. Figure 4.28
Figure 4.29 Selecting the New Type as Variate and clicking inserts the duplicate column into the spreadsheet as shown in Figure 4.30. OK
Figure 4.30
4.4 Defining subsets of data values
4.4
109
Defining subsets of data values
When dealing with a large set of data, you often need to be able to select a subset of values to study, either temporarily, or for the remainder of a session. For example, with the pollution data in Chapter 3, we might want to concentrate only on the rainy days and draw a picture of the distribution of sulphur measurements. GenStat caters for this by allowing you to impose restrictions (filters) to define subsets of vectors (variates, texts or factors). The vectors keep all their original values, but subsequent commands working with the vectors will restrict their attention only to the subset. One way of doing this is provided by the GenStat spreadsheet. For example, suppose for the drench data we wish to display a list of the sheep whose final weight is less than 51 kilograms. First, we close the spreadsheet we have been working with, and form a new spreadsheet containing only the columns Drench and Lwt2. Selecting New and then Da ta in GenStat from the Spread menu generates the Load Spreadsheet menu in Figure 4.31. In this menu we select Drench and Lwt2 as the data to load. The resulting spreadsheet is shown in Figure 4.32. We now generate the Figure 4.31 menu in Figure 4.33 by selecting Restrict/Filter and By L ogical Expression from the Spread menu. The Restrict Spreadsheet menu allows you to restrict or filter the data within a spreadsheet based on a logical expression. The Expression boxes define the condition, and the Restriction Type buttons indicate whether the restriction is formed by including or excluding the units (or rows) that satisfy the logical condition.
110
Figure 4.32
4 GenStat spreadsheet
Figure 4.33
In our example we want to include all the units within the restriction that is where the units in the Lwt2 column are less than 51. To create the expression for this restriction we double-click Lwt2 from the Columns list into the first expression box. We then double-click the Less Than option from the Com parison list which puts a ‘<’ symbol into the expression box and then type 51. We select the Include option from the Restriction Type box and click OK . The resulting spreadsheet shown in Figure 4.34 now only shows the requested subset of units. When we use these vectors in future, until we cancel the restriction, operations will be restricted to just the specified set of units. (This applies both to operations with menus and with commands.) This is illustrated in Figure 4.35, where we use the PRINT directive, to print Lwt1 and Lwt2. Notice that, even though we only included Lwt2 in our restrictions as these two vectors are printed in parallel, the restriction is applied to both. Initial weights of the sheep (Lwt1) are only displayed for the filtered values for the final weights (Lwt2). Figure 4.34
4.4 Defining subsets of data values
Figure 4.35
The restricted units are not discarded, and can be viewed in the spreadsheet in an alternative colour. To do this select Display Restricted Rows from the Restrict/Filter options on the Spread menu. This will display all the rows in the spreadsheet, but with the restricted rows shown in the red (default colour); see Figure 4.36. You can also toggle the display of the restricted rows by clicking the ‘+’ button in the top right hand corner of the spreadsheet (positioned above the scroll bar as indicated by the cursor shown in Figure 4.36).
Figure 4.36
111
112
4 GenStat spreadsheet
As the restricted units are not discarded, we can also change the restriction to look at some other set of units, or impose a further restriction. For example, say we now want to add to our restriction the condition that we want to identify the sheep whose treatment was to be drenched once. To combine a new restriction with the existing restriction, we could use the Restrict Spreadsheet using an
menu again or, alternatively as the column Drench contains grouped Figure 4.37 data (factor), we can use the Restrict on Factor menu, as shown in Figure 4.37. To open the menu, select the To G roups (factor levels) item from the Restrict/Filter option on the Spread menu. This menu displays the labels or levels of a factor, that you can select to filter the data by. Select Once from the Selected Levels and the Include option from the Re striction T ype. To combine this restriction with the current subset select the Com bine with New setting from the Existing Restrictions options. Figure 4.38 Clicking OK produces the spreadsheet shown in Figure 4.38. In creating our subset of data we have created one subset using a logical condition and then further restricted this set using a second condition. Using the Restrict Spreadsheet using an Expression menu you can create a restriction by combining the two logical conditions into a single condition using both the expression boxes. Expression
4.4 Defining subsets of data values
113
Figure 4.39 shows how to do this for our example. First, we remove the current restriction by clicking on the Rem ove All to ensure we are using the complete set of data. Now, as before, we enter the condition for Lwt2 less than 51 in the first expression box. Then, in the second box, we enter the condition for the factor restriction: Drench.in.’Once’. The Figure 4.39 “.in.” operator, which is explained in Section 2.7, can be inserted by double-clicking on Inclusion in the list of Com parisons . To combine these two conditions we have selected the And option between the boxes; that is, we want to include into our restriction sheep whose final weight is less than 51 kilograms and that have been drenched once. Clicking on OK produces the same spreadsheet as shown in Figure 4.38. To restore the data to its original form at any time you need to remove the restriction applied to the data. You can do this be selecting Rem ove All from the Restrict/Filter option of the Spread menu. If you want to store a subset of the units in a vector rather than restricting the original data set, you can use the Subset menu. This is opened by selecting the Subset option on the Da ta menu. You can also define the restriction by specifying the rows in the spreadsheet explicitly. The rows are selecting using the Select line of the Spread menu, and the Restrict/Filter menu then allows you to indicate how these are to generate the restriction. Whichever way the restriction is defined within the spreadsheet, it is imposed within GenStat using the RESTRICT directive and this provides an alternative if you wish to define very complicated restrictions or to restrict vectors too large to be displayed in a spreadsheet.
4 GenStat spreadsheet
114
4.5
Sorting data
The spreadsheet allows you to re-order the units of a list of vectors according to one or more index vectors. To illustrate this we will reintroduce the spreadsheet Drench.gsh. First we close all the currently open spreadsheets either by using the Close option on the File menu or by clicking the “X” button at the top right-hand corner of the spreadsheet windows. We now open the file Drench.gsh using the Open option of the File menu, which produces the spreadsheet shown in Figure 4.40. We now want to sort the data in the spreadsheet by specifying the final weights in descending order. To do this, select Sort from the Spread menu; this produces the menu shown in Figure 4.41.
Figure 4.41
Figure 4.40
Figure 4.42
We have selected Lwt2 from the Sort on Column list to be the index for the sort, and selected Ascending from the Order options. Clicking OK produces the spreadsheet shown in Figure 4.42, where the rows are re-ordered, so that the
4.5 Sorting data
115
values in the Lwt2 column are in ascending order. If you have textual columns you can sort these alphabetically. You can also do multi-column sorts, where you specify an ordering based on a series of columns. The columns are then sorted using the first column, then rows that have equal values in the first column are sorted according to column, and so on. To illustrate this we will sort the data in alphabetical order for Drench and then in ascending order by Lwt1 within each drench group. Using the Sort menu we select Drench as the first column that we are going sort by and select Labels from the Sort Factors By option to sort the factor in order of its labels. Selecting the Multicolumn option adds the text Key;1 to the column Drench in the Sort on Column list. The Key;1 tells us that this is the first column that we are going to sort by. Now select the column Lwt1, this adds Key;2 to the text, telling us that this is the second column by which the data will be sorted (see Figure 4.43). Clicking on OK produces the spreadsheet shown in Figure 4.44.
Figure 4.43
Figure 4.44
116
Figure 4.45
4 GenStat spreadsheet
Figure 4.46
You can also sort a selection within a spreadsheet. For example, to sort the final weights for Rep 4 in descending order we first need to make a selection of the rows containing Rep 4. To make a multiple selection, click on the first row of the selection, then hold the control button down and click on the second row of the selection, and so on (keeping the control button selected). Figure 4.45 shows the selection of all the rows for Rep 4. Opening the Sort menu when a selection has been made enables some additional options at the bottom of the menu, as shown in Figure 4.46. Select Lwt2 from the Sort on column list, Descending from the Order options and Selected from the Rows to Sort options. When sorting a selection Figure 4.47 of rows it is useful to group the results together to see how they have been sorted. You do this at the bottom of the spreadsheet by
4.6 Data manipulation
117
selecting the Place sorted rows at bottom of sheet from the Row s to Sort options. Figure 4.47 shows the results of this process on our example spreadsheet. Other facilities for sorting data are provided by the GenStat SORT directive.
4.6
Data manipulation
Before any statistical analyses are performed the data may have to be manipulated into the correct form required for the analysis. This can sometimes be time consuming and awkward. We now show some advanced data handling techniques that make data manipulation easier.
Figure 4.48
Figure 4.49
We first show how to append data to a spreadsheet. This is particularly useful when you have data split across two data files, or on separate worksheets within a spreadsheet. The following example demonstrates how to append data that are stored on different worksheets within an Excel file. The file Toysales.xls contains a subset of data of yearly sales data over 3 years of a toy company for the sale of a toy dogs and kittens. The data set includes the location of the shop, the number of toys sold and the price per unit. The worksheet Dog Sales contains the figures for the toy dogs, the worksheet Kitten Sales contains the figures for sale of their kitten toy during the same period, and the worksheet Dog and Kitten Sales contains data on both of these. First, we load the toy dog sales data file into a spreadsheet. We click on the Open option of the File menu, select the file Toysales.xls and click on OK . This uses the Excel import wizard, described in Section 2.1. Here we simply need to select the worksheet Dog Sales in the Select
118
4 GenStat spreadsheet
menu (Figure 4.48), and click Finish .The resulting spreadsheet is shown in Figure 4.49. To append the data for the toy kitten sales we need to use the Append Da ta to Sheet menu (see Figure 4.50) which is selected using Append from the M a n i p u la te options on the Spread menu. We select File as our data source and use the Browse button to select the file Figure 4.50 Toysales.xls and put the name and path of the file into the File box. We then select N am e for the Match Columns by option as we want to match the columns from the GenStat spreadsheet file by their column names. To identify the different data sets within the spreadsheet we enter the name Toy in the Reco rd Source in Factor box. This will create a new factor in the spreadsheet where each level of the factor represents the different appended data sets. By default these are simply the numbers 1 and 2, however, you can specify labels for these by entering names into the Factor Label boxes. In Figure 4.50 we have entered Kitten to label the appended data and Dog to represent the original data. Clicking OK produces the Select Excel W orksheet for Impo rt menu again (Figure 4.48). This time we select the Kitten Sales worksheet, and click Finish . GenStat reads the data from the file, appends the values onto the current spreadsheet, and creates a Figure 4.51 Excel W orkshe et for Import
4.6 Data manipulation
119
new factor Toy using the labels Dog and Kitten to represent the different data sets (see Figure 4.51). We now close this sheet, and will next input the data in the third worksheet, Dog and Kitten Sales . GenStat has two menus that enable you to easily stack or unstack your columns of data within a spreadsheet. We will first look at how you can stack columns together. The data in the worksheet Dog and Kitten Sales are shown in Figure 4.52. There are six columns; the location (now a factor), year sold, two columns of sales and two columns of prices. We want to stack the two c o l u m n s o f s a l es (SoldDog and SoldKitten) together and the two columns of prices (CostDog and Figure 4.52 CostKitten) together. To stack the columns we select Stack from the Manipulate options in the Spread menu, which produces the menu shown in Figure 4.53. We want to stack two columns together at a time, so we enter 2 into the Num ber of colum ns to stack together box. We enter the factor name Toy into the Reco rd column source in Factor
box; this creates a new column containing a factor where each level Figure 4.53
120
4 GenStat spreadsheet
will represent a column that has been stacked. We click in the Stack Columns list and then select the two columns CostDog and CostKitten by clicking the -> button. This puts the names of these columns into the Stack Columns list and prefixes them with a ‘1’. The 1 indicates the columns that we wish to stack together in our first stacked column. We now select the two columns SoldDog and SoldKitten, clicking on the -> button to copy these into the Stack Columns list. This time the names are prefixed by a 2 indicating that these columns will be placed into the second stacked column. You can include repeated columns in the stacked spreadsheet. For example, we have selected the columns City and Year to be repeated for each level of the stacking by clicking in the Repeat Columns box and then double-clicking on their names in the Ava ilable da ta box. Clicking OK produces a new spreadsheet shown in Figure 4.54. This spreadsheet consists of five columns: a column for the source factor (Toy), a repeated column for the city, another repeated column for the year, and the two stacked columns with the costs and the numbers sold. The new spreadsheet creates default names, with suffixes “_1" for the repeated and stacked columns. However, you can change these using the Column Renam e menu (obtained by selecting R enam e from the Colum n option of the Spread menu. Now suppose that we want to unstack columns in the stacked spreadsheet, so that we have a Figure 4.54 column of data for each year.
4.6 Data manipulation
121
The Unstack menu (Figure 4.55) is opened by selecting Unstack from the Manipulate option of the Spread menu. This menu splits up single columns into multiple columns based on the levels of an unstacking factor. In this example the unstacking factor is Year_1 , which we created using the stack Figure 4.55 menu. So we doubleclick the name Year_1 from the Ava ilable Data list to put it into the Unstacking Factor box. The columns are unstacked so that the rows of each level of the unstacking factor become a new column. There are 3 levels for the factor Year, so the resulting spreadsheet should contain 3 columns for each unstacked column. Click in the Unstack Columns list, then highlight the names CostDog_1 and SoldDog_1. Now click the -> button to transfer them across to the Unstack columns list. The ID Factors box allows you to specify factors to identify the rows within each year, to ensure that these correspond across columns. (This is important here, as the cities are not in the same order for every year.)
Figure 4.56 Clicking OK produces the spreadsheet shown in Figure 4.56, where there are 3 columns of prices and sales for each year. As with the Stack menu, GenStat has
122
4 GenStat spreadsheet
given the columns default names, but these can again be changed using the Column R enam e menu.
Figure 4.57
Figure 4.58
If you have data open in two spreadsheets, you can merge them together in different orders or at different levels of aggregation using the Merge menu. To illustrate this menu we will match together two sets of data where the data has been stored in different files. The files Health1.gsh and Health2.gsh contain some data carried out on university students. The file Health1.gsh contains measurements of their height, weight, age and gender, whilst the file Health2.gsh contains data on their pulse rates before and after exercising. Both files contain a column with the students ID, which will be used as an indicator to merge the spreadsheets. To merge the data from these files they both need to be open within spreadsheets in GenStat. So, we open both of them using Open from the File menu (Figures 4.57 and 4.58).
4.6 Data manipulation
123
Clicking in the spreadsheet and then selecting Merge from the Manipulation option on the Spread menu opens the menu shown in Figure 4.59. We wish to merge in the data from the spreadsheet Health2.gsh so we select this from the Merge data from Sheet list. Here, we simply wish to merge the two sheets using the students ID, so we Figure 4.59 select ID from the Matching Column list and also from the W ith Column list. If we only wanted to merge a subset of columns from the sheet Health2.gsh, we could select them by clicking on the Select Columns to Transfer button. The options at the bottom of the menu allow you to control how the rows are updated in the spreadsheet and how to update existing columns in both spreadsheets. Leaving the menu with the default settings produces the merged spreadsheet shown in Figure 4.60. Where student IDs were found in one sheet, but not the other, missing values are used to complete the row. For example, students 4 and 14 were found in the spreadsheet Health1.gsh, but not Health2.gsh so these rows have missing values inserted for the columns merged from the s p r e a d s h e e t Figure 4.60 Health2.gsh. The reverse can be seen for the students 12 and 13. Other data manipulation methods available via the Ma nipulate options of the Spread menu include transposing, duplicating or converting spreadsheets. Health1.gsh
124
4.7
4 GenStat spreadsheet
Bookmarking and comments
Sometimes it is useful to insert place holders into your text windows or spreadsheets. This is particularly useful if you have a large spreadsheet or text file open, and want to quickly go to a particular cell or line. To bookmark a text Figure 4.61 window or spreadsheet you can use the Book m ark option on the Search menu. For spreadsheets there is an additional menu available that allows you to bookmark particular numerical values. To illustrate the bookmark facilities we will open the sulphur data, stored in the GenStat spreadsheet file Sulphur.gsh.. In this example we will bookmark the maximum and minimum values within the columns Sulphur and Windsp, so that these values can be identified quickly. We select By Value from the Book m ark options on the Search menu , which opens the menu in Figure 4.61. Here we have selected both Sulphur and Windsp from the Select Columns list, and the Extreme values (max, min) from the Bookm ark Values list. Clicking OK produces the spreadsheet in Figure 4.62, where the bookmarked cells are shown in a user-defined colour (by default magenta). You can navigate to these cells by selecting Next on the Book m ark option on the Search menu. Each time you select this menu option, the cursor will move to the next bookmark within the spreadsheet. You can add comments to individually bookmarked cells in a spreadsheet by adding a note. Clicking in the bookmarked cell within the column Sulphur at row 20, and selecting Add Note from the Book m ark option on the Search menu opens the menu in Figure 4.63. Figure 4.62
4.7 Bookmarking and comments
125
This displays a small, resizable text editor where a comment can be supplied for the bookmarked cell; by default this supplies a note based on the option selected from the Bookm ark Values Figure 4.63 list. In this example the default note specifies that the cell is the maximum value for Sulphur. If a spreadsheet containing bookmarks is saved into a GenStat spreadsheet file, the bookmarks will be retained when the file is opened again. Another feature, useful for visually displaying values that fall into different categories or conditions, is the Conditional Formatting menu. To open this menu select Conditional Formatting from Column on the Spread menu. The menu (shown in Figure 4.64) allows you to set up to 3 different conditions to format the data. Figure 4.64 You can also choose the colour in which to display the data that satisfies each of the conditions. In Figure 4.64 we have set different colours to represent different amounts of Sulphur in the air (small values in red and larger amounts in blue or green). Note that we have set the condition for greater than or equal to 20 before greater than or equal to 10. This is to ensure that the values greater than 10 but less than 20 are shown in blue. Clicking OK will redisplay the values within the column Sulphur in the chosen colours. Individual spreadsheet cells can also be made temporarily missing (the value is retained in the spreadsheet but is set to missing in any calculations or statistical analysis), so that individual values can be excluded from an analysis. An example of where this could be useful is in an Analysis of Variance where, if you restricted out a row you could get a fault that the design is unbalanced. Analysis of Variance is discussed further in Chapter 6. In Chapter 3 when the values were transformed to logarithms, GenStat produced a warning that it could not calculate the logarithm
126
4 GenStat spreadsheet
of 0. So, this is a case, where we may wish to make this value temporarily missing before making a transformation. To make this value temporarily missing select Tem porary missing values from Column on the Spread menu. This opens the menu shown in Figure 4.65, where we have selected Sulphur and entered row 1 (where Figure 4.65 the value 0 is located). Clicking on the button Missing changes the cell to be temporarily missing, and clicking OK produces the spreadsheet in Figure 4.66. The temporary missing cell is disabled and has an asterisk appended to the Figure 4.66 value in the cell. Alternatively, you can toggle the status of the current cell the Alt+F8 key.
4.8
Dynamic Data Exchange (DDE)
A Dynamic Data Exchange (DDE) server is a windows program that can provide data to other programs. GenStat can request data from a DDE server on a given topic and item. If the data are changed in the DDE server, GenStat can specify that the DDE server notifies it of the changes. This ability is known as a hot link. For example, Microsoft Excel is a DDE server where the available topics are sheets in open workbooks, and the DDE items are cell addresses/ranges. GenStat can create a spreadsheet from a given sheet within an open Excel file, and if you make a change to a cell in Excel, the corresponding cell within GenStat is automatically updated. Many DDE servers can also receive data, so any changes made in GenStat can be sent to the DDE server, keeping it synchronized with GenStat. The data format from a DDE link must be in a rectangular table with no empty rows or columns in the data block. GenStat also provides the option to include the column names in the first row of the rectangular block. You can save the DDE link information in a GenStat DDE Link file (file extension “.GDE”). When a file saved in this format is opened using the Open option of the File menu, GenStat will recreate a DDE link and open the latest version of the data within a spreadsheet. If you save a spreadsheet that currently has a DDE link open in the GenStat spreadsheet format (.GSH), then the information about any current DDE link will be stored also. So, the next time you
4.8 Dynamic Data Exchange
127
open the GenStat spreadsheet, you will be offered the opportunity to reconnect to the previous DDE server. To illustrate how to create a DDE link we will set up a link to the Sulphur data, that is stored in the sheet GenStat Data in the file S u l p h u r .x l s . T o connect GenStat to a DDE server, it must be running with the data opened. So, first we start up Excel and open the file Sulphur.xls. Now we return to GenStat and Figure 4.67 select DDE Link from the New option of the Spread menu. This opens the menu shown in Figure 4.67. From the list of Servers we select Excel, and from the list of Topics we select the sheet [Sulphur.xls]GenStat Data . We now need to specify the rectangular range where the data are positioned within the Excel sheet. The data including the column names are stored within the range A1:D115, so we enter this range (A1:D115) in the Item list. You can use the Excel DDE notation of R*C* (for example R1C1:R115C4). However, GenStat allows for the more usual cell formats for Excel (A*:D*). Other notations for row and column letters can be set by clicking on the Servers button. This opens the menu shown in Figure 4.68, where you can enter the row and column letters in the Row Num ber Prefix and Column N um ber Pr ef ix boxes respectively. This menu can also be used to store links to DDE Servers so that you can start them from GenStat. For example, if you click in the Server EXE file location box and then click on Browse you Figure 4.68 can then navigate to the
128
4 GenStat spreadsheet
folder containing the executable of the DDE Server. In this example, we have navigated to the folder containing the Excel executable and clicked OK to put the name into the Se rve r EXE file location box. We then have entered a name Excel in the Server Nam e box to identify this link, and clicked Add to add it to the list. Now, if we select the Excel item in the list and click the Start button, Excel will start. We have selected the options for a Hot link and to receive changes back from the server. Also, as the rectangular block of data includes column names, we have selected the Send/Receive changes to/from Server option. After clicking OK you will be prompted by the menu shown in Figure 4.69, that can be used to control how the spreadsheet is to be created. Using the default options in Figure 4.69 and clicking OK creates the spreadsheet shown in Figure 4.70. When a spreadsheet is created with Figure 4.69 a DDE link, it displays a different icon in the top left corner so that you can easily detect that the spreadsheet has been created this way. Now, if a value is changed within the Excel spreadsheet, it will automatically change within GenStat, and vice-versa. Figure 4.70
4.8 Dynamic Data Exchange
129
Closing the DDE server whilst the link is still open to GenStat automatically produces a warning in GenStat prompting to close the spreadsheet in GenStat Figure 4.71 (Figure 4.71). We now reconnect the same DDE link using the method outlined earlier in this section, and then select Save from the File menu to save the spreadsheet and the DDE link. The Save menu shown in Figure 4.72 initially offers to save the spreadsheet as a GenStat DDE link file (.GDE). This file type does not contain any data, but is a text file containing the DDE server/topic/item strings and link status. Re-opening this Figure 4.72 file type will restart the DDE server and create a spreadsheet from the data in the DDE server. However, we change the File Type to GenStat Spreadsheet (*.gsh), enter the name Ddelink.gsh and click Save to save the file as a GenStat spreadsheet. Now if we close the spreadsheet and reopen it using the Open option of the File menu, a prompt appears offering to reconnect the DDE link (see Figure 4.73). Clicking Yes on this prompt and OK on the subsequent Figure 4.73 Cre ate D D E L i n k menu reinitiates the DDE link .
130
4 GenStat spreadsheet
The behaviour of the DDE link can be changed at any time by selecting Edit DDE Link from Sheet on the Spread menu. This opens the menu shown in Figure 4.74, where we have selected the two Disconnect options to permanently disconnect the link. You can temporarily suspend a link by selecting the two suspend options. Now we close Excel and open the saved GenStat Figure 4.74 spreadsheet file containing t h e D D E l i nk (Ddelink.gsh). For the prompt offering to reconnect we click No . To reconnect the DDE link at any time during the current session select DDE Link from Add on the Spread menu. This opens the menu shown in Figure 4.75, where all the boxes have been entered with the link information. Clicking on Figure 4.75 OK will re-establish the DDE link.
4.9
Reading and writing data to databases
GenStat has facilities for reading and writing to databases using Open DataBase Connectivity (ODBC). ODBC is a Microsoft standard to allow a common method of accessing databases made by different software packages. The ODBC interface is built into Windows, and the common ODBC drivers are installed as standard in all Windows versions from Windows 95 second edition onwards. GenStat is able to query any data source that has an ODBC interface. This includes all main database systems (Access, Oracle, Informix, SQL Server, dBase, FoxPro, Paradox)
4.9 Reading and writing data to databases
131
and many spreadsheets (Excel, Quattro etc). It is possible to use ODBC to read a data file from a package that is not even installed on your PC. An ODBC link can either be defined using the ODBC/Data Sources Applet within the Control Panel or when you initially start an ODBC Data Query within GenStat. We will demonstrate how to create a link using the ODBC Data Query facilities within GenStat. Selecting O DBC Data Q uery from New on the Spread menu opens the menu shown in Figure 4.76. This menu shows all the current ODBC connections currently available on your PC. Figure 4.76 Connections to databases using ODBC are made by creating Data Source Names (DSN). A DSN stores all the information about how to connect to the data source and is stored permanently on a PC once it has been created. There are three types of DSN available, and the DSN you choose will depend on how you want to access to the database. The three types of DSN are as follows: 1
User DSN - This type of DSN can only be accessed by the current user who initially created it. So any other user (different username and password) working in the same PC will not be able to access the database. 2 System DSN - This type of DSN is specific to a computer. So any user of the computer will have access to the database using this type of DSN. 3 File DSN - This type of DSN is created as a file (*.dsn), which can be copied to any computer. Anyone who can access the file containing the DSN information can then access the database. On the menu in Figure 4.76 the file DSN’s are listed under the File Data Source tab and the User and Sytem DSN’s are listed under the Machine Data Source tab. We will now illustrate how you can create a File DSN for a MS Access data file. Figure 4.77
132
4 GenStat spreadsheet
The file we will connect to is called cardata.mdb and contains data on 33 cars recorded in 1997. Selecting the File D ata Source and clicking on New opens the menu shown in Figure 4.77. This menu lists all the ODBC drivers currently available on the PC. We are connecting to an Access database file, so we select the Microsoft Access Driver (*.mdb) from the list and click Next to proceed. This opens the menu shown in Figure 4.78, where a descriptive name for the DSN can be supplied. We enter Car Data in the space provided and click Next , which opens the menu in Figure 4.79. This menu gives you a summary of the choices you made; if you want to change any details you can click on Back . Clicking on Finish creates the DSN with the choices shown in the menu.
Figure 4.78
Figure 4.79
After clicking Finish you will prompted with some addtional menus depending on which ODBC driver you are connecting to. These menus are specific to the ODBC driver and are used for specifying information for the driver connection to the data source plus any other driver specific options. The Access Setup menu is shown in Figure 4.80. Here we need to specify the file name that we want to connect to (cardata.mdb). Clicking on Select opens a browse menu (see Figure 4.81) where we have selected the file cardata.mdb. Clicking on OK selects the file and displays the name in the Database options (see Figure 4.80). If the database is password protected, you can click on the Advanced button to specify a username and password associated with the database. If the database is password protected and you do not provide one using the Advance button, then you will be prompted for a password each time you try to connect. Clicking OK on the Access Setup menu completes the DSN and enters it into the list of File Data Sources on the Select Data Source menu.
4.9 Reading and writing data to databases
Figure 4.80
133
Figure 4.81
To initiate the ODBC connection to GenStat, we double-click on the Car Da ta DSN from the list of File Da ta Sources . Figure 4.82 shows the resulting menu where you are provided with a list of all the tables and views within the DSN. Selecting a table or view from the T able list displays all the columns within that table or query within the Available Columns list. Merged data across tables are not supported with the interactive interface, but can be handled by editing the resulting SQL statement generated by GenStat. An alternative way to access merged Figure 4.82 data across tables is to create a view within the database itself as these are displayed in the Table list. You can then select the columns as you would with a single table. In Figure 4.82 we have selected the table CarData, and from the Available Columns, we have made a multiple selection Car, Price and Max_MPH. We have then clicked on the !> button to copy the selected columns across to the Selected Columns list.
134
4 GenStat spreadsheet
Clicking on Next opens a Filter menu as shown in Figure 4.83. Within this menu you can choose a subset of rows from the database based on a logical condition. The condition is entered into the space provided and you can use the lists of available columns, functions and operators to help build the expression. For example, we want to Figure 4.83 filter to those rows where the price for the cars is less than £10,000. Double-clicking on the name Price from the Available Column list puts the name in the edit field for the expression. Similarly double-clicking on ‘less than’ in the Ope rators list puts a ‘<’ symbol into the expression. Finally we type 10000 and click Next to continue. Figure 4.84 shows the final menu in the process; this specifies how you want to run the query. You can simply run the query by selecting the R u n t h e S Q L Q u e r y option. Alternatively, you can view the generated SQL statement, by selecting the View or Edit the SQL Que ry option as shown in Figure 4.84. Selecting the option to Save the Figure 4.84 Que ry makes the box below available for us to enter the file name D:\Gen6ed\Course\Example\Query1.gdb. This will save the whole ODBC query process within a file called a GenStat GDB file. A GDB file can be opened using the Open option on the File menu, and will automatically run the query on the ODBC Server specified within the file. Clicking OK opens the SQL View menu shown in Figure 4.85 that displays the SQL statement generated by the query. You are able to edit the SQL within this window if you wish, before actually running the query. Clicking on the Tables or Figure 4.85 Columns buttons will open lists that can be used to construct the query.
4.9 Reading and writing data to databases
135
When editing the SQL, any column names containing non-alphanumeric characters must be surrounded by quotes. For example, the column name Max_MPH contains an underscore (_), so the name has been surrounded by quotes in the generated SQL statement. Using the generated SQL statement and clicking OK produces the spreadsheet shown in Figure 4.86. A GenStat spreadsheet can be written to a database using ODBC, provided you have the correct access rights to do this. There are three ways in which you can write to a database: Figure 4.86 create a new table, add new rows into a table and update existing rows within a table.
Figure 4.87
Figure 4.88
To illustrate these methods we will write data to the Ca r Da ta database in MS Access. The file cardata1.gsh, shown in Figure 4.87 contains data on the time required to accelerate from 0 to 60 mph. With this file open in GenStat select Cre ate Da tab ase T able from Export on the Spread menu. As with reading data using ODBC, you are required to specify a DSN for the database you want to connect to. Double-clicking on the Ca r Da ta DSN in the File Data Source list on the Da ta Source menu opens the menu in Figure 4.88. Here we have entered the name NewCarData for the table that is to be created in Access. By default all the columns are transferred into the table. However, you can select specific columns from the spreadsheet to be transferred by clicking on the Se lect Colum ns to be in Table button.
136
4 GenStat spreadsheet
Clicking OK adds the new table to the database, and on successful completion of the process a prompt appears as shown in Figure 4.89. When columns are transferred to the new table in the database the same column names are Figure 4.89 used as in the spreadsheet.
Figure 4.91
Figure 4.90 shows the new table within Access with the same column names as the GenStat spreadsheet. The file Peugeot.gsh contains additional data for some Peugeot cars, which needs to be added to the database. To add these rows to the new table, we first open the spreadsheet in GenStat (the spreadsheet is shown in Figure 4.91), and then select Ins ert into Da tab ase T able from Export on the Spread menu. Double-clicking the Car Data DSN from the Select Data Source menu opens the menu shown in Figure 4.92. We select the table NewCarDat a from the Ins ert into Table list and select the Using Names in Sheet option. You can save the export link into a GenStat ODBC Link file (.GLK), so that you can automatically re-run the insert o p e r a t i o n o n s u b s e q u e n t Figure 4.92 spreadsheets without having to go through the menu steps again. We have selected the Save Export Link in GLK file option, and have entered the file name D:\Gen6ed\Intro\Example\ODBC1.glk to save the export link information. Figure 4.90
4.9 Reading and writing data to databases
137
Figure 4.93 Clicking O K prompts you with confirmation dialogue shown in Figure 4.93, and inserts the rows from the Figure 4.94 spreadsheet into the database as shown in Figure 4.94. The final method of writing to a database is to update existing rows within the database. The file Ford.gsh contains data from further testing on Ford cars where the time taken to reach 60 mph has been improved on all models. Opening the file Ford.gsh into GenStat, gives the spreadsheet shown in Figure 4.95 Figure 4.95. We select Merge with existing Database Table from Export on the Spread menu and doubleclick the Ca r Da ta DSN on the Select Data Source menu. This opens the menu shown in Figure 4.96. Select NewCarD ata from the Merge data into T able list. Each car has an ID number that can be used to match them between the spreadsheet and database. So, select the column ID from the Figure 4.96 Matching Sheet Column list and select ID from the W ith Table Column list. This will match the data from the spreadsheet with the database using the column ID and replaces the values for the other columns. As with the menu for inserting rows into a database, you can save the export link information in a GenStat ODBC Link file (.GLK) to automatically
138
4 GenStat spreadsheet
run the process another time. We have specified this by selecting the Save E xport Link in G LK file o p t i o n , and entered the filename D:\Gen6ed\Intro\Example\ODBC2.glk in the space provided. A description of the other options available on this menu can be found by clicking on the Help button. Clicking OK prompts you with confirmation dialogue shown in Figure 4.97 and replaces the rows in the database using those the GenStat spreadsheet, as shown in Figure 4.98.
Figure 4.97
Figure 4.98
To run a GenStat ODBC Link file to automatically insert rows or merge data into a database, select Run OBDC export link from Export on the Spread menu. This opens the menu shown in Figure 4.99, where you can either run the link using the current spreadsheet, or run the link from a given GenStat spreadsheet file Figure 4.99 (you will need to specify the location of the file). In Figure 4.99 we have selected the file D:\Gen6ed\Intro\Example\ODBC2.glk using the Browse button and have selected the option to run using the currently active sheet. Clicking OK will re-run the export link for the replacing of rows outlined above and will produce the confirmation dialog and spreadsheet as shown in Figures 4.97 and 4.98.
4.10 Other facilities There are many other facilities for data manipulation using the spreadsheet facilities within GenStat. One useful feature is the ability to set a spreadsheet as an active spreadsheet. If a spreadsheet is set as an active spreadsheet, then only changes made in this spreadsheet will be updated in GenStat; all other
4.11 Commands
139
spreadsheets will be prevented from updating GenStat, until you remove this setting. Another advantage of specifying an active spreadsheet is that the Spread menu will always be available whether you are in the spreadsheet or within a text window. You can set a spreadsheet as an active spreadsheet by selecting Set as active sheet on the Spread menu. More details on active spreadsheets can be found in the on-line help. Another method for rearranging data in GenStat is through the Paste Special menu. This is accessed using the Paste Special option on the Edit menu. With this feature you can copy data onto the clipboard from another data source, and then control how the data are to be pasted within the GenStat spreadsheet cells. For example, you can use this to paste a rectangular block of data into a single column, or to paste grouped blocks of data into multiple rows. You can calculate summary statistics based on just the data within the current spreadsheet. For example, you may want to aggregate data to provide summaries, or perhaps expand a set of factor results to give a row for every factorial combination. This menu is accessed by selecting Sum m ary Stats from Ca lculate on the Spread menu. A spreadsheet can have a set of GenStat commands embedded within it. This allows you to provide a statistical analysis along with the spreadsheet. This is explained, with an example, in Section 10.1.
4.11 Commands Many of the menu options illustrated in this chapter can be equivalently carried out using the command language. However, some of the features such as data verification, copying from the clipboard and bookmarking however, can only be performed in WindowsTM. To filter or restrict data as outlined in Section 4.3 you can use the RESTRICT directive. The VECTOR parameter specifies the data columns that are to be restricted and the CONDITION parameter allows you to set the condition to restrict the data values by. For example, the following shows how to perform the restriction in Figure 4.33: RESTRICT Drench,Lwt2; Condition = ((Lwt2 < 51))
to remove a restriction you need to use RESTRICT again but omitting the condition RESTRICT Drench,Lwt2
To form a subset of data you can use the SUBSET procedure. In this procedure the condition is supplied using the CONDITION option. The NEWVECTOR parameter
140
4 GenStat spreadsheet
allows you to specify a new vector to save the subset (otherwise it overwrites the contents of the original vector). The following example shows how to create a new subset called subLwt2 from the values of Lwt2 using the condition that all the weights are less than 51. SUBSET [CONDITION=(Lwt2 < 51)] Lwt2; NEWVECTOR=subLwt2
The sorting of data in Section 4.4 can be performed using the SORT directive. The index column that defines the sorted order is specified by the INDEX option, and the direction of sorting is specified using the DIRECTION option. The sorting performed in Figure 4.41 can be reproduced by: SORT [INDEX=Lwt2; DIRECTION=ascend] Drench,Rep,Lwt1,Lwt2
For a multi-key sort you can supply a list of identifiers for the INDEX option and the data will then be sorted by list order. The data will be sorted by the first item in the index list and then by the second item in the index list, and so on. To reproduce the sort in Figure 4.43 you could use the following. SORT [INDEX=Drench,Lwt1; DIRECTION=ascend] \ Drench,Rep,Lwt1,Lwt2
The stacking and unstacking of data can be reproduced using the STACK and UNSTACK procedures. For stacking columns together, the source factor is saved using the DATASET option and the columns that are to be stacked are supplied by the parameters V1V100. The following commands demonstrate how to reproduce the stacked data set in Figure 4.53. STACK [DATASET=Toy] Year_1,CostDog_1,SoldDog_1;\ V1=Year,CostDog,CostKitten;\ V2=Year,SoldDog,SoldKitten
The unstacking of the data in Figure 4.55 can be produced using the command shown below: UNSTACK [DATASET=Year_1] 3(CostDog_1,SoldDog_1);\ DATASETINDEX = 1998,1999,2000; \ UNSTACKEDVECTOR =CostDog_101,CostDog_102,\ CostDog_103, SoldDog_101,SoldDog_102,SoldDog_103
The source factor, Year_1, is supplied using the DATASET option. The DATASETINDEX parameter specifies the levels or labels of the DATASET factor indicating the group whose units are to be stored in the UNSTACKEDVECTOR. In this example we have used the levels for the Year_1: 1998, 1999 and 2000. The data to be stacked is supplied as a list using the UNSTACKEDVECTOR parameter. For Dynamic Data Exchange the DDEEXPORT procedure can be used for writing data to a DDE server. This can allow you to build up worksheets of results in spreadsheets, such as Excel. Within Excel you can write data to the worksheet cell
4.11 Commands
141
by cell, or alternatively you can add formula to cells. You can also send macro commands to Excel to open files, add new worksheets, save or close files. The location within the DDE server is specified using the options SERVER, TOPIC and ITEM. However, for the two common spreadsheets Excel and Quattro Pro for WindowsTM, these have been broken down into more convenient options called OUTFILE, SHEETNAME, COLUMN and ROW. For Excel and Quattro Pro only the first cell needs to be provided, as GenStat can automatically work out the range given the size of the data. If you want to send commands you can supply these by setting METHOD=command. The following example will open Excel, create a new worksheet and copy the data to the worksheet. Open the columns crop and counts from the file Bacteria.xls, and then run the following program to copy them back into the file on a new sheet. DDEXPORT [METHOD=command]\ '[OPEN((''D:\\Gen6ed\\Intro\\Example\\Bacteria.xls'')]' DDEXPORT [METHOD=command] '[WORKBOOK.INSERT(1)]' DDEXPORT [OUTFILE='Bacteria.xls';SHEET='Sheet1';\ ROW=1; COL=1] crop,counts DDEXPORT [METHOD=command] '[SAVE()]'
The DDE commands used in the example above are a subset of the Excel 4 macro language. The format of the commands is [Function(arg1,arg2,...)]. If there are text strings in the arguments then these must be supplied in double quotes (for example, "Arg1"). The following list specifies some of the most useful Excel commands that can be used with the DDEEXPORT procedure. Restore the Excel window [APP.RESTORE()] [APP.MINIMIZE()] Minimize the Excel window [APP.ACTIVATE()] Make Excel the application with focus [OPEN("filename")] Open a workbook in Excel [WORKBOOK.INSERT(1)] Insert a new workbook [WORKBOOK.SELECT("sheetname")] Make the named sheet the current sheet [WORKBOOK.DELETE()] Delete the current sheet [SELECT("object")] Select the cells/column/rows specified in object [SORT(1,"R1C1",1)] Sort the selected cells using key in specified cell [SAVE()] Save the current workbook [SAVE.AS("filename",1)] Save the current workbook as a new file [CLOSE(1)] Close and save the current workbook (0 = close but do not save)
142
4 GenStat spreadsheet
To read data from a database you can use the DBIMPORT procedure. You can supply the name of an existing GDB file containing information on the data to load using the GDBFILE parameter. Alternatively, you can supply a database connection string using the DB parameter with an SQL statement using the SQL parameter. To run the example in Section 4.9 you could supply the saved GDB file as follows: DBIMPORT GDBFILE='D:\\Gen6ed\\Intro\\Example\\Query1.gdb'
To write tables or data to a database you can use the DBEXPORT procedure. The METHOD option specifies how the data are to be written in the ODBC data source: to create a table use METHOD=create, to add rows to an existing table use METHOD=insert, and to update rows in an existing table use METHOD=merge. In its simplest form, you can just provide a previously saved GenStat ODBC Link file (GLK). The data to be sent can either be specified as a pointer to a set of structures in GenStat or a text giving a GenStat spreadsheet (GSH) file. If you are using an ODBC Link file and this does not specify a GenStat spreadsheet as the data to transfer you will need to specify the data using the DATA parameter. Column names within the ODBC table are assumed to be the same as the GenStat identifiers. If you want to use different names then you can specify COLUMNNAMES and WITH (for matching with MATCH). The COLMERGEMETHOD option controls whether columns from the data not found in the data database table are to be added to the database table. Subsets of columns can be specified using the SUBSET parameter. If METHOD=merge, the MATCH parameter must be set and five columns at most can be matched. The WITH parameter may be set if the columns in the table do not have the same names as the structures specified by the DATA parameter. The ROWMERGEMETHOD option controls how unmatched rows are handled in a merge: the setting none does not add unmatched rows, the setting matched only adds a row if another with the same matching criteria already exist in the table, and all adds in all unmatched rows into the table. The WARNINGDIALOGS option can be used to control whether warning message boxes are displayed on the WindowsTM desktop when errors occur. The option ERRORACTION controls what to do when non-fatal errors occur; you can halt the process or continue. The following example shows how you can run a GenStat ODBC Link file: DBEXPORT [GLKFILE='ODBC1.GLK']
The second example will run a GenStat ODBC Link file, but this time data currently stored within GenStat will be used for the merging. DBEXPORT [GLKFILE='ODBC2.GLK'] ID,CAR,ZERO_60
4.12 Exercises
143
The last example demonstrates how you can extract the connection string from a GenStat ODBC Link file, and create a new table in the database using data currently within GenStat. "Read the database connection string from GLK file" OPEN 'ODBC1.GLK'; CHAN=2; INPUT; WIDTH=600 SKIP [CHAN=2] 1; TEXT [1] DB "Skip ODBC Link ID" READ [CHAN=2;PRINT=*;LAYOUT=FIXED;FORMAT=!(600);END=*] DB CLOSE 2; INPUT "Create the new table in the database" DBEXPORT [METHOD=create] ID,CAR,ZER0_60; DB=DB;\ TABLE='NewTable'
4.12 Exercises 4(1) The following data are from an experiment assessing the durability of four different types of carpet, four machines were available to simulate the wear arising from daily use. day 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
machine 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
carpet d a c b a d b c b c a d c b d a
wear 38 17 38 39 19 22 26 35 41 54 11 36 59 36 22 16
Enter this data into a GenStat spreadsheet. Use the Fill menu from the Ca lculate option on the Spread menu to generate the day and machine information. Change the first 3 columns to factors and ensure the labels for carpet are a, b, c, d and e. Using the Verify menu from the Sheet option on the Spread menu, check that you have entered the data correctly. The data are stored in the file Carpet.gsh, compare your spreadsheet with this data set using the Com pare menu. Close the spreadsheet and clear the data pool when you have finished. 4(2) The file Computer.gsh contains the number of personal computers sold in a shop during each month of the year 2001, together with the prices charged.
144
4 GenStat spreadsheet
Using the Restrict/Filter options on the Spread menu subset the data to display only the rows where the months the price is greater than £1100. Build up the subset further by filtering the rows where the number sold is less than 15. Remove the restriction from the spreadsheet and restrict the data again this time using both the conditions at the same time (you will need to use both Expression fields of the Restrict using an expression menu). Remove the restriction when you have finished. Sort the spreadsheet in descending order according to the number of computers sold. Sort the spreadsheet again, this time using the months in alphabetical order. Sort the spreadsheet with multiple indexes, firstly by the price and then by the number of PC’s sold. Close the spreadsheet and clear the data pool when you have finished. 4(3) Experiments on cauliflowers in 1957 and 1958 provided data on the mean number of florets (y) in the plant and the temperature (x). Open the spreadsheet from file Floret.GSH, and stack the columns y1 & y2 together and x1 & x2 together. Draw a scatterplot of the mean number of florets against the temperature. Redraw the graph, but this time enter the source factor (created from the stack) into the Groups box to highlight the two different groups. 4(4) The file Ant.xls contains data from a insecticide trial for killing ants. Five types of insecticide were used on each three types of bait. The data has been entered on different sheets in the Excel file. Open the sheet Baits 1 & 2 from the file and then append the data from the Sheet Bait 3 to make the complete data set. Bookmark the column time to show the maximum and minimum values (You can access the bookmark menu using the By Value option in the Book m ark section of the Spread menu). Clear these bookmarks. Using the Conditional Format menu highlight the values for insecticide 2 in blue and the values for insecticide 4 in red. Clear the conditional formatting. Use the Unstack menu to create separate columns for the times for each of the three baits. You should use the column bait as the source factor for the unstacking. 4(5) Clear all the data from the GenStat data pool and close all open spreadsheets. Open the sheet Ant.xls within Excel and set up a DDE link to the data within the Baits 1 & 2 sheet so that the data are received and sent whenever changes have been made. Note the range of the data is from A1:C31. Once the link is up
4.12 Exercises
145
and running check the links, by changing a value in GenStat and then seeing whether it has automatically changed in Excel. Then change a value in Excel and see if the change has been updated in GenStat. Save the spreadsheet as a GenStat DDE Link file (.GDE). Close the spreadsheet and reconnect the DDE Link by opening the DDE Link file. 4(6) Clear all the data from the GenStat data pool. Using the OD BC D ata Que ry menu from the Spread menu, connect to the car data (Car.mdb) and bring in all the columns of data, but only for cars costing £10000 or more. Save the query in a GenStat GDB file. Sort (and display) the data in ascending order according to their horsepower. Clear the data from the GenStat data pool. This time using the Open in the File menu, open the GDB file.
146
5
Linear regression
In this chapter we describe how to fit regression models with GenStat. We start with simple linear regression, where a straight line is fitted to represent the relationship between two variables; one variable is considered as the response variable (or y-variate) and the model predicts its mean value given the value of the other, explanatory variable (or x-variate). The regression menus in GenStat allow you to explore relationships between many variables. Section 5.2 introduces multiple linear regression, where there may be several explanatory variables, Section 5.4 looks at all subsets regression, and Section 5.3 shows how to deal with explanatory variables that are categorical. Linear regression models can also be curved: this is no contradiction, because linear means linear in terms of the parameters, and not necessarily linear in the explanatory variables, as explained in Section 5.5. We show some of these curved models, and indicate how other, nonlinear, curves can also be fitted with GenStat. Further details about nonlinear models can be found in Sections 3.7 and 3.8 of Part 2 of the Guide to GenStat. All these regression models allow for random variability in the relationships: the response variable has a component of uncertainty. The analyses are based on the assumption that this component, called the error or residual, has a Normal distribution with constant variance for all the observations. We introduce some of the methods available to check these assumptions. Section 5.6 describes how GenStat can also fit generalized linear models, which use other distributions more suitable for data like counts and proportions that cannot be Normally distributed. Further details are given in Section 3.5 of the Guide to GenStat..
5.1
Fitting a straight line
The file Pressure.gsh is a GenStat spreadsheet file, containing recordings of blood-pressure from a sample of 38 women whose ages range from 20 to 80. It is easy to access into GenStat in this form, using the Da ta menu, selecting Load then Data File and browsing to find the file. The data can best be displayed in a graph of pressure against age. Figure 5.1 was drawn using the Graphics menu and selecting Point Plot. The graph shows clearly that there is a fairly linear relationship between bloodpressure and age. This can be quantified in terms of a linear regression model, which specifies a line of best fit or regression line between the points on the graph. It is natural here to consider that blood-pressure responds to increasing age, so we fit a line or model that predicts blood-pressure from age.
148
5 Linear regression
The equation of the line that we fit is pressurei = a + b × agei + ei where a can be visualized as the intercept of the regression line, b as its slope and ei as the error, or vertical distance of the ith point from the line. A regression analysis produces estimates of the parameters a and b of this model, and also of the variance of the variable e which is often of as much interest as the parameters. The details of the method of estimation, and of the assumptions that are necessary, Figure 5.1 are dealt with in many standard texts such as Applied Regression Analysis by Draper & Smith (1981, Wiley, New York). The model can be fitted in GenStat by selecting Regression An alysis from the Stats menu, and then clicking on Linear as shown in Figure 5.2.
Figure 5.2
5.1 Fitting a straight line This brings up the
149
Linear
Regression menu shown in Figure
5.3. For simple linear regression you just need to fill in the Re spo nse Variate and Explanatory Variate boxes. Clicking on OK then causes GenStat to analyse the data, and display a report in the Output window. The Linear Figure 5.3 Regressio n menu will remain on view, so you may need to click on the Output window to bring it to the top. Here is the report that appears. ***** Regression Analysis ***** Response variate: Pressure Fitted terms: Constant, Age *** Summary of analysis ***
Regression Residual Total
d.f. 1 36 37
s.s. 2647.7 561.6 3209.3
m.s. 2647.69 15.60 86.74
v.r. 169.73
F pr. <.001
Percentage variance accounted for 82.0 Standard error of observations is estimated to be 3.95 *** Estimates of parameters ***
Constant Age
estimate 63.04 0.4983
s.e. 2.02 0.0382
t(36) 31.27 13.03
t pr. <.001 <.001
The analysis is headed by a reminder of the model, listing the response variable and the fitted terms: these are the explanatory variable and the constant or intercept term. The Linear Regression menu includes the constant by default; if you want to omit it, you can click on Options and choose not to estimate the constant term. This would constrain the fitted line to pass through the origin (that is, the response is zero when the explanatory is zero), but remember that the analysis would still be based on the assumptions that the variability about the line is constant for the whole range of the data, and that the relationship is linear right down to the origin. The next section of output is the analysis of variance of the observations of blood-pressure. The column headed m.s. (mean square) shows how much variance is attributable to the linear dependence on age, and how much is left over or
150
5 Linear regression
residual. The variance ratio (v.r.) is the ratio of the mean squares and could be used to test formally whether there is a significant relationship. The column headed F pr. gives the probability of a variance ratio as large as this occurring by chance if there were no relationship between the variables – but remember that this is based on the standard assumptions of linear regression, described later. A variance ratio as large as the one in this analysis indicates a significant relationship at the 0.1% level of significance (corresponding to probability 0.001). The percentage variance accounted for is a summary of how much of the variability of this set of response measurements can be attributed to the fitted model. It is the difference between residual and total mean squares expressed as a percentage of the total mean square. When expressed as a proportion rather than a percentage, this statistic is called the adjusted R2; it is not quite the same as R2, the squared coefficient of correlation. The adjustment takes account of the number of parameters in the model compared to the number of observations. The final section of the analysis shown by default, displays the estimates of the parameters in the model. So, for example, you can see that blood-pressure rises on average by 0.4983 units for each year of age, with a standard error of 0.0382. The corresponding t-statistic is large, 13.03 with 36 degrees of freedom, indicating that there is a significant association between pressure and age as we expected from the graph. Again, the significance level is based on the standard assumptions of linear regression. Each section of output in this report is controlled by a check box in the Regression Options menu, so you can ask for specific sections of output before the analysis is carried out. Alternatively, after the analysis you can click on Further Output in the Linear Regression menu and ask for other sections. The resulting Linear Regression Further Output menu is shown in Figure Figure 5.4 5.4; if you check Sum m ary or Estimates , further options are available so that you can choose whether probabilities should appear with the analysis. The Fitted Values selection produces the display shown below, which we have curtailed to save space.
5.1 Fitting a straight line
151
*** Fitted values and residuals ***
Unit 1 2 3 4 5
Standardized Response Fitted value residual Leverage 82.17 77.00 1.36 0.072 88.19 85.97 0.57 0.028 89.66 94.44 -1.24 0.042 81.45 80.98 0.12 0.045 85.16 83.97 0.31 0.032
... 34 35 36 37 38 Mean
83.76 84.35 68.64 100.50 100.42
80.98 89.95 75.00 93.44 102.91
0.72 -1.44 -1.69 1.82 -0.67
0.045 0.028 0.090 0.038 0.111
87.95
87.95
0.00
0.053
The fitted values are those predicted by the model for each observation; that is, a + b × xi. Instead of displaying the simple residuals, ei, these values have been divided by their standard error: the resulting standardized residuals should be like observations from a Normal distribution with unit variance, if the assumptions made in this analysis are valid. The leverage values indicate how influential each observation is: a large value indicates that the fit of the model depends Figure 5.5 strongly on that observation. You can display the fit graphically by clicking on the Fitted Model button in the Linear Regression Further Output (Figure 5.4). This displays the picture shown in Figure 5.5, which shows the observed data with the fitted line superimposed.
152
5 Linear regression
You can draw various pictures to check the assumptions of the analysis visually. If you select Model Checking from the Regression Further O utput menu, the menu shown in Figure 5.6 allows you to choose between five types of graph for any of the residuals, leverage values, or the C o o k ' s s t a t i s t i c s ( a Figure 5.6 combination of the residual and leverage information). Figure 5.7 shows the default, which is a composite of four of these graphs: a histogram of the residuals, so that you can check that the distribution is symmetrical and reasonably Normal; a plot of residuals against fitted values, so that you can check whether the residuals are roughly symmetrically distributed with constant variance; a Normal plot which plots the ordered residuals against Normal distribution statistics – if they lie roughly on a straight line, the residuals are roughly Normally distributed; and a half-Normal plot which does the same for the absolute values of the residuals, and can be more useful for small sets of data.
5.1 Fitting a straight line
153
Figure 5.7 These pictures indicate that the variance seems unrelated to the size of the observation, but that the distribution seems to be more constrained than the Normal: the largest residuals are a little smaller than would be expected from a Normal distribution. We should therefore be cautious in interpreting the Fstatistics and t-statistics, which rely on the assumption of Normality.
5 Linear regression
154
As well as displaying the results of an analysis, the regression menus allow you to save the results in standard data structures. This is a common feature of most of the analysis menus in GenStat. After a regression analysis you can click on the Save button of the Linear Regression menu (Figure 5.3), which generates the Figure 5.8 Lin e a r Options
Re gr es sio n
Save
menu. The residuals, fitted values, parameter estimates and standard errors can all be saved in variates: if you check one of these boxes, you will be prompted for the name of the variate to store the results, as shown in Figure 5.8. The variance-covariance matrix of the parameter estimates can also be saved in a symmetric matrix, another of GenStat's standard data structures. The fitted values provide predictions of the response variable at the values of the explanatory variable that actually occurred in the data. If you want predictions at other values, you can use the prediction menu, obtained by clicking on the Predict button in the Linear Regression menu. This generates the Predictions - Simple Linear Regression menu shown in Figure 5.9. Initially the Figure 5.9 Predict Values at box has mean filled in, so that a prediction would be formed for pressure at the mean value of the ages. However, we have changed this to ask for predictions at ages 25, 50, 75 and 100. The Display box has boxes that can be checked to provide predictions, standard errors, standard errors of differences between predictions, least significant differences of predictions, and a description of how the predictions are formed. Here we print predictions, standard errors and the description.
5.2 Multiple linear regression
155
*** Predictions from regression model *** These predictions are estimated mean values. The standard errors are appropriate for interpretation of the predictions as summaries of the data rather than as forecasts of new observations. Response variate: Pressure Prediction Age 25.00 75.501 50.00 87.958 75.00 100.416 100.00 112.874
s.e. 1.150 0.641 1.152 2.018
The output explains that the standard errors are appropriate as predictions for fitted values for these ages in this data set, not as predictions for new observations. We can augment the standard errors by the additional variability arising from a new set of observations at ages 25 - 100 by checking the box Include variance for future obs erva tion . (For further details see Section 3.3.4 of Part 2 of the Guide to GenStat.)
5.2
Multiple linear regression
In the previous section there was no difficulty identifying the explanatory variable to be used in the model. In many studies, however, this is not the case, and there may be a large number of potential variables to consider when trying to model the response. You then need to be able to explore models, comparing alternative variables or sets of variables, as well as to display and check any model finally selected. We shall illustrate this approach with a short set of data from a production plant, on page 352 of Applied Regression Analysis by Draper & Smith (1981, Wiley, New York). Information was collected over 17 months on variables possibly associated with water usage: the average temperature, the amount of production, the number of operating days and the number of employees. The data are loaded from the spreadsheet file Water.gsh.
156
5 Linear regression
Linear models with more than one explanatory variable are called multiple linear regression models. If you choose this title from the drop-down list in the Linear Regression menu, you can then specify several explanatory variables as well as the Figure 5.10 single response variable, as shown in Figure 5.10. However, rather than just fitting the full model in one step, we shall illustrate how you can fit a sequence of regression models. This is best done using the G eneral Linear Regression option from the drop-down menu. This allows you to modify the model as many times as you like, using the Change Model button in the Linear Regression menu. It is useful in a sequential study to start by specifying a maximal model, which includes all the explanatory terms that may be used in the sequence of models to be fitted. GenStat is then able to customize the Change Model menu so that the Available Da ta Figure 5.11 box is replaced by a T erm s box containing all the terms that may be fitted. Also, if any explanatory variables have missing values, a common set of units (for which all variables have values) is identified at the start, so that all models can be properly compared. To start with, we leave the Model to be Fitted box blank and fit only the constant, as shown in Figure 5.11. It is important to note a small difference between the model boxes in General Linear Regression compared to the other types. Here, you can construct model formulae using the operators given in the Ope rators box: therefore, if you want just
5.2 Multiple linear regression
157
a list of explanatory variates, as here, you must type in commas to separate the identifiers. With Multiple Linear Regression these are added automatically. Here is the output from this first analysis. ***** Regression Analysis ***** Response variate: Water Fitted terms: Constant
*** Summary of analysis ***
Regression Residual Total
d.f. 0 16 16
s.s. 0.000 3.193 3.193
m.s. * 0.1995 0.1995
v.r.
F pr.
Percentage variance accounted for 0.0 Standard error of observations is estimated to be 0.447 * MESSAGE: The following units have large standardized residuals: Unit Response Residual 16 4.488 2.73
*** Estimates of parameters ***
Constant
estimate 3.304
s.e. 0.108
t(16) 30.49
t pr. <.001
We can build the model using the Change Model menu (Figure 5.12), obtained by returning to the Linear Regression menu and clicking on Change Model. This has a Term s window, in which you select the explanatory variables that you want to change. As you click on each one it is highlighted to show that it has been selected. As usual, you can hold down the Ctrl key when you click a line, so that this will not affect the highlighting of the other lines. Once you have selected the variables of interest, you can click the Add button to add them to the model. Alternatively, you can click the Drop button to remove them from the model, or click the Switch button to remove those that are in the model and add those that are not. The Try button allows you to assess the effect of switching each of the selected variables, before making any change. There is also a section of the menu for stepwise regression which is discussed in Section 5.3.
158
5 Linear regression
In Figure 5.12, we have selected all the variables, and checked just the Display Changes box in the Explore section of the menu. Clicking Try now generates a succinct summary of the effect of each potential change. The first column describes the change. Subsequent columns give the degrees of freedom, sum of squares and mean square of the change. Figure 5.12 Here we are simply adding single potential xvariates, so the degrees of freedom are all one. Also, the residual of the initial model is printed to indicate the general level of variation. You might want to add terms with large mean square (or remove terms with small mean squares, if there were any in the model already). ***** Changes investigated by TRY ***** Change + + + +
d.f.
s.s.
m.s.
1 1 1 1
0.545 0.025 1.270 0.261
0.545 0.025 1.270 0.261
16
3.193
0.200
Employ Opdays Product Temp
Residual of initial model
is useful particularly if you have many explanatory variables and do not wish to fit them all. Here we shall be adding them all to the model, and so we will not use Try again. However, we will take its advice as to which variable to add to the model first. The output shows that Product has the largest mean square, so we use the Change Model menu to add this (by selecting the Product line, and then clicking Add ). The output is given below. Try
***** Regression Analysis ***** Response variate: Water Fitted terms: Constant, Product
5.2 Multiple linear regression
159
*** Summary of analysis ***
Regression Residual Total Change
d.f. 1 15 16
s.s. 1.270 1.922 3.193
m.s. 1.2702 0.1282 0.1995
v.r. 9.91
F pr. 0.007
-1
-1.270
1.2702
9.91
0.007
Percentage variance accounted for 35.8 Standard error of observations is estimated to be 0.358 * MESSAGE: The following units have large standardized residuals: Unit Response Residual 16 4.488 2.31 * MESSAGE: The following units have high leverage: Unit Response Leverage 2 2.828 0.27 3 2.891 0.25
*** Estimates of parameters *** estimate 2.273 0.0799
Constant Product
s.e. 0.339 0.0254
t(15) 6.71 3.15
t pr. <.001 0.007
The messages in the summary warn about one large residual, and two months with high leverage. So we would have to be careful in interpreting the results if we suspected that these two months were special in some way. Otherwise, the output from this analysis is similar to that in Section 5.1, and it shows that the model here accounts for only 36% of the variance in water use. We can attempt to account for more of the variance by including the effect of another explanatory variable; we shall try the effect of temperature, so the model will be as follows: water = a + b × production + c × temperature This can be fitted easily by returning to the Linear Regression menu and clicking on Change Model again (Figure 5.12). You can then select Temp from the T erm s box and click on Add as before to fit the modified model. The output is shown below. ***** Regression Analysis ***** Response variate: Water Fitted terms: Constant, Product, Temp *** Summary of analysis ***
Regression Residual Total
d.f. 2 14 16
s.s. 1.560 1.633 3.193
m.s. 0.7798 0.1167 0.1995
v.r. 6.68
F pr. 0.009
5 Linear regression
160 Change
-1
-0.289
0.2894
2.48
0.138
Percentage variance accounted for 41.5 Standard error of observations is estimated to be 0.342 * MESSAGE: The following units have large standardized residuals: Unit Response Residual 16 4.488 2.04
*** Estimates of parameters *** estimate 1.615 0.0808 0.00996
Constant Product Temp
s.e. 0.528 0.0242 0.00632
t(14) 3.06 3.34 1.57
t pr. 0.008 0.005 0.138
The Change line and the t-statistic for Temp tell the same story here: the extra explanatory variable accounts for a further 6% of the variance, but does not seem to have a significant effect in conjunction with the amount of production. We now include the effect of the number of operating days in each month, by adding it via the Change Model menu. To decrease the amount of output, we have clicked on Options first, and cancelled the display of the parameter estimates in the r e s ulting G e n e r a l L i n e a r Regression Options menu, so that we just get the model summary, as shown in Figure 5.13. (Notice that this menu would also allow Figure 5.13 you to specify a variate of weights if you wanted to do a weighted linear regression.) In the output, shown below, the percentage variance accounted for has increased to 50%. So this variable has a marked effect on water usage. ***** Regression Analysis ***** Response variate: Water Fitted terms: Constant, Product, Temp, Opdays *** Summary of analysis ***
Regression Residual
d.f. 3 13
s.s. 1.893 1.300
m.s. 0.63093 0.09999
v.r. 6.31
F pr. 0.007
5.2 Multiple linear regression Total
16
3.193
0.19954
Change
-1
-0.333
0.33328
3.33
161
0.091
Percentage variance accounted for 49.9 Standard error of observations is estimated to be 0.316
Finally, we add the fourth explanatory variable, the number of employees, returning to the default output. ***** Regression Analysis ***** Response variate: Water Fitted terms: Constant, Product, Temp, Opdays, Employ *** Summary of analysis ***
Regression Residual Total Change
d.f. 4 12 16
s.s. 2.4488 0.7438 3.1926
m.s. 0.61221 0.06198 0.19954
v.r. 9.88
F pr. <.001
-1
-0.5560
0.55603
8.97
0.011
Percentage variance accounted for 68.9 Standard error of observations is estimated to be 0.249 * MESSAGE: The following units have high leverage: Unit Response Leverage 1 3.067 0.59
*** Estimates of parameters ***
Constant Product Temp Opdays Employ
estimate 6.36 0.2117 0.01387 -0.1267 -0.02182
s.e. 1.31 0.0455 0.00516 0.0480 0.00728
t(12) 4.84 4.65 2.69 -2.64 -3.00
t pr. <.001 <.001 0.020 0.022 0.011
This variable, too, has a large effect, raising the percentage variance accounted for to 69%. Notice that the t-statistics now provide evidence of a significant effect of each variable when all the others are taken account for. The estimate for the Temp parameter is larger than in the model with just production and temperature, 0.01387 compared to 0.00996, and its standard error is smaller, 0.00516 compared to 0.00632. The first effect is caused by the fact that there is correlation, or confounding, between the effects of the explanatory variables: so any effect is estimated differently in the presence of a different set of other explanatory variables. The difference in standard errors is caused both by this and by the fact that more variance has been accounted for in the last model.
5 Linear regression
162
The effect of this confounding can also be highlighted by looking at an accumulated analysis of variance. This shows the sequential effects of including the variables, in the order in which they were listed, rather than their effects in the presence of all the other variables. This summary is available from the Linear Regression Further Output menu, shown in Figure 5.14, and is displayed below. Figure 5.14 ***** Regression Analysis ***** *** Accumulated analysis of variance *** Change + Product + Temp + Opdays + Employ Residual Total
d.f. 1 1 1 1 12
s.s. 1.27017 0.28935 0.33328 0.55603 0.74380
m.s. 1.27017 0.28935 0.33328 0.55603 0.06198
16
3.19263
0.19954
v.r. 20.49 4.67 5.38 8.97
F pr. <.001 0.052 0.039 0.011
The F-probability for Temp here could be used to test the effect of temperature eliminating the effect of Product but ignoring Opdays and Employ; the tprobability with the estimate of Temp above, tests the effect eliminating the effect of all the other explanatory variables.
5.3 Stepwise and all subsets regression
5.3
163
Stepwise and all subsets regression
The sequential fitting methods described in Section 5.2 can be very labour intensive if there are many variables. The Change Menu (Figure 5.15) also provides stepwise facilities that allow you to build up the model automatically. To illustrate these with the water usage data, we first fit a model with just the constant (using the menu in Figure 5.11 in Section 5.2), and then Figure 5.15 click the Change button to produce the Change Model menu as before. The process takes the form of a number of steps (specified in the Max. Num ber of Steps box) in which variables are added or dropped from the model. The possible changes to consider are selected in the T erm s box; in Figure 5.15 we have decided to consider all the variables. Each possible change is assessed using a variance ratio calculated as the mean square of the change line divided by the residual mean square of the original model. If you click the Forward Selection button, at each step GenStat adds the variable with the largest variance ratio, provided that variance ratio exceeds the value specified in the Test Criterion box. The default value for the criterion is one, but many users prefer the value four; see for example page 153 of McConway, Jones & Taylor (1999, Statistical Modelling using GENSTAT, Arnold, London). If we click on Forward Selection in Figure 5.15, two steps are taken, adding first Product and then Employ, as shown below. *** Step 1: Residual mean squares *** 0.1282 0.1765 0.1955 0.1995 0.2112
Adding Product Adding Employ Adding Temp No change Adding Opdays
Chosen action: Adding
Product
5 Linear regression
164
*** Step 2: Residual mean squares *** 0.09710 0.11665 0.12816 0.13174 0.19954
Adding Employ Adding Temp No change Adding Opdays Dropping Product
Chosen action: Adding
Employ
As only the Display Changes box is checked in the menu, GenStat simply produces a brief summary of the changes. The residual mean square of the original model at each step is given in the “No change” line. Notice also that, for information, GenStat also shows the effect of dropping terms. Thus, if you set the maximum number of steps equal to number of variables, GenStat will perform a complete forward stepwise fit automatically, stopping only when no further variable seems to be useful. The Ba c k w a rd Eli m i n a t io n button examines the effect of dropping variables from the model. Suppose we now select Em ploy and Product in the Change Model menu (Figure 5.16), and click on Backward Elimination . At each step, GenStat now drops the term with the smallest variance ratio, provided that variance ratio is less than the test criterion. As the output below shows, Figure 5.16 both variance ratios are greater than the criterion, so the process stops after a single step. *** Step 1: Residual mean squares *** 0.09710 0.12816 0.17649
No change Dropping Employ Dropping Product
Chosen action: No change
5.3 Stepwise and all subsets regression
165
The menu can thus be used for full automatic backwards stepwise regression by first fitting the full model with the General Linear Regression menu (Figure 5.11). Then select all the variables in the Change Model menu, set a maximum number of steps equal to the number of variables and click on Backward Elimination . Finally, if you click the Stepwise Regression button, GenStat will first look to see if any variable can be dropped. Then, if that is not possible, it looks to see if any can be added. Automatic stepwise procedures result in only one model, and alternative models with an equivalent or even better fit can easily be overlooked. In observational studies with many correlated variables, there can be many alternative models, and selection of just one well-fitting model may be unsatisfactory and perhaps misleading. Another method is to fit all possible regression models, and to evaluate these according to some criterion. In this way several best regression models can be selected. However the fitting of all possible regression models can be very time-consuming. It should also be used with caution, because models can be selected that appear to have a lot of explanatory power, but contain only noise variables (those representing random variation). This can occur particularly when the number of parameters is large in comparison to the number of units. The models should therefore not be selected on the basis of a statistical analysis alone, but by considering the physical plausibility of models and by taking account of any previous modelling experience. All subsets regression can be performed using the All Sub sets Regression menu. This is obtained by selecting Regression An alysis from the Stats menu, clicking on All Subse ts Regression and then Linear Models (as we shall be investigating a linear regression model Figure 5.17 again), as shown in Figure 5.17.
5 Linear regression
166
Figure 5.18 shows the menu set up to examine all possible regression models for the water usage data. W ater is entered as the response variate, the explanatory variates are listed (separated by commas) in the Model form ula or List of Explanatory Data box, and the All P ossible box is Figure 5.18 checked. The output provides a brief summary of all the regressions. By default, the models with each number of explanatory variables are ordered according to their percentage variances accounted for (the column header “Adjusted”), and a statistic known as Mallows Cp is provided for further information. Cp is rather more conservative than the percentage variance accounted for (see Section 3.2.6 of the Guide to GenStat, Part 2 Statistics) but here they lead to the same conclusions. Other statistics can be selected using the All Subsets Regression Options menu (obtained by clicking the Options button as usual). This also allows you to set a limit on the total number of terms in the subsets. (It may be impracticable to fit them all if there are many variables.) ***** Model Selection ***** Response variate: Number of units: Forced terms: Forced df: Free terms:
Water 17 Constant 1 Employ + Opdays + Product + Temp
*** All possible subset selection *** * MESSAGE: Probabilities are based on F-statistics, i.e. on variance ratios Best subsets with 1 term Adjusted 35.77 11.55 2.04 <0.00
Cp 18.02 29.71 34.30 38.10
Df 2 2 2 2
Best subsets with 2 terms
Employ .099 -
Opdays .735
Product .007 -
Temp .266 -
5.4 Regression with grouped data Adjusted 51.34 41.54 33.98 16.99 6.42 1.51
Cp 10.93 15.35 18.76 26.41 31.18 33.39
Df 3 3 3 3 3 3
167
Employ .030 .075 .107 -
Opdays .454 .679 .354
Product .003 .005 .007 -
Temp .138 .181 .168
Employ .042 .019 .062
Opdays .199 .091 .247
Product .004 .002 .002 -
Temp .177 .036 .092
Employ .011
Opdays .022
Product .001
Temp .020
Best subsets with 3 terms Adjusted 54.70 54.06 49.89 19.70
Cp 9.96 10.22 11.97 24.61
Df 4 4 4 4
Best subsets with 4 terms Adjusted 68.94
Cp 5.00
Df 5
The output shows that the best model with a single explanatory variable is the one with production (confirming the conclusion from our use of Try in Section 5.2), the best with two variables has production and number of employees, and so on. The menu also provides some rather more flexible and powerful stepwise regression facilities which we will not demonstrate. For details see the on-line help or Section 3.2.6 of the Guide to GenStat, Part 2, Statistics, which describes the RSEARCH procedure that the menu uses.
5.4
Regression with grouped data
So far, all the variables used in regression models have been continuous. However, one of the goals of many investigations is to look for differences between groups, to see how the relationship between two continuous variables changes. So we now show how to fit regression models that take account of groupings of the units. We shall analyse the pollution data explored in Chapter 3, and fit some linear models to try to explain the variation in the sulphur observations. First, we fit a simple linear regression on the wind speed, after loading the data from the spreadsheet file Sulphur.gsh. ***** Regression Analysis ***** Response variate: Sulphur Fitted terms: Constant, Windsp
*** Summary of analysis ***
Regression
d.f. 1
s.s. 935.
m.s. 934.52
v.r. 9.49
F pr. 0.003
5 Linear regression
168 Residual Total
111 112
10932. 11866.
98.48 105.95
Percentage variance accounted for 7.0 Standard error of observations is estimated to be 9.92 * MESSAGE: The following units have large standardized residuals: Unit Response Residual 20 49.00 3.57 98 43.00 3.88 * MESSAGE: The following units have high leverage: Unit Response Leverage 30 3.00 0.075 72 5.00 0.051 95 14.00 0.054 100 25.00 0.051 *** Estimates of parameters ***
Constant Windsp
estimate 17.03 -0.636
s.e. 2.33 0.207
t(111) 7.32 -3.08
t pr. <.001 0.003
We discovered in Section 3.1 that the distribution of sulphur measurements was skew, so we are forewarned that there may be problems in assumin g a Normal distribution with constant variance for this variable. We use the Model Checking option to plot the residuals against the fitted values to see if this is a problem after taking account of the effect of wind speed. The plot in Figure 5.19 s h o w s a v er y s k e w distribution of residuals, with much more spread above 0 Figure 5.19 than below. So we shall try analysing the log of the sulphur measurements instead, in the hope that they are more Normally distributed. As in Section 3.1, we form a new variate Logsulphur containing the logarithms (to base 10) of the sulphur values. ***** Regression Analysis *****
5.4 Regression with grouped data
169
Response variate: Logsulphur Fitted terms: Constant, Windsp *** Summary of analysis ***
Regression Residual Total
d.f. 1 110 111
s.s. 1.50 15.89 17.39
m.s. 1.4952 0.1445 0.1567
v.r. 10.35
F pr. 0.002
Percentage variance accounted for 7.8 Standard error of observations is estimated to be 0.380 * MESSAGE: The following units have large standardized residuals: Unit Response Residual 98 1.633 2.68 * MESSAGE: The following units have high leverage: Unit Response Leverage 30 0.477 0.076 72 0.699 0.052 95 1.146 0.055 100 1.398 0.051
*** Estimates of parameters ***
Constant Windsp
estimate 1.1066 -0.02557
s.e. 0.0892 0.00795
The residual plot in Figure 5.20 shows a much more symmetrical distribution of observations, with no evidence of changing variance with the size of sulphur measurement. The plot does show up the imprecise recording of the sulphur measurements as integers: the apparent diagonal lines of points correspond to sulphur measurements with equal values. However, another problem is still apparent: the smoothed trend line in the graph shows evidence of curvature, which indicates Figure 5.20 that the effect of wind speed is not linear. Perhaps a
t(110) 12.41 -3.22
t pr. <.001 0.002
5 Linear regression
170
quadratic or exponential model, as discussed in Section 5.4, would be better. However, we shall continue with the linear model here to make it easier to show how to fit the effect of groups. The decrease in sulphur measurements with wind speed, noted in Section 3.1, is estimated here to be about 5.7% per km/h (antilog(!0.02557) = 94.3%), and is statistically significant. We would also like to estimate the difference between wet and dry days, and see if the relationship between sulphur and wind speed is different in the two categories. These two goals can be achieved by selecting Sim ple Linea r Re gres sion with Groups from the drop-down
Figure 5.21 list in the Linear Regression menu. This displays a menu with an extra box to specify a factor defining the groups; the filled-in box is shown in Figure 5.21, with the factor Rain entered as the grouping factor. This selection sends three successive analyses to the Output window. The first is exactly the same as that produced already with the Simple Linear Regression option, so we did not need to do that analysis separately. The second analysis fits a model with a separate intercept for wet and dry days, as shown below. ***** Regression Analysis ***** Response variate: Logsulphur Fitted terms: Constant + Windsp + Rain *** Summary of analysis ***
Regression Residual Total Change
d.f. 2 109 111
s.s. 1.89 15.50 17.39
m.s. 0.9442 0.1422 0.1567
v.r. 6.64
F pr. 0.002
-1
-0.39
0.3933
2.77
0.099
Percentage variance accounted for 9.2 Standard error of observations is estimated to be 0.377 * MESSAGE: The following units have high leverage: Unit Response Leverage 30 0.477 0.102 72 0.699 0.073
*** Estimates of parameters ***
5.4 Regression with grouped data estimate 1.1235 -0.02193 -0.1240
Constant Windsp Rain yes
s.e. 0.0891 0.00818 0.0745
171
t(109) 12.62 -2.68 -1.66
t pr. <.001 0.008 0.099
Parameters for factors are differences compared with the reference level: Factor Reference level Rain no
The effect of rainfall is quantified here in terms of the difference between dry and wet days: that is, by comparing level yes of the factor Rain to its reference level no. (By default the reference level is the first level of the factor, but the FACTOR directive has an option REFERENCELEVEL that enables you to use other levels.) So the model is Logsulph = a + b × Windsp for dry days, and Logsulph = a + d + b × Windsp for wet days. The model thus consists of two parallel regression lines. The estimates show that rainfall decreases the sulphur on average by 25% (antilog(–0.1240) = 75%), but this effect is not statistically significant because of the large unexplained variation in the sulphur measurements. This version of the model is very convenient if you want to make comparisons with the reference level (which may, for example, represent a standard set of conditions or treatment). However, we show later in this section how you can obtain the alternative version with a parameter in the model for each intercept. We can see whether the linear effect of wind speed is different in the two categories of rainfall by looking at the third and final analysis in the Output window. ***** Regression Analysis ***** Response variate: Logsulph Fitted terms: Constant + Windsp + Rain + Windsp.Rain *** Summary of analysis ***
Regression Residual Total Change
d.f. 3 108 111
s.s. 1.92 15.47 17.39
m.s. 0.6402 0.1432 0.1567
v.r. 4.47
F pr. 0.005
-1
-0.03
0.0323
0.23
0.636
Percentage variance accounted for 8.6 Standard error of observations is estimated to be 0.378 * MESSAGE: The following units have large standardized residuals: Unit Response Residual
172
5 Linear regression
98 1.633 2.61 * MESSAGE: The following units have high leverage: Unit Response Leverage 30 0.477 0.160 72 0.699 0.112 95 1.146 0.111 104 1.580 0.093
*** Estimates of parameters ***
Constant Windsp Rain yes Windsp.Rain yes
estimate 1.153 -0.0252 -0.208 0.0079
s.e. 0.109 0.0107 0.193 0.0167
t(108) 10.57 -2.36 -1.08 0.47
t pr. <.001 0.020 0.283 0.636
Parameters for factors are differences compared with the reference level: Factor Reference level Rain no
This model includes the interaction between the explanatory factor and variate. In GenStat, interactions are represented using the dot operator, so that Windsp.Rain represents the interaction between wind speed and rain. More details about the specification of statistical models are given in Section 6.5. The output now shows the slope of the regression for dry days, titled Windsp, and the difference in slopes between wet and dry, titled Windsp.Rain yes. So again we can see immediately Figure 5.22 that the difference between the slopes is small and not significant. The graph of the fitted model is shown in Figure 5.22. An analysis of parallelism can be carried out using the Accum ulated option of the Linear Regression Further Output menu, as in Figure 5.14. This allows you to make a formal assessment of how complicated a model you need. You can then
5.4 Regression with grouped data
173
select the appropriate model from the Final Model box (see Figure 5.21) and click on OK to fit it. *** Accumulated analysis of variance *** Change + Windsp + Rain + Windsp.Rain Residual Total
d.f. 1 1 1 108
s.s. 1.4952 0.3933 0.0323 15.4677
m.s. 1.4952 0.3933 0.0323 0.1432
111
17.3884
0.1567
Here a Common line (in fact, a simple linear regression) would be enough, but to illustrate the fitted parallel lines we have selected Parallel lines, estimate lines and clicked on OK . This fits parallel lines but now with a parameter for each intercept, rather than parameters for differences from the reference level (which would be given by the alternative setting Parallel lines, estimate differences from ref. level). The other settings
are: Com mon line ; Parallel lines, estim ate lines ; and Parallel lines, estimate differences from ref. level. The fitted parallel
Figure 5.23
lines are shown in Figure 5.23. ***** Regression Analysis ***** Response variate: Logsulphur Fitted terms: Windsp + Rain *** Summary of analysis ***
Regression Residual Total Change
d.f. 2 109 111
s.s. 1.89 15.50 17.39
m.s. 0.9442 0.1422 0.1567
v.r. 6.64
F pr. 0.002
-1
-0.39
0.3933
2.77
0.099
v.r. 10.44 2.75 0.23
F pr. 0.002 0.100 0.636
174
5 Linear regression
Percentage variance accounted for 9.2 Standard error of observations is estimated to be 0.377 * MESSAGE: The following units have high leverage: Unit Response Leverage 30 0.477 0.102 72 0.699 0.073
*** Estimates of parameters ***
Windsp Rain no Rain yes
estimate -0.02193 1.1235 1.000
s.e. 0.00818 0.0891 0.109
t(109) -2.68 12.62 9.14
t pr. 0.008 <.001 <.001
If we now click on the Predict button in the Linear Regression menu (Figure 5.21), we can obtain predictions from this parallel-line model. The predictions menu (Figure 5.24) is now customized to include the grouping factor (Rain). In Figure 5.24, the drop-down list box Predict at L evels is set to all , to indicate that we want to form predictions for all the levels of Rain. The alternative setting, stan dard ize , forms averages over the levels of Rain, and the Standardization Method box then allows you to indicate whether you want ordinary averages (Equal), or whether you want the levels weighted according to their replication in the data set (Marginal), or whether you want to specify your own weights (Specify) which might correspond Figure 5.24 to the numbers of wet and dry days that you would anticipate in some future period. The other box specifies the values of the explanatory variate (Windsp) for which we want predictions, here 0, 5, 10, 15 and 20. We have also checked the box to include variance of future observation (unlike Figure 5.9 in Section 5.1), so the standard errors in the output below are relevant for the values as predictions of the amounts of sulphur on future occasions. *** Predictions from regression model *** These predictions are estimated mean values.
5.5 Fitting curves and polynomials
175
The predictions have been formed only for those combinations of factor levels for which means can be estimated without involving aliased parameters. The standard errors are appropriate for interpretation of the predictions as forecasts of new observations rather than as summaries of the data. Response variate: Logsulphur Rain no Prediction Windsp 0.00 1.124 5.00 1.014 10.00 0.904 15.00 0.794 20.00 0.685
5.5
s.e.
yes Prediction
s.e.
0.387 0.382 0.380 0.383 0.390
1.000 0.890 0.780 0.671 0.561
0.393 0.385 0.381 0.382 0.387
Fitting curves and polynomials
Linear regression models are linear in the parameters. However, they do not have to be linear in the explanatory variables, so we can fit a model like yi = a + b × xi + c × xi2 almost as easily as the model in the previous sections. A model like this is called a polynomial regression model and can be fitted in GenStat using the General Linear Regression option of the Linear Regression menu. We start by fitting a polynomial model to some results from an experiment on sugar cane, where there is a curvilinear relationship. The file Cane.gsh stores the yield of sugar from four replicates of each of five levels of nitrogen fertilizer. After loading the contents of this file, and selecting Linear Regression , you need to fill in the menu to specify the response variable and the model to be fitted, as shown in Figure 5.25. The Maximal Model box does not need to be filled here (it is designed for sequential model fitting, Figure 5.25 described in Section 5.2). The model to be fitted makes use of the POL function in the GenStat language, specifying concisely that a quadratic effect of Nitrogen should be fitted (a quadratic is a polynomial of order 2). If you simply specified Nitrogen as the fitted model, then GenStat would fit a simple linear regression as in the last section. The POL function is available in the Ope rators box, along with other functions and operators that can be used to specify models. The POL function will
5 Linear regression
176
allow models with up to the fourth power. If you want to use higher powers, you would need to fit orthogonal polynomials using the REG function (see Section 3.4.2 of Part 2 of the Guide to GenStat). The following output shows the results of fitting the model, which includes parameters for nitrogen and for the square of nitrogen. ***** Regression Analysis ***** Response variate: Yield Fitted terms: Constant + Nitrogen Submodels: POL(Nitrogen; 2) *** Summary of analysis ***
Regression Residual Total
d.f. 2 17 19
s.s. 34798. 1885. 36683.
m.s. 17398.9 110.9 1930.7
v.r. 156.90
F pr. <.001
Percentage variance accounted for 94.3 Standard error of observations is estimated to be 10.5 * MESSAGE: The following units have large standardized residuals: Unit Response Residual 6 144.0 2.10 7 145.0 2.20 *** Estimates of parameters ***
Constant Nitrogen Lin Nitrogen Quad
estimate 74.19 1.112 -0.002721
s.e. 4.96 0.117 0.000563
t(17) 14.97 9.47 -4.83
t pr. <.001 <.001 <.001
There is a message in the output about two large residuals: GenStat automatically checks to see if any residuals are large compared to a standard Normal distribution (see Section 3.1.2 of Part 2 of the Guide to GenStat for the exact criterion). However, these two are only just outside the range (–1.96, 1.96) which contains 95% of observations from a Normally distributed variable. The printed output shows the parameter estimates, so you can see that the fitted curve has the equation: Yield = 74.19 + 1.112 × Nitrogen ! 0.002721 × Nitrogen 2
5.5 Fitting curves and polynomials
177
You can display the fitted model by clicking on the Fitted Model button of the Regression Further Output menu as before. However, for all models other than simple linear regression, this option brings up the Graph of Fitted Model menu, shown in Figure 5.26 Figure 5.26, to allow for alternative types of display. In this case, there is still only a single graph that can be drawn, with Nitrogen on the x-axis as before; but when there are several explanatory variables, as in Section 5.2, this allows you to choose between them. The resulting picture, in Figure 5.27, shows the segment of the quadratic curve that has been fitted. The polynomial model that we have fitted above has a number of problems for our experimental data. Firstly, the shape of the curve is fixed in important ways: Figure 5.27 it is a quadratic that must be symmetrical about the maximum, and the curvature changes in a fixed way. There is no scientific reason why this shape should fit well, so we should be cautious in using it to summarize the results. We should certainly beware of trying to extrapolate outside the range of the data: for larger amounts of fertilizer nitrogen, the model will predict falling yields, for which there is no evidence here. GenStat provides other, more advanced, functions that can help to model nonlinear relationships. The SSPLINE function fits a smoothing spline, which can give a useful indication of the shape of a relationship without imposing too much pre-defined structure. A model that contains this function, or the LOESS (locally weighted regression) function, is called an additive model rather than a linear model; for further details, see Section 3.4.3 of Part 2 of the Guide to GenStat.
5 Linear regression
178
We illustrate the idea here by fitting a smoothing spline to the sugar yields (Figure 5.28). As with POL, you can select the SSPLINE function from the Ope rators window (though it appears in abbreviated form as S rather than SSPLINE) and it needs two arguments: the identifier of the explanatory variate and the order of the model (here 2). The order effectively specifies how much the original data are to be smoothed: order 1 corresponds to perfect smoothing – a straight Figure 5.28 line – while order 4 would correspond here to no smoothing – a curve passing through the mean yield at each of the five distinct values of Nitrogen. ***** Regression Analysis ***** Response variate: Yield Fitted terms: Constant + Nitrogen Submodels: SSPLINE(Nitrogen; 2) *** Summary of analysis ***
Regression Residual Total
d.f. 2 17 19
s.s. 34649. 2034. 36683.
m.s. 17324.6 119.6 1930.7
v.r. 144.82
F pr. <.001
Percentage variance accounted for 93.8 Standard error of observations is estimated to be 10.9 * MESSAGE: The following units have large standardized residuals: Unit Response Residual 6 144.0 2.14 7 145.0 2.23
*** Estimates of parameters ***
Constant Nitrogen Lin
estimate 87.80 0.5675
s.e. 4.24 0.0346
t(17) 20.73 16.41
t pr. <.001 <.001
5.5 Fitting curves and polynomials
179
The output does not show the equation of the fitted curve: it is rather complicated, involving cubic polynomials fitted between each distinct pair of values of Nitrogen. The linear component is, however, estimated and displayed as before. The point of this analysis is to draw the picture, shown in Figure 5.28. This shows a smooth curve quite similar to the previous polynomial curve, but still rising at the largest value of Nitrogen rather than reaching a maximum there. Whereas the polynomial is an artificial curve for this relationship, the exponential curve or asymptotic regression has more scientific justification. It has the equation yield = a + b × rnitrogen which represents a curve rising to a plateau or asymptote, quantified by the parameter a, if the rate parameter r is between 0 and 1 and the range parameter b is negative. This model is not linear, so it cannot be fitted using linear regression. However it is one of the repertoire of nonlinear curves available from the Standard Curve line in the Regression section of the Stats menu (see Figure 5.2). If you select this, a menu appears as in Figure 5.29, including a drop-down list whose first item is the expon ential curve. The box entitled Direction of Response allows you to choose between an exponential growth model (corresponding to a value of r greater than 1 in the equation above, and with an asymptote to the left in a graph) or a model in which the response increases or declines to an asymptote as Figure 5.29 the explanatory variable increases (r positive but less than 1). Here are the results of fitting the exponential curve to the sugar yields. ***** Nonlinear regression analysis ***** Response variate: Explanatory: Fitted Curve: Constraints:
Yield Nitrogen A + B*R**X R < 1
*** Summary of analysis *** d.f.
s.s.
m.s.
v.r.
F pr.
5 Linear regression
180 Regression Residual Total
2 17 19
35046. 1637. 36683.
17523.18 96.27 1930.68
182.02
<.001
Percentage variance accounted for 95.0 Standard error of observations is estimated to be 9.81 *** Estimates of parameters ***
R B A
estimate 0.98920 -131.1 203.0
s.e. 0.00213 10.6 10.8
Note that no t-probabilities are shown in this nonlinear analysis, because both standard errors and tstatistics are approximations, depending on the amount of curvature of the model and how well it fits the data. The fitted model is shown in Figure 5.30. It seems to fit the data well, and has reasonable behaviour at both extremes of the nitrogen fertilizer treatments. The Standard Curves menu covers most situations but, if you want to fit a curve that it Figure 5.30 does not cover, GenStat has an alternative menu, shown in Figure 5.31, that allows you to define and fit your own nonlinear curves. This is obtained by clicking on the Nonlinear Models line in the Regression section of the Stats menu (see Figure 5.2). We illustrate it by refitting the exponential model.
5.5 Fitting curves and polynomials
181
First of all we enter Yield into the Re spo nse Variate field in the usual way. Then we must define the model to be fitted. The model can contain a mixture of linear and nonlinear terms. The nonlinear terms are defined by clicking on the New button in the M o d e l Expressions section. This pops up the G e n e r a te Figure 5.31 Expression menu (Figure 5.32) which you use to specify expressions to define the nonlinear parts of the model. Here we have defined the expression R_N = R ** Nitrogen
The expression has been given the reference identifier Exp and, when we click on OK , this identifier is entered (automatically) by GenStat into the Model Expressions list box in the Nonlinear Models menu (Figure 5.31). The variate R_N is a linear term in the model, which can be written as A + B*R_N
Figure 5.32
So we check the Estima tion includes linear param eters box, and enter R_N into the Linear Terms fitted in Model box. It is much more efficient to estimate the parameters A and B in this way. The alternative would be to define the whole model in the expression, for example by FitYield = A + B * R**Nitrogen
The expression sets variate FitYield to the fitted values given by the model. If the Linear Terms fitted in Model box is not checked, the Maximal Model box is replaced by a box called Fitted Values into which you should enter the name of the fitted values variate FitYield.
5 Linear regression
182
Notice that you can have more than one linear term. In fact you can define a maximal model and use the Change M odel menu as in Section 5.2 to decide which ones are needed. The Distribution and Link Function boxes allow you to define and fit generalized nonlinear models (see Section 3.5.8 of Part 2 of the Guide to GenStat). The default settings of Normal and Identity, as in Figure 5.31, fit the usual type of nonlinear model in which the residuals are assumed to be Normally distributed. The next step is to list the nonlinear parameters (in this case just R) in the Nonlinear Pa ram ete rs box of the N o n l in e a r M o de ls menu (Figure 5.31). You will need to set initial values for these, and possibly also bounds and steplengths, by using the Nonlinear Parameter Settings
menu (Figure 5.33) generated Figure 5.33 by clicking on the Settings button in the the Nonlinear Models menu. Here we have set an initial value of 0.9, and defined an upper bound of 1.0, but have not defined any lower bound and have left GenStat to decide on the step length to be used. Finally, clicking on OK produces the output below. ***** Nonlinear regression analysis ***** Response variate: Yield Nonlinear parameters: R Model calculations: Exp Fitted terms: Constant, R_N *** Summary of analysis ***
Regression Residual Total
d.f. 2 17 19
s.s. 35046. 1637. 36683.
m.s. 17523.18 96.27 1930.68
v.r. 182.02
Percentage variance accounted for 95.0 Standard error of observations is estimated to be 9.81 *** Estimates of parameters ***
R * Linear Constant
estimate 0.98920
s.e. 0.00213
203.0
10.8
F pr. <.001
5.6 Generalized linear models R_N
-131.1
183
10.6
The parameter B is now the regression coefficient of R_N and A is now the Constant, but otherwise the results are identical to those given (rather more easily) by the Standard Curves menu. Further information about nonlinear curve fitting are in Section 3.8 of Part 2 of the Guide to GenStat.
5.6
Generalized linear models
The regression menus that we have seen so far are intended for continuous data that can be assumed to follow a Normal distribution. However, GenStat can handle many other types of data. One possibility is that the data may consist of counts. For example, you may have recorded the number of various types of items that have been sold in a shop, or numbers of accidents occurring on different types of road, or the number of fungal spores on plants with different spray treatments. Such data are generally assumed to follow a Poisson distribution. At the same time, it is usually assumed also that treatment effects will be proportionate (that is, the effect of a treatment will be to multiply the expected count by some number, rather than to increase it by some fixed amount). So, the model will be linear on a logarithmic scale rather than on the natural scale as used in ordinary linear regression. Models like this are known as log-linear models and form just one of the types of model covered by GenStat’s facilities for generalized linear models. The Generalized Linear Models menu is obtained by clicking on Generalized Linear line in the Regression section of the Stats menu (see Figure 5.2). For a log-linear model, you should then select Loglinear modelling in the An alysis drop-down list box, as shown in Figure 5.34. The menu operates similarly to the General Linear Regression menu (Figure Figure 5.34 5.11), allowing you to define a maximal model and then investigate which of its terms are required, as shown in Section 5.2. Here, though, we have a single explanatory variate, and so there is
5 Linear regression
184
no need to specify the Maximal Model. The data set (in GenStat spreadsheet file Cans.gsh) concerns the number of cans of drink (sales) sold by a machine during 30 weeks. The explanatory variate temperature is the average temperature during that week. Clicking on OK produces the output below. ***** Regression Analysis ***** Response variate: Distribution: Link function: Fitted terms:
sales Poisson Log Constant, temperature
*** Summary of analysis *** mean deviance approx d.f. deviance deviance ratio chi pr Regression 1 52.61 52.614 52.61 <.001 Residual 28 32.05 1.145 Total 29 84.66 2.919 * MESSAGE: ratios are based on dispersion parameter with value 1 Dispersion parameter is fixed at 1.00 * MESSAGE: The following units have large standardized residuals: Unit Response Residual 30 137.00 2.87
*** Estimates of parameters ***
estimate s.e. t(*) Constant 4.3410 0.0303 143.49 temperature 0.01602 0.00222 7.22 * MESSAGE: s.e.s are based on dispersion parameter
antilog of t pr. estimate <.001 76.78 <.001 1.016 with value 1
The initial description contains the extra information that the data have a Poisson distribution, and that the link function ( the transformation required to give a scale on which the model is linear) is the logarithm to base e. These are the two aspects required to characterize a generalized linear model. In the Log-linear modelling menu they are set automatically, but you can also select General Model in the An alysis field to obtain a menu where you can set these explicitly, and thus fit any of GenStat’s generalized linear models.
5.6 Generalized linear models
185
With generalized linear models, the summary of analysis contains deviances instead of sums of squares. Under the null hypothesis they have P2 distributions, and a quick rule-of-thumb is that their expected values are equal to their degrees of freedom. However, some sets of data show over-dispersion. The residual deviance is then noticeably greater than its expectation and, instead of assessing the regression line by comparing its deviance with P2, you should use the deviance ratio (and Figure 5.35 assess this using an F distribution). You should also estimate the dispersion parameter, by checking Estimate in the Dispersion Parameter section of either the Generalized Linear Model Options menu or the Generalized Linear Models Further Output menu (Figure 5.35). GenStat will then adjust the standard errors of the parameter estimates to take account of the over dispersion. Note, however, that the residual deviance may be large not because of over dispersion, but simply because some important terms have been omitted from the model (and these may not even be available in the data set). You should then keep the dispersion parameter at the default value of 1, and continue to assess the deviances using P2 distributions. Further details are given in Section 3.5.1 of Part 2 of the Guide to GenStat. Here, though, the residual deviance is not substantially more than its expectation (as illustrated by the fact that its mean deviance is 1.145). So we can treat the regression deviance as P2 on one degree of freedom ! and note that there seems to be a very strong effect of temperature on sales.
5 Linear regression
186 The fitted model can be displayed by clicking on the Further Output button to obtain the Generalized Linear M odels F ur t h e r Ou tpu t menu
(Figure 5.35), and then clicking on Fitted Model to obtain the Graph of Fitted Model menu, as seen in Section 5.4 (Figure 5.26). The curve of the fitted line in the graph (Figure 5.36) illustrates the logarithmic link transformation, as well as the point with the large residual (on the top right Figure 5.36 of the plot). You can also produce the model-checking plots in the same way as in earlier sections. The Generalize d Linear Models menu also has customized menus for binomial data, where each data value records a number of subjects responding out of a total number observed. The models will often involve factors as well as variates. Section 6.6 describes how to define the model formulae that you will then need to specify in the Maximal Model and M odel to be Fitted boxes. Example analyses are described in Section 3.5 of Part 2 of the Guide to GenStat.
5.7
Regression commands
The commands for regression analysis give you more control over the fitting of models, and allow more complex models to be fitted as well. We describe here only those commands that carry out analyses like those already done in this chapter; for more information, see Chapter 3 of Part 2 of the Guide to GenStat. The MODEL directive must be used before any regression analysis, to specify the response variate; for example: MODEL Pressure
MODEL can also define the distribution and link function of a generalized linear model (Section 5.5) using its DISTRIBUTION and LINK options. A simple linear regression can then be fitted with the FIT directive:
5.7 Regression commands
187
FIT Age
The FIT directive has a PRINT option to control the output that is produced, so you could ask for all sections of output with the command: FIT [PRINT=model,summary,estimates,correlations, \ fitted,accumulated] Age
Alternatively, after fitting a model you can use the RDISPLAY directive to display further sections of output without refitting the model; it has a PRINT option just like FIT. The RGRAPH procedure allows you to draw a picture of the fitted model. For example, RGRAPH
draws a graph of a simple linear regression. After multiple regression, you can specify the explanatory variate or a grouping factor or both, as in RGRAPH Logsulphur; GROUPS=Rain
The RCHECK procedure provides model checking. It has two parameters: the first specifies what to display in the graph (residuals, Cook's statistics or leverages) and the second specifies the type of graph (composite, histogram, fittedvalues, index, Normal or halfNormal). For example, RCHECK
draws the composite picture, and the fitted-values graph shown in Figure 5.19 can be drawn by RCHECK residual; fitted
The RKEEP directive allows you to extract information into standard structures. It has many parameters, for each piece of information. To extract the estimates and standard errors, for example, give a command like the following after fitting the model: RKEEP ESTIMATES=polest; SE=polse
Multiple linear regression is achieved simply by listing the explanatory variates in the FIT directive. The list may also include factors, which allows you to fit simple or multiple linear regression models with groups; for example: FIT Windsp,Rain
The parameter of FIT can also be a model formula (defined in Section 6.6), which can include interactions between factors or variates or both; for example: FIT Windsp*Rain
5 Linear regression
188
fits the linear effect of the variate Windsp, the main effect of the factor Rain, and the interaction between them (which represents separate linear effects of Windsp for each level of Rain). To explore a sequence of regression models, you can use the FIT directive and any of the ADD, DROP, TRY or SWITCH directives to modify the model in the same way as described in Section 5.2 for the Change Model menu. If you want to establish a maximal model first, for example when there are missing values, you can start by using the TERMS directive to list all possible explanatory variables, or specify the maximal model as a model formula; for example, TERMS Product,Temp,Opdays,Employ
You can use the POL and SSPLINE functions in the parameter of the FIT directive, to specify polynomial or smoothing spline models; for example, FIT SSPLINE(Nitrogen; 2)
Standard curves are fitted using the FITCURVE directive, which has a parameter and a PRINT option just like FIT, together with a CURVE option to choose the type of curve; for example: FITCURVE [PRINT=summary; CURVE=exponential] Nitrogen
General, user-defined, nonlinear models are fitted using the FITNONLINEAR directive.
5.8
Other facilities
Other menus in the regression section, but not described in this Chapter, include ordinal regression, screening tests, generalized linear mixed models, hierarchical generalized linear models (provided these have been selected in the Advanced menus section on the Menus page of the Options menu) and regression trees. Information on their use can be obtained by clicking on the Help buttons on each of the menus.
5.9
Exercises
5(1) An absorptiometer was used to measure the absorption of light passing through suspensions that contained different numbers of cells. It was intended to estimate the number of cells in future suspensions by the rapid light absorption method, so it was decided to regress cell counts on light absorption. The data are given below, and are available in the spreadsheet file Absorb.gsh,where X is the absorptiometer reading and Y the cell count (108/ml).
5.9 Exercises X 0.37 0.59 0.48 0.62 0.74 0.71
Y 8.2 10.6 7.3 13.3 11.4 12.9
X 0.64 0.77 0.78 0.93 0.81
Y 12.1 14.2 16.1 15.0 16.9
X 0.84 0.71 1.02 0.91 0.94
189
Y 15.8 18.2 16.8 19.1 23.4
This example comes from Experimentation in Biology by Ridgman (1975, Blackie, Glasgow). Load these data into GenStat and fit a linear regression of cell count on absorptiometer reading. Produce a graphical display of the regression. Compare this with a model with no constant (or intercept term). 5(2) Spreadsheet file Rubber.gsh, contains data from an experiment to study how the resistance of rubber to abrasion is affected by its strength and hardness. The data are from Davies & Goldsmith (1972, Statistical Methods in Research & Production, Oliver & Boyd, Edinburgh), and are also used by McConway, Jones & Taylor (1999, Statistical Modelling using GENSTAT, Arnold, London, Chapter 4). Use linear regression to see how loss depends on hardness. How well are the assumptions satisfied? Form predictions for hardness values 50, 60, 70, 80 and 90. Is strength a better explanatory variable? Figure 5.37 Fit a multiple linear regression with both hardness and strength. Are both needed in the model? How well are the assumptions satisfied? Form predictions for hardness values 50, 60, 70, 80 and 90, and strength values 150, 175, 200, 225 and 250. 5(3) An attempt was made to predict the total mark of candidates in a school examination from their mark in the compulsory papers, together with their mark in an English language paper on a previous occasion. Some of the data obtained are presented below, and are in file Mark.gsh. Candidate 1 2 3 4
Total mark 476 457 540 551
Compulsory papers 111 92 90 107
Previous paper 68 46 50 59
5 Linear regression
190 5 6 7 8 9 10 11 12 13 14 15
575 698 545 574 645 556 634 637 390 562 560
98 150 118 110 117 94 130 118 91 118 109
50 66 54 51 59 97 57 51 44 61 66
This example, too, comes from Applied Regression Analysis. Regress the total mark first on the compulsory mark and then on both compulsory and previous marks. Test if there is a significant improvement in prediction of the total mark by including the information from the previous paper. 5(4) S p readsh e e t f i l e Peru.gsh contains a data set, discussed in Section 6.2 of McConway, Jones & Taylor (1999, Statistical Modelling using GENSTAT, Arnold, London), which contains information about 39 Peruvian indians. The aim is to see how the variate sbp (systolic blood pressure) is related to the other variables. Use forward stepwise Figure 5.38 regression to fit a model containing up to four variables. Can that model be improved? Use all subsets regression to see if there are any alternative models.
5.9 Exercises
191
5(5) S p r e a d s h e e t file Calcium.gsh contains a data set, discussed in Sections 4.6 and 6.4 of McConway, Jones & Taylor (1999, Statistical Modelling using GENSTAT, Arnold, London), which contains information about mortality in 61 towns. The aim is to study how this relates to the calcium concentration in the drinking water supply. Fit a linear regression of mortality on calcium. Are the assumptions satisfied better by transforming mortality to logarithms? Perform a simple linear regression with groups to see how the Figure 5.39 relationship is affected by region. 5(6) Experiments on cauliflowers in 1957 and 1958 provided data on the mean number of florets in the plant and the temperature during the growing season (expressed as accumulated temperature above 0°C). 1957 number temp 3.8 2.5 6.2 4.2 7.2 5.3 8.7 5.8 10.2 7.2 13.5 8.9 15.0 10.0
1958 number temp 6.0 2.5 8.5 4.4 9.1 5.3 12.0 6.4 12.6 7.2 13.3 7.8 15.2 9.2
Load the data from file Cauliflower.gsh and carry out an analysis of parallelism of the relationship between number of florets and accumulated temperature, checking the assumptions for linear regression. 5(7) A product is known to lose weight after manufacture. The following measurements (in 1/16 oz) were taken at half-hourly intervals, and are available in file Wtloss.GSH: Time after production 0.0
Weight difference 0.21
5 Linear regression
192 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
-1.46 -3.04 -3.21 -5.04 -5.37 -6.03 -7.21 -7.46 -7.96
This example comes from Applied Regression Analysis by Draper & Smith (1981, Wiley, New York). Fit a quadratic model that represents the loss of weight as a function of time. Look at the residuals from this models and draw conclusions about the validity of the model. Remove the quadratic term from the model and plot the residuals against fitted values to see the effect of omitting this term. Fit a smoothing spline to the data: try different orders to see which smooths the data satisfactorily. 5(8) The first column of file Weed.GSH contains counts of the numbers of weeds found growing in 10 plots. The second column records the amount of herbicide applied to each plot earlier in the year. Fit a log-linear model to see how the numbers of weeds relates to herbicide.
6
Analysis of variance
GenStat has very comprehensive facilities for analysis of variance. Almost all of these can be accessed using custom menus. In this chapter, we start with the simplest design, a one-way completely randomized experiment, before introducing factorial experiments, which have more than one treatment or fixed effect. We use an experiment with a randomized block design to show how to deal with blocks, which involve more than one stratum or component of variance in the analysis, and extend this idea by analysing a split-plot design. Many other types of design can also be analysed by GenStat, and details are available in Chapter 4 of Part 2 of the Guide to GenStat but there is not space in this book. We also introduce some of GenStat's extensive facilities for creating designed experiments, available from the Design option of the Stats menu.
6.1
One-way analysis of variance
We shall start with a simple one-way analysis of variance. This experiment was set up to study the effect of a a 81.5 80.7 80.3 79.8 dietary supplement on the gain in b 81.6 81.9 80.4 80.4 weight of rats. There were five different treatments (representing c 83.5 81.6 82.2 81.3 different amounts of the supplement) and 20 rats were allocated at random, d 82.4 83.1 82.8 81.8 four to each treatment. e 83.2 82.8 82.1 82.1 The data are available in the file Rat.gsh and can be loaded as described in Section 1.2, by clicking Da ta on the menu bar, selecting Load and then Da ta File . Alternatively, click on File and select Open , which will load the data and display the spreadsheet as well, as in Figure 6.1. There are two columns of data: the name diet is in italics, showing that this column is a factor, and weight is a variate. To produce an analysis of variance you first need to click Stats on the menu bar and select Analysis of Variance . The type of design is selected using the Design list box. The possibilities range from simple One -wa y A N O VA to General Analysis of Variance ! each with its appropriate boxes and buttons. Diet
Weight
6 Analysis of variance
194
Figure 6.1
Figure 6.2
When you select One-way ANOVA (no Blocking) the menu takes the form shown in Figure 6.2, with a box for the y-variate and another for the treatment factor. As with the boxes in other GenStat menus, the relevant settings can be selected from the Available Data window. In Figure 6.2, we have set Y-variate to weight and Treatm ents to diet. Clicking OK then produces the analysis shown below. ***** Analysis of variance ***** Variate: weight Source of variation diet Residual Total
d.f. 4 15 19
s.s. 12.7930 7.5925 20.3855
m.s. 3.1982 0.5062
v.r. 6.32
F pr. 0.003
***** Tables of means ***** Variate: weight Grand mean diet
81.76 a 80.57
b 81.07
c 82.10
d 82.53
e 82.55
*** Standard errors of differences of means *** Table rep. d.f. s.e.d.
diet 4 15 0.503
Here we have just the default output. This contains an analysis-of-variance table, in the standard format, then the grand (or overall) mean, and a table of means for
6.2 Two-way analysis of variance
195
the different diets with an accompanying standard error to assess differences between pairs of diet means. Standard errors of means and least significant differences can be obtained by modifying the ANOV A Options menu shown in Figure 6.5. We also show, later in this chapter, how to obtain further output, including plots of means and residuals.
6.2
Two-way analysis of variance
We now consider a more complicated example. This is a field experiment performed to examine the effects of sulphur and nitrogen fertilizers on the yield of canola. So there are two treatment factors, which we shall call S and N. The experiment used a randomized-block design, so there is also a factor, here called block, to indicate the block to which each of the experimental plots belonged. To analyse the experiment we first load the data from the file Canola.gsh, producing the Figure 6.3 spreadsheet in Figure 6.3. Initially, we shall ignore the (randomized-block) structure of the design, and use the data merely to illustrate how to perform a two-way analysis of variance. Selecting Two-way ANOVA (no Blocking) from the Design list box generates the menu shown in Figure 6.4. The yvariate yield is entered into Figure 6.4 the Y-variate box as before and there are now two boxes, Treatm ent 1 and Treatm ent 2 , into which the two treatment factors (N and S) are entered. The Interactions box allows you to decide whether you want to fit the interaction between the factors (All interactions), or just their main effects (No interactions). Here we have left the default setting so that the interaction between nitrogen and sulphur will be included.
196
6 Analysis of variance
Figure 6.5 shows the ANOVA O p tio n s menu (selected by clicking the Options button) which allows you to control the output produced initially from the analysis. The menu consists of a collection of boxes that can be checked to select the output components that are required. By default the following are selected: AOV Table (analysis-of-variance table), Inform atio n (details of any large residuals, non- Figure 6.5 orthogonality or aliasing in the model), Means (tables of means), and F-proba bility (probabilities for variance ratios in the analysis-of-variance table). If we click to clear the Information and Means boxes, and click OK in the ANOV A Options and then the Analysis of Variance menus, only the analysis-of-variance table will be produced, as shown below. ***** Analysis of variance ***** Variate: yield Source of variation N S N.S Residual Total
d.f. 2 3 6 24 35
s.s. 4.59223 0.97720 0.64851 1.29476 7.51269
m.s. 2.29611 0.32573 0.10808 0.05395
v.r. 42.56 6.04 2.00
F pr. <.001 0.003 0.105
The table now has lines for three treatment terms: N represents the main effect of nitrogen, that is the overall way in which yield responds to nitrogen. Similarly S represents the main effect of sulphur, while N.S represents the interaction between nitrogen and sulphur. The interaction assesses the way in which the effect of nitrogen on yield differs according to the amount of sulphur or, equivalently, the way in which the sulphur effect differs according to the amount of nitrogen. If there is no interaction, we could decide on the best amount of nitrogen to apply without needing to consider how much sulphur will be used (and how much sulphur to use without needing to think about the amount of nitrogen).
6.2 Two-way analysis of variance
197
The Further Output button in the Analysis of Variance menu allows additional analysis of variance output to be obtained. Many of the components, shown in Figure 6.6, are the same as those in the Options menu. So we can obtain tables of means by checking the Means box, and then clicking the OK button. (Notice that GenStat can allow the use of multiple comparisons between means. The appearance of the Multiple Com parisons button is Figure 6.6 controlled by a check box in the Menus page of the Options menu; see Section 1.5.) ***** Tables of means ***** Variate: yield Grand mean
1.104
N
0.00 0.601
180.00 1.313
230.00 1.398
S
0.00 0.829
10.00 1.155
20.00 1.167
40.00 1.266
N 0.00 180.00 230.00
S
0.00 0.560 0.894 1.032
10.00 0.770 1.289 1.404
20.00 0.524 1.525 1.454
40.00 0.552 1.545 1.700
*** Standard errors of differences of means *** Table rep. d.f. s.e.d.
N
S
12 24 0.0948
9 24 0.1095
N S 3 24 0.1896
Notice that GenStat has produced a table of means for every term in the analysis of variance, each with an appropriate standard error for assessing differences between pairs of means. The measures of variability to accompany the means
6 Analysis of variance
198
(standard errors of difference, least significant differences or standard errors of means) are controlled by the Standard Errors check boxes. You can click more than one of these, for example to have least significant differences as well as standard errors of differences. If you do request least significant differences, a further box appears allowing you to change the significance level used in their calculation from the default of 5%.
6.3
Randomized-block designs
The randomized-block design is perhaps the simplest type of designed experiment. In these designs, the experimental units are grouped together into sets known as blocks with the aim that units in the same block will be more similar than units in different blocks. Each block contains the same number of replicates of each treatment combination (usually one of each), and the allocation of the treatments is randomized independently within each block. In the analysis, the aim is to estimate and remove the between-block differences so that the treatment effects can be estimated more precisely. In our example, there is a factor called block to indicate the "block" of land to which each plot belonged. In other examples the blocking factor might represent different litters of animals, or different days on which the experiment was conducted, and so on. GenStat has three possible menus for randomized-block designs, depending what treatments have been examined in the experiment. Here we have two treatment factors, and so it is simplest to select the Two-way ANOVA (in Randomized Blocks) line from the Design list box. The alternatives are O ne-way Figure 6.7 A N O V A ( in R a n d o m iz e d Blocks) and General Treatment Structure (in Randomized Blocks) .
The menu (Figure 6.7) is similar to the Two-way ANOVA (no Blocking) menu but with an extra box Blocks into which the block factor needs to be entered. Clicking the OK button generates the analysis-of-variance table again. ***** Analysis of variance ***** Variate: yield
6.3 Randomized-block designs Source of variation
199
d.f.
s.s.
m.s.
v.r.
2
0.30850
0.15425
3.44
block.*Units* stratum N S N.S Residual
2 3 6 22
4.59223 0.97720 0.64851 0.98625
2.29611 0.32573 0.10808 0.04483
51.22 7.27 2.41
Total
35
7.51269
block stratum
F pr.
<.001 0.001 0.061
The differences between the blocks are placed into the line entitled "block stratum", while the "block.*Units* stratum" contains the variation of the plots within blocks. The variance ratio for the block stratum compares the variability of the blocks of land with the variability of the individual plots within each block ! and its value of 3.44 shows that it was worthwhile using the design in this experiment. This can be confirmed also by the fact that the mean square for the Residual has decreased from 0.054 to 0.045. (The Residual line now represents the random variability of the experimental plots after removing block differences as well as the effects of the treatments.) So the standard errors of differences of means (printed using the Further Output menu as before) will also be smaller. ***** Tables of means ***** Variate: yield Grand mean
1.104
N
0.00 0.601
180.00 1.313
230.00 1.398
S
0.00 0.829
10.00 1.155
20.00 1.167
40.00 1.266
N 0.00 180.00 230.00
S
0.00 0.560 0.894 1.032
10.00 0.770 1.289 1.404
20.00 0.524 1.525 1.454
40.00 0.552 1.545 1.700
*** Standard errors of differences of means *** Table rep. d.f. s.e.d.
N
S
12 22 0.0864
9 22 0.0998
N S 3 22 0.1729
200
6 Analysis of variance
Now that we have performed an analysis we can click the Save button to obtain the ANOVA Save Options menu, allowing us to save variates of residuals and fitted values, and tables of means. After checking the appropriate box, a window (entitled In:) will appear into which you enter the identifier of the structure in which the information is to be saved. Figure 6.8 saves the residuals in a variate called yieldres and an N by S table of means in a table called NSmeans. You can Figure 6.8 save means for any of the treatment terms in the analysis; the name of the term is selected from the list in the Treatment Term box. By checking the Display in Spreadsheet box, we can arrange for the table of means to be loaded automatically into a table spreadsheet, from which it can conveniently be transferred into other documents, as explained in Section 3.5. To save other information from ANOVA, such as sums of squares, degrees of freedom and so on, you need to enter command mode and use the directive AKEEP (see Section 6.9). We can also obtain plots of residuals and of means. Clicking the R es idu al P l o ts button produces the ANOV A Residual Plots menu as shown in Figure 6.9. If you check the Histogram box, a histogram is plotted of the residuals. The Fitted values box produces a plot of residuals against fitted values. Normal produces a Normal plot and Half Normal a half-Normal plot of the residuals. The added variable plot, produced by checking the Figure 6.9
6.3 Randomized-block designs
201
box, can be used for example to assess the usefulness of a potential covariate. Also, if the data are from a field experiment you can display the residuals in field layout. Here we leave the default settings (shown in Figure 6.9) and generate the output shown in Figure 6.10. Added Variable
Figure 6.10
202
6 Analysis of variance
As described already in Section 1.2, you can click the GenStat icon on the task bar to return to GenStat and the An alysis of Variance menu. If you again click the Further Output button and then click the Me an P lots button, the ANOVA Me ans Plots menu Figure 6.11 appears (Figure 6.11). This menu plots one-way or two-way tables of means from the analysis. The Method box contains option buttons to select the type of plot. Means represents each mean by a point, Lines plots the point at the means and draws lines between them, and Da ta draws just the lines together with the original data values. The Fa cto r for X-axis is the factor against whose levels the means are to be plotted, while Groups specifies the other factor in a two-way table. Separate lines are drawn for the groups, and the points in each group are plotted using different pens. If neither X Factor nor Groups is specified, the first two-way table of means in the analysis is plotted, or for the first one-way table if there were no two-way tables. Here we have set Method Figure 6.12 to Lines, the Factor for Xaxis to S and Groups to N. The resulting graph is shown in Figure 6.12.
6.4 Fitting contrasts
6.4
203
Fitting contrasts
Sometimes there may be comparisons between the levels of a treatment factor that you are particularly keen to assess. For example, you might have had an initial suspicion that there would be little difference between the 180 and 230 levels of nitrogen in the previous section, but similar (and larger) differences between 0 Figure 6.13 and 180, and between 0 and 230. You might then want to fit a single mean for the 180 and 230 levels of nitrogen, and assess the contrast between this value and the mean for level 0. You can define contrasts like these using the ANOVA Contras ts menu which is obtained by clicking on the Co ntras ts button in the main Analysis of Variance menu (see Figures 6.2, 6.4 or 6.7). The Contrast Factor and Contrast Type fields in the menu shown in Figure 6.13, indicate that we want to assess comparisons between the levels of the factor N, and the Num ber of Contras ts field indicates that we want to fit one contrast. When we click on OK , a GenStat spreadsheet appears containing the contrast matrix Cont whose name was specified in the Contrast Matrix field; this name was Figure 6.14 selected automatically by the ANOVA Contras ts menu, but you can specify your own name if you prefer, or if you have already formed a suitable matrix. You use the spreadsheet to specify the coefficients that define the comparison. In Figure 6.14, the matrix defines the comparison: (N180 + N230) / 2 ! N0 Notice that you can also define names for the contrasts, using the Rows column.
6 Analysis of variance
204
Back in the Analysis of Variance menu (Figure 6.15) you can see that the T reatm e n t 1 field now contains a function of N, namely COMP(N;1;Cont). The syntax of these functions is described in Section 6.6. There is a box controlling the printing of contrasts in the Display section of the Figure 6.15 ANOVA options menu (see Figure 6.5). If we check this together with the AO V T able box, and then click on OK in the Options and main Analysis of Variance menus, the output below appears. ***** Analysis of variance ***** Variate: yield Source of variation
d.f.
s.s.
m.s.
v.r.
block stratum
2
0.30850
0.15425
3.44
block.*Units* stratum N 0 versus 180 and 230 S N.S 0 versus 180 and 230.S
2 1 3 6
4.59223 4.54954 0.97720 0.64851
2.29611 4.54954 0.32573 0.10808
51.22 101.48 7.27 2.41
<.001 <.001 0.001 0.061
3 22
0.59907 0.98625
0.19969 0.04483
4.45
0.014
Residual Total
35
7.51269
***** Tables of contrasts ***** Variate: yield ***** block.*Units* stratum ***** *** N contrasts *** 0 versus 180 and 230
0.754
s.e. 0.0749
ss.div. 8.00
*** N.S contrasts *** 0 versus 180 and 230.S S
0.00 -0.35
e.s.e. 0.150
10.00 -0.18
20.00 0.21
ss.div. 2.00 40.00 0.32
F pr.
6.4 Fitting contrasts
205
Notice that, in the analysis-of-variance table, the line for the main effect N is now accompanied by a line entitled "0 versus 180 and 230" giving the degrees of freedom, sum of squares and so on for that comparison. In addition the N.S interaction is accompanied by a line "S.0 versus 180 and 230" which represents the interaction between the comparison and the factor S (that is, it measures how the size of the comparison varies according to the level of S). The section headed "Tables of contrasts" then shows the estimate of the contrast, 0.754, with standard error 0.0749. The "ss. div" value is analogous to the replication of a table of means or effects: it is the divisor used in calculating the estimated values of the contrasts. This is useful mainly where there is a range of e.s.e.'s for a table of contrasts: the contrasts with the smallest values of the ss. div. are those with the largest e.s.e., and vice versa. (The ss. div. of each estimated contrast is in fact the sum of squares of the values of the coefficients used to calculate it, weighted according to the replication.) The S.N contrasts table shows how the overall value of the contrast varies according to the level of S. So, at level 0 of S, the estimated contrast is 0.754!0.35. When a factor like sulphur (or nitrogen) has quantitative levels, you might want to investigate whether the yield increases linearly with the amount of sulphur (or nitrogen); you could also include a quadratic term to check for curvature in the response. You can fit polynomial contrasts like these by selecting Polynomial within the Contrast Type box in the AN OV A C ontra sts menu. If we set the Contrast Factor to S and the Num ber of C ontra sts to 2, the Treatm ent 2 box of the Analysis of Variance menu will contain the function POL(S;2). If we change the setting of the Treatm ent 1 box back to N, and then click on OK , we obtain the output below. ***** Analysis of variance ***** Variate: yield Source of variation
d.f.
s.s.
m.s.
v.r.
2
0.30850
0.15425
3.44
block.*Units* stratum N S Lin Quad Deviations N.S N.Lin N.Quad Deviations Residual
2 3 1 1 1 6 2 2 2 22
4.59223 0.97720 0.69741 0.19577 0.08403 0.64851 0.52294 0.07788 0.04769 0.98625
2.29611 0.32573 0.69741 0.19577 0.08403 0.10808 0.26147 0.03894 0.02385 0.04483
51.22 7.27 15.56 4.37 1.87 2.41 5.83 0.87 0.53
Total
35
7.51269
block stratum
F pr.
<.001 0.001 <.001 0.048 0.185 0.061 0.009 0.433 0.595
6 Analysis of variance
206
***** Tables of contrasts ***** Variate: yield ***** block.*Units* stratum ***** *** S contrasts *** Lin
0.0094
Quad -0.00042 Deviations S
s.e. 0.00239
ss.div. 7875.
s.e. 0.000199
ss.div. 1131429.
e.s.e. 0.0706 0.00 -0.028
10.00 0.074
ss.div. 9.00 20.00 -0.055
40.00 0.009
*** N.S contrasts *** N.Lin
e.s.e. 0.00413 N
N.Quad
0.00 -0.0115
ss.div. 2625.
180.00 0.0058
e.s.e. 0.000345 N
Deviations N 0.00 180.00 230.00
0.00 180.00 0.00028 -0.00035 e.s.e. 0.122 S
0.00 -0.02 0.03 0.00
230.00 0.0058 ss.div. 377143. 230.00 0.00007 ss.div. 3.00 10.00 0.06 -0.07 0.01
20.00 -0.05 0.05 -0.01
40.00 0.01 -0.01 0.00
In the analysis of variance, the sum of squares for sulphur is partitioned into the amount that can be explained by a linear relationship of the yields with sulphur (the line marked Lin), the extra amount that can be explained if the relationship is quadratic (the line Quad), and the amount represented by deviations from a quadratic polynomial. A cubic term would be labelled as Cub, and a quartic as Quart. You are not allowed to fit more than fourth-order polynomials. The interaction of nitrogen and sulphur is also partitioned: N.Lin lets you assess the effect of fitting three different linear relationships, one for each level of nitrogen; N.Quad assesses the effect of fitting a different quadratic contrast for each level of N; and the deviations line represents deviations from these quadratic polynomials. So, the analysis shows strong evidence for linear and quadratic effects of sulphur, and for interactions between these contrasts and nitrogen (as we would have expected from the plot in Figure 6.12). The tables of contrasts again provide estimates of the parameters of the contrasts. For example, the overall linear effect is 0.0094, and the effect for level 0 of nitrogen is 0.0094!0.0115
6.4 Fitting contrasts
207
You can fit more than one set of contrasts at a time. If we had retained the nitrogen comparison, we would have obtained the output below. ***** Analysis of variance ***** Variate: yield Source of variation
d.f.
s.s.
m.s.
v.r.
F pr.
block stratum
2
0.30850
0.15425
3.44
block.*Units* stratum N 0 versus 180 and 230 S Lin Quad Deviations N.S 0 versus 180 and 230.Lin
2 1 3 1 1 1 6
4.59223 4.54954 0.97720 0.69741 0.19577 0.08403 0.64851
2.29611 4.54954 0.32573 0.69741 0.19577 0.08403 0.10808
51.22 101.48 7.27 15.56 4.37 1.87 2.41
<.001 <.001 0.001 <.001 0.048 0.185 0.061
1
0.52294
0.52294
11.67
0.002
1 22
0.04448 0.98625
0.04448 0.04483
0.99
0.330
Residual Total
35
7.51269
0 versus 180 and 230.Quad
***** Tables of contrasts ***** Variate: yield ***** block.*Units* stratum ***** *** N contrasts *** 0 versus 180 and 230
0.754
s.e. 0.0749
ss.div. 8.00
*** S contrasts *** Lin
0.0094
Quad -0.00042 Deviations S
s.e. 0.00239
ss.div. 7875.
s.e. 0.000199
ss.div. 1131429.
e.s.e. 0.0706 0.00 -0.028
10.00 0.074
ss.div. 9.00 20.00 -0.055
40.00 0.009
*** N.S contrasts *** 0 versus 180 and 230.Lin
0.0173
0 versus 180 and 230.Quad -0.00042
s.e. 0.00506
ss.div. 1750.
s.e. 0.000422
ss.div. 251429.
The interaction between nitrogen and sulphur is now partitioned according to the nitrogen comparison. The line "0 versus 180 and 230.Lin" assesses the
6 Analysis of variance
208
effect of fitting two different linear relationships, one for each level 0 of nitrogen, and one for levels 180 and 230 of nitrogen, instead of a single overall linear contrast. Similarly, the line "0 versus 180 and 230.Quad" represents the difference between the two quadratic contrasts. So you can define contrasts on any treatment factor, and GenStat will automatically estimate their interactions. To fit polynomial contrasts, GenStat calculates orthogonal polynomials and does a multiple regression of the effects of factor using the polynomials as xvariates. Regression contrasts are similar to polynomial contrasts, except that here you can supply your own matrix of x-variates. GenStat orthogonalizes the xvariates for you, so that each one represents the effect adding this x-variable to a model containing all the earlier ones.
6.5 The
Designing an experiment Gen erate
a
Standard
menu enables you to generate many standard experimental designs. It is obtained by clicking Stats on the menu bar and selecting Design, followed by Standard Design. The type of design is selected using the Design list box. The categories parallel those in the Analysis of Variance menu ! again each with its appropriate boxes and buttons. The menu in Figure 6.16 generates a randomizedblock design with four Figure 6.16 blocks (corresponding to four different laboratories) to study two treatment factors: Drug with three levels, and Dose with two levels. In the menu, we have checked the Unit Labels box to form a variate, Subjcode, of numerical labels to identify the units of the design. It is often more convenient to use a single numerical code to identify observations from an experiment, rather than having to use the levels of all the blocking factors (here subjects within laboratories). Design
6.5 Designing an experiment
209
Checking the Random ize Design box asks GenStat to randomize the design. GenStat automatically determines the appropriate type of randomization from the inter-relationships of the blocking factors of the design. For a randomized-block design. this amounts to randomizing the allocation of the treatments independently within each block; see Section 6.3. (However, if you want to do your own randomization, you can use the Ra ndom ize menu, obtained by clicking Stats on the menu bar and selecting Design, followed by Ra ndo m ize .) The Random ization Seed box supplies a seed used to generate the random numbers for the randomization. GenStat suggests a seed automatically (at random), in the same way that it suggests defaults for the other fields in the menu. However, you can supply your own seed if you prefer, and keeping the same seed will generate the same randomization if you want to reproduce the exact design in future. The menu prints the design and, because the Dum m y AN OVA table box is checked, it also generates a skeleton analysis-of-variance as shown below. *** Treatment combinations on each unit of the design ***
Laboratory Subject 1 2 3 4 5 6
1 1 3 2 1 2 3
2 1 2 1 2 2 1
3 2 3 1 1 2
3 1 1 2 1 2 2
4
3 1 1 2 3 2
2 1 2 2 1 1
1 1 3 3 2 2
1 2 1 2 1 2
Treatment factors are listed in the order: Drug Dose
***** Analysis of variance ***** Source of variation d.f. Laboratory stratum
3
Laboratory.Subject stratum Drug 2 Dose 1 Drug.Dose 2 Residual 15 Total
23
210
6 Analysis of variance
The menu has a check box that allows you to load the design factors into a new spreadsheet (see Figure 6.17). The spreadsheet facilities can be used to redefine the factor levels or to specify labels. To do this, you click Spread on the menu bar, followed by Factor and then either Ed it Levels or Ed it Labels as required. In this example we have assumed that the number of laboratories in the trial had already been decided (to be four). Clicking on the Figure 6.17 Replications required button produces a menu to allow you to determine the replication (Figure 6.18). For a randomized-block design, the replication depends on the number of blocks (here laboratories). To make the calculation, GenStat needs to know the size of the smallest difference that you need to detect (here one), and the anticipated within-block variance (here 0.5). The Figure 6.18 variance is best obtained from an earlier analysis of similar data, and is provided by the residual mean square in the “block.plot” (in this case, Laboratory.Subject) stratum. Other boxes allow you to set the significance level that you plan to use to detect the difference (i.e. alpha) and the probability of detection (i.e. the power required for the test).
6.5 Designing an experiment
211
Clicking OK in Figure 6.18, pops ups the menu shown in Figure 6.19, which indicates the required number of replicates (here 7). You can then either click Ap ply to enter that number automatically into the design menu (Figure 6.16), click Cancel to close the menu with no actions, or click Change to return to the Figure 6.19 Replications Required menu (Figure 6.18). The Replication and SEDs boxes were checked in Figure 6.18, producing the output below. ***** Standard errors of differences ***** No.reps 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Resid d.f. 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Resid m.s. 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000
s.e.d. 0.7071 0.5774 0.5000 0.4472 0.4082 0.3780 0.3536 0.3333 0.3162 0.3015 0.2887 0.2774 0.2673 0.2582 0.2500 0.2425 0.2357 0.2294 0.2236
RESPONSE/sed 1.414 1.732 2.000 2.236 2.449 2.646 2.828 3.000 3.162 3.317 3.464 3.606 3.742 3.873 4.000 4.123 4.243 4.359 4.472
t-value 2.015 1.812 1.753 1.725 1.708 1.697 1.690 1.684 1.679 1.676 1.673 1.671 1.669 1.667 1.665 1.664 1.663 1.662 1.661
Pr detected 0.287 0.469 0.596 0.693 0.767 0.825 0.869 0.902 0.927 0.946 0.961 0.971 0.979 0.985 0.989 0.992 0.994 0.996 0.997
***** Replication ***** To detect a treatment difference of at a significance level of with a detection probability of requires a replication of
1.000 0.0500 0.8000 7
The Replications required button is available for any design where the replication can be modified simply by altering the levels of one of the factors (for example split-plot designs, split-split-plot designs, criss-cross designs and so on), but not e.g. for Latin squares where the replication cannot be changed without changing the number of levels of the treatment factor.
6 Analysis of variance
212 The
S ta n da rd
D esig n
menu (obtained by clicking on the Options button in the Gen erate a Standard Design menu itself) allows you to add extra replicates to the first level of any of the treatment factors. This could be useful if the first level is a control treatment against which the other levels are to be compared. If you check Figure 6.20 the box marked Extra , the two other boxes in the top line of the menu become accessible, allowing you to select the factor of interest (in the right-hand box), and specify the number of extra replications. In Figure 6.20 we have asked for one extra replicate for the first drug (making two replicates altogether). The Added control to factorial treatm ents in box is relevant if you want to add a control treatment that is relevant to more than one treatment factor. Suppose we want to include a placebo drug in the example above. We shall now have seven treatment combinations: the six existing treatments (three drugs at two doses), and the additional placebo treatment (no drug at any dose). To set up the design, we need to revise the main menu as in Figure 6.21, to show One-way Figure 6.21 Options
Design (in Random ized Blocks)
in the Design box, and to give a name (here Treat) for the factor representing the full set of treatment combinations. You do not need to set the number of levels for Treat, as this will be determined automatically by the options menu.
6.5 Designing an experiment
213
Then, in the Standard Design Options menu (Figure 6.22), we need to check the Added con trol to fac torial treatm ents box, select the
factor to be subdivided into the added control plus factorial structure (here Treat), and specify names for the factors to represent the substructure within Treat. The factor Control Figure 6.22 represents the comparison between the placebo and any sort of drug or dose; Drug represents the three drugs as before, and Dose the doses. Figure 6.23 shows the spreadsheet containing the design factors, and the skeleton analysis-of-variance table is shown below. The Control line in the analysis of variance represents the overall effect of any drug at any (non-zero dose), Control.Drug represents overall differences between the drugs (averaged over the two doses), Control.Dose represents the comparison between the two doses Figure 6.23 (averaged over the different drugs), and Control.Drug.Dose represents the interaction between Drug and Dose (assuming that some sort of drug has been taken). ***** Analysis of variance ***** Source of variation d.f. Laboratories stratum
3
Laboratories.Subjects stratum Control 1 Control.Drug 2 Control.Dose 1
6 Analysis of variance
214 Control.Drug.Dose Residual
2 18
Total
27
The "factorial plus added control" treatment structure is not one of the constructs covered directly by the Analysis of Variance menu, although the necessary model formula can be explicitly into the Treatm ent Structure box that appears when General Analysis of Variance or any of the Gen eral Treatm ent Structure settings are selected in the Design box (see Section 6.6). However, the spreadsheet also contains commands to analyse the design, which can be used as an alternative to the Analysis of Variance menus, when the data values have been collected and entered as extra columns in the spreadsheet. The menu is obtained by clicking Spread on the menu bar and selecting Sheet, followed by An alysis . GenStat provides several more-specialized types of design. These are obtained by selecting Design from the Stats menu and then clicking on Select Design.
6.6
Syntax of model formulae
The structure of the design and the treatment terms to be fitted in a GenStat analysis of variance are specified by model formulae. In the simpler menus, like those we have used earlier in this chapter, the formulae are constructed automatically behind the scenes. However, for the more advanced menus and analyses you will need to specify your own formulae. Several of the menus allow you to specify any number of treatment factors, interactions and so on. So, for example, the General Analysis of Variance , the General Treatment Structure (no Blocking) and the Gen eral T reatm ent Structu re (in Random ized Blocks) menus all have a box entitled Tr eatm ent Structure into which a formula (known as the treatment formula) needs to be entered. The General Analysis of Variance menu also allows you to define any underlying structure for the design (for example completely randomized, randomized-block, split-plot, split-split-plot, and so on). This is specified by a model formula (the block formula) which is entered into the Block Structure box; this can be left blank with unstructured (completely randomized) designs. This formula defines the strata and thus the error terms for the analysis. In its simplest form, a model formula is a list of model terms, linked by the operator "+". For example, A + B
is a formula containing two terms, A and B, representing the main effects of factors A and B respectively. Higher-order terms (like interactions) are specified as series
6.6 Syntax of model formulae
215
of factors separated by dots, but their precise meaning depends on which other terms the formula contains, as we explain below. The other operators provide ways of specifying a formula more succinctly, and of representing its structure more clearly. The crossing operator * is used to specify factorial structures. The formula N * S
was used by GenStat to specify the two-way analysis of variance introduced in Section 6.2. This is expanded to become the formula N + S + N.S
which has three terms: N for the nitrogen main effect, S for the main effect of sulphur, and N.S for the nitrogen by sulphur interaction. Higher-order terms like N.S represent all the joint effects of the factors N and S that have not been removed by earlier terms in the formula. Thus here it represents the interaction between nitrogen and sulphur as both main effects have been removed. The other most-commonly used operator is the nesting operator (/). This occurs most often in block formulae. For example, the formula block /
plot
is expanded to become the formula block + block.plot
As the formula contains no "main effect" for plot, the term block.plot would represent plot-within-block effects (that is the differences between individual plots after removing any overall similarity between plots that belong to the same block). This is similar to the block model for the randomized design in Section 6.3 except that we have the factor plot instead of *Units*. A formula can contain more than one of these operators. The three-factor factorial model A * B * C
becomes A + B + C + A.B + A.C + B.C + A.B.C
and the nested structure block / wplot / subplot
which occurs as the block model of a split-plot design (Section 6.8) becomes block + block.wplot + block.wplot.subplot
They can also be mixed in the same formula. For example, the factorial-plusadded-control study in Section 6.5 has treatment structure
6 Analysis of variance
216
Control / (Drug * Dose)
which expands to Control + Control.Drug + Control.Dose + Control.Drug.Dose
In general, if l and m are two model formulae: l * m
=
l + m + l.m
l / m
=
l + fac(l).m
(where l.m is the sum of all pairwise dot products of a term in l and a term in m, and fac(l) is the dot product of all factors in l). For example: (A + B) * (C + D) = (A + B) + (C + D) + (A + B).(C + D) = A + B + C + D + A.C + A.D + B.C + B.D (A + B)/C = A + B + fac(A + B).C
=
A + B + A.B.C
Terms in the treatment formula can be partitioned into contrasts by specifying a function of the factor. COMPARISON(factor; scalar; matrix) partitions the factor into the comparisons specified by the matrix. There is a row of the matrix for each comparison, and the scalar specifies how many of them are to be fitted. POL(factor; scalar; variate) partitions the factor into polynomial contrasts (linear, quadratic and so on). The scalar gives the maximum order of contrast (1 for linear only, 2 for linear and quadratic, and so on) and the variate gives a numerical value for each level of the factor. If the variate is omitted, the levels defined when the factor was declared will be used. REG(factor; scalar; matrix) partitions the factor into the (user-defined) regression contrasts specified by the coefficients in each row of the matrix. The scalar defines the number of contrasts to be fitted.
6.7 Unbalanced designs
6.7
217
Unbalanced designs
Most of the designs covered by the Analysis of Variance menus are balanced and, in fact, all of those discussed so far in this chapter have been orthogonal. Essentially this means that the order in which the treatment terms are fitted is unimportant (other than each main effects must be fitted before any of its interactions). So we could have specified sulphur as the first treatment factor and nitrogen as the second treatment factor in the menus in Figures 6.4 and 6.7, and still have obtained the same sums of squares and effects. This contrasts with the situation in the multiple linear regression in Section 5.2, where the x-variates were correlated (i.e. non-orthogonal), and so different regression coefficients were obtained for each x-variate according to which other x-variates had been Figure 6.24 fitted. GenStat spreadsheet file Product.gsh, displayed in Figure 6.24 contains the results of an experiment to study the effects of factors A, B and C on the yield Y of a production process. The intention was originally to run the experiment in two separate days, and to have two observations of each treatment combination on each day. However, due to time constraints, there were several combinations (chosen at random) in each of the days that could only be performed once. If the design had been constructed with equal replication, as planned, it could have been analysed using the Ge nera l Treatm ent Structure (in Random ized Blocks) design setting. The
block factor would be day, and the treatment structure would be a factorial with three factors: A*B*C, as Figure 6.25 shown in Figure 6.25. However, this generates a fault message (below) reporting that the design is unbalanced.
218
6 Analysis of variance
**** G5F0001 **** Fault (Code AN 1). Statement 1 on Line 111 Command: ANOVA [PRINT=aovtable,information; FACT=32; FPROB=yes] Y Design unbalanced - cannot be analysed by ANOVA Model term A.B (non-orthogonal to term day) is unbalanced, in the day.*Units* stratum.
Instead we need to use the Unbalanced Treatm ent Structure design setting,
shown in Figure 6.26. This is not customized for any particular design, but merely has two boxes to define the model to be fitted. The Blocking (Nuisance terms) box contains the main effect of days as we are not interested Figure 6.26 in testing for day effects, we simply want to remove any day differences before assessing the treatments. The Treatment Structure box contains a factorial model with treatment factors A, B and C. The commands that are generated by this setting of the menu use the GenStat regression facilities (via procedure AUNBALANCED) rather than the analysis-ofvariance facilities. So GenStat produces an accumulated analysis-of-variance, like those in Sections 5.2 and 5.4, indicating the order in which the terms were fitted. The term day is fitted first because this is a nuisance term, reflecting random variability which we want to eliminate before we assess the treatments. The +A line then gives the (main) effect of A after eliminating day. The +B line gives the main effect of B, eliminating day and A , and so on. Each line in the table presents the effect of a particular term, eliminating the terms in the lines above, but ignoring the terms in the lines below. This is technically true also in the examples presented earlier in this chapter but there, as already mentioned, the designs were orthogonal and so the ordering of the treatment terms was unimportant. Here if we had specified C*A*B, the sums of squares for A, B and C would have been 1699.1, 429.4 and 1063.0 respectively, and there would also have been changes to the sums of squares for the interactions. The results would have led to the same conclusions to those from the earlier order (namely that there are main effects of A and C, and an A by C interaction), but in a design with a greater degree of nonorthogonality you would be well advised to investigate several orderings.
6.7 Unbalanced designs
219
Alternatively, the Options menu for the designs with Unbalanced Treatmen t Structure (Figure 6.27)
contains a check box to allow you to request screening tests. In the marginal test (the column headed “mtest” below) the term is added to the simplest possible model. So A.B would be added to a model containing only the main effects A and B. This assesses the effect of the Figure 6.27 term ignoring as many other terms as possible, and so it checks to see if there is any evidence for the term having an effect. In the conditional test (the column headed “ctest” below) the term is added to the most complex possible model. So, A would be added to a model containing B, C and B.C. This checks to see if the term has any effect that cannot be explained by any other terms. Ideally (as here) the tests will both lead to the same conclusion. If not, the conclusions is that there is more than one plausible model for the data, but the design is too unbalanced to allow you to choose between them. ******* Screening of terms in an unbalanced design ******* Variate: Y
*** Marginal and conditional test statistics and degrees of freedom *** degrees of freedom for denominator (full model): 48 term A B C
mtest 3.42 0.76 4.27
mdf 2 2 1
ctest 3.47 0.84 4.78
cdf 2 2 1
term A . B A . C B . C
mtest 1.04 5.25 0.71
mdf 4 2 2
ctest 1.00 4.81 0.57
cdf 4 2 2
term (A . B) . C
mtest 1.40
mdf 4
ctest 1.40
cdf 4
*** P-values of marginal and conditional tests ***
6 Analysis of variance
220 term A B C
mprob 0.041 0.474 0.044
cprob 0.039 0.439 0.034
term A . B A . C B . C
mprob 0.395 0.009 0.498
cprob 0.415 0.013 0.569
term (A . B) . C
mprob 0.248
cprob 0.248
******* Analysis of an unbalanced design using GenStat regression Variate: Y
*******
***** Regression Analysis ***** *** Accumulated analysis of variance *** Change + day + A + B + C + A.B + A.C + B.C + A.B.C Residual Total
d.f. 1 2 2 1 4 2 2 4 48
s.s. 914.0 1706.8 418.8 1065.9 1166.0 2456.7 284.4 1397.4 11960.4
m.s. 914.0 853.4 209.4 1065.9 291.5 1228.3 142.2 349.4 249.2
66
21370.4
323.8
*** Predictions from regression model *** Response variate: Y Prediction A 1 113.2 2 101.2 3 105.3
Minimum standard error of differences Average standard error of differences Maximum standard error of differences
4.679 4.795 4.909
*** Predictions from regression model *** Response variate: Y Prediction B 1 103.2 2 108.1 3 108.3
Minimum standard error of differences Average standard error of differences
4.724 4.788
v.r. 3.67 3.42 0.84 4.28 1.17 4.93 0.57 1.40
F pr. 0.061 0.041 0.438 0.044 0.336 0.011 0.569 0.248
6.7 Unbalanced designs Maximum standard error of differences
221
4.896
*** Predictions from regression model *** Response variate: Y Prediction C 1 110.6 2 102.4
Standard error of differences between predicted means
*** Predictions from regression model *** Response variate: Y Prediction B 1 A 1 115.2 2 97.9 3 96.7
2
3
112.3 99.9 113.2
111.8 106.4 106.8
Minimum standard error of differences Average standard error of differences Maximum standard error of differences
7.894 8.313 9.393
*** Predictions from regression model *** Response variate: Y Prediction C 1 A 1 125.9 2 101.7 3 104.6
2 100.9 100.7 105.9
Minimum standard error of differences Average standard error of differences Maximum standard error of differences
6.454 6.778 7.103
*** Predictions from regression model *** Response variate: Y Prediction C 1 B 1 110.2 2 111.9 3 109.7
2 96.5 104.5 106.9
Minimum standard error of differences Average standard error of differences Maximum standard error of differences
6.454 6.770 7.215
3.903
6 Analysis of variance
222
*** Predictions from regression model *** Response variate: Y
A 1
2
3
C B 1 2 3 1 2 3 1 2 3
Prediction 1
2
136.1 124.1 116.2 102.1 101.8 101.3 92.3 110.6 112.6
95.1 100.8 107.6 93.8 98.1 111.3 101.1 115.8 101.2
Minimum standard error of differences Average standard error of differences Maximum standard error of differences
11.16 11.74 14.42
In an unbalanced design, there will usually be a different standard error for differences between each pair of means. Here we have simply printed a summary giving the minimum, average and maximum standard errors for differences between pairs of means. The Options menu (Figure 6.27) allows you to print a symmetric matrix giving the standard errors for differences between every possible pair of means, but this is omitted here to save space. In the earlier designs in this chapter, the treatment combinations were all equally replicated, and so the standard errors were the same for every pair of means.
6.8 Split-plot designs
6.8
223
Split-plot designs
We now show how to analyse split-plot designs with the analysis-of-variance V3 N1 V3 N0 V3 N0 V3 N1 menus. These designs were V1 N0 V1 N1 V2 N0 V2 N2 devised originally for agricultural experiments V1 N3 V1 N2 V2 N3 V2 N1 where some of the factors V2 N0 V2 N1 V1 N1 V1 N2 can be applied to smaller plots of land than others. V2 N2 V2 N3 V1 N3 V1 N0 Here there are two treatment V3 N2 V3 N0 V2 N3 V2 N0 factors: three different varieties of oats (labelled V3 N1 V3 N3 V2 N2 V2 N1 V1, V2 and V3 on the plan), V1 N3 V1 N0 V1 N2 V1 N3 and four levels of nitrogen (labelled N0 to N3). V1 N1 V1 N2 V1 N0 V1 N1 Because of limitations on V2 N1 V2 N0 V3 N2 V3 N3 the machines for sowing seed, different varieties V2 N2 V2 N3 V3 N1 V3 N0 cannot conveniently be V2 N1 V2 N2 V1 N2 V1 N0 applied to plots as small as those that can be used for V2 N3 V2 N0 V1 N3 V1 N1 the different rates of V3 N3 V3 N1 V2 N3 V2 N2 fertilizer. So the design was set up in two stages. First of V3 N2 V3 N0 V2 N0 V2 N1 all, the blocks were each V1 N0 V1 N3 V3 N0 V3 N1 divided into three plots of the size required for the V1 N1 V1 N2 V3 N2 V3 N3 varieties, and the three varieties were randomly allocated to the plots within each block (exactly as in the randomized blocks design). Then each of these plots, or whole-plots as they are usually known, was split into four sub-plots (one for each rate of nitrogen), and the allocation of nitrogen was randomized independently within each whole-plot. Split-plot designs occur not only in field experiments, but also in animal trials (where, for example, the same diet may need to be fed to all the animals in a pen but other treatments may be applied to individual animals), or in industrial experiments (where different processes may require different sized batches of V3 N3
V3 N2
V3 N2
V3 N3
6 Analysis of variance
224
material), or even in cookery experiments. There can also be more than one treatment factor applied to any size of unit. Figure 6.28 shows GenStat spreadsheet, Oats.gsh, which contains the data from the experiment. The blocks factor (column 1) indicates the block to which each of the individual experimental plots belongs, wplots (column 2) numbers the whole plots within each block and subplots (column 3) numbers the sub-plots within each Figure 6.28 whole plot. The fourth and fifth columns contain the treatment factors, variety and nitrogen, and final column is the y-variate yield. The data can be analysed using the SplitPlo t Design menu, as shown in Figure 6.29. The factor defining the blocks is entered into the Blocks box, and the factor defining the whole-plots within each block is entered into the W hole Plots box. There is no Figure 6.29 need to specify a factor for the sub-plots but, if one is available, it can be entered into the Sub-plots box. The treatment terms to be fitted are specified by entering a model formula (see Section 6.6) into the Treatm ent structure box. The factors for the formula can be selected from the Ava ilable Data window, and the available operators are listed in the Ope rators window. Here we have specified variety * nitrogen
6.8 Split-plot designs
225
to indicate that we want the main effects of variety and nitrogen, and their interaction. As explained in Section 6.6, GenStat expands the * operator so that the formula becomes variety + nitrogen + variety.nitrogen
(and if you prefer you can enter this expanded form, where the terms to be fitted in the analysis are all specified explicitly, instead). The Interactions box can be used to control the level of interactions to be fitted ! you can indicate either All Interactions, as here, or just main effects (No Interactions), or select Specify level of interaction to specify the required level of interaction (that is, set a limit on the maximum number of factors in the treatment terms that are fitted). As usual, the Y-Variate box is set to the GenStat variate containing the data values (here yield). If we now click the OK button, after resetting the options to display Information and Means , the output below is produced. ***** Analysis of variance ***** Variate: yield Source of variation
d.f.
s.s.
m.s.
v.r.
5
15875.3
3175.1
5.28
2 10
1786.4 6013.3
893.2 601.3
1.49 3.40
0.272
blocks.wplots.subplots stratum nitrogen 3 variety.nitrogen 6 Residual 45
20020.5 321.7 7968.8
6673.5 53.6 177.1
37.69 0.30
<.001 0.932
Total
51985.9
blocks stratum blocks.wplots stratum variety Residual
71
* MESSAGE: the following units have large residuals. blocks 1
31.4
s.e. 14.8
***** Tables of means ***** Variate: yield Grand mean
variety
nitrogen
104.0
Victory Golden rain 97.6 104.5 0 cwt 79.4
0.2 cwt 98.9
Marvellous 109.8
0.4 cwt 114.2
0.6 cwt 123.4
F pr.
6 Analysis of variance
226 variety nitrogen Victory Golden rain Marvellous
0 cwt 71.5 80.0 86.7
0.2 cwt 89.7 98.5 108.5
0.4 cwt 110.8 114.7 117.2
0.6 cwt 118.5 124.8 126.8
*** Standard errors of differences of means *** Table
variety
nitrogen
variety nitrogen rep. 24 18 6 s.e.d. 7.08 4.44 9.72 d.f. 10 45 30.23 Except when comparing means with the same level(s) of variety 7.68 d.f. 45
There are now three strata. The blocks stratum contains the variation between the blocks. The blocks all contain exactly the same treatments (one of each of the possible combinations of variety and level of nitrogen), so none of this variation can arise from the effects of the treatments. There are hence no treatment terms estimated in this stratum. (For the same reason none of the treatment terms was estimated in the block stratum of the randomized-block design in Section 6.3.) However, varieties (which were applied to complete whole-plots in the design), are estimated in the blocks.wplots stratum; in conventional terminology this is called the stratum for whole-plots within blocks and it contains the variation between the whole-plots after eliminating differences between the blocks. The variance ratio for varieties is calculated by dividing the variety mean square by the blocks.wplots residual mean square. It is easy to see that this is the correct thing to do. When we look to see whether the varieties differ we are really trying to answer the question: "Do the yields from the three sets of whole-plots, on the first of which the variety Victory was grown, on the second Golden rain, and on the third Marvellous, differ by more than the amount that we would expect for any three randomly chosen sets of whole-plots (each set containing one whole-plot from every block)?". Technically, variety is said to be confounded with whole plots. The terms for nitrogen, which was applied to sub-plots, and for the variety.nitrogen interaction are both estimated in the stratum for sub-plots within whole-plots (blocks.wplots.subplots). Thus, these are both compared against the residual of that stratum, which measures the variability of the sub-plots after eliminating differences between the whole-plots (and blocks). The standard errors accompanying the tables of means also take account of the stratum where each treatment term was estimated.
6.9 Commands for analysis of variance
227
The variety s.e.d. of 7.08 = %(2×601.3/24) is based on the residual mean square for blocks.wplots, while that for nitrogen ( 4.44 = %(2×177.1/18) ) is based on that for blocks.wplots.subplots. The variety × nitrogen table is more interesting. There are two s.e.d.'s according to whether the two means to be compared are for the same variety. If they are, then the sub-plots from which the means are calculated will all involve the same set of whole-plots, so any whole-plot variability will cancel out, giving a smaller s.e.d. than for a pair of means involving different varieties. Finally notice that this time the Information output category has generated a message noting that block 1 has a large residual compared to the residuals of the other five blocks. In this instance, the message can be taken as confirming the success of the choice of the blocks: that is, that the yields of the plots in block 1 are consistently higher than those in the other blocks. Large residuals in the block.wplot.subplot stratum, however, might indicate possibly aberrant values.
6.9
Commands for analysis of variance
Most of the menus described in this chapter use the ANOVA directive, which analyses generally balanced designs. These include most of the commonly occurring experimental designs such as randomized blocks, Latin squares, split plots and other orthogonal designs, as well as designs with balanced confounding, like balanced lattices and balanced incomplete blocks. Many partially balanced designs can also be handled, using pseudo factors, so a very wide range of designs can be analysed. Before using ANOVA we first need to define the model that is to be fitted in the analysis. Potentially this has three parts. The BLOCKSTRUCTURE directive defines the "underlying structure" of the design or, equivalently, the error terms for the analysis; in the simple cases where there is only a single error term this can be omitted. The TREATMENTSTRUCTURE directive specifies the treatment (or systematic, or fixed) terms for the analysis. The other directive, COVARIATE, lists the covariates if an analysis of covariance is required. At the start of a job all these model-definition directives have null settings. However, once any one of them has been used, the defined setting remains in force for all subsequent analyses in the same job until it is redefined. For example, the statements below were generated by the One-way ANOVA (no Blocking) menu to analyse the example in Section 6.1. "One-way ANOVA (no Blocking)." BLOCK "No Blocking" TREATMENTS diet
228
6 Analysis of variance COVARIATE "No Covariate" ANOVA [PRINT=aovtable,information,mean; FPROB=yes] weight
The BLOCK (or, in full, BLOCKSTRUCTURE) directive is given a null setting to cancel any existing setting; so this indicates that the design is unstructured and has a single error term. Similarly, the COVARIATE statement cancels any covariates that may have been set in an earlier menu. The TREATMENTS (or, in full, TREATMENTSTRUCTURE) directive is used to specify that we have a single term in the analysis, the main effect of diet. The first parameter of the ANOVA directive specifies the y-variate to be analysed. The PRINT option is set to a list of strings to select the output to be printed. These are similar to the check boxes of the Further Output menu. The most commonly used settings are: aovtable analysis-of-variance table, information details of large residuals, non-orthogonality and any aliasing in the model, covariates estimated coefficients and standard errors of any covariates, effects tables of effects, residuals tables of residuals, contrasts estimated coefficients of polynomial or other contrasts, means tables of means, %cv coefficient of variation, and missingvalues estimated missing values. By default PRINT=aovtable,information,covariates,means,missing. Probabilities are not printed by default for the variance ratios in the analysis-ofvariance table, but these can be requested by setting the FPROBABILITY option to yes. ANOVA has a PSE option to control the standard errors printed for tables of means. The default setting is differences, which gives standard errors of differences of means. The setting means produces standard errors of means, LSD produces least significant differences and by setting PSE=* the standard errors can be suppressed altogether. The LSDLEVEL option allows the significance level for the least significant differences to be changed from the default of 5%. ANOVA also has a FACTORIAL option which can be used to specify the maximum order (that is, number of factors) in the treatment terms to be fitted in the analysis; the default is 3. To show a more complicated example, these statements were generated to analyse the split-plot design in Section 6.8 "Split-Plot Design."
6.9 Commands for analysis of variance
229
BLOCK blocks/wplots/subplots TREATMENTS nitrogen*variety COVARIATE "No Covariate" ANOVA [PRINt=aovtable,information,mean; FACT=3; FPROB=yes]\ yield
The block formula blocks/wplots/subplots
expands, as explained in Section 6.6, to give the three terms block + block.wplot + block.wplot.subplot
each of which defines a stratum for the analysis. Similarly, the treatment formula nitrogen*variety
expands to nitrogen + variety + nitrogen.variety
to request that GenStat fits the main effects of nitrogen and variety, and their interaction. Again there are no covariates. The Further Output menu uses the ADISPLAY directive to produce the output, procedure APLOT to produce the plots of residuals and procedure AGRAPH to plot tables of means. ADISPLAY has options PRINT, FPROBABILITY, PSE and LSDPROBABILITY like those of ANOVA. However, with ADISPLAY the default for PRINT is to print nothing. Finally, the AKEEP directive is used by the ANOVA Save Options menu to save the residuals and fitted values after an analysis. This is done by two options called RESIDUALS and FITTEDVALUES. AKEEP also allows information to be saved for any of the individual terms in the analysis. The terms are defined by a formula which is specified using the TERMS parameter. The formula is expanded into a list of model terms, subject to the limit defined by the FACTORIAL option which operates like the FACTORIAL option of ANOVA; the other parameters then specify data structures in parallel with this list, to store the information required. Tables of means are saved using the MEANS parameter. So, for example, the variate of residuals yieldres and the N by S table of means meantab in Figure 6.8 were saved by the statement AKEEP [RESIDUALS=yieldres] N.S; MEANS=meantab
Other useful parameters of AKEEP are EFFECTS (tables of effects for treatment terms), REPLICATIONS (replication tables), RESIDUALS (tables of residuals for block terms), DF (degrees of freedom) and SS (sums of squares). Below we use AKEEP to save the sum of squares and degrees of freedom for nitrogen and variety from the analysis of the split-plot design in Section 6.8.
6 Analysis of variance
230
AKEEP nitrogen+variety; SS=N_ss,V_ss; DF=N_df,V_df PRINT N_ss,N_df,V_ss,V_df; DECIMALS=1,0
N_ss 20020.5
N_df 3
V_ss 1786.4
V_df 2
Unbalanced designs are analysed using procedure AUNBALANCED, which uses the GenStat regression facilities. The method of use is similar to that for ANOVA. The treatment terms to be fitted must be specified, before calling the procedure, by the TREATMENTSTRUCTURE directive. Similarly, any covariates must be indicated by the COVARIATE directive. The procedure also takes account of any blocking structure specified by the BLOCKSTRUCTURE directive. However, it cannot produce stratified analyses like those generated by ANOVA, and is able to estimate treatments and covariates only in the "bottom stratum". So, for example, the full analysis can be produced for a randomized block design, where the treatments are all estimated on the plots within blocks, but it cannot produce the whole-plot analysis in a split-plot design. Unbalanced designs with several error terms can be analysed by the Mixed Models (REML) menu, described in Chapter 7. The parameters of AUNBALANCED are identical to those of ANOVA, and there are also FACTORIAL and FPROBABILITY options like those of ANOVA. Printed output is again controlled by the PRINT option, with settings: aovtable to print the analysis-of-variance table, effects to print the effects (as estimated by GenStat regression), means to print tables of predicted means with standard errors, residuals to print residuals and fitted values, screen to print "screening" tests for treatment terms, and %cv to print the coefficient of variation. The default is to print the analysis-of-variance table and tables of means. AUNBALANCED calls procedure RSCREEN to provide the screening tests for the treatment terms: marginal tests to assess the effect of adding each term to the simplest possible model (i.e. a model containing any blocks and covariates, and any terms marginal to the term); conditional tests to assess the effect of adding each term to the fullest possible model (i.e. a model containing all terms other than those to which the term is marginal). For example, if we have BLOCKSTRUCTURE Blocks
and TREATMENTSTRUCTURE A + B + A.B
6.9 Commands for analysis of variance
231
the marginal test for A will show the effect of adding A to a model containing only Blocks, while the conditional test will show the effect of adding A to a model containing Blocks and B. (The terms A and B are marginal to A.B.) AUNBALANCED forms tables of means using the PREDICT directive. The first step (A) of the calculation forms the full table of predictions, classified by every factor in the model. The second step (B) averages the full table of over the factors that do not occur in the table of means. The COMBINATIONS option specifies which cells of the full table are to be formed in Step A. The default setting, estimable, fills in all the cells other than those that involve parameters that cannot be estimated, for example because of aliasing. Alternatively, setting COMBINATIONS=present excludes the cells for factor combinations that do not occur in the data. The ADJUSTMENT option then defines how the averaging is done in Step B. The default setting, marginal, forms a table of marginal weights for each factor, containing the proportion of observations with each of its levels; the full table of weights is then formed from the product of the marginal tables. The setting equal weights all the combinations equally. Finally, the setting observed uses the WEIGHTS option of PREDICT to weight each factor combination according to its own individual replication in the data. The PFACTORIAL option sets a limit on the number of factors in the terms for which means are to be printed. The PSE option of AUNBALANCED controls the types of standard errors that are produced to accompany the tables of means, with settings: differences for a summary of the standard errors for differences between pairs of means, alldifferences for standard errors for differences between all pairs of means, lsd for a summary of the least significant differences between pairs of means, alllsd for all the least significant differences between pairs of means, and means for standard errors of the means (relevant for comparing them with zero). The default is differences. The NOMESSAGE option allows various warning messages (produced by the FIT directive) to be suppressed, and the PLOT option allows various residual plots to be requested: fittedvalues for a plot of residuals against fitted values, normal for a Normal plot, halfnormal for a half Normal plot, and histogram for a histogram of residuals. Procedure AUDISPLAY is used to produce further output for an unbalanced design. It has options PRINT, FPROBABILITY, COMBINATIONS, ADJUSTMENT, PSE and LSDLEVEL like those of AUNBALANCED, except that no screening tests are available.
6 Analysis of variance
232
6.10 Other facilities In this Chapter we have shown only four of the 14 design types (including synonyms) that can be analysed using the Analysis of Variance menu. Other possibilities include Latin squares, Graeco-Latin squares, strip-plot and split-split plot designs and lattices.
6.11 Exercises 6(1) An experiment was conducted to assess the percentage of alcohol by volume of five types of wine labelled A to E. Three bottles of each type were tested in the laboratory in a random order, as listed below and stored in file Wine.gsh. E 4.931 D 7.263 A 4.857 C 3.361 B 6.871 E 4.141 C 3.164 B 3.012 A 5.668 D 12.185 B 4.223 E 3.323 A 4.668 C 2.686 D 7.776
Analyse the experiment and plot a graph of the residuals against the fitted values. Transform the data using a logit transformation, re-analyse the data and plot another graph of residuals against fitted values. 6(2) An experimenter conducted a trial with insecticides for killing ants. Five types of insecticide were used on each of three types of bait. The experimenter measured the time from the release of a colony of ants to when the bait was picked up. Each combination of bait and insecticide was used three times, the order of the observations being decided entirely at random. The data are available in file Ant.gsh. "Bait Insecticide Time" 3 5 35.22 2 2 29.30 1 1 30.16 3 3 35.70 2 4 38.63 1 4 40.90 1 2 28.80
6.11 Exercises 3 2 1 2 1 3 3 3 1 1 2 1 2 3 2 2 2 3 1 2 2 3 2 2 3 1 1 2 3 1 3 3 3 1 2 1 1 3
2 4 4 5 5 4 1 1 2 3 3 1 5 5 3 1 4 4 3 2 3 2 1 5 3 3 1 1 2 5 4 1 5 5 2 2 4 3
233
30.91 36.46 43.13 38.32 35.36 38.58 35.69 33.46 30.61 30.94 27.48 32.45 33.00 37.29 36.21 33.44 36.41 35.46 34.16 35.30 37.74 35.17 36.14 28.74 40.19 31.63 26.33 37.40 32.96 36.75 36.29 37.47 32.72 36.01 31.23 33.89 35.22 37.66
Analyse the experiment and plot the means. 6(3) Seven litters each of five rats were used in a randomized-block design (with litters as blocks) to study the effects of different diets on the gain in weight of rats. Analyse the data, in file Ratlitters.gsh, to see whether there are any differences between the diets. "litter 1 1 1 1 1 2 2 2
diet B D A C E C A E
gain" 87.9 80.0 76.4 54.6 76.2 51.7 74.6 78.9
6 Analysis of variance
234 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7
D B D C A B E A E C D B C B E A D E D B C A A C D E B
63.2 85.6 70.8 62.2 77.6 88.6 83.2 83.0 70.7 80.6 84.9 103.6 83.7 100.6 101.3 94.5 76.6 71.9 72.9 54.2 47.4 55.8 54.9 76.8 68.6 82.8 65.7
6(4) In an experiment to assess the durability of four different types of carpet, four machines were available to simulate the wear arising from daily use. As it was thought that there might be differences between the conditions in the laboratory on each day that the experiment was run, a Latin square was used. In a Latin square there are two blocking factors, often referred to as the rows and columns of the design, and the blockstructure has rows crossed with columns. Here the block factors are Machine and Day. The results of the experiment are shown in the table below with the different types of carpet being denoted by the letters a-d. Each day corresponds to a row of the table, and each machine to a column. The data are in the file Carpet.gsh. d a b c
38 19 41 61
a d c b
18 22 54 36
c b a d
38 26 11 22
b c d a
39 35 36 16
The measurement is the percentage wear of the carpet. Transform the results to logits and analyse the experiment. Save the means using the ANOV A Save Options menu, transform them back to percentages, and display them with the Data Display menu.
6.11 Exercises
235
6(5) Construct a randomized block design for three factors Additive, Timing and Amount with three, two and two levels, respectively. (Hint: select the design setting Gen eral Treatm ent Structure (in Randomized Blocks) in the Gen erate a Standard Design menu. Optional advanced follow-up exercise: try the Select Design facilities to see if you can produce an alternative design with smaller blocks, and compare the two. Click on Stats on the menu bar, select Design and click on Select Design. Then select "factorial designs (with interactions confounded with blocks)" in answer to the question "Which type of design would you like to generate?". 6(6) In an experiment to study the effect of two meat-tenderizing chemicals, the two (back) legs were taken from four carcasses of beef and one leg was treated with chemical 1 and the other with chemical 2. Three sections were then cut from each leg and allocated (at random) to three cooking temperatures, all 24 sections ( 4 carcasses × 2 legs × 3 sections ) being cooked in separate ovens. The table below shows the force required to break a strip of meat taken from each of the cooked sections (the data are also in the file Meat.gsh). Analyse the experiment. Leg Carcass Section 1 1 2 3
1 ------------------Chemical Temp Force 1 2 5.5 1 3 6.5 1 1 4.3
2 ------------------Chemical Temp Force 2 3 6.3 2 1 3.5 2 2 4.8
2
1 2 3
2 2 2
1 3 2
3.2 6.0 4.7
1 1 1
3 2 1
6.2 5.0 4.0
3
1 2 3
2 2 2
1 2 3
2.6 4.3 5.6
1 1 1
2 1 3
4.6 3.8 5.8
4
1 2 3
1 1 1
3 1 2
5.7 3.7 4.9
2 2 2
2 3 1
4.1 5.9 2.9
On the assumption that the temperature levels are equally spaced and increasing, use the polynomial contrast menu to see whether the force increases linearly with temperature. Design a follow-up experiment to study two different chemicals and one further temperature (making four temperatures in all), assuming that we now have only three carcasses but can take four sections from each leg. 6(7) An experiment was carried out to compare the effects of various fungicide treatments on the growth and yield of oil seed rape. Four plots for each of the five
236
6 Analysis of variance
treatments were laid out in a randomized block design. The treatments (labelled A, B, C, D and E) were: A - untreated control B - standard fungicide applied at time 1 C - new fungicide applied at full rate at time 1 D - new fungicide applied at full rate at time 2 E - new fungicide applied at half rate at times 1 and 2 Analyse the results of the experiment (available in file Fungicide.gsh). Fit contrasts to assess the overall difference between the control and the new fungicide, the overall difference between the standard and the new fungicide, and the difference between application times 1 and 2 for the new fungicide.
7
REML analysis of mixed models
The Analysis of Variance menus, described in Chapter 6, deal mainly with balanced designs. This ideal situation, however, is not always achievable. The randomizedblock design in Section 6.3 is balanced because every block contained one of each treatment combination. However, there may sometimes be so many treatments that the blocks would become unrealistically large. Designs where each block contains less than the full set of treatments include cyclic designs and Alpha designs (both of which can be generated within GenStat by clicking Stats on the menu bar, selecting Design and then Select Design), neither of which tend to be balanced. In experiments on animals, some subjects may fail to complete the experiment for reasons unconnected with the treatments. So even an initially balanced experiment may not yield a balanced set of data for analysis. The Mixed Models (REML) menus, which use the GenStat REML directive, are designed to handle these situations. They also allow you to fit models to the complex correlation structures that occur in repeated measurements or in spatially-correlated data from field experiments.
7.1
Linear mixed models: split-plot design
We start by reanalysing the split-plot design in Section 6.8, to highlight the differences and similarities between REML and ANOVA. Figure 7.1 shows the Linear M ixed M odels menu, obtained by clicking Stats on the menu bar and selecting Mixed Models (REML) , followed by Linear Mixed Models . The Fixed Model box corresponds to the Treatment Structure box in the split-plot menu, and specifies the terms Figure 7.1 defining the fixed effects in the model to be fitted. The Linear M ixed M odels menu provides general facilities covering any type of design, and so the random effects are defined explicitly by the contents of the Random Model box, instead of being defined automatically as in the split-plot menu. The model is the same though, namely blocks/wplots/subplots
7 REML analysis of mixed models
238
which expands to give the three (random) terms; see Section 6.6. block + block.wplot + block.wplot.subplot
Similarly, the fixed model nitrogen*variety
expands as before to nitrogen + variety + nitrogen.variety
to request that GenStat fits the main effects of nitrogen and variety, and their interaction. (The Interactions box, which operates just like the one in the An alysis of Variance menu, has requested all interactions in the fixed model to be included.) The Options button produces the menu in Figure 7.2. The standard model options (as shown in the figure) are fine for this design, so we need only select the output to display (and then click OK ). Returning to the main menu (Figure 7.1): initial v a l ues a r e seldom required for simple REML analyses like this, and the Spline Model box is not Figure 7.2 relevant (this is mainly useful with repeated measurements), so we can click on OK and generate the output shown below. ***** REML Variance Components Analysis ***** Response Variate : yield Fixed model Random model
: Constant+nitrogen+variety+nitrogen.variety : blocks+blocks.wplots+blocks.wplots.subplots
Number of units
: 72
* blocks.wplots.subplots used as residual term * Sparse algorithm with AI optimisation
*** Estimated Variance Components ***
7.1 Linear mixed models: split-plot design Random term blocks blocks.wplots
Component
S.e.
214.5 106.1
168.8 67.9
239
*** Residual variance model *** Term
Factor
blocks.wplots.subplots
Model(order)
Parameter
Identity
Sigma2
Estimate
S.e.
177.1
37.3
*** Wald tests for fixed effects ***
Fixed term
Wald statistic
d.f.
Wald/d.f.
Chi-sq prob
3 2 6
37.69 1.49 0.30
<0.001 0.226 0.936
0.30
0.936
* Sequentially adding terms to fixed model nitrogen variety nitrogen.variety
113.06 2.97 1.82
* Dropping individual terms from full fixed model nitrogen.variety
1.82
6
* Message: chi-square distribution for Wald tests is an asymptotic approximation (i.e. for large samples) and underestimates the probabilities in other cases.
The output first lists the terms in the fixed and random model, and indicates the residual term. The residual term is a random term with a parameter for every unit in the design. Here we have specified a suitable term, blocks.wplots.subplots, explicitly. However, if we had specified only blocks and blocks.wplots as the Random Model (for example by putting blocks/wplots), GenStat would have added an extra term *units* to act as residual. (*units* would be a private factor with a level for every unit in the design.) GenStat estimates a variance component for every term in the random model, apart from the residual. The variance component measures the inherent variability of the term, over and above the variability of the sub-units of which it is composed. Generally, this is positive, indicating that the units become more variable the larger they become. So here the whole-plots are more variable than the subplots, and the blocks are more variable than the whole-plots within the blocks. (This is the same conclusion that you would draw from the analysis-ofvariance table in Section 6.8 and, in fact, you can also produce the variance components as part of the stratum variances output from the Analysis of Variance
240
7 REML analysis of mixed models
menu.) However, the variance component can sometimes be negative, indicating that the larger units are less variable than you would expect from the contributions of the sub-units of which they are composed. This could happen if the sub-units were negatively correlated. The section of output summarising the residual variance model indicates that we have not fitted any specialized correlation model on this term (see the column headed Model), and gives an estimate of the residual variance; this is the same figure as is given by the mean square in the residual line in the blocks.wplots.subplots stratum in the split-plot analysis-of-variance table. The next section, however, illustrates a major difference between the two analyses. When the design is balanced, GenStat is able to partition the variation into strata with an appropriate random error term (or residual) for each treatment term (see Section 6.8). The variance ratios can be assumed to have F-distributions, which can be assessed in the usual way ! the resulting probabilities are in the final column of the analysis-of-variance table. No such partitioning is feasible for the unbalanced situations that REML is designed to handle. Instead GenStat produces a Wald statistic to assess each fixed term. In a balanced design, like those covered in Chapter 6, the Wald statistic corresponds to the treatment sum of squares divided by the stratum mean square. The statistic would have an exact P2 distribution if the variance parameters were known but, as they must be estimated, it is only asymptotically distributed as P2. In practical terms, the P2 values will be reliable if the residual degrees of freedom for the fixed term is large compared to its own degrees of freedom. In a balanced design, the number of residual degrees for a fixed (or treatment) term is easy to ascertain ! it is simply the number of residual degrees of freedom for the stratum where the term is estimated (see Section 6.8). Also, if the design is balanced, the fourth column of the tables (that is, the Wald statistic divided by its degrees of freedom) will be distributed as Fn,m, where m is the number of degrees of freedom of the fixed effect (column 3), and n is the number of residual degrees of freedom for the fixed effect. For unbalanced designs the F distribution is only approximate and, in any case, it may be difficult to deduce the appropriate residual numbers of degrees of freedom. The important point to remember, though, is that use of the P2 probabilities tends to give significant results rather too frequently, so you need to be careful especially if the value is close to a critical value. In the example, however, the situation seems clear cut: there is clear evidence that nitrogen has an effect but no evidence for any differences between varieties nor any interaction. In the example, the treatment terms are orthogonal so it makes no difference whether nitrogen or variety is fitted first. In an unbalanced design, however, the ordering of fitting is important, and you should be aware that each Wald test
7.1 Linear mixed models: split-plot design
241
in the "Sequentially adding terms to fixed model" section represents the effect of adding the term concerned to a model containing all the terms in the preceding lines. The next section, headed "Dropping individual terms from full fixed model" looks at the effect of removing terms from the complete fixed model: so the lines here allow you to assess the effects of a term after eliminating all the other fixed terms. This is particularly useful for seeing how the model might be simplified. Notice that the only relevant term here is the variety by nitrogen interaction. We cannot remove a main effect (such as nitrogen or variety) from a model that contains an interaction involving that factor. The Fu rther Output button generates the menu shown in Figure 7.3, in which we have checked the boxes to produce tables of predicted means and standard errors of differences between means. The Model Term s for Effects and Means box enables you to specify the terms for which you want tables of means (and, if you had checked the Estimated Effec ts box, tables of effects). The Figure 7.3 default, which is fine here, is to produce a tables for each term in the fixed model. Clicking OK then generates the tables shown below. Because the fixed terms are orthogonal, the means are identical to those produced by the Analysis of Variance menu (Section 6.8). *** Table of predicted means for Constant *** 104.0
Standard error:
6.64
*** Table of predicted means for nitrogen *** nitrogen
0 cwt 79.4
0.2 cwt 98.9
0.4 cwt 114.2
Standard error of differences:
0.6 cwt 123.4 4.436
7 REML analysis of mixed models
242
*** Table of predicted means for variety *** variety
Victory 97.6
Golden rain 104.5
Standard error of differences:
Marvellous 109.8
7.079
*** Table of predicted means for nitrogen.variety *** variety nitrogen 0 cwt 0.2 cwt 0.4 cwt 0.6 cwt
Victory
Golden rain
Marvellous
71.5 89.7 110.8 118.5
80.0 98.5 114.7 124.8
86.7 108.5 117.2 126.8
Standard error of differences:
Average Maximum Minimum
Average variance of differences:
9.161 9.715 7.683 84.74
Standard error of differences for same level of factor:
Average Maximum Minimum
nitrogen 9.715 9.715 9.715
variety 7.683 7.683 7.683
The REML facilities thus produce the same information as that given by the An alysis of Variance menu where this is possible in their more general context, but they are not able to match its more specialised output. The advantage of the REML menus, however, lies in the much wider range of situations that they cover ! and we illustrated several of these in the rest of this chapter.
7.2
Linear mixed models: a non-orthogonal design
We now consider the analysis of a rather more complicated field experiment (at Slate Hall Farm in 1976), previously analysed by Gilmour et al. (1995). The design was set up to study 25 varieties of wheat, and contained six replicates (each with one plot for every variety) laid out in a two by three array. The variety grown on each plot is shown in the plan below. Each replicate has a block structure of rows crossed with columns, so the random model is replicate s / (rows * c olumns)
(rows crossed with columns, nested within replicates), which expands to give
7.2 Linear mixed models: a non-orthogonal design replicates + replicates.rows replicates.rows.columns
+
243
replicates.columns
+
So we have random terms for replicates, rows within replicates, columns within replicates and, finally, replicates.rows.columns represents the residual variation. The fixed model contains just the main effect of the factor variety. 1
2
4
3
5
19
23
2
6
15
18
25
9
11
2
6
7
9
8
10
8
12
16
25
4
5
7
16
23
14
21
22
24
23
25
11
20
24
3
7
6
13
22
4
20
11
12
14
13
15
22
1
10
14
18
24
1
15
17
8
16
17
19
18
20
5
9
13
17
21
12
19
3
10
21
3
18
8
13
23
16
24
10
13
2
10
4
17
11
23
1
16
6
11
21
12
20
1
9
23
12
6
24
18
5
5
20
10
15
25
4
7
18
21
15
19
13
1
25
7
2
17
7
12
22
25
3
14
17
6
21
20
8
2
14
4
19
9
14
24
8
11
22
5
19
3
22
15
9
16
Figure 7.4 shows a GenStat spreadsheet file, s t o r e d a s Slatehall.gsh, containing the data. As well as the factors already mentioned, the sheet also contains factors plotnumber (indexing the individual plots), fieldrow and fieldcolumn (defining the row and column Figure 7.4 positions within the whole field, rather than within each replicate) which we shall use to define spatial correlation structures in Section 7.3.
7 REML analysis of mixed models
244
Figure 7.5 shows the Linear Mixed M odels menu with the necessary boxes filled in. If we use the Linear Mixed Model Options
menu (Figure 7.2) to request predicted means and standard errors of differences of means (in addition to the existing Display options), and then Figure 7.5 click on OK in the Linear Mixed M odels menu itself, the following output is produced. ***** REML Variance Components Analysis ***** Response Variate : yield Fixed model Random model Number of units
: Constant+variety : replicates+replicates.rows+replicates.columns +replicates.rows.columns : 150
* replicates.rows.columns used as residual term * Sparse algorithm with AI optimisation
*** Estimated Variance Components *** Random term replicates replicates.rows replicates.columns
Component
S.e.
0.4262 1.5595 1.4812
0.6890 0.5091 0.4865
*** Residual variance model *** Term
Factor
replicates.rows.columns
Model(order)
Parameter
Identity
Sigma2
Estimate
S.e.
0.806
0.1340
*** Wald tests for fixed effects *** Fixed term
Wald statistic
d.f.
Wald/d.f.
24
8.84
Chi-sq prob
* Sequentially adding terms to fixed model variety
212.26
* Dropping individual terms from full fixed model
<0.001
7.2 Linear mixed models: a non-orthogonal design variety
212.26
24
8.84
245 <0.001
* Message: chi-square distribution for Wald tests is an asymptotic approximation (i.e. for large samples) and underestimates the probabilities in other cases.
*** Table of predicted means for Constant *** 14.70
Standard error:
0.422
*** Table of predicted means for variety *** variety
1 12.84
2 15.49
3 14.21
4 14.52
5 15.33
variety
6 15.27
7 14.01
8 14.57
9 12.99
10 11.93
variety
11 13.27
12 14.84
13 16.19
14 13.27
15 14.98
variety
16 13.46
17 14.98
18 15.92
19 16.70
20 16.40
variety
21 14.93
22 16.44
23 13.29
24 15.46
25 16.31
Standard error of differences:
0.6202
Unusually for a large variety trial, this particular design is balanced (in fact it is a lattice square), and we can gain additional insights into the REML analysis by looking at the output that we could have obtained from the An alysis of Variance menu. The menu is not customised for Figure 7.6 the design, but we can use the General Analysis of Variance setting in the Design box, and specify the Treatment Structure and Block S tructure as shown in Figure 7.6. The standard analysis of variance output (analysis-of-variance table, information summary, means and standard errors of differences) is shown below. ***** Analysis of variance ***** Variate: yield
7 REML analysis of mixed models
246 Source of variation
d.f.
s.s.
m.s.
5
133.3273
26.6655
24
215.9053
8.9961
replicates.columns stratum variety 24
229.8094
9.5754
replicates.rows.columns stratum variety 24 166.7674 Residual 72 58.3011
6.9486 0.8097
replicates stratum replicates.rows stratum variety
Total
149
v.r.
F pr.
8.58
<.001
804.1105
***** Information summary ***** Model term
e.f.
replicates.rows stratum variety
non-orthogonal terms
0.167
replicates.columns stratum variety 0.167
replicates.rows
replicates.rows.columns stratum variety 0.667 replicates.rows
replicates.columns
* MESSAGE: the following units have large residuals. replicates 6
-1.895
replicates 1 replicates 1
rows 4 rows 5
s.e. 0.943 columns 3 columns 2
-1.665 1.710
s.e. 0.623 s.e. 0.623
***** Tables of means ***** Variate: yield Grand mean
14.704
variety
1 12.962
2 15.561
3 14.152
4 14.560
5 15.481
6 15.358
7 14.008
variety
8 14.428
9 12.968
10 11.928
11 13.222
12 14.835
13 16.176
14 13.187
variety
15 15.067
16 13.287
17 14.968
18 15.881
19 16.742
20 16.277
21 15.048
variety
22 16.430
23 13.283
24 15.464
25 16.344
*** Standard errors of differences of means *** Table rep. d.f. s.e.d.
variety 6 72 0.6363
7.2 Linear mixed models: a non-orthogonal design
247
Notice that the analysis-of-variance table has three lines for variety. As each row contains a different set of varieties, the differences between the rows in each replicate enable us to obtain estimates of the variety effects (which appear in the replicates.rows stratum). The same is true of the columns. The design is balanced because the various comparisons between varieties are all estimated with the same efficiency in the replicates.rows stratum; the Information Summary indicates the efficiency is in fact 0.167. Similarly, they all have efficiency 0.167 in the replicates.columns stratum, and efficiency 0.667 in the replicates.rows.columns stratum. So, the possible information on variety is split (1/6 : 1/6 : 2/3 ) between the three strata. We can see the estimates obtained in each stratum by checking the Effects box in the ANOVA F u r t h e r O u t p u t menu (Figure 7.6) and then clicking OK , and you can verify that the standard table of means produced by ANOVA, above, is calculated using the estimated effects from the lowest stratum ( replicates.rows. columns ): the mean 1.2962 for variety 1 is the grand mean 14.704 plus Figure 7.7 the effect of variety 1 in the replicates.rows.columns table, namely !1.742. ***** Tables of effects ***** Variate: yield ***** replicates.rows stratum ***** variety effects
e.s.e. *
rep. 6
variety
1 -5.614
2 1.296
3 0.604
4 -1.468
5 -3.522
6 2.790
7 -3.458
variety
8 1.718
9 0.520
10 -3.814
11 -2.718
12 -2.544
13 1.020
14 1.236
7 REML analysis of mixed models
248 variety
15 0.582
16 5.598
17 3.786
18 3.480
variety
22 -0.028
23 1.360
24 -3.058
25 -3.894
19 3.902
20 3.530
21 -1.294
***** replicates.columns stratum ***** variety effects
e.s.e. *
rep. 6
variety
1 -3.432
2 -2.588
3 0.812
4 -0.650
5 -1.450
6 -4.948
7 1.930
variety
8 4.064
9 -3.010
10 -1.584
11 1.852
12 2.828
13 2.540
14 -0.752
variety
15 -3.536
16 -0.642
17 -2.494
18 0.740
19 -1.706
20 4.934
21 -2.924
variety
22 3.990
23 -3.730
24 4.434
25 5.332
***** replicates.rows.columns stratum ***** variety effects
e.s.e. 0.4499
rep. 6
variety
1 -1.742
2 0.857
3 -0.553
4 -0.144
5 0.777
6 0.653
7 -0.697
variety
8 -0.277
9 -1.736
10 -2.777
11 -1.482
12 0.130
13 1.471
14 -1.517
variety
15 0.362
16 -1.418
17 0.263
18 1.176
19 2.037
20 1.573
21 0.343
variety
22 1.726
23 -1.421
24 0.760
25 1.639
In contrast, the REML analysis has produced a single set of estimates, and these automatically combine (with an appropriate weighting) all the separate estimates. In fact the REML estimates correspond to the combined effects and means in the ANOVA Further Output menu. Below, we show these tables, together with the output generated by checking the Stratum Variances box which contains the variance components. The combined means have a smaller standard error of difference than the standard means, but the complicated structure of their estimation means that we can no longer assume that differences between them follow t-distributions with a known number of degrees of freedom. ***** Tables of combined effects ***** Variate: yield variety effects variety
e.s.e. 0.4385 1
2
rep. 6 3
effective d.f. 79.99 4
5
6
7
7.2 Linear mixed models: a non-orthogonal design
249
-1.869
0.786
-0.495
-0.186
0.628
0.570
-0.697
variety
8 -0.131
9 -1.716
10 -2.772
11 -1.432
12 0.133
13 1.486
14 -1.438
variety
15 0.276
16 -1.243
17 0.277
18 1.217
19 1.991
20 1.695
21 0.230
variety
22 1.739
23 -1.413
24 0.760
25 1.602
***** Tables of combined means ***** Variate: yield variety
1 12.836
2 15.490
3 14.209
4 14.519
5 15.333
6 15.274
7 14.007
variety
8 14.574
9 12.989
10 11.932
11 13.272
12 14.838
13 16.190
14 13.266
variety
15 14.980
16 13.461
17 14.982
18 15.922
19 16.696
20 16.399
21 14.934
variety
22 16.444
23 13.291
24 15.465
25 16.306
*** Standard errors of differences of combined means *** Table rep. s.e.d. effective d.f.
variety 6 0.6202 79.99
***** Estimated stratum variances ***** Variate: yield Stratum replicates replicates.rows replicates.columns replicates.rows.columns
variance
effective d.f.
variance component
26.6655 8.6037 8.2120 0.8062
5.000 23.464 23.438 73.099
0.4262 1.5595 1.4812 0.8062
The example reinforces the point that the REML output is the same as that given by ANOVA when both are feasible, but that the generality of the REML method leaves aspects that it cannot duplicate. More importantly, though, it shows that the REML method makes use of all the available information about each fixed effect. These aspects indicate the efficiency and appropriateness of the methodology, and the exercises at the end of the chapter will illustrate its ability to handle designs that cannot be analysed by ANOVA. In the next section we demonstrate another important advantage, namely the ability of REML to fit models to spatial correlation structures.
250
7.3
7 REML analysis of mixed models
Spatial analysis
The analysis in Section 7.2 assumed a conventional correlation structure, with an equal correlation between any two rows within a replicate, and an equal correlation between any two columns within a replicate. An alternative approach would be to Figure 7.8 regard the layout as a twodimensional array and model the underlying correlation structure. There are two possible menus, depending on whether or not the plots are on a regular grid or are in an irregular layout. Here the arrangement is regular, so we click Stats on the menu bar, select Mixed Models (REML) , select Spatial Model and then click on Regular G rid (see Figure 7.8), to produce the menu in Figure 7.9. If we assume that the error process is separable ( t h at i s , t h a t t he correlations across rows and across columns are independent), the twodimensional pattern of variances and covariances amongst the plots can be modelled as the direct product of a correlation model across rows with a correlation model across Figure 7.9 columns (see Cullis & Gleeson 1991). Within the menu we simply need to specify the row and column factors (in the Row Factor and Column Factor boxes), and select the required correlation model from those available in the Row-model and Column-m odel boxes. In Figure 7.9, we have set both to AR o rder 1 , that is a first-order autoregressive model. Other possibilities include:
7.3 Spatial analysis
251
meaning equal correlations between any pair of rows/columns, as fitted by the Linear Mixed M odels menu in Section 7.2; Power model (city block) correlation between each pair of rows (or columns) is a power of the distance, measured in numbers of rows (or columns), between the rows (or columns); AR o rder 2 second-order autoregressive model. There are boxes allowing you to fit a random row effect in addition to the row model, or to fit a linear trend across rows, and there are similar boxes for the columns. You can also specify additional random terms, in the R andom T erm s box. These might include other types of blocking, as for example if the plots had been sown or harvested on different days or by different operators. Another possibility would be to add a term for random measurement error. We can do this by specifying a term indexing the individual units of the design. However, we cannot use fieldrow.fieldcolumn, as this is being defined as a spatial correlation term. We can, however, use the factor plotnumber. (Alternatively, REML would allow you to use the string '*units*' to denote an internal factor with a level for every units of the design, just like plotnumber.) We also need to set the Data variate to yield, and the Fixed T erm s to variety, and can then click OK to produce the output below. Iden tity
**** G5W0001 **** Warning (Code VC 53). Statement 1 on Line 131 Command: REML [PRINT=model,components,waldTests; MAXCYCLE=20; PSE=differences; More than one residual term specifed - first term found will be used as R
***** REML Variance Components Analysis ***** Response Variate : yield Fixed model Random model
: Constant+variety : fieldrow.fieldcolumn+plotnumber
Number of units
: 150
* fieldrow.fieldcolumn used as residual term with covariance structure below * Sparse algorithm with AI optimisation *** Covariance structures defined for random model *** Covariance structures defined within terms: Term fieldrow.fieldcolumn
Factor Model Order Nrows fieldrow Auto-regressive (+ scalar) 1 10 fieldcolumn Auto-regressive 1 15
*** Estimated Variance Components ***
7 REML analysis of mixed models
252 Random term Extra units term
Component
S.e.
0.486
0.179
*** Residual variance model *** Term
Factor
fieldrow.fieldcolumn fieldrow fieldcolumn
Model(order)
Parameter
AR(1) AR(1)
Sigma2 phi_1 phi_1
Estimate
S.e.
4.580 0.6827 0.8438
1.670 0.1023 0.0684
*** Wald tests for fixed effects *** Fixed term
Wald statistic
d.f.
Wald/d.f.
Chi-sq prob
24
10.22
<0.001
10.22
<0.001
* Sequentially adding terms to fixed model variety
245.40
* Dropping individual terms from full fixed model variety
245.40
24
* Message: chi-square distribution for Wald tests is an asymptotic approximation (i.e. for large samples) and underestimates the probabilities in other cases.
GenStat produces a diagnostic warning that we have specified more than one residual term. This is not a problem here, as fieldrow.fieldcolum n is actually a spatial
covariance term. The output then describes the models, and prints their parameters. The effectiveness of the analysis is shown by the increase in the Wald statistics (from 212.26 to 245.40) compared to the Figure 7.10 previous, conventional analysis. We can assess the need for the measurement error term by studying the deviance of the models including and excluding the term. We click on the Further
7.3 Spatial analysis
253
Output button to generate the Spatial Model Further Output menu (Figure 7.10), check
the Deviance box, and click on OK . For the deviance of the model excluding plotnumber, we need to delete that term from the Random T erm s box in the Spatial Mod el - Regular G rid
menu (Figure 7.8), click on Options to bring up the Spatial Model Options menu, modify the display settings so that only the Deviance box is checked (Figure 7.11), and click on OK in the Options and then the Figure 7.11 main menu. *** Deviance: -2*Log-Likelihood *** Deviance 242.35
d.f. 121
Note: deviance omits constants which depend on fixed model fitted.
*** Deviance: -2*Log-Likelihood *** Deviance 249.35
d.f. 122
Note: deviance omits constants which depend on fixed model fitted.
The deviance is useful for determining the appropriate random model for the data. It is best to fit the full fixed model first, and then study how each modification to the random model affects the deviance. Changes in the deviance are approximately distributed as P2, so here the extra variance parameter for measurement error has a P2 value of 7.00 on one degree of freedom. The deviance itself is not useable. In fact to simplify the calculations some constant terms (which depend only on the fixed model) are omitted, so the value printed by GenStat may even be negative. Once you have determined an appropriate random model, you can then move on to assess the terms in the fixed model.
254
7 REML analysis of mixed models
The menu for an irregular grid (Figure 7.12) is similar to the menu for a regular grid (Figure 7.8), but with irrelevant boxes removed and the boxes for row and column factors replaced by boxes for x and y positions (or coordinates). The model choices are restricted to identity or Figure 7.12 power. The power can be of "city block" distance (row distance plus column distance), or squared distances (row distance squared plus column distance squared) or Euclidean distance (square root of row distance squared plus column distance squared).
7.4
Repeated measurements
Figure 7.13 shows a GenStat spreadsheet, ATP.gsh, containing data from an experiment to study the effects of preserving liquids on the enzyme content of dog hearts. There were 23 hearts and two treatment factors, A and B, each at two levels. Measurements were made of ATP as a percentage of total enzyme in the heart, at one and two hourly intervals during a twelve hour period following initial preservation. There are two menus for repeated measurements depending on whether the measurements are presented in separate variates, one for each time, or all in a single variate. The available analyses are identical, the menus merely provide different ways of specifying the data Figure 7.13 for the analysis.
7.4 Repeated measurements
255
Here the measurements are all in a single variate, ATP, and there are factors heart and time to indicate which heart provided the measurement in each unit of ATP, and the time when the measurement took place. Figure 7.14 shows the appropriate menu, which is obtained by clicking on Stats on the menu bar, Figure 7.14 selecting Mixed Models (REML) followed by Re pea ted M eas urem ents , and then clicking on Data in One Variate (the alternative being Data in Multiple Variates). In the figure we have filled in all the necessary boxes. Notice that we need to tell GenStat that the time points were the same for every subject (or heart). This would not be necessary if the measurements were in individual variates, one for each time. The types of model that can be fitted differ according to whether or not the times of measurement were equally spaced or irregular. Autoregressive and uniform correlation models can be fitted only to equally spaced measurements. U n s t ru c t u r e d , a n t edependence or power models can be fitted in either situation. In this Figure 7.15 example we shall fit antedependence models: a set of variates observed at successive times is said to have an ante-dependence structure of order r if each ith variate (i>r), given the preceding r, is independent of all further preceding variates. In the analysis we
256
7 REML analysis of mixed models
start with order 1, and use the options menu (Figure 7.15) to ask for only the deviance to be printed. *** Deviance: -2*Log-Likelihood *** Deviance 1021.84
d.f. 171
Note: deviance omits constants which depend on fixed model fitted.
We can generalize the model by including additional uniform correlation within subjects (this is equivalent to including a random term for subjects, here the different hearts) or by increasing the order of ante-dependence to two. To investigate the first alternative we need to check the Additional uniform correlation within subje cts box in Figure 7.14. *** Deviance: -2*Log-Likelihood *** Deviance 1011.50
d.f. 170
Note: deviance omits constants which depend on fixed model fitted.
The change in deviance is 10.34. This is distributed as P2 on one degree of freedom. So there is definite evidence to support including uniform correlation within hearts. Now changing the ante-dependence structure to order two produces the deviance below. *** Deviance: -2*Log-Likelihood *** Deviance 1005.50
d.f. 162
Note: deviance omits constants which depend on fixed model fitted.
7.4 Repeated measurements
257
The P2 value, 6.00 on 8 degrees of freedom, is not significant. So we retain the ante-dependence order one (and keep the additional uniform correlation), and then use t h e R e p e a t e d Measurem ents O utput menu
F u r t h er
to print details of the variance model and Wald tests for the fixed terms. Figure 7.16
***** REML Variance Components Analysis *****
Response Variate : ATP Fixed model Random model
: Constant+time+A+B+time.A+time.B+A.B+time.A.B : heart+heart.time
Number of units
: 230
* heart.time used as residual term with covariance structure as below * Sparse algorithm with AI optimisation * Units with missing factor/covariate values included *** Covariance structures defined for random model *** Covariance structures defined within terms: Term heart.time
Factor heart time
Model Identity Antedependence
*** Estimated Variance Components *** Random term heart
Component
S.e.
10.523
4.515
*** Residual variance model ***
Order Nrows 0 23 1 10
7 REML analysis of mixed models
258 Term
Factor
heart.time heart time
Model(order)
Parameter
Sigma2 Identity Antedependence(1) dinv_1 dinv_2 dinv_3 dinv_4 dinv_5 dinv_6 dinv_7 dinv_8 dinv_9 dinv_10 u_12 u_23 u_34 u_45 u_56 u_67 u_78 u_89 u_910
Estimate
S.e.
1.000 -
FIXED -
0.09154 0.06305 0.03440 0.03129 0.01486 0.02841 0.009923 0.01197 0.01024 0.009345 0.2206 -0.03461 -0.05248 0.5698 0.2384 -0.5742 -0.5976 -0.4167 -0.6000
0.04171 0.03045 0.01241 0.01108 0.00543 0.01039 0.003257 0.00394 0.00335 0.003043 0.4181 0.37493 0.26525 0.3692 0.1710 0.3830 0.2004 0.2089 0.2217
*** Wald tests for fixed effects ***
Fixed term
Wald statistic
d.f.
Wald/d.f.
Chi-sq prob
9 1 1 9 9 1 9
30.45 0.42 0.29 4.29 2.56 5.76 0.34
<0.001 0.518 0.593 <0.001 0.006 0.016 0.963
0.34
0.963
* Sequentially adding terms to fixed model time A B time.A time.B A.B time.A.B
274.09 0.42 0.29 38.60 23.02 5.76 3.04
* Dropping individual terms from full fixed model time.A.B
3.04
9
* Message: chi-square distribution for Wald tests is an asymptotic approximation (i.e. for large samples) and underestimates the probabilities in other cases.
7.4 Repeated measurements
259
The output shows evidence of time effects and of interactions involving time, A and B. So we finish by forming A by B by time tables of predicted means (Figure 7.17).
Figure 7.17
*** Table of predicted means for time.A.B ***
time 0.000 1.000 2.000 3.000 4.000 5.000 6.000 8.000 10.000 12.000
B A 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
Standard error of differences:
Average variance of differences:
1
2
77.47 82.22 72.95 84.38 79.31 78.36 74.99 75.16 76.10 75.23 72.37 73.46 64.38 67.63 57.87 68.92 48.41 61.16 43.43 56.82
84.14 82.35 81.26 81.94 82.74 75.70 79.98 75.48 75.17 69.41 71.86 61.59 61.33 58.92 47.62 56.73 43.68 54.49 39.31 51.52
Average Maximum Minimum
5.087 7.670 2.314 27.31
Standard error of differences for same level of factor:
260
7 REML analysis of mixed models
time A Average 5.082 4.952 Maximum 7.670 7.670 Minimum 2.675 2.314 Average variance of differences: 28.63 25.92
7.5
B 4.952 7.670 2.314 25.92
Commands for REML analysis
The menus described in this chapter all use the REML directive. Before using REML we first need to define the model that is to be fitted in the analysis. For straightforward linear mixed models, the only directive that needs to be specified is VCOMPONENTS. The FIXED option specifies a model formula defining the fixed model terms to be fitted, and the RANDOM parameter specifies another model formula defining the random terms. There are two other parameters. INITIAL provides initial values for the estimation of each variance component. These are supplied as the ratio of the component to the residual variance, but the default value of one is usually satisfactory. The CONSTRAINT parameter can be used to indicate whether each variance component is to be constrained in any way. The default setting, none, leaves them unconstrained. The positive setting forces a variance component to be kept positive, the fixrelative fixes the relative value of the component to be equal to that specified by the INITIAL parameter, and the fixabsolute setting fixes it to the absolute value specified by INITIAL. The FACTORIAL option sets a limit on the number of factors and variates allowed in each fixed term (default 3); any term containing more than that number is deleted from the model. Usually, only FIXED and RANDOM need to be set. For example, the statement below defines the models for the split-plot example in Section 7.1. VCOMPONENTS [FIXED=variety*nitrogen] \ RANDOM=blocks/wplots/subplots
The VSTRUCTURE directive is used, if necessary, after VCOMPONENTS to define the covariance models required for spatial models or for repeated measurements. The TERMS option defines the term or terms for which structures are to be defined. For a random term involving more than one factor, the covariance matrix can be formed either as a single matrix for the whole term, or as the direct product of several matrices corresponding to the factors. In this chapter, we deal only with structures constructed as direct products. For example, the model in Section 7.3 is the direct product of two autoregressive structures, and is specified by VSTRUCTURE [fieldrow.fieldcolumn] \ FACTOR=fieldrow,fieldcolumn; MODEL=ar,ar; ORDER=1,1
7.5 Commands for REML analysis
261
The first parameter, FACTOR, specifies the factors in the term(s) for which covariance models are to be defined, the MODEL parameter defines the model, and the ORDER parameter (if necessary) defines their orders. If a factor in a term is omitted from the FACTOR list, it is assumed to have the ordinary, uniform covariance structure (i.e. an equal covariance between each of its levels). The example in Section 7.4 has an ante-dependence structure defined over time for each heart, but uniform correlations between the hearts. This can be defined by VSTRUCTURE [heart.time] FACTOR=time; MODEL=antedependence;\ ORDER=2; H ETEROGENE ITY=none
For ante-dependence structures it is best to supply initial values for the covariance parameters or the estimation may not converge. The REML menus calculate these automatically. Details of how to specify these for yourself, or how to define other models. are given in Section 5.4 of Part 2 of the Guide to GenStat. Once the models have been defined, the REML directive can be used to perform the analysis. The first parameter of REML specifies the y-variate to be analysed. The PRINT option is set to a list of strings to select the output to be printed. These are similar to the check boxes of the Further Output menu. The most commonly used settings are: model description of model fitted, components estimates of variance components and estimated parameters of covariance models, effects estimates of parameters in the fixed and random models, means predicted means for factor combinations, vcovariance variance-covariance matrix of the estimated components, deviance deviance of the fitted model, waldtests Wald tests for all fixed terms in model, missingvalue estimates of missing values, covariancemodels estimated covariance models. The default setting of PRINT=model,components,Wald,cova gives a description of the model and covariance models that have been fitted, together with estimates of the variance components and the Wald tests. By default if tables of means and effects are requested, tables for all terms in the fixed model are given together with a summary of the standard error of differences between effects/means. Options PTERMS and PSE can be used to change the terms or obtain different types of standard error. For example, REML [PRINT=means; PTERMS=nitrogen.variety; \
7 REML analysis of mixed models
262
PSE=allestimates]
will produce a nitrogen by variety table of predicted means with a standard error for each cell. Further output is produced by the VDISPLAY directive, which has options PRINT, PTERMS and PSE like those of REML. Information from the analysis can be saved using the VKEEP directive. For example this has options RESIDUALS and FITTEDVALUES to save the residuals and fitted values respectively. It also has parameters to allow you to save variance components, predicted means, standard errors and so on. Full details are given in Section 5.8 of Part 2 of the Guide to GenStat.
7.6
Other facilities
The M ixed Models (REML) section of the Stats menu has further menus for multivariate linear mixed models, random coefficient regression, multiple experiments, generalized linear mixed models. It also has menus for hierarchical generalized linear models, but these need to be enabled by checking the appropriate box on the Menus page of the Options menu (see Section 1.5). Many of the mixed models menus have a Predict button, to access the R em l Predictions menu. This is based on the VPREDICT directive (see Section 5.5 of Part 2 of the Guide to GenStat), and provides similar facilities to the regression prediction menus (see Figure 5.9 in Section 5.1, and Figure 5.24 in Section 5.4).
7.7
Exercises
7(1) GenStat spreadsheet file Vartrial1.gsh contains data from a trial of 35 varieties of wheat. The design has two replicates each laid out in a five by seven plot array. Assuming that the same block structure is appropriate as in Section 7.2 (rows crossed with columns within replicates), analyse the data as a linear mixed model. 7(2) The trial in exercise 7(1) was laid out with the second replicate immediately below the first replicate. GenStat spreadsheet file Vartrial2.gsh contains the same data as in 7(1) together with an extra factor rowposition, representing the actual row locations on the field. (To avoid warnings about duplicate column names, close the sheet Vartrial1.gsh before you open Vartrial2.gsh.) The data are on a regular grid, rowposition by column. Investigate spatials model, with autoregressive structures on rows and columns.
7.7 Exercises
263
7(3) GenStat spreadsheet Parasite.gsh contains data from a trial to compare two methods for controlling intestinal parasites in calves. (For further details see Kenward, 1987, A method for comparing profiles of repeated measurements, Applied Statistics, 36, 296-308.) Sixty calves were allocated at random to two treatment groups, and their weights were measured at the outset, and then after 2, 4, 6, 8, 10, 12, 14, 16, 18 and 19 weeks. Analyse the data using the Repeated Me asu rem ents menu. Hint: try an ante-dependence structures of order one or two with no random subject effects, and remember that the data are in one variate. 7(4) In a trial to study the effect of a dietary additive, seven rats were allocated at random to receive the standard diet, and seven to receive the enhanced diet. Their weight gains were measured after 1, 3, 5, 7 and 10 weeks. The data contained in GenStat spreadsheet Ratmeasures.gsh. Analyse the data using the Repeated Me asu rem ents menu. Hint: try an ante-dependence structure of order one with no random subject effects, or a power model, and remember that the data are in one variate.
264
8
Multivariate analysis
Multivariate analysis is useful when you have several different measurements on a set of n objects. In GenStat the measurements would usually be stored in separate variates, and these would have a unit for each object. The objects are often regarded as being a set of n points in p dimensions (p being the number of variates). Many techniques, for example principal components analysis (Section 8.1) and canonical variates analysis (Section 8.2) are aimed at reducing the dimensionality. That is, they aim to find a smaller number of dimensions (usually 2 or 3) that exhibit most of the variation present in the data. This can help you determine patterns or structure in the data, as well as identify the relative importance of individual variables. GenStat has several menus for producing graphical representations, for example multidimensional scaling (Section 8.3) and principal coordinates analysis. It also has facilities for modelling multivariate data, including multivariate analysis of variance (Section 8.6) and partial least squares. Another important requirement is to take a set of units and classify them into groups based on their observed characteristics. Hierarchical cluster analysis (Section 8.4) starts with a set of groups each of which contains one of the units. These initial groups are successively merged into larger groups, according to their similarity, until there is just one group containing all the observations. GenStat also provide menus for non-hierarchical classification (Section 8.5), where the aim is to form a single grouping of the observations that optimizes some criterion such as the within-class dispersion, or the Mahalanobis squared distance between the groups, or the between-group sum of squares. Finally, in this chapter, Section 8.7 describes the facilities for constructing classification trees, which allow you to predict the classification of unknown objects using multivariate observations.
8 Multivariate analysis
8.1
266
Principal components analysis
Principal component analysis aims to find linear combinations of the data variates that represent most of the variation between the units. As an example, we consider some examination marks for 88 students in the subjects Mechanics, Vectors, Algebra, Analysis and Statistics. We are interested in finding dimensions that best separate the students. In particular, the first dimension may provide insights into how best to combine the Figure 8.1 results into a single mark. (For details, see Mardia, Kent and Bibby, 1979, Multivariate Analysis, Academic Press, London.) The data are available in GenStat spreadsheet Exam.gsh (Figure 8.1). The Principal Components An alysis menu (Figure 8.2) is obtained by clicking on the Principal Com pon ents line in the Mu ltivariate Analysis section of the Stats menu on the menu bar. You need to enter the data variates into the Data to be Analysed window, and decide whether the analysis is to be based on sums of squares and products, or variances and covariances or correlations. The first two produce essentially the same analysis (there is just a common scaling of %(n!1) applied to the variates, to convert from sums of Figure 8.2 squares to variances). The final setting, Correlation M atrix standardizes each variate (by subtracting its mean and dividing by its standard deviation). This can be very useful if the variates do not
8.1 Principal components analysis
267
share a common scale and show very different amounts of variation. Here we have a common scale (0 to 100). So we have chosen to use the variance-covariance matrix (which GenStat will calculate for us automatically, from the variates). Clicking on the Options button produces the P r i nc i p al Co m pon ents
Analysis
Optio ns
menu (Figure 8.3), which controls the printed output from the analysis. We have set Display box to print Late nt Ro ots , Loadings , Scores and Significance Tests . We have also set the Num ber of D im ensions box to 5, which will give all the available latent roots and vectors. If you choose to have less than the full number of dimensions, the Residuals check box can print residuals representing the information in the dimensions Figure 8.3 that have been excluded. The Num ber of Dimensions setting also applies to results saved from the Principal Co m pon ents Save Options menu, which is obtained by clicking on the Save button on the Principal Co m ponents Analysis menu. Output from the analysis of the exam marks is shown below. ***** ***
Principal components analysis Latent Roots 1 687.0
***
***
*** 2 202.1
Percentage variation 1 61.91 Trace
*****
3 103.7
4 84.6
5 32.2
3 9.35
4 7.63
5 2.90
***
2 18.21
***
1110 ***
Latent Vectors (Loadings)
Mechanics Vectors Algebra
1 -0.50545 -0.36835 -0.34566
***
2 0.74875 0.20740 -0.07591
3 -0.29979 0.41559 0.14532
4 0.29618 -0.78289 -0.00324
5 0.07939 0.18888 -0.92392
8 Multivariate analysis Analysis Statistics
***
-0.30089 -0.54778
0.59663 -0.60028
Significance tests for equality of final K roots No. (K) Roots
Chi squared
2 3 4 5
***
-0.45112 -0.53465
19.06 28.86 66.02 221.38
0.51814 -0.17573
0.28552 0.15123
***
df 2 5 9 14
Principal Component Scores
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
268
1 -66.32 -63.62 -62.93 -44.54 -43.28 -42.55 -39.11 -37.53 -39.39 -32.14 -28.39 -24.87 -24.81 -21.66 -21.92 -19.51 -18.73 -17.11 -19.65 -17.22 -15.75 -17.78 -13.26 -14.83 -14.75 -13.36 -9.15 -12.11 -6.65 -10.12 -5.89 -6.52 -9.47 -8.95 -5.63 -5.97 -3.30 -4.38 -2.25 0.02 -0.22 -2.17 -0.30 -1.11 -1.19
*** 2 6.45 -6.75 3.08 -5.58 1.13 -10.97 -8.26 5.60 -1.13 16.40 0.52 9.24 -6.71 21.62 -25.67 17.19 -0.34 1.33 10.93 -22.85 -0.80 -7.84 -37.87 0.96 4.57 7.48 3.71 -41.31 7.48 -21.23 -5.92 3.96 -18.92 -23.82 -4.75 11.96 10.86 0.87 8.37 10.45 -14.50 7.96 8.98 8.36 11.11
3 -7.07 -0.86 -10.23 4.38 1.53 -4.87 0.81 5.50 -9.41 10.28 5.74 11.35 -9.07 7.17 -6.69 5.80 3.84 11.59 -2.77 1.29 9.94 -17.13 7.58 4.08 -10.62 2.15 13.70 -13.62 6.93 -7.34 11.80 8.11 -12.30 -12.62 -1.78 0.03 8.64 2.20 10.46 19.95 10.59 1.74 8.89 -3.18 -2.85
4 -9.65 -9.15 -3.72 -4.48 5.81 -0.48 -4.35 -3.78 2.51 -1.91 -0.26 -0.48 4.36 3.74 -5.89 -0.08 -11.53 -9.71 15.26 3.85 0.63 7.84 -9.70 8.21 3.97 8.67 5.77 -2.50 3.94 3.86 -14.78 10.64 22.90 3.65 -9.86 18.40 0.84 13.74 -3.64 -2.58 -5.23 7.88 -8.00 -1.56 -10.32
5 5.46 -7.57 -0.38 4.41 0.74 -7.10 -0.13 -4.37 5.33 2.13 1.31 1.24 -6.11 -0.57 -0.89 0.96 5.08 -3.11 -5.29 -3.83 6.01 -6.68 -1.95 7.82 -2.39 9.67 0.49 6.45 -12.63 6.10 5.25 0.76 -0.60 6.13 1.82 -4.65 -5.23 -2.89 -1.87 -0.32 -6.13 2.14 2.54 -3.36 4.10
8.1 Principal components analysis 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
0.62 -1.38 1.51 1.45 1.61 3.01 5.01 5.99 3.65 6.54 4.76 5.00 11.38 7.25 9.51 7.13 9.57 12.85 16.48 15.48 17.28 17.63 18.32 21.78 13.53 23.09 23.00 24.25 26.57 29.00 27.99 26.21 30.27 31.75 32.37 34.45 42.88 42.32 42.71 44.84 44.30 62.49 65.96
6.68 9.03 -13.43 2.16 15.07 -2.23 -20.08 20.37 22.67 3.93 -3.11 -10.98 7.59 2.56 3.47 24.59 10.80 1.75 -10.86 1.25 36.73 10.24 2.55 -6.75 -4.58 22.53 -19.81 19.16 3.95 -6.59 33.73 -16.29 -17.27 1.11 1.95 -34.41 -3.21 0.89 -0.18 -12.66 -7.85 -7.57 -2.27
12.37 -1.72 14.85 14.53 2.70 14.31 1.10 4.33 -27.90 -10.16 -30.39 -8.73 9.22 -17.78 -11.77 -26.31 -5.35 6.01 17.86 3.23 -7.72 -10.07 -9.36 6.77 -3.18 -2.40 6.83 -20.09 8.87 1.71 0.78 -3.91 -5.62 14.61 -7.25 -4.87 3.13 2.14 4.31 8.88 2.64 -7.74 -2.52
0.03 5.50 -5.77 -3.10 -2.17 -2.12 -11.84 -3.15 -10.65 -14.96 -10.33 11.95 8.25 -0.57 -4.11 -0.70 12.81 7.76 0.50 13.29 -4.60 1.13 -13.21 -10.89 15.93 5.68 -4.19 -1.51 14.28 -17.25 -2.94 7.95 -12.74 10.66 14.14 22.48 -21.57 -6.78 -2.43 4.82 5.89 -0.59 -17.70
269 0.73 5.55 0.30 5.47 0.32 1.97 -6.94 0.68 -3.68 -2.58 -3.45 8.01 -7.14 7.92 -0.70 6.98 5.04 -3.38 9.01 -0.45 4.04 -0.80 6.23 -8.00 -4.59 -6.99 9.92 -7.74 -0.69 -9.33 0.53 12.11 -7.93 6.09 -5.23 -11.34 -6.98 -7.32 -2.72 -7.21 4.60 14.56 7.22
The first principal component defines the direction in which the student marks exhibit the greatest variation, and the latent root is the variance in that direction. The second component defines the direction with the greatest variation of the directions orthogonal to the first component. The third component defines the direction with the greatest variation of the directions orthogonal to the first two components, and so on. Here, the first component contains about 62% of the variation, and the first and second components contain about 80%. The first component is in the direction ! 0.50545 × Mechanics ! 0.36835 × Vectors ! 0.34566 × Algebra ! 0.45112 × Analysis ! 0.53465 × Statistics
8 Multivariate analysis
270
(This might suggest that, if our aim is to produce the single score that best separates the students, Statistics should be weighted most highly, then Mechanics, then Analysis and then Vectors and Algebra!) The significance tests for equality of final K roots can be useful for deciding how many roots are needed. Asymptotically (that is, as the number of units becomes large) these have chi-square distributions when the analysis is based on variances or on sums of squares. However, this is not true for analyses based on correlations. To use the tests, we start by testing for equality of all the roots, then all except the first, all except the first and second, and so on, until the test is nonsignificant. The rationale is that, if we are to omit the final dimension, we should also omit all dimensions that are no more variable than that dimension. Here it turns out that we need all five dimensions (which must reassure the students that they have not wasted their effort in taking five exams!). The scores are the coordinates of the units in the direction of each principal component. So, the first column of scores shows how the students would be ranked according to the first principle component. This gives an interesting comparison with the ordering in the data set, which is according to the students’ mean marks. The menu uses the PCP directive, which is described in Section 6.2.1 of the Guide to GenStat, Part 2.
8.2
Canonical variates analysis
Canonical variates analysis is appropriate when the units are classified into groups. The aim is to find linear combinations of the data variates that represent most of the variation between the group (rather than between the individual units, as in principal components analysis; Section 8.1). We illustrate the analysis using a classic data set, Fisher’s Figure 8.4 Iris Data, which consists of measurements of sepal and petal lengths and widths on iris plants of three different species. This is available in the GenStat spreadsheet Iris.gsh (Figure 8.4).
8.2 Canonical variates analysis
271
The Canonical Variates Analysis menu (Figure 8.5) is obtained by clicking on the Canonical Variates line in the Multivariate Analysis section of the Stats menu on the menu bar. You need to enter the data variates into the Data to be Analysed window, and the factor defining the groups into the Grouping Factor window. Clicking on the Options button produces the Canonical Variates An alysis Options menu, which controls the printed output from the analysis. Figure 8.5 In the options menu (Figure 8.6), we have set the Display box to print Late nt Roots , Loadings , Canonical Variate Means and D i s t an c e s . The N u m b e r o f Dimensions box is set to 2, which is the maximum possible here as there are only three species of iris in the data set. If you choose to have less than the full number of dimensions, the Residuals check box can print residuals representing the information in the dimensions that have been excluded. The N u m b e r o f Figure 8.6 Dimensions setting also applies to results saved from the Canonical Variates Save Options menu, which is obtained by clicking on the Save button on the Canonical Variates Analysis menu. The Graphics section of the menu is set to plot the data with the first canonical variate along the x-axis, and the second along the y-axis. The output from the analysis is shown below.
8 Multivariate analysis ***** ***
Canonical variate analysis Latent Roots
2 0.29
Percentage variation 1 99.12
***
Trace
*****
***
1 32.19 ***
272
***
2 0.88
***
32.48 ***
Latent Vectors (Loadings)
1 2 3 4 ***
1 -0.829 -1.534 2.201 2.810
Canonical Variate Means
1 2 3 ***
***
0.000 9.480 13.393 1
*** 2 0.215 -0.728 0.513
***
1 2.105
Inter-group distances 1 2 3
2 0.024 2.165 -0.932 2.839
1 -7.608 1.825 5.783
Adjustment terms
1
***
0.000 4.147 2
2 6.661 ***
0.000 3
The results show that 99% of the between-group variation is in the direction of the first canonical variate: !0.829 × Sepal-Length !1.534 × Sepal-Width + 2.201 × Petal-Length + 2.810 × Petal-Width (using the coefficients in column 1 of the latent-vectors matrix). This is confirmed by the plot in Figure 8.7.
8.3 Multidimensional scaling
273
The matrix of canonical variate means presents the coordinates (or scores) for each group in the direction of each canonical variate. These are adjusted so that the centroid of the points, weighted by sizes of the groups, is at the origin. The adjustment term for each canonical variate gives the amount that had to be subtracted from the group means of the original variates in order to achieve this. (See Guide to GenStat, Part 2 Section 6.3.1 for Figure 8.7 more details.)
8.3
Multidimensional scaling
Multidimensional scaling operates on a symmetric matrix which is assumed to represent distances between a set of units. It aims to construct coordinates of points, in a defined number of dimensions, whose distances are approximately the same as those in the original matrix. To illustrate the analysis we will try to recreate the locations of some British towns, based on figures for the shortest distances between each of them by road. The data are in the GenStat spreadsheet Roaddist.gsh (Figure 8.8). This is a symmetric matrix spreadsheet (as shown by the blanks above the diagonal).
8 Multivariate analysis
274
Figure 8.8 The Multidim ensional Scaling menu is obtained by clicking on the Multidimensional Scaling line in the Mu ltivariate Analysis section of the Stats menu. In Figure 8.9, we have entered Distances as the distance matrix to use, and set the required number of Figure 8.9 dimensions to 3. The algorithm starts with an initial configuration of points which it then modifies using a method known as steepest descent, until no further improvements are possible (see the Guide to GenStat, Part 2 Section 6.11). To evaluate the configuration, it does a regression of the inter-point distances, calculated from the current configuration, against the supplied distances. The Method setting on the menu controls whether this is a “monotone regression” (which corresponds to what is known as non-metric scaling) or an ordinary linear regression (corresponding to metric scaling). It then compares the fitted distances from the regression with the original distances using a quantity known as the stress.
8.3 Multidimensional scaling
275
The Scaling section of the Multidimensional Scaling Options menu (Figure 8.10) controls whether the stress is calculated on a least-squares scale, a leastsquares-squared scale or a logarithmic scale. The Treatment of Ties section of the options menu allows you to vary the way in which tied values in the Figure 8.10 supplied distances are treated. With the Prima ry setting, no restrictions are placed on the inter-point distances corresponding to tied distances. In the Secon dary setting, the inter-point distances corresponding to tied distances are required to be as nearly equal as possible. The Tertiary setting is a compromise between the primary and secondary approaches to ties: the block of ties corresponding to the smallest distance are handled by the secondary method, and the remaining blocks of ties are handled by the primary method. This is particularly useful when the supplied distance matrix contains only a distinct values. Further information is given in the Guide to GenStat, Part 2 Section 6.11, which describes the MDS directive that is used by these menus. The directive also has some additional facilities, for example the ability to try several automatically-generated initial configurations, or to supply your own. You can save information from the analysis using the Multidimensional Options menu
Scaling
Save
(Figure 8.11), which is obtained by clicking on the S a v e button on the Multidimensional Scaling menu. Here we have asked to save the coordinates in a matrix called Locations, and to display these in a spreadsheet. If we click on OK, h e r e a n d i n t he Multidimensional Scaling menu Figure 8.11 itself, GenStat produces the output below.
8 Multivariate analysis *****
Multidimensional scaling
*****
*** Least-squares scaling criterion *** * Distances fitted using monotonic regression (non-metric MDS) * Primary treatment of ties * ***
Coordinates
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ***
*** Locations 1 1.5484 -0.3078 -0.3114 0.2217 -0.8890 -0.6226 -0.6431 0.5885 -0.9355 0.9990 -0.8919 -0.5067 1.4446 -0.5251 -0.3561 -0.5691 -0.1263 1.6924 2.2182 0.1128 -1.2765 -0.0796 0.0530 0.5924 -1.0095 -0.9086 -0.0041 0.9710 -0.6909 0.2118
Latent Roots
2 0.1783 -0.3633 0.0458 -0.0628 0.0445 -0.1728 -0.3115 -0.0668 0.7144 0.0146 -0.3042 -0.5402 -0.3659 -0.0835 0.9341 0.7727 -0.2216 -0.1077 -0.0953 0.4403 -0.5883 0.3954 -0.1291 0.2350 -0.4204 0.2234 0.1900 -0.2686 -0.3321 0.2454
3 0.0479 0.3318 0.0425 0.1680 -0.2099 -0.0704 0.0887 0.0626 0.0971 -0.0621 -0.3297 0.4894 -0.2570 -0.0028 0.0379 -0.1861 0.7325 -0.2834 -0.3802 0.1190 -0.6677 -0.0108 0.1804 -0.0547 -0.4627 -0.1553 0.0320 0.3837 0.3307 -0.0117
2 4.14
3 2.43
*** 1 23.43
276
*
8.3 Multidimensional scaling
277
To plot the points, we need to convert the rectangular matrix spreadsheet of locations to a vector spreadsheet, by making this the active window and then selecting C o n v e r t in the Ma nipulate section of the Spread menu on the menu bar. The columns will then be become variates, probably with names C1, C2 and C3, which can be used in the graphics menus in the usual way. Figure 8.13 shows a plot of the first two dimensions, and Figure Figure 8.12 8.14 plots the second dimension against the third (showing some of the distortion in the data from a 2-dimensional solution).
Figure 8.13
Figure 8.14
8 Multivariate analysis
8.4
278
Hierarchical cluster analysis
The hierarchical cluster analysis facilities in GenStat provide ways of grouping n objects into classes according to their similarity. It starts with a set of n clusters (or groups), each containing a single object. These initial clusters are successively merged into larger clusters, according to their similarity, until there is just one cluster (containing all the objects). We shall use a set of data concerning mean mandible measurements of various types of modern and prehistoric dog (Higham, Kijngam & Manly, 1980, An analysis of prehistoric canid remains from Thailand, Journal of Archaeological Science, 7, 149-165). This data set is also discussed by Manly (1986, Multivariate Statistical Methods a Primer, Chapman & Hall, London). The data are in the GenStat spreadsheet Dog.gsh (Figure 8.15).
Figure 8.15
The Hierarchical Clu ste r Analysis menu is obtained by clicking on the Hierarchical line in the Cluster An alys is subsection in the Mu ltivariate Analysis section of the Stats menu on the menu bar. If you have already formed a similarity matrix, you can enter its name straight into the Sim ilarity Matrix field in the menu Figure 8.16 (Figure 8.16).
8.4 Hierarchical cluster analysis
279
Alternatively, you can click on the Fo rm Sim ilarity Matrix button and use the Form Similarity Matrix menu (Figure 8.17). The names of the variates need to be entered into the Data Values window, and you need to define a name (here dogmat) for the resulting symmetric matrix. You must also select the way in which the similarities are to be calculated. Here we have chosen one based on the geometric distance between the points representing each pair of objects. (A formal definition of this, and the other Figure 8.17 possibilities is in the Guide to GenStat, Part 2, Section 6.1.2, where it describes the METHOD option of the FSIMILARITY directive, used by the menu.) Finally, you can specify a vector (here the text called type from the first column of the spreadsheet) to label rows and columns of the matrix. When you click on OK , the name of the matrix is automatically entered into the Sim ilarity Matrix field in the Hierarchical Cluster An alysis menu. The Method field in the Hierarchical Clu ster Analysis menu contains a drop-down list box to specify the method of clustering to use. These differ according to the way in which they define the similarity between clusters containing more than one object: Single Link defines the similarity to be the maximum similarity between any pair of objects (taken one from each cluster); Nearest Neighbour is a synonym for Single Link; Co m plete Link defines the similarity between two clusters as the minimum similarity between any pair of objects; Furthest Neighbour is a synonym for Com plete Link; Average Link defines the similarity, between a cluster and a new cluster formed by merging two clusters, as the average of the similarities with each of the original clusters;
8 Multivariate analysis
is similar to Average Link , except that the average is taken over all the objects in the two merging clusters; can be viewd in terms of clusters being represented by points in a multidimensional space ! when two clusters join, the new cluster is represented by the midpoint of the original cluster points.
Group Average
Median Sorting
Output from the analysis is controlled by the Hierarchical Cluster Analysis Options menu (Figure 8.18). For the dog example, we will simply print, and plot, the dendrogram. This displays the points at which the various clusters combine, allowing you to assess the relationships between the objects. If you specify a threshold in the Forming Groups field of the options menu, GenStat will form a factor grouping all the objects that have been combined into a single cluster at that level of similarity. Figure 8.18 You can arrange to save the factor using the Hierarchical Cluster Analysis Save Options menu. **** Single linkage cluster analysis **** **** Dendrogram **** ** Levels 100.0 Modern dog Prehistoric dog Cuon Dingo Golden jackal Chinese wolf Indian wolf
280
1 7 5 6 2 4 3
90.0
.. ..) ..).. .....) .....) .....) .....)...........
8.5 Non-hierarchical cluster analysis
281
The dendrogram for the dogs, printed above and plotted (with better resolution) in Figure 8.19, shows that the modern and prehistoric dogs are most closely related, and that both of these are related to the Cuon and to the Dingo and Golden jackal. The Indian and Chinese wolves are related to each other more than any of the other dogs, but the similarity is not close.
Figure 8.19
8.5
Non-hierarchical cluster analysis
Non-hierarchical cluster analysis aims to find a single grouping of a set of n objects by optimizing a criterion, for example by maximizing the between-group sum of squares. Other criteria in GenStat include maximizing the total betweengroups Mahalanobis distance, minimizing the within-class dispersion or a criterion known as maximal predictive classification, which is designed specifically for binary data. For full definitions, see the Guide to GenStat, Part 2 Section 6.17. This form of clustering includes the technique known as K-means clustering, where the criterion is usually the within-class dispersion. To illustrate the menus we shall use some measurements taken on 30 bronze brooches (Doran & Hodson, 1975, Mathematics and Computers in Archaeology, Edinburgh University Press, Table 9.1). These are stored in GenStat spreadsheet Brooch.gsh (Figure 8.20).
8 Multivariate analysis
282
Figure 8.20 Before doing the cluster analysis, to counteract skewness in the variables, we transform each column of measurements x to log10(x+1). This can be done using the Ca lculate menu in the usual way (see Section 2.7). Alternatively, to save time, the transformed data are available in spreadsheet Logbrooch.gsh. To obtain the Non-hierarchical Cluster Analysis menu you click on the Non-hierarchical line in the Cluster Analysis subsection in the Mu ltivariate Analysis section of the Stats menu on the menu bar. Figure 8.21 shows the menu set to use all the measurements to form four groups using the between-group sum of squares criterion. The Non-hierarchical Cluster An alysis Options menu controls the way in which the search for the best grouping is carried out, and the output that is produced. Figure 8.21
8.5 Non-hierarchical cluster analysis
283
In Figure 8.22, we have asked GenStat to form the initial classification by finding the four objects that are furthest apart in the 7-dimensional space defined by the measurements, using these objects as the “cores” of the initial groups, and allocating the other objects to the group with the nearest core. (Note: this is feasible only if the number of groups does not exceed the number of variates.) The Figure 8.22 Between-group Interchanges box controls how GenStat generates new groupings from the initial classification. Here we are allowing objects both to be swapped between groups, and to be transferred from group to group. The setting Sw ap Only would constrain the group sizes to remain the same throughout the search (which might be useful, for example, if you wanted groups of equal sizes, and chosen the Equal-sized Groups option for the Initial Classification), and the setting Fix at Initial Configuration makes no changes. You can save the final classification, by checking the Grouping box and defining the name of a factor to store the information in the In box of the Non-hierarchical Clu ste r Analysis Save Options menu (see Figure
8.23). Figure 8.23 Output from the clustering is of the brooches is shown below. *****
Non-hierarchical Clustering
*****
***
Sums of Squares criterion
***
Initial classification
***
Number of classes = 4
***
Class contributions to criterion 1 0.5471
***
***
2 0.2434
*** 3 1.4189
4 0.3416
8 Multivariate analysis ***
Criterion value = 2.55101
***
Classification of units 1 3 4
***
3 3 3
3 3 1
Class mean values
1 2 3 4
1 2 3 4
***
4 3 4
284
*** 3 3
4 2
3 3
3 1
2 2
3 2
1 3
1 3
***
Bow_height Bow_thickness 1.303 0.768 1.217 0.514 1.243 0.728 1.112 0.564 Foot_length 1.819 1.402 1.298 1.259
Bow_width Coil_diameter Element_diameter 0.828 1.093 0.992 1.102 0.899 0.756 0.818 0.912 0.959 0.619 0.874 0.540
Length 1.994 1.710 1.674 1.623
Units rearranged into class order
***
Group 1 1 1.398 1.380 1.255 1.204 1.279
2 0.653 0.833 0.756 0.724 0.875
6 1.973 1.623 1.681 1.978 1.839
7 2.061 1.857 1.898 2.111 2.045
1 1.146 1.204 1.279 1.279 1.176
2 0.380 0.690 0.519 0.602 0.380
6 1.505 1.477 1.342 1.362 1.322
7 1.740 1.681 1.699 1.778 1.653
1 1.204 1.431 1.380 1.230 1.279 1.255
2 0.623 0.940 0.792 0.708 0.653 0.881
3 0.653 0.919 0.833 0.724 1.013
4 1.230 1.079 1.041 1.114 1.000
5 1.146 1.176 0.954 1.079 0.602
3 1.270 1.104 0.959 0.991 1.185
4 0.903 0.845 1.000 0.903 0.845
5 0.778 0.845 0.778 0.778 0.602
3 0.690 0.857 0.940 0.851 0.653 1.009
4 0.903 1.000 0.903 1.000 1.041 0.845
5 0.954 1.114 0.954 0.903 0.778 0.845
Group 2
Group 3
2 3
8.5 Non-hierarchical cluster analysis 1.146 1.204 1.230 1.204 1.146 1.255 1.204 1.362 1.204 1.146
0.568 0.763 0.653 0.681 0.732 0.556 0.748 0.869 0.699 0.778
6 1.531 1.380 1.322 1.041 1.204 1.301 1.380 1.322 1.255 1.322 1.322 1.462 1.362 1.114 1.447 1.000
7 1.785 1.875 1.839 1.663 1.613 1.602 1.623 1.591 1.653 1.708 1.568 1.732 1.681 1.663 1.732 1.462
1 0.9031 1.2041 1.3010 1.0414
2 0.4314 0.6532 0.6532 0.5185
6 1.3424 1.4472 1.2041 1.0414
7 1.5563 1.7482 1.7559 1.4314
285
0.792 0.756 0.785 0.813 0.732 0.544 0.778 0.892 0.964 1.025
0.845 0.778 0.903 0.903 0.778 1.041 0.903 1.000 1.041 0.699
1.041 1.114 0.954 0.845 1.041 0.954 1.146 1.000 1.000 0.699
3 0.6532 0.6721 0.6721 0.4771
4 0.8451 0.9031 0.9031 0.8451
5 0.4771 0.6021 0.6021 0.4771
Group 4
***
Optimum classification
***
Number of classes = 4
***
Class contributions to criterion 1 0.5471
***
2 0.2434
***
Criterion value = 2.50611
***
Classification of units 1 3 4
***
4 3 4
3 3 4
Class mean values
3 3 1
3 0.9901
4 0.7254
*** 3 3
***
***
4 2
3 3
4 1
2 2
3 2
1 3
1 3
2 3
8 Multivariate analysis 1 2 3 4
1 2 3 4
***
Bow_height Bow_thickness 1.303 0.768 1.217 0.514 1.247 0.730 1.146 0.615 Foot_length 1.819 1.402 1.326 1.207
286
Bow_width Coil_diameter Element_diameter 0.828 1.093 0.992 1.102 0.899 0.756 0.815 0.917 0.991 0.692 0.873 0.606
Length 1.994 1.710 1.694 1.594
Units rearranged into class order
***
Group 1 1 1.398 1.380 1.255 1.204 1.279
2 0.653 0.833 0.756 0.724 0.875
6 1.973 1.623 1.681 1.978 1.839
7 2.061 1.857 1.898 2.111 2.045
1 1.146 1.204 1.279 1.279 1.176
2 0.380 0.690 0.519 0.602 0.380
6 1.505 1.477 1.342 1.362 1.322
7 1.740 1.681 1.699 1.778 1.653
1 1.204 1.431 1.380 1.230 1.255 1.146 1.204 1.230 1.204 1.146 1.255 1.204 1.362 1.204
2 0.623 0.940 0.792 0.708 0.881 0.568 0.763 0.653 0.681 0.732 0.556 0.748 0.869 0.699
6
7
3 0.653 0.919 0.833 0.724 1.013
4 1.230 1.079 1.041 1.114 1.000
5 1.146 1.176 0.954 1.079 0.602
3 1.270 1.104 0.959 0.991 1.185
4 0.903 0.845 1.000 0.903 0.845
5 0.778 0.845 0.778 0.778 0.602
3 0.690 0.857 0.940 0.851 1.009 0.792 0.756 0.785 0.813 0.732 0.544 0.778 0.892 0.964
4 0.903 1.000 0.903 1.000 0.845 0.845 0.778 0.903 0.903 0.778 1.041 0.903 1.000 1.041
5 0.954 1.114 0.954 0.903 0.845 1.041 1.114 0.954 0.845 1.041 0.954 1.146 1.000 1.000
Group 2
Group 3
8.6 Multivariate analysis of variance 1.531 1.380 1.322 1.041 1.301 1.380 1.322 1.255 1.322 1.322 1.462 1.362 1.114 1.447
1.785 1.875 1.839 1.663 1.602 1.623 1.591 1.653 1.708 1.568 1.732 1.681 1.663 1.732
1 0.9031 1.2041 1.2788 1.3010 1.0414 1.1461
2 0.4314 0.6532 0.6532 0.6532 0.5185 0.7782
6 1.3424 1.4472 1.2041 1.2041 1.0414 1.0000
7 1.5563 1.7482 1.6128 1.7559 1.4314 1.4624
287
Group 4 3 0.6532 0.6721 0.6532 0.6721 0.4771 1.0253
4 0.8451 0.9031 1.0414 0.9031 0.8451 0.6990
5 0.4771 0.6021 0.7782 0.6021 0.4771 0.6990
The output gives details of the initial classification and of the final (optimal) classification, showing the criterion value, how the objects are allocated to the groups and the mean values of the measurements in each group. In this example, the initial classification has been very successful. The optimum classification differs only in that the eighth object has been transferred from group 3 to group 4, and the 29th object from group 4 to group 3.
8.6
Multivariate analysis of variance
Multivariate analysis of variance can be viewed as the extension of ordinary analysis of variance (as in Chapter 6) to handle several response variates at once. So, for example, instead of making assumptions of Normality for the residuals from a single response variate, we are now assuming multivariate Normality of residuals from several response variates.
8 Multivariate analysis
288
We illustrate the analysis using data from an experiment to investigate sex and temperature effects on the growth of tumours in rats (see page 143 of Chatfield & Collins, 1986, Introduction to Multivariate Analysis, Chapman and Hall, London). Three rats of each sex were reared in each of three temperatures (4, 20 and 34). There was no blocking (i.e. Figure 8.24 this is a completely randomized design). The weights of the rats were measured (prior to sub-cutaneous seeding of the tumours). The response variates, taken at the end of the experiment, are the tumour weight and the final weight (excluding the tumour). The data are available in spreadsheet Tumour.gsh (Figure 8.24). The MANOVA menu (Figure 8.25) is obtained by clicking on the M A NOVA line in the Mu ltivariate Analysis section of the Stats menu on the menu bar. In Figure 8.25, we have specified a treatment structure of Sex*Temperature, to fit the main effects of sex and temperature, and their interaction Figure 8.25 (see Section 6.6). There is no block structure but we want to treat the variate InitialWeight as a covariate (so we check the Co variate s box, and enter its name into the adjacent field). Covariates are included in the treatment model like variates in a linear regression. So, GenStat estimates a regression coefficient for them, and adjusts the other estimates and sums of squares to take account of their presence in the model (see Guide to GenStat, Part 2 Section 4.3).
8.6 Multivariate analysis of variance
289
The MANO VA Options menu, shown in Figure 8.26, allows you to control the output from the multivariate analysis, and also to display output from the univariate anova’s of the individual response variates. Here we have asked just to print the various tests from the multivariate analysis, and omitted the sums and squares and products of the treatment effects and residuals (which are involved in calculating the tests). Figure 8.26 The output, shown below, indicates that there are sex and temperature effects, but no interaction and no effect of the covariate. ***** Multivariate analysis of covariance ***** ***
Sex ***
*** Tests *** Wilk's Lambda: Approximate Chi sq: Approximate F test: Pillai-Bartlett trace: Roy's maximum root test: Lawley-Hotelling trace:
***
0.3485 10.54 on 2 d.f.; probability 0.005 9.35 on 2 and 10 d.f.; probability 0.005 0.6515 0.6515 1.870
Temperature ***
*** Tests *** Wilk's Lambda: Approximate Chi sq: Approximate F test: Pillai-Bartlett trace: Roy's maximum root test: Lawley-Hotelling trace:
***
0.3269 11.74 on 4 d.f.; probability 0.019 3.75 on 4 and 20 d.f.; probability 0.020 0.8477 0.4949 1.525
Sex . Temperature ***
*** Tests *** Wilk's Lambda: Approximate Chi sq: Approximate F test: Pillai-Bartlett trace:
0.7830 2.57 on 4 d.f.; probability 0.632 0.65 on 4 and 20 d.f.; probability 0.633 0.2278
8 Multivariate analysis Roy's maximum root test: Lawley-Hotelling trace:
***
290
0.1605 0.2634
Covariate: InitialWeight.
*** Tests *** Wilk's Lambda: Approximate Chi sq: Approximate F test: Pillai-Bartlett trace: Roy's maximum root test: Lawley-Hotelling trace:
0.8219 1.96 on 2 d.f.; probability 0.375 1.08 on 2 and 10 d.f.; probability 0.375 0.1781 0.1781 0.2167
The analysis uses the MANOVA procedure (see Guide to GenStat, Part 2 Section 6.6.1). This uses the ANOVA directive, which requires the design to be balanced (see Section 6.7 or Guide to GenStat, Part 2 Section 4.7). For unbalanced data, you can use the RMULTIVARIATE procedure, but this is not currently accessible through the menus.
8.7
Classification trees
A classification tree is a device for predicting (or identifying) the class to which an unidentified object belongs. The starting point is a sample of objects from the various classes. Measurement recorded on the sample may be either continuous (supplied in variates) or discrete (supplied in factors). Below we shall illustrate the methods using the iris data from Section 8.2, where the data were all continuous (see Figure 8.4). The Classification Tree menu (Figure 8.27) is in the Trees subsection of the Mu ltivariate An alysis section of the Stats menu on the menu bar. In Figure 8.27, we have specified Species as the name of the factor defining the groups to be predicted, and entered the names of all the measurements into the X-variates box. The Save Tree in box allows you to specify a name for the tree structure that Figure 8.27 GenStat will generate to represent the classification tree. If you do not do this, GenStat will use its own
8.7 Classification trees
291
private name, but you will not find it easy to use the tree outside the menus. Here we have specified the name IrisTree. The tree progressively splits the objects into subsets based on t h e i r v a l u e s fo r t he measurements. Construction starts at a node known as the root, which contains all of the objects. A factor or variate is chosen to use there that "best" splits the individuals into two subsets. For example, in the tree for the irises, the first division is done by seeing whether the petal lengths are less than or greater Figure 8.28 then 2.450 (see the output below). The tree is then extended to contain two new nodes, one for each of the subsets, and factors or variates are selected for use at each of these nodes to subdivide the subsets further. The process stops when either no factor or variate provides any additional information, or the subset contains individuals all from the same group, or the subset contains fewer individuals than a limit specified by the Num ber of item s to stop splitting field of the Classification Tree Options menu (Figure 8.28). The nodes where the construction ends are known as terminal nodes. Factors may have either ordered or unordered levels, according to whether or not the X-Variable factor levels ordered box is checked. For example, a factor called Temperature with levels 5, 10 and 20 would usually be treated as having ordered levels, whereas levels labelled 'London', 'Moscow', 'New York', 'Ottawa' and 'Paris' of a factor called Town would be regarded as unordered. For unordered factors, all possible ways of dividing the levels into two sets are tried. With variates or ordered factors with more than 2 levels, a suitable value p is found to partition the individuals into those with values less than or greater than p. The radio buttons in the Method and Anti-end-cut-factor boxes in the Classification Tree Options menu allow you to choose how to assess the potential splits: whether to use Gini information or mean posterior improvement, and whether to use adaptive anti-end-cut factors. Details are given in the Guide to GenStat, Part 2 Section 6.18.1. The Display box of the Classification Tree Options menu (Figure 8.28) is set to print the tree only in “indented” format. This is a representation analogous to those used to display botanical trees. In the iris output, printed below, The first variable
8 Multivariate analysis
292
to examine Petal_Length. If this is less than 2.450, the iris specimen is identified as Setosa. Otherwise you progress to index 2, and examine Petal_Width. So, a specimen of Versicolor might be identified by the sequence: (1) Petal_Length > 2.450; (2) Petal_Width < 1.750; (3) Petal_Length > 4.950; (5) Petal_Width > 1.550 Versicolor. Notice that the same variable can be used several times as the observed characteristics are refined on the way to an identification. 1 Petal_Length<2.450 Setosa 1 Petal_Length>2.450 2 2 Petal_Width<1.750 3 3 Petal_Length<4.950 4 4 Petal_Width<1.650 Versicolor 4 Petal_Width>1.650 Virginica 3 Petal_Length>4.950 5 5 Petal_Width<1.550 Virginica 5 Petal_Width>1.550 Versicolor 2 Petal_Width>1.750 6 6 Petal_Length<4.850 Virginica 6 Petal_Length>4.850 Virginica
Generally the construction of a classification tree will result in over-fitting. That is, it will form a tree that keeps selecting factors or variates to subdivide the individuals beyond the point that can be justified statistically. The solution is to prune the tree to remove the uninformative sub-branches. The pruning uses accuracy figures, which are stored for each node of the tree. The tree also stores a prediction for each node, which corresponds to the group with most individuals at the node. For each node of a classification tree, the accuracy is the number of misclassified individuals at the node, divided by the total number of individuals in the data set. It thus measures the “impurity” of the subset at that node (how far it is from it from being homogeneous i.e. having individuals all from a single group). You can prune the tree using the Tree Pruning menu, which is accessible from T ree subsection in the Multivariate Analysis section of the menu bar or by clicking on the P ru n e button on the Classification Tree menu. As we have loaded the menu from the Classification Tree menu, GenStat Figure 8.29 has filled in the name of the tree automatically.
8.7 Classification trees In the Display box, we have asked for the relationship between the impurity and the number of terminal nodes to be presented in a graph (Figure 8.30) and a table (below). The table and graph show that the impurity of the pruned trees drops rapidly as the number of terminal nodes increases from one up to three, but then tails off more slowly. This suggests that we should prune down to three terminal nodes, but no further. This tree is the fifth in the sequence of pruned trees (count from the right of the graph, or notice the numbering in first Figure 8.30 column of the table. ***** Characteristics of the pruned trees ***** Tree no.
RT
Number of terminal nodes
1 2 3 4 5 6 7
0.0133 0.0133 0.0200 0.0267 0.0400 0.3333 0.6667
7 6 5 4 3 2 1
By clicking the button Replace with pruned we can replace contents of the tree IrisTree with this smaller tree. We simply need to fill in the number of the tree (5) in the resulting menu, click on OK (and then cancel the Tree Pruning menu). Figure 8.31
293
8 Multivariate analysis The pruned tree can be displayed using the Classification Tree Further Output menu (Figure 8.32), obtained by clicking on the Further Output button on the Classification Tree menu.
Figure 8.32 ***** Summary of classification tree: IrisTree Number of nodes: 5 Number of terminal nodes: 3 Misclassification rate: 0.040 Variables in the tree: Petal_Length Petal_Width
***** Details of classification tree: IrisTree 1 Current prediction: 1.000 Number of observations: 150 Species Setosa Versicolor Virginica Proportions 0.333 0.333 0.333 Test: Petal_Length<2.450 Next nodes: 2 3 2 Current prediction: 1.000 Number of observations: 50 Species Setosa Versicolor Virginica Proportions 1.000 0.000 0.000 Conclusion: Setosa 3 Current prediction: 2.000 Number of observations: 100 Species Setosa Versicolor Virginica Proportions 0.000 0.500 0.500 Test: Petal_Width<1.750 Next nodes: 4 5 4 Current prediction: 2.000 Number of observations: 54 Species Setosa Versicolor Virginica Proportions 0.000 0.907 0.093 Conclusion: Versicolor 5 Current prediction: 3.000 Number of observations: 46 Species Setosa Versicolor Virginica Proportions 0.000 0.022 0.978 Conclusion: Virginica
1 Petal_Length<2.450 Setosa 1 Petal_Length>2.450 2
294
8.7 Classification trees
295
2 Petal_Width<1.750 Versicolor 2 Petal_Width>1.750 Virginica
Tree diagram -----------1 2 -> 3 4 -> 5
The initial summary, generated by the Sum m ary check box, lists the number of nodes (5) and terminal nodes (3) in the tree, its misclassification rate and which variables it uses. The details section (from the Details check box, ) gives information about each node, referring to the numbering displayed in the tree diagram at the end of the output (which is generated by the Num bered Diagram check box). Note, if possible, it is best to use “accuracy” figures that are derived from a different set or sets of data from that which was used to construct the tree. This cannot be done through the menus, but you can use the BCVALUES procedure, which is described in the Guide to GenStat, Part 2 Section 6.18.3. Another useful procedure, which also cannot be accessed currently through the menus is BCIDENTIFY. This has a convenient interactive interface, that asks you to enter the information required by the tree as and when it is needed. (For details of the options and parameters that allow you to use it in batch mode, see the Guide to GenStat, Part 2 Section 6.18.3). To run the procedure in this way, you merely need to set the TREE option to the name of the tree, here IrisTree. If we type the command BCIDENTIFY [TREE=IrisTree]
and execute it, for example by clicking on the Submit Line line in the Run menu on the menu bar, GenStat asks the question in Figure 8.33. (Our answer is yes.) Figure 8.33
8 Multivariate analysis
296
The next question is in Figure 8.34, to which we shall answer that the petal length is greater than 2.450. (Check the box and click on OK .)
Figure 8.34 This generates the question in Figure 8.35, to which we shall answer that the petal width is less than 2.450.
Figure 8.35 We have now reached the terminal node, and GenStat asks if we want to print the identification (Figure 8.36). It would be best to take the default suggestion, of yes, here as we have not set the option of BCIDENTIFY Figure 8.36 that would save the information! The output shows first a transcript of the questions and answers (as requested in Figure 8.33), and then the identification of Versicolor. ***** Identification using a classification tree ***** Observations: Petal_Length>2.450 Petal_Width<1.750
8.8 Other facilities
297
Identification: Versicolor
GenStat also has menus and commands for constructing regression trees. These operate very similarly to those for classification trees, except that the attribute to predict is the value of a response variate rather than the level of a group factor. The menus can be found in the Trees subsection of the Multivariate Analysis section of the Stats menu on the menu bar, and the underlying methodology is explained in the Guide to GenStat, Part 2 Section 3.9.
8.8
Other facilities
This chapter illustrates menus from most of the main areas of multivariate analysis provided by GenStat. Other menus are listed below with references to sections in the Guide to GenStat describing the associated commands and methodology: Correspondence Analysis Part 2 Section 6.12; Canonical Correlations Part 2 Section 6.9; Discrim inant A nalysis Part 2 Section 6.5; Generalized Procrustes Part 2 Section 6.15.2; Partial Least Squares Regression Part 2 Section 6.8; Principal Co ordinate s A nalysis Part 2 Section 6.10; Procrustes Rotation Part 2 Section 6.15.1. Other multivariate facilities, not available through the menus, include factor rotation (directive FACROTATE; Part 2 Section 6.4), ridge and principal-component regression (procedure RIDGE; Part 2 Section 6.7), analysis of skew symmetry (procedure SKEWSYMMETRY; Part 2 Section 6.14) and the construction of identification keys (procedure BKEY; Part 2 Section 6.19).
8.9
Exercises
8(1) GenStat spreadsheet file Goblet.gsh contains data on 25 goblets from prehistoric sites in Thailand (see page 147 of Manly, 1986, Multivariate Statistical Methods a Primer, Chapman & Hall, London). Perform a principal components analysis to study the relationships between the goblets. Perform a cluster analysis of the goblets. How does the dendrogram reflect the closeness of the goblets in the principal-component plot? Form a non-hierarchical classification into five groups. How does this compare with the dendrogram?
8 Multivariate analysis
298
8(2) GenStat spreadsheet file Skull.gsh contains data on 150 male Egyptian skulls from five different epochs (see pages 4 and 5 of Manly, 1986, Multivariate Statistical Methods a Primer, Chapman & Hall, London). Perform a multivariate analysis of variance. Are there any epoch differences? Perform a canonical variates analysis. Plot the first two canonical variates and study how the skulls differ between epochs. Form a classification tree. Prune down to 20 terminal nodes. What is the misclassification probability?
9
Time series
A time series in GenStat is a variate containing a sequence of observations made at equally spaced points in time. The time series menus are obtained by clicking Stats on the menu bar and selecting T im e Series (Figure 9.1). They are designed to allow you first to explore the data by plotting graphs and printing autocorrelations, and then to fit autoregressive integrated moving-average (ARIMA) models as advocated by Box & Jenkins (1970). Details of the full time series facilities in GenStat, and information about the underlying theory, are given in Chapter 7 of Part 2 of the Guide Figure 9.1 to GenStat. In this chapter we illustrate the menus using one of several quarterly indicators of UK pig production stored in GenStat spreadsheet file Pig.gsh (see Data. A Collection of Problems from Many Fields for the Student & Research Worker, Andrews & Herzberg, Springer-Verlag, 1985, for details). The series that we shall use is called Gilts, and represents sows in pig for the first time.
9.1
Exploration of time series
Figure 9.2
Figure 9.3
Figure 9.2 shows the Time Series Data Exploration menu, obtained by clicking on Da ta Exploration in Figure 9.1. We enter the identifier of the series, Gilts, into the Series box, and then click on Options to generate the Data Exploration Options menu
9 Time series
300
in Figure 9.3. In this case we have chosen only to plot the information, and not to display any of it in the Output window.
Figure 9.4 The resulting graph contains the following plots: the series itself; the autocorrelations calculated between successive observations (this is often called the correlogram); the partial autocorrelations (for each lag k, this is the excess correlation between measurements separated by k timepoints that is not accounted for by the intervening points); and the sample spectrum.
9.2 ARIMA model fitting
9.2 The
301
ARIMA model fitting ARIMA
M odel
menu is obtained by clicking on A R IM A M o d e l Fitting in Figure 9.1. Here, we are fitting a model to the Gilts time series with two autoregessive parameter s , no differencing, and one m o v ing average parameter (see Figure Figure 9.5 9.5). We can also define a seasonal model by clicking on the Seasonal Model button. In Figure 9.6, we have defined the seasonal period to be four (corresponding to a year), and set the order of differencing and the number of moving average parameters both to be one. The resulting output is shown below. Figure 9.6 F i tt in g
***** Time-series analysis ***** Output series
: Gilts
Noise model : _erp
Residual deviance Innovation variance
= 2140. = 43.96
Number of units present Residual degrees of freedom
= 48 = 39
*** Summary of models ***
Model
Orders: Type
_erp
ARIMA
Delay B
AR P
Diff D
MA Q
Seas S
-
2 0
0 1
1 1
1 4
9 Time series
302
*** Parameter estimates ***
Model
Seas. Period
Diff. Order
Delay
Parameter
Noise
1
0
-
Constant Phi (AR)
4
1
Theta (MA) - Theta (MA)
Lag
Ref
1 2 1 4
1 2 3 4 5
Estimate
Having fitted the model, we can now click on the Forecast button in the ARIMA Model Fitting menu (Figure 9.5). In the resulting ARIMA Foreca sts menu (Figure 9.7), we have asked to display forecasts and confidence limits for the next four periods, and to plots the series with the forecasts added on (with their confidence limits) at the end. The ouput is shown below, and the graph is in Figure 9.7 Figure 9.8. *** Forecasts *** Maximum lead time: 4 * Forecasts for future values Lead time 1 2 3 4
forecast 102.51 107.85 101.82 92.5
lower limit 91.60 92.97 85.39 75.9
upper limit 113.41 122.74 118.25 109.1
-2.159 1.6128 -0.8604 0.684 0.874
SE
t
0.460 -4.70 0.0959 16.82 0.0817 -10.53 0.163 4.18 0.144 6.07
9.3 Time series commands
303
Figure 9.8
9.3
Time series commands
The Time Series Data Exploration menu uses the CORRELATE directive to calculate the autocorrelations and the partial autocorrelations, and the BJIDENTIFY procedure to produce the graphs. The ARIMA Model Fitting menu defines the model using the TSM directive, fits it with the ESTIMATE directive, and calculates the forecasts using the FORECAST directive. GenStat has several other commands, currently not accessible by menu, which may be useful with time series. The full list of commands is as follows.
9 Time series
304
Directives: CORRELATE
TSM FTSM
TRANSFERFUNCTION
ESTIMATE
TDISPLAY
TKEEP FORECAST TSUMMARIZE FILTER FOURIER
forms correlations between variates, autocorrelations of variates, and lagged cross-correlations between variates (Guide to GenStat Part 2, Section 7.1.1) defines Box-Jenkins models (Guide to GenStat Part 2, Sections 7.3.2 and 7.5.1) forms preliminary estimates of parameters in time-series models (Guide to GenStat Part 2, Section 7.7.1) specifies input series and transfer-function models for subsequent estimation of a model for an output series (Guide to GenStat Part 2, Sections 7.4.1 and 7.5.2) estimates parameters in Box-Jenkins models for time series (Guide to GenStat Part 2, Sections 7.3.3, 7.4.2 and 7.5.3) displays further output after an analysis by ESTIMATE (Guide to GenStat Part 2, Section 7.3.5) saves results after ESTIMATE (Guide to GenStat Part 2, Sections 7.3.6 and 7.5.4) forecasts future values (Guide to GenStat Part 2, Sections 7.3.7, 7.4.3 and 7.5.5) displays time series model characteristics (Guide to GenStat Part 2, Section 7.7.3) filters time series by time-series models (Guide to GenStat Part 2, Section 7.6.1) calculates cosine or Fourier transforms of a real or complex series (Guide to GenStat Part 2, Section 7.2.1)
Procedures BJESTIMATE
BJFORECAST
fits an ARIMA model, with forecasts and residual checks (Guide to GenStat Part 2, Section 7.3.1) plots forecasts of a time series using a previously fitted ARIMA (Guide to GenStat Part 2, Section 7.3.8)
9.4 Exercises BJIDENTIFY
PERIODTEST PREWHITEN REPPERIODOGRAM SMOOTHSPECTRUM
9.4
305
displays time series statistics useful for ARIMA model selection (Guide to GenStat Part 2, Section 7.1.3) gives periodogram-based tests for white noise in time series filters a time series before spectral analysis gives periodogram-based analyses of replicated time series forms smoothed spectrum estimates for univariate time series (Guide to GenStat Part 2, Section 7.2.3)
Exercises
9(1) Open the GenStat spreadsheet file Pig.gsh containing the time series on UK pig production mentioned in Section 9.1. Profit is an index of the profit on the sale of pigs, Slaughter is the ration of sow to boar numbers in the following quarter to the total pig breeding herd size at the beginning of the quarter (i.e. it measures removal of pigs from the herd), Cleanpig is the number of “clean” pigs reared for meat (rather than being culled from the herd), and Herdsize is the size of the breeding herd. Study each of the series by plotting its autocorrelations, partial correlations and sample spectrums. Fit an appropriate ARIMA model to the profits.
306
10
More about commands and syntax
The GenStat commands provide the key to the full power of GenStat, enabling you to access advanced features that are not available through the menus or to program your own methods. In the first part of this chapter, we describe some of the ways in which you might use commands. Then Sections 10.3 - 10.8 give a full definition of the GenStat language.
10.1 Working with commands The menus in GenStat provide convenient ways of generating analyses. As we have seen, in earlier chapters, they operate by generating commands which are executed by the GenStat server. Most users will have GenStat’s options set to record all these commands in the Input Log. (This is controlled by the boxes in the Au dit Trail tab of the Options menu as explained in Section 1.5). If you have kept a full record (with the menu set as in Figure 1.32), you can recreate the same analyses later. First select the Input Log as the active window by clicking on the Input Log line of the W indow menu on the menu bar (Figure 1.30). Then save it in a file using the Save menu obtained by clicking on either Save or Save As in the File menu on the menu bar (Figure 1.15). When you rerun GenStat, open the file using the Se lect Inp ut F ile menu (obtained by clicking on Open in the File menu on the menu bar). Then click on Submit W indow in the Run menu on the menu bar. You may also want to copy commands from the Input Log into another text window and adapt them. We illustrate this by returning to the analysis of the water usage data in Section 5.2. First we reload the data by opening the spreadsheet file Water.gsh, then we Figure 10.1 fill in the L i n e a r Regression menu as shown in Figure 10.1, to fit a regression for W ater with a single explanatory variable, Product. This generates the output below. ***** Regression Analysis ***** Response variate: Water
10 More about commands and syntax
308
Fitted terms: Constant, Product
*** Summary of analysis ***
Regression Residual Total
d.f. 1 15 16
s.s. 1.270 1.922 3.193
m.s. 1.2702 0.1282 0.1995
v.r. 9.91
F pr. 0.007
Percentage variance accounted for 35.8 Standard error of observations is estimated to be 0.358 * MESSAGE: The following units have large standardized residuals: Unit Response Residual 16 4.488 2.31 * MESSAGE: The following units have high leverage: Unit Response Leverage 2 2.828 0.27 3 2.891 0.25
*** Estimates of parameters ***
Constant Product
estimate 2.273 0.0799
s.e. 0.339 0.0254
t(15) 6.71 3.15
t pr. <.001 0.007
The commands that have been executed to produce the analysis can be found at the end of the Input Log . "Simple Linear Regression" MODEL Water TERMS Product FIT [PRINT=model,summary,estimates; CONSTANT=estimate;\ FPROB=yes; TPROB=yes] Product
(We have reformatted the FIT command from the way in which it may appear within the log, so that each line fits within the width of the page. Remember that the character \ indicates that the command continues onto the next line.) Here we are using only some of the options and parameters of the commands. You can see the full list by putting the cursor into the command name (MODEL, TERMS or FIT) and pressing the F1 key. GenStat's context-sensitive help will then load the help page for the command concerned. The help page first lists the options of the command, then its parameters, and then gives a description of how the command is used. The initial part of the page for FIT (Figure 10.2) shows that the PRINT option tells GenStat what output to produce, the CONSTANT option indicates that the model should contain estimate the constant term, and the FPROB and TPROB options (short for FPROBABILITY and TPROBABILITY) request probabilities to be printed for F and t-statistics. Here the model (specified by the parameter of FIT) contains just the single explanatory variate Product.
10.1 Working with commands
309
Figure 10.2 You will have noticed that the summary of analysis section of the output had some warning messages listing units with large standardized residuals or high leverage. We encourage you to look at these to check the validity of your analysis, so the regression menus do not provide any ways of stopping them. However, the help page for FIT shows that it has an option called NOMESSAGE, which allows you to suppress any of the regression messages. To rerun the analysis without the warnings, we could open a new text window (see Section 1.3), copy the FIT command there, and edit it to become. FIT [PRINT=model,summary,estimates; CONSTANT=estimate;\ FPROB=yes; TPROB=yes; NOMESSAGE=leverage,residual]\ Product
10 More about commands and syntax
310
Then run the command using one of the methods provided by the Run menu on the menu bar (see Section 1.3). Here as FIT is the only command in the window, it would be easiest to click on the line Submit W indow, as shown in Figure 10.3.
Figure 10.3 You could also edit the commands to do a regression on the variate Employ instead of Product. "Simple Linear Regression" MODEL Water TERMS Employ FIT [PRINT=model,summary,estimates; CONSTANT=estimate;\ FPROB=yes; TPROB=yes] Employ
Suppose we have decided that the right way to analyse this data set is by simple linear regression. (This is for illustration purposes only, of course, as Section 5.2 showed that we actually need a multiple linear regression, fitting all the explanatory variates.) It might then be useful to put the commands into the Analyse Spreadsheet Columns menu for the spreadsheet Water.gsh. First we need to make Water.gsh the active window. Then we select the An alysis line within the Sheet section of the Spread menu on the menu bar.
10.1 Working with commands
311
The lower half of the resulting menu (Figure 10.4) has a section into which you can type or paste commands to analyse the data. The menu provides flexibility by allowing you to perform the analysis, in turn, for several columns. The columns can be selected by highlighting them in the S e l e ct Columns
for
A n alysis
window of the menu. You refer to them in the analysis commands using Figure 10.4 the dummy whose name is given in the D um m y window. A dummy is a data structure that contains the identifier of another structure. When a command containing a dummy is executed, the dummy is replaced by the identifier that it currently contains. So, by using a dummy, you can conveniently change the data structure on which the command operates. In Figure 10.4 the dummy is called Y. The commands in the lower half of the window were pasted straight from the Input Log (again with some reformatting to ensure that they do not need to scroll beyond the right-hand side of the window). So, they have Figure 10.5 Product as the explanatory variate. To set the commands to refer to an arbitrary explanatory variate (denoted by the dummy Y) you click on the Replace button, fill in the resulting menu (Figure 10.5), and click on OK . The commands then become "Simple Linear Regression" MODEL Water TERMS Y FIT [PRINT=model,summary,estimates;\
10 More about commands and syntax
312
CONSTANT=estimate; FPROB=yes; TPROB=yes]\ Y
To run the commands with explanatory variates Employ and Opdays, for e x ample, you can highlight their lines in the Select Columns for An alysis window and then click on O K . Alternatively, click on the Save button to store the commands with the sheet. You can then run the analysis at any time: first highlight the columns by clicking on their names at the top of the spreadsheet; then either select the Sheet An alysis line in the Run menu on the menu bar, or make a right-mouse click on the spreadsheet and Figure 10.6 select User Defined (Sheet Analysis) from the An alysis section of the resulting menu (see Figure 10.6). Each of these analyses is for the same dependant variable, Water. (It might be more usual to use the dummy to refer to the dependent variate of the regression, so that you would be fitting the same model but to different sets of observations. However, this discussion is intended mainly to illustrate the principles behind the spreadsheet analysis menu.)
10.2 Repeating a sequence of commands
313
This means that the initial line MODEL Water
is common to all the analyses. We can arrange that this command is executed only once (and thus improve efficiency) by moving it to the S p r e a d s h e e t A n a l ys i s Setup Directives menu
(Figure 10.7). This menu is obtained by clicking on the Setup button of the A n a l ys e Columns
Spreadsheet
menu. After Figure 10.7 moving the line, you click on the OK button on the Spreadsheet Analysis Setup Directives menu, and then on the Save button of the Analyse Spreadsheet Columns menu.
10.2 Repeating a sequence of commands The commands that are executed with the spreadsheet analysis commands for the columns Employ and Opdays can, as usual, be found in the Input Log . "Analysis of Data in Spreadsheet: Water.gsh" MODEL Water FOR Y = Employ,Opdays "Simple Linear Regression" TERMS Y FIT [PRINT=model,summary,estimates;\ CONSTANT=estimate; FPROB=yes; TPROB=yes]\ Y ENDFOR
After the initial “setup” line to define Water as the dependant variate (and a comment to introduce the analysis), the TERMS and FIT lines are applied to the variates Employ and Opdays using a for loop. This is introduced by a FOR directive, and terminated by an ENDFOR directive. The parameters of FOR take the form: dummy = list of identifiers. Here we have Y = Employ,Opdays so the contents of the loop are executed twice. On the first time, Y is set to Employ and on the second it is set to Opdays.
10 More about commands and syntax
314
If FOR has more than one parameter, the dummies change in parallel. So, if we had an additional dependent variate, Coffee say, we could put FOR Dependant = Coffee,Water; Y = Employ,Opdays MODEL Dependant TERMS Y FIT [PRINT=model,summary,estimates;\ CONSTANT=estimate; FPROB=yes; TPROB=yes]\ Y ENDFOR
to perform a regression for Coffee with explanatory variate Employ, and then one for Water with explanatory variate Opdays. So the two dummies, Dependant and Y, pass through their lists in parallel. If the second list, or any other subsequent list, is shorter than the first list it is “recycled”: that is, the dummy starts the list again each time it reaches the end until the first list has finished. For example, FOR Dependant = Coffee,Water,Biscuits; Y = Employ,Opdays MODEL Dependant TERMS Y FIT [PRINT=model,summary,estimates;\ CONSTANT=estimate; FPROB=yes; TPROB=yes]\ Y ENDFOR
would perform a regression for Coffee with explanatory variate Employ, then one for Water with explanatory variate Opdays, and finally one for Biscuits with explanatory variate Employ. Further information about FOR can be found in the Guide to GenStat, Part 1, Section 5.2. That section also explains the other programming directives which, for example, allow you to exit from the middle of a loop or to construct if-blocks which execute different sets of commands depending on some logical tests. Details of how to form your programs into procedures are given in the Guide to GenStat, Part 1, Section 5.3.
10.3 Syntax of GenStat commands GenStat commands, or statements as we prefer to call them, all have the form: statement-name [ option-settings ] parameter-settings : The statement-name gives the name of the directive or procedure that is to be used, and the terminating colon (:) can be omitted if the statement is at the end of a line. Conversely, if you want to continue a statement onto the next line, you can end the line with a continuation symbol \.
10.3 Syntax of GenStat commands
315
Parameters specify parallel lists of arguments for the directive or procedure. For example PRINT STRUCTURE=name,pay,hours,rate; DECIMALS=0,2,0,2
would print name and hours with no decimal places, and pay and rate with two. The list for the first parameter of the directive or procedure must be the longest; for PRINT this is the parameter STRUCTURE. Other parameters provide ancillary information and they will be recycled if they are shorter than the first. So, for example, you could write just PRINT STRUCTURE=name,pay,hours,rate; DECIMALS=0,2
A warning is printed if any of the other parameters is longer than the first. Options specify settings that apply to all the (parallel) sets of parameters. For example, if you were to put PRINT [CHANNEL=2] STRUCTURE=name,pay,hours,rate; \ DECIMALS=0,2
then name, pay, hours, and rate would all be printed to the output file on channel 2 (with their attendant numbers of decimal places). Most options have default values, chosen to be those most often required, and so usually they need not be specified. For example, here are all the options of PRINT. CHANNEL
= identifier
SERIAL
= string
IPRINT
= strings
RLPRINT
= strings
CLPRINT
= strings
RLWIDTH
= scalar
Channel number of file, or identifier of a text to store output; default current output file Whether structures are to be printed in serial order, i.e. all values of the first structure, then all of the second, and so on (yes , no ); default no , i.e. values in parallel What identifier (if any) to print for the structure (identifier , extra , associatedidentifier ), for a table associatedidentifier prints the identifier of the variate from which the table was formed (eg. by TABULATE ), IPRINT=* suppresses the identifier altogether; default identifier What row labels to print (labels , integers ), RLPRINT=* suppresses row labels altogether; default labels What column labels to print (labels , integers ), CLPRINT=* suppresses column labels altogether; default labels Field width for row labels; default 13
10 More about commands and syntax INDENTATION WIDTH
= scalar
= scalar
SQUASH
= string
MISSING = text ORIENTATION = ACROSS
DOWN
string
= scalar or factors
= scalar or factors
WAFER
= scalar or factors
PUNKNOWN
= string
316
Number of spaces to leave before the first character in the line; default 0 Last allowed position for characters in the line; default width of current output file Whether to omit blank lines in the layout of values (yes , no ); default no What to print for missing value; default '*' How to print vectors or pointers (down , across ); default down , i.e. down the page Number of factors or list of factors to be printed across the page when printing tables; default for a table with two or more classifying factors prints the final factor in the classifying set and the notional factor indexing a parallel list of tables across the page, for a one-way table only the notional factor is printed across the page Number of factors or list of factors to be printed down the page when printing tables; default is to print all other factors down the page Number of factors or list of factors to classify the separate "wafers" (or slices) used to print the tables; default 0 When to print unknown cells of tables (present , always , zero , missing , never ); default present
UNFORMATTED = REWIND = string WRAP
string
= string
Whether file is unformatted (yes , no ); default no Whether to rewind unformatted file before printing (no , yes ); default no Whether to wrap output that is too long for one line onto subsequent lines, rather than putting it into a subsequent "block" (yes , no ); default no
As you have seen from the examples in Section 1.3, most of these do not usually need to be specified. However, they provide considerable flexibility of output when you want it, particularly for printing multi-way structures such as tables. Some parameters also have defaults. Here are the parameters of PRINT. STRUCTURE = identifiers FIELDWIDTH = scalars
Structures to be printed Field width in which to print the values of each structure (a negative value –n prints number in Eformat in width n); if omitted, a default is determined (for numbers, this is usually 12; for
10.3 Syntax of GenStat commands
317
text, the width is one more character than the longest line) DECIMALS = scalars Number of decimal places for numbers; if omitted, a default is determined which prints the mean absolute value to 4 significant figures CHARACTERS = scalars Number of characters to print in strings SKIP = scalars or variates Number of spaces to leave before each value of a structure ( * means newline before structure) FREPRESENTATION = strings How to represent factor values (labels , levels , ordinals ); default is to use labels if available, otherwise levels JUSTIFICATION = strings How to position values within the field (right , left , center , centre ); default right MNAME = strings Name to print for table margins (margin , total , nobservd , mean , minimum , maximum , variance , count , median , quantile ); if omitted, "Margin " is printed
Again, we have been able to obtain very acceptable output in previous examples using the default settings, such as 12 for the fieldwidth for variates, and right for JUSTIFICATION. The settings of options and parameters are either lists, expressions, or formulae, and each setting should be separated from the next (if any) by a semi-colon (;). A list is a sequence of items, each separated from the next by a comma. Expressions are used to define calculations, as discussed in Section 2.7, and formulae to define statistical models, as discussed in Section 6.6. Lists may be of numbers, as with the VALUES option in VARIATE [VALUES=0,5,14,2.3E1,3,8,0,2,8] rain
Numbers in GenStat can be represented in either ordinary or "scientific" format. There are also lists of textual strings: for example TEXT [VALUES='Last Sunday',Monday,Tuesday,Wednesday,\ Thursday,Friday,Saturday,Sunday] day
In general, strings are enclosed in single quotes ('). However, in a string list, these can be omitted provided the string starts with a letter, and then contains only letters or digits. A letter in GenStat is any of the capital letters A to Z, or the lower case letters a to z, or the underscore character (_), or percent (%), while a digit is one of the numerical characters 0 up to 9. Finally, there are lists of identifiers, as in the STRUCTURE parameter of PRINT: for example
10 More about commands and syntax
318
PRINT STRUCTURE=name,pay,hours,rate
An identifier is the name given to a data structure. It must start with a letter and then contain only letters or digits. Only 32 characters are stored, so total_software_sales_in_January_1999 will not be distinguished from total_software_sales_in_January_2000 (both will be stored as total_software_sales_in_January_). However, upper case is distinguished from lower case, so SALES and Sales are not the same. (It is possible to change this using the SET directive, but this is not customary. You can also use SET to request that GenStat stores and checks only the first eight characters of identifiers, as was the case in the Fourth and earlier Editions.) Identifiers can also have suffixes, enclosed in square brackets: for example x[2] or employee['grade']. These are explained in Section 10.7. Any list may contain missing values, each represented by an asterisk (*). These represent unknown (or unset) information. You can put comments anywhere in a statement, or between two statements. Comments begin and end with the double-quote character ("); between the quotes you can type anything you like.
10.4 Abbreviation rules Names of directives, options, parameters, and functions are all system words and they can always be abbreviated to four characters. If more than the minimum number of characters is given for any system word, they will be checked as far as the 32nd; characters from the 33rd onwards are ignored. Option and parameter names can usually be abbreviated even further. For every directive or procedure, an implicit order is defined for its options and for its procedures. For PRINT, as shown above, the ordering of the parameters is ST R U C TURE , FIELDWIDTH , DECIMALS , CHARA C T E R S , S KI P , FREPRESENTATION, JUSTIFICATION, and MNAME. The rule is that you need specify only sufficient letters to distinguish each parameter from the parameters that occur before it in this (implicit) list. Above we have printed the minimum form for each one in bold letters; so for example you can abbreviate FIELDWIDTH to F, as this is preceded only by STRUCTURE, but SKIP requires the two letters SK. However, as already mentioned, if you are uncertain about the ordering for a particular directive, it is always sufficient to specify four characters. This abbreviation rule also applies to the values of options and parameters where you select one, or more, strings from the (limited) set that are recognized by the option or parameter concerned. Thus, for example,
10.4 Abbreviation rules READ [PRINT=data,errors] day
can be written (see Section 2.9) as READ [PRINT=d,e] day
An option or parameter name may be omitted altogether, along with its accompanying equals sign, if GenStat can deduce it from the position of the setting within the statement. In the statement PRINT [CHANNEL=2; SERIAL=yes] \ STRUCTURE=name,pay,hours,rate; DECIMALS=0,2
you can leave out CHANNEL= as CHANNEL is the first option of PRINT: unless you say otherwise (by giving the option name explicitly), GenStat assumes that the first option setting in a statement is for the first option in the implicit ordering for the directive (or procedure). Similarly, as STRUCTURE is the first parameter of PRINT, you can also omit STRUCTURE=, to obtain PRINT [2; SERIAL=yes] name,pay,hours,rate; DECIMALS=0,2
For subsequent options (and parameters), GenStat looks to see which option (or parameter) comes after the current one in the implicit ordering. Thus, after CHANNEL GenStat would expect by default to find SERIAL; as our statement also has SERIAL straight after CHANNEL, this can be omitted too. PRINT [2; yes] name,pay,hours,rate; DECIMALS=0,2
You can include null settings in a statement by typing nothing (other than spaces or comments or continuation symbols) between two semi-colons. If you insert a null setting for FIELDWIDTH, GenStat will then be able to deduce that our third parameter setting is for DECIMALS and so you can simply put PRINT [2; yes] name,pay,hours,rate; ; 0,2
Again this does require you to know the implicit ordering for the directive or procedure, but it can save a great deal of typing as you get to know GenStat better.
10.5 Repeating a statement The repetition symbol & provides a very convenient way of repeating a statement. It terminates the previous statement, if necessary, and then repeats the name of the statement together with any options that were set. So, for example, you could type the statements READ [CHANNEL=2] year
as
:
READ [CHANNEL=2] day,temp
10 More about commands and syntax READ [CHANNEL=2] year
&
320
day,temp
You can also modify the options by including further settings after &. Thus READ [CHANNEL=2] year READ [CHANNEL=2; SERIAL=yes] day,temp READ [CHANNEL=3; SERIAL=yes] sunshine,windspeed
can be simplified to READ [CHANNEL=2] year & [SERIAL=yes] day,temp & [CHANNEL=3] sunshine,windspeed
10.6 Making lists more compact The values of any data structure can be substituted into a list of the appropriate type, using the substitution symbol hash (#), shown as the pounds symbol (£) on some printers. So, for example, values of variates can be substituted into number lists: VARIATE [VALUES=1,2,3,4] x VARIATE [VALUES=#x,#x] xx
gives xx eight values (1,2,3,4,1,2,3,4). Similarly, values of texts can be substituted into lists of strings. For example, in TEXT [VALU ES=data,e rrors] pde READ [PRINT=#pde] day,rain
the PRINT option of READ is given the setting data,errors. Lists of numbers that increase or decrease in a regular way can be represented conveniently in GenStat as progressions. These have the general form x, y ... z where x is the first number, y the second number, and z the final limit. You can put spaces anywhere within this construct except within the sequence of three dots (...). The progression generates a sequence of numbers: x, x + s, x + 2s as far as x + ks where s is the difference between y and x (so x + s = y) and k is the largest integer such that x + ks does not go beyond z. The step can be either positive or negative and need not be an integer. If the step is 1 or !1, the second number y can be omitted. For example 1...5 5...1
= =
1,2,3,4,5 5,4,3,2,1
10.7 Suffixed identifiers and pointers 2,4...10 2,(4...10)
= =
2,4,6,8,10 2,4,5,6,7,8,9,10
Notice that the progression in the final example must be placed in brackets to make it clear that the second number has been omitted. Lists of numbers, strings, or identifiers that are repeated in a regular way can be compacted using multipliers. A pre-multiplier precedes a bracketed list and repeats the individual elements of the list, in turn, a specified number of times. For example 3(1,2)
=
1,1,1,2,2,2
Post-multipliers come after a bracketed list of numbers and repeat the entire list, en bloc, the specified number of times. For example (day,temp,rain)2
=
day,temp,rain,day,temp,rain
=
1,1,2,2,3,3,1,1,2,2,3,3,4,4
They may be combined. For example 2((1...3)2,4)
You can use scalars as pre-multipliers or post-multipliers but you must also use a substitution symbol SCALAR [VALUE=3] nplot FACTOR [LEVELS=4; VALUES=#nplot(1),(2,3,4)#nplot] block
gives block the values 1,1,1, 2,3,4, 2,3,4, 2,3,4
10.7 Suffixed identifiers and pointers Lists of data structures can be stored in a GenStat pointer structure to save having to type the list in full every time it is used. For example POINTER [VALUES=rain,temp,windspeed] vars VARIATE #vars READ [CHANNEL=2] #vars PRINT #vars; DECIMALS=2,1,2
defines rain, temp, and windspeed to be variates, and then reads and prints their values. When none of the structures in the list is itself a pointer, the substitution symbol (#) works in the same way as with variates and texts. If, however, there are pointers in the list, they too are substituted, as are any pointers to which they point. You can also refer to the elements of pointers using suffixes. For example, you can refer to rain either using its own identifier, or as the first element of vars by using the suffix [1]: so
10 More about commands and syntax vars[1] vars[2] vars[3]
322
is rain is temp is windspeed
Furthermore, you can put a list within the brackets: vars[3,1]
is windspeed,rain.
Also, you can put a null list to mean all the available suffixes of the pointer: vars[]
is rain,temp,windspeed.
Identifiers like vars[1], vars[2], and vars[3] are called suffixed identifiers and, in fact, you can use these even without defining the identifier of the pointer explicitly. Whenever a suffixed identifier is used, GenStat automatically sets up a pointer for the unsuffixed part of the identifier if it does not already exist. Furthermore the pointer will usually be extended automatically (whether it has been set up by you or by GenStat) if you later use a new suffix, like vars[93] for example. If, however, you do not want this to happen, you should define the pointer explicitly and set option FIXNVALUES=yes. For example POINTER [VALUES=length,width,height; FIXNVALUES=yes]\ dimensions
The SUFFIXES option of the POINTER directive allows you to specify the required suffixes for pointers that are defined explicitly. Notice that the suffixes do not need to be a contiguous list, nor need they run from one upwards. For example VARIATE [VALUES=1990,1991,1992,1993] suffs POINTER [NVALUES=4; SUFFIXES=suffs] profit
defines profit to be a pointer of length four, with suffixes 1990 to 1993. You could actually omit the NVALUES option here as GenStat can determine the length of the pointer by counting the number of values. However, by supplying a text instead of a scalar for NVALUES you can define labels for the suffixes of the pointer. The length of the text defines the number of values of the pointer, and its values give the labels. For example TEXT [VALUES=name,salary,grade] labs POINTER [NVALUES=labs] employee
would allow you to refer to employee['name'], employee['salary'], and so on. Lower and upper-case labels are distinguished, unless you set option CASE=ignored in the POINTER statement. You can also set option ABBREVIATE=yes, to allow the labels to be abbreviated. (By default, CASE=significant and ABBREVIATE=no.) So, if you had specified
10.8 Unnamed data structures
323
POINTER [NVALUES=labs; CASE=ignored; ABBREVIATE=yes]\ employee
you would be able to refer to employee['name'], for example, as employee['n'] or employee['Name'].
10.8 Unnamed data structures It can be wasteful and cumbersome to set up a structure explicitly when it is needed only once in a program. So GenStat allows you to define an unnamed structure instead. Some particularly useful types of unnamed structures are described below. The unnamed scalar is just a number. Whenever GenStat expects the identifier of a scalar, a number may be given instead. In fact, you have been using this type of unnamed structure already: the statement VARIATE [NVALUES=10] x
is equivalent to SCALAR [VALUE=10] n VARIATE [NVALUES=n] x
However, the converse, that you can use a scalar instead of a number, is not always true. The exception is with pre-multipliers and post-multipliers when, as mentioned in Section 10.7, the scalar must be preceded by the substitution symbol (#). The other forms all have a common style: they start with an exclamation mark, then a type code, and then a list enclosed in round brackets. An unnamed variate takes the form !V( list of numbers )
where the letter V can be in either upper or lower case, or equivalently !( list of numbers )
For example GRAPH rain; !(1...8)
plots rain against day number, 1 to 8. The unnamed text takes one of two forms. If the text has a single value, then the value can be placed within quotes (') and used instead of the identifier of the text. For example: GRAPH [TITLE='Rainfall during my holiday'] rain; !(1...8)
If there are several values, the form is
10 More about commands and syntax
324
!T( list of strings )
where the letter T can be in either upper or lower case. For example: PRINT !t(Sat,Sun,Mon,Tues,Wed,Thur,Fri,Sat),rain
The unnamed pointer has the form !P( list of identifiers )
where, again, the letter P can be in either upper or lower case. There are also unnamed expressions (type code E), and unnamed formulae (type code F). Finally, type code S provides another means of specifying an unnamed scalar. Unnamed structures are particularly useful when assigning values to several structures. The directives that are used to declare data structures all have a parameter as well as an option to specify the values of the structures that are defined. (This is one of the few places in the GenStat language where a directive has both an option and a parameter with the same name.) So you can use the VALUES parameter of VARIATE to specify different sets of values for x and y, as follows: VARIATE x, y; VALUES=!(1,2,3), !(4,5,6)
x now contains the values 1, 2, and 3, and y contains 4, 5, and 6.
10.9 Exercises 10(1) Below we list the contents of a data file, Employ.txt, containing details of the numbers of employees reporting sick on the working days of four weeks during the summer of 2001. First it contains the numbers of employees, then the week numbers (1 to 4), and the abbreviated day names (Mon–Fri). Then there are pairs of values, for each day in turn, giving the maximum temperature and an indicator (yes or no) of whether or not it rained. 31 28 20 22 1 1 1 1 Mon Tue Wed Mon Tue Wed Mon Tue Wed Mon Tue Wed 17 no 18 no 20 no 22 no
23 21 25 23 23 24 26 23 20 20 21 25 24 21 22 21 : 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 : Thur Fri Thur Fri Thur Fri Thur Fri : 14 yes 13 yes 14 yes 15 yes 14 yes 16 no 15 no 17 no 21 no 18 yes 19 yes 21 no 20 no 19 yes 19 yes 22 no :
Write a program to read and print the various sets of data in the file, as follows. First read the number of sick employees and the week number into a variate and a factor respectively and then print them both in parallel. Remember that you will
10.9 Exercises
325
need to declare your factor but you may, if you wish, let GenStat define its levels from the data. It is easiest to read these structures using two separate READ statements, but think about how you might combine them into a single statement. Now read the day name into a factor. How does the READ statement differ from that for week number? Finally read temperature and the rain indicator (as a factor) and print all the data in parallel. You may now also be interested to plot the numbers of sick employees against the temperature each day, perhaps with separate symbols for the wet and dry days. (Hint: see Section 2.8.) 10(2) Below we show a rather verbose GenStat program, stored as Shop.gen. Modify it to become compact as possible, by using abbreviations but not by removing spaces and new lines. TEXT [NVALUES=11] branch VARIATE [NVALUES=11] sales01,sales02 VARIATE [NVALUES=11] frontage,depth READ [PRINT=data,errors,summary] branch, sales01, sales02,\ frontage, depth Ashford 4741100 496700 25 33 Bradford 3386800 350100 21 32 Chelmsford 645800 395200 15 22 Dartford 2381200 298900 12 28 Fordingbridge 1379600 412000 12 25 Guildford 2727300 234700 16 26 Hereford 2993300 358500 14 32 'Milford Haven' 3409000 460600 18 24 Oxford 4752400 439100 15 30 Stafford 4117400 473700 16 28 Twyford 942500 294900 12 16 : CALCULATE [PRINT=*] sales01 = sales01/100 CALCULATE [PRINT=*] sales02 = sales02/100 CALCULATE [PRINT=summary] allsales = sales01 + sales02 CALCULATE [PRINT=summary] sale_pm2 = \ allsales / 2 / (frontage * depth) PRINT [SERIAL=no] branch,frontage, depth,allsales, sale_pm2;\ DECIMALS=0,0,0,0,2; JUSTIFIC=left,right,right,right,right STOP
10(3) The first of April 2000 was a Saturday. Using unnamed structures, write a PRINT statement to print the day numbers in the month, with the day of the week and the week number side by side: 1 2 3
Saturday Sunday Monday
1 2 2
10 More about commands and syntax and so on.
326
11
Other statistical methods
This Introduction covers most of the data management and manipulation menus, but does not describe all of the statistical menus in detail. Additional areas are summarised below, with cross references to other GenStat documentation. Full details of all the facilities in GenStat, and its commands, are accessible in PDF format from the Help menu (see Figure 1.4). The Reference Manual consists of: 1. Summary, 2. Directives. 3. Procedures in Procedure Library PL15, and has an accompanying book for the New Features in GenStat Release 7.1. The Guide to GenStat contains 1. Syntax and Data Management, and 2. Statistics.
11.1 Six sigma GenStat has wide range of facilities to support the six-sigma approach to quality improvement. There are some specialized menus in the Six Sigm a section of the Stats menu on the menu bar. These include control charts, Pareto charts and capability statistics, and a wizard to form various designs popular in industrial statistics. Further details are in Section 2.10 of Part 2 of the Guide to Figure 11.1 GenStat.
11.2 Survey data The Survey Analysis section of the Stats menu (Figure 11.2) gathers together several menus useful for the analysis of surveys. The first two lines provide simple frequency tables and tables of means, totals, maxima, minima, variances and quantiles for surveys with no special structure. The second two lines generate menus to analyse results from stratified surveys using the SVSTRATIFIED procedure (see Part 3 of the GenStat Reference Manual). Figure 11.2
11 Other statistical methods
328
11.3 Geostatistics GenStat has a set of menus in the Spatial An alysis section of the Stats menu for spatial analysis by "kriging". This is a method originating in geostatistics for analysing data distributed in two dimensions. The kriging model specifies how successive measurements of a variable in space are correlated with each other, in terms of a "variogram". This is analogous to the "correlogram" used in the analysis of time series, but for twodimensional (spatial) data rather than one-dimensional (temporal) data. GenStat has a menu to form the variogram (see Figure 11.3). There is then a menu for fitting models to Figure 11.3 describe how the correlations vary with distance, and perhaps also with direction. Finally, the Krige menu allows you to generate predictions (and their variances). Examples, and a description of the underlying methodology, are in Section 8.3 of Part 2 of the Guide to GenStat.
11.4 Survival analysis Survival data are data in which the response variate is, for example, the lifetime of a component or the survival time of a patient. Typically these are censored, i.e. some individuals survive beyond the end of the study, and so their survival time is unknown. The survivor function F(t) is defined as the probability that an individual is still surviving at time t. The Kaplan-Meier estimate of the survivor function (provided by the Kaplan-Meier menu) is simply the number surviving out of the number at risk in each time interval. Alternatively, Figure 11.4 GenStat can calculate the life-table (or
11.5 Repeated measurements
329
actuarial) estimates of the survivor function. Nonparametric tests can be made to compare the survival distributions of two or more groups of right-censored survival data. There are also menus for modelling the survival times, by assuming that they follow exponential, Weibull or extreme-value distributions, or by fitting proportional hazards models. Further details are in Section 8.2 of Part 2 of the Guide to GenStat.
11.5 Repeated measurements A repeated-measurements study is one in which subjects (animals, people, plots, etc) are observed on several occasions. Each subject usually receives some randomly allocated treatment, either at the outset or repeatedly through the investigation, and is then observed at successive occasions to see how the treatment effects develop. GenStat has a comprehensive collection of menus for the analysis of such data (see Figure 11.5). The relevant REML menus are described and illustrated in Section 7.4. Figure 11.5 Other menus provide customized plotting of the observations (or profiles) against time, repeated measures analysis of variance, and analyses based on ante-dependence structure or generalized estimating equations. For details see Sections 8.1 and 3.5.12 of Part 2 of the Guide to GenStat.
11 Other statistical methods
11.6 Multiple experiments The multiple experiments menus (Figure 11.6) provide analyses that combine information from several related experiments. This process, often called meta analysis, can be performed using the REML methods and menus; see Section 5.7 of Part 2 of the Guide to GenStat. GenStat also has a menu for fitting AMMI (additive main effects and multiplicative interaction) models; see Part 3 of the GenStat Reference Manual. Figure 11.6
330
Index *units* 251 Abbreviation 16 of directive name 16, 318 of function name 318 of option name 16, 318 of parameter name 16, 318 of procedure name 16 of repeated numbers 321 of string setting of option 318 Accumulated summary 172 Actuarial estimate 329 ADD directive 188 Added variable plot 200 Addition 45 Additive model 177 ADISPLAY directive 229 Adjusted R-squared 150 AGRAPH procedure 229 AKEEP directive 200, 229 All subsets regression 163, 165, 188 Alpha design 237 AMM I 330 Ampersand 319 Analysis of spreadsheet 311, 312 Analysis of covariance 227 Analysis of parallelism 172 Analysis of variance 193 AOV table 194, 196, 198 in regression 149, 172, 176 menu 193, 195, 197, 198, 224 one-way 193, 194, 227 two-way 195, 198 ANOV A directive 227 ANO VA F urther Output menu 197 ANO VA M eans Plots menu 202 ANOV A Options menu 196 AN OV A Residual Plo ts menu 200
ANO VA S ave Options menu 200 Ante-dependence 255, 261, 329 APLOT procedure 229 Argument of a function 49 ARIMA model 301 Arithmetic operator 45 Arithmetic progression 320 ASCII file 62 Assumption for regression 147, 149, 151, 152 Asterisk as crossing operator 215 as missing value 36, 316 as multiplication 45 double as exponentiation symbol 45 Asymptotic regression 179 Audit 22 Audit trail 307 Autoregressive model 250, 251, 255 Auxiliary parameter 16 Averaging of effects 231 AXES directive 65 Axis limit 72 title 65 Backslash 34, 61, 314 Balanced design 227 Barchart 81 Basic statistics 73 Binomial data 186 Blank character 14, 34, 35 Block structure 198, 199 BLOCKSTRUCTU RE directive 227 Bookmark 21 Box and Jenkins methods 299 Boxplot 9, 92 menu 8
BOX PLOT procedure 92 Bracket round 53, 321 square 15 Browse 34, 106 Calculate Column in Spreadsheet menu 52 CALCULATE directive 17, 63 Calculate Functions menu 50 Calculate Functions menu 48, 52 Calculate menu 46-49 Calculation 46, 97 Canonica l variates a nalysis 265 Capability statistics 327 Cascade 21 Case 17, 318 Categorical data 81, 84, 167 Censored survival data 328 Change Model menu 157 Change regression model 156 Chi-square test 93 CHISQUARE procedure 93 City block 251, 254 Clipboard 20 Cluster analysis 265, 278, 281 Coincident points 64 Colon end of data 67 end of statement 314 Colour 64, 65 Combined effects 248 Combined means 248 Comma 14, 34, 36, 157 Command 11, 307, 314 editing 309 running 310 Comment 318 Communication 20
332 Confounding 161, 226 Constant in expression 47 in regression 149 Constrained regression 149 Continuation symbol 34, 61, 314 Contrast 216 Control chart 327 Cook's statistics 187 Copy and paste 21 of structure 66 Correlation 49, 150, 161 Correlation model 240, 250 Correlation structure 250 Counts 183 Counts, table of 82, 84 Covariate 227 COVARIAT E directive 227 Cross-tabulation 84 Crossing operator 215 Cursor 4 Curve 175, 188 Cut and paste 21 Cyclic design 237 DAT extension 34, 35, 106 Data deletion 66 display 7, 12, 66 export 20 grouped 35, 187 import 20 menu 6 selection 109 storage 29, 62 summary 29, 48 Data structure 7 Decimal places 15, 16 Declaration 69, 324 Default value 315, 316 DELETE directive 66 Deleting text 13 DESCRIBE procedure 92 Design for industrial experiments 327
Index Design of experiments 193, 208 Deviance in REML 252, 253, 256 Deviations from fitted contrasts 206 DEVICE directive 66 DGRAP H directive 64 DH ISTO GRA M directive 63, 92 Diagnostic 15 Digit 317 Direct product of correlation models 250 Directive 12, 14 Directive name 14-16 Directory 34, 61 Dispersion parameter 185 Display from analysis of variance 194, 196, 197 from regression 149, 187 graphics 11 of data 7, 12, 66 Display Variables menu 7 Division 45 Dot character as operator 172, 215 three dots 320 Double-quote 34, 318 Dragging 18 DROP directive 188 Dummy data structure 311 DUPLICAT E directive 66 Edit menu 20 Edit window 18 Effective standard errors for contrasts 205 Efficiency 247 Eliminated effect 162 End of command 314, 319 of data 67 Environment of graphics 66 of interface 21
Equal weights in prediction 231 Error (as mistake) in command 14 Erro r (as resid ual) in analysis of variance 227 in regression 147, 148, 152 Estimate of parameter 150 extraction 154, 187 Euclidean distance 254 Exact test 87 Excel 71 Exchanging data 20 Exclamation mark 323 Excluding data 109 Explanatory factor 172 Explanatory variable 147, 155 Exploratory regression 155 Exponential curve 179 Exponential distribution of survival times 329 Exponential model 170 Exponentiation 45 Exporting data 20 Expression 17, 63 Extension of file 35 Extra cting results from analysis of variance 200, 229 from regression 154, 187 Extrapolation 177 Extreme data 176, 196 Extreme-value distribution of survival times 329 Factor 8 automatic formation 35 classifying table 82, 84 in expression 54 in regression 187 Factorial operator 215 False value in expression 52 Fault Log 15 Fault message 14 Field experiment 237, 242 File data 6
Index graphics 66 menu 12 of commands 18 output 21, 62 FILEREAD procedure 60 Finding a string 21 First parameter of command 14, 16, 315 Fisher's exact test 87 FIT directive 186, 188 FITCURVE directive 188 Fitted values from analysis of variance 200, 229 from regression 151 Fixed effect 237 Fixed format 70 For loop 313 Forecasting time series 302 Forward selection 163 Free format 68 Function 48, 50 argument 49 for factor 54 Further output from analysis of variance 197 from regression 150, 187 General Analysis of Variance menu 214 Generalized linear mixed models 188 Generalized linear model 147 Generalized linear mo dels 183 Generally balanced design 227 Generating a standard design 208 Geostatistics 328 Graeco-Latin square 232 GRAPH directive 64 Graph of Fitted Mo del menu 177, 186 Graphics device 65
environment 66 fitted analysis of variance model 202, 229 fitted regression model 151, 187 metafile 65 model checking 152, 187, 201, 229 pen 64 symbols 65 window 9 Graphics server 9 Grouped data 35, 167 in regression 167, 187 Half-Normal plot 152, 187, 200 Hash symbol 320, 321, 323 Help 4 for commands 308 Help menu 4 Hierarchical cluster analysis 265 Hierarchical generalized linear models 188 Higher-order term 214, 215 Histogram 63, 81, 92 HPGL graphics 65 Icon: 3 Identifier 16, 318 in list 66 suffixed 322 Ignoring effect 162 Importing data 20 Indentation 16 Influential data 151, 159, 187 Information summary 247 Input 20 Input Log 4, 12, 22, 307 Insert key 13 Insert mode 4, 13 Interaction in analysis of variance 196, 214, 215 in regression 172, 187 Interaction - between contrasts 206, 207
333 Interactive mode 18, 22 Intercept 148, 149 Interface 20, 21 Irregular layout in REML 250, 254 Kaplan-Meier estimate of survivor function 328 Keeping results from analysis of variance 200, 229 from regression 154, 187 Key for graph 65 Kriging 328 Kurtosis 92 Large residual 176, 196 Latin square 232 Lattice design 232 Lattice square 245 Layout of data 34, 60, 68 of output 15 of table 84 Least significant difference 195, 228 Letter 317 Leverage 151, 159, 187 Life table estimate 328 Limit in histogram 56 of axis 72 Line number 21 Line of best fit 147 Line-printer graphics 92 Linear contrast 216 Linear mixed model 237 Linear model 147, 175 Linear Regression Further Output menu 150, 151, 162, 172 Linear Regression menu 149, 156, 170, 175 Linear Regression Options menu 150, 160 Linear Regression Save Options menu 154 Linear trend 251 Link function 184
334 List 14, 317 of identifiers 66, 157, 317 of numbers 317, 320 of textual strings 317 LIST directive 66 Listing of data 7, 66 Log file 4, 12 Log-linear model 87, 183 Logarithm 48 Logical operator 52 Logical test 52 Long command 34, 61, 314 Mainframe 2 Mann-Whitney test 93 MANNW HITNEY procedure 93 Margin in output 16 marker 21 end of data 67 graphical 65 Matrix 154 multiplication 45 Maximal model 156, 188 Mean 92 Mean square 149 Memo ry 2 Menu b ar 3 Messages suppressing 309 Meta analysis 330 Missing factor combination 231 Missing value 36, 84 in data 36 in graphics 57 in list 318 in output 316 in regression 156 in table 82, 84 insertion 52 replacement 50, 316 representation 36 Mistake 15 Mixed Model 237 Mod el for analysis of variance 227
Index for regression 148, 159, 175, 176, 179 Model checking 152, 169, 187, 200, 229 Model Checking menu 152 MO DEL directive 186 Model formula 156, 187, 188, 214, 216, 224 Mod el term 214 Mouse 18, 20, 21 Multiple regression 155, 187 Multiplication 45 repetition of numbers 321 Multivariate analysis 265 Multivariate analysis of variance 265 Name for data structure 16 Nesting operator 215 Non-hierarchical cluster analysis 265 Nonlinear model 179 No nparame tric tests for survival data) 329 Normal distribution 147, 153, 168 Normal plot 152, 187, 200 Null setting in a statement 319 Numb er of groups in a histogram 56, 64 Omission of option and parameter names 319 One-way analysis of variance 193, 194, 227 OPEN directive 66 Opening a file 21, 66 Operator precedence 53 Option of command 15, 315 name 16 setting 317 Options menu 21 Options menu 307 Ordinal regression 188 Origin in regression 149 Orthogonal polynomial 176 Outlier 176, 196
Output file 21, 62 graphical 9 layout 15 tabular 82 window 13 Output window 4 Over-dispersion 185 Overwrite mode 4, 13 Paired test 94 Parallel data 15 Parallel output 15 Parallelism 172 Parameter of command 14, 16, 315 name 16 setting 317 Parameter of model 148, 150 Parenthesis 53, 321 Pareto chart 327 Partial least squares 265 Partially balanced design 227 Paste and cut 21 Patterned list 321 PEN directive 65 Pen for graphics 64 Percent character 317 Percentage variance accounted for 150 Pi 23 Plotter 65 Plotting symbol 65 Point plot 57 Pointer data structure 321 POINTE R directive 322 Poisson distribution 183 POL function 175, 188 Polynomial contrast 216 Polynomial regression 175, 188 Post-multiplier 321 Pounds symbol 320 Power model 251, 255 Power transformation 45 Pre-multiplier 321
Index Precedence of operators 53 Predicted value from analysis of variance 200, 229 from regression 151 Prediction for a mixed model 262 in regression 154, 174 Primary parameter 14, 16, 315 Prime symbol 34, 317 Principal coordinates analysis 265 PRINT directive 12, 62, 315, 316, 318 Printer 12, 16 Printing a window 21 Pro bab ility for F-statistic 150, 194, 196, 228 for t-statistic 150 Procedure 12, 14, 314 name 15, 16 Profile plot 329 Progression of numbers 320 Proportional hazards model 329 Pruning a classification tree 292 Quadratic contrast 170, 216 Question menu 35 Quotes double around co mment 34, 318 single around text 34, 317 R-squared statistic 150 R AM 2 Random effect 237 Random measurement error in REML 251 Random term 251 Random ized-block design 195, 198 Range of groups in a histogram 56, 64 RCHECK procedure 187
RDISPLAY procedure 187 Re-sizing a window 4 Read D ata from ASC II File menu 34, 36 READ directive 60, 66 Reading d ata from a file 70 parallel data 68 serial data 69 Record of commands 12 Reco rd of com mands 4 Record of outp ut 4 Recycling of parameters 315 Redefining length of vector 69 REG function 176 Regression 147 assumption 150, 152 constrained 149 fitted line 147, 151 linear 147, 175 missing value 156 model 148, 159, 175, 176, 179 multiple 155 nonlinear 179 parameter 148, 150 polynomial 175, 188 smoothed 177, 188 summary 149, 159, 176 Regression trees 188 Regular grid in REML 250 Relational test 52 Relationship between variables 57, 147 Release 2 REML 237 compared to ANOVA 249 Rep eated measurements 254, 329 Repeating a command 319 Repetition symbol 319 Replacing a string 21 Residual 150
335 from analysis of variance 200, 227, 229 from regression 151, 176, 187 Response variable 147 RESTRICT directive 113 Restricting the units of a vector 109 Return to Genstat after graphics 11 RGRAPH procedure 187 RKEEP directive 187 Round bracket 53, 321 Run menu 13, 18 Save Data menu 22 Saving analysis o f varianc e results 200, 229 contents of window 20, 21 data 22 interface settings 22 regression results 154, 187 Scalar 45, 47, 321 Scatter plot 57 Screening tests 188 Search menu 20 Searching for a string 20, 21 Selection of data 109 Semi-colon 16, 317 Separable error p rocess in REML 250 Separator of data 36 Sequential model fitting 175, 188 Serial data 69 Serial output 15 Server 4, 9, 11, 18 Set inclusion 53, 54 Significance 150 Significant digit 15 Single quote 34, 317 Six sigma 327 Skewness 56, 92, 168 Slash symbol 45, 215 Slope of regression line 148 Smoothing spline 177, 188 in residual plot 169
336 Sorting data 114 Space character 14, 34, 35 Space for data 2 Spatial analysis 250 Spatial covariance term 252 Speed 9 Spline 238 Split-plot design 215, 223, 224, 228, 237 Split-split plot design 232 Spreadsheet analysis of 311-313 autom atic transfer of data 29 use to define subsets 109 Square bracket 15 SSPLINE function 177, 188 Standard Curves menu 179 Standard error 161 of difference of means 195, 197, 199, 226, 228 of mean 195, 228 of regression parameter 150, 154, 180, 187 Standard errors - for contrasts 205 Standardized residual 151 Statement name 314 Statistics 73 Stats menu 148, 165, 208, 209, 214 Status bar 3, 9, 13, 21 Storage of commands 18 of data 29, 62 of identifiers 321 of results from analysis of variance 200, 229 of results from regression 154, 187 Stratified survey 327 Stratum 199, 226, 229, 247 Stratum variance 248 Strip-plot design 232 Structure 7 Sub-plot 223 Subset of values 113
Index Substituting the values of a data structure 320 Substitution symbol 320-321 Subtraction 45 Suffixed identifier 322 Summary accumulated 172 of analysis 149, 159, 176 of data 29, 48 Summary by Group s menu 82, 84 Survey analysis 327 Survival analysis 328 life table 328 nonparametric tests 329 SWITC H directive 188 Symbol for plotting 65 Symmetric matrix 154 Syntax of command 12, 314 T-statistic 150 T-test 93 Table 82, 93 of means 194, 197, 202 Table of means 241 Tabular output 15, 82 TABU LATE directive 93 Tabulation 82, 84 Terminal node 291 Terminating a command 314, 319 Text 34 suffix of pointer 322 use in expression 54 TEXT directive 317 Tied data 275 Tiling of windows 21 Time 9 time series 299 Title for boxplot 92 for graph 65 Tool bar 21 Transfer of data 29 Treatment term 196, 227 TREATMENTSTR UCT URE directive 227
Trend 169 Triangle as plotting symbol 65 True value in expression 52 TRY directive 188 TTEST procedure 93 Two-way analysis of variance 195, 198 Unbalanced design 232, 240, 262 Underscore character 317 Uniform correlation model 255 Unknown table entry 84 Unnamed data structure 323 Unstructured model in REML 255 User defined nonlinear curves 180, 188 Variance 92, 148, 153 percentage accounted for 150 Variance component 239, 240, 248, 260 Variance ratio 150, 226 Variate 8, 23 VARIATE directive 23, 317, 324 Variogram 328 Vector 69 Vector spreadsheet 37 Version 2 Wald statistic 240, 252, 257 Weibull distribution of survival times 329 Whole-plot 223, 239 Wilcoxon test 94 Window edit 18 input log 12 menu 21 within graphics window 65 W indow outp ut 4 W indow: input log 4 Word-processor 20 Working directory 4, 34 Workspace 2, 22