Advice on requesting statistical advice

One of the best aspects of knowing a bit of ecological statistics is that people often come to you asking for statistical help or advice. Not only is it great because, well, helping people is great, but it is also great learning. After all, in what other circumstances would a plant ecologist worry about how to analyze data from studies of ecotoxicology or genetics? Each request for help is a great opportunity to learn a bit more about some form of analysis or approach, and I personally love it.

With this in mind, some things can make this easier for both sides. Specifically, I think that some information is essential and should be given when the advice is requested, if possible. Below are some ideas on how to ask for statistical help or advice (I do not speak here on how to analyze data because other people have done a much better job at this). So, I think that the following information should always be given (in short) when requesting statistical advice and thought about when planning a study:

Question to be answered

Statistical analysis may give you the answers you are looking for. However, a good answer may only be found if you have a good question (for example, “is nest predation affected by distance from the edge?”). Analyzing data simply to extract something from it (a situation such as “I have these data on 15,000 butterflies in a savanna area and don’t know how to use it) is much more complicated, and sometimes impossible. And finding a good question depends more on biological knowledge than on statistical skills. Of course the question should have been formulated when the study was being planned. However, this does not always happen. Sometimes we know the general theme of the study but, for whatever reason, we have not formulated a clear enough question. This is perfectly normal, at least during one’s undergrad or masters. Other times the sampling design may not be appropriate to answer the question at hand, and we only realize this when it is too late. In these cases, I would recommend – and this is a personal recommendation of a simple Ecology PhD candidate – is not to think “Gosh, I have these data, what statistical tests may I use on them?”, but to think, instead, “Cool, I have these data, wonder if I can use them to answer any interesting ecological questions?”. After all, you collected the data, you studied the taxon or the environment where they were collected, so no one better than you to tell what kind of question your data may answer! Once you have the question, then is the time to think of the statistical test.

Sampling unit

Ok, we have the question, time to move on! The sampling unit is a very important piece of information that is sometimes left out, maybe because it is seen as something obvious by the author of the study. For example, in a study on the effect a plant extract has on seed germination, the sampling unit may be each seed, or each set of seeds in a Petri dish, or a set of Petri dishes. Once again, it depends on the question, and sampling unit may often be viewed in more than one way. However, if we were to think of each seed as the sampling unit, than the statistical test has to take into account the non-independence among the seeds placed on the same dish. “Non-independence, Master Yoda?” (Consider myself a master, I do not) – Maybe one dish was placed on a spot with slightly more light (perhaps by some defect in the germination chamber), and the seeds may germinate faster because of this difference in light incidence, not because of the plant extract being studied. And this is something difficult to control. It may not alter the final result, but the possible non-independence is an important piece of information that has to be accounted for when planning the study. This also applied to other studies, such as plants within plots or quadrats in the forest and so on.

Nature of the response variable

What is the measure, or operational variable, being used? Is it a continuous variable (e.g. height), a count variable (e.g. number of individuals), a categorical variable (e.g. species), a binary variable (e.g. whether a nest had been predated or not) or a proportion variable (e.g. percentage of germinated seeds)? This information is essential to determine which tests may or may not be used. For example, the famous Analysis of Variance (Anova) should be used basically with continuous variables. For count date, we may think of a chi-square or a GLM. It is also important to state if you have one response variable or many, as the latter case would call for a multivariate analysis.

Nature of the explanatory variables

 

OK, we have the response variable, but what of the explanatory variables? In short, they are the variables used to explain the variation (or the pattern) of the response variable. The may refer to the different treatments; or they may be the different concentrations of a chemical agent; or they may be environmental variables measured at the site, such as pH or soil moisture; or they may even be the spatial location of the samples. The analyses tend to be less restrictive when it comes to the explanatory variables, but it is important to know, for example, whether the variables are continuous or categorical (both types of variables may exist in the same dataset).

Replication

Which brings us to the question of replication, essential to determine the type of test that may be applied. If the explanatory variables are continuous, how may replicates, or repeated sampling units, do we have? If the explanatory variables are distributed in groups or among factors (e.g. different pre-determined concentrations of a chemical compound), how many replicates are there for each treatment?  In addition, is the sampling design balanced – in other words, do all treatments have the same number of replicates? Depending on the sampling size, some statistical tests should not be used, as they may not work with small samples or may have very low power. For other tests, computation may be too complex for a very large sample size.

Some additional details

Finally, some specific details regarding sampling design may be necessary. The sampling method – for example, if insects were sampled with pitfall traps or with yellow plates – may be irrelevant. However, the way in which the traps were distributed may be important – depending on the distance between traps or of the way in which they are group, some form of correction for spatial autocorrelation may be necessary. If the traps were placed in groups, the analysis might need to account for pseudoreplication, for example by means of a mixed model. Another thing to take into account is whether the same sampling unit had been measured more than once along the experiment, as the different measurements of the same sampling unit are not independent. And so on. In any case, a good planning, which accounts for these factor, is essential, and statistics are not capable of solving the problems created by an ill-planned experiment.

An example

This is an example of how I would describe the data I collected during my Masters (I don’t explicitly state the sampling unit because it seems clear without it, but I might of course be wrong):

I collected data on edge effects, so my objective is to know how the response variables are related to distance from the edge of the forest fragment. All the response variables are contiguous and I intend to analyze them one by one; the explanatory variable is distance from the edge, measured in meters, and may assume 15 different values (0, 2, 5… 30, 40, 50… 120, 150, 180 m). The sampling was performed along linear transects, that is, plots (sampling units) along straight lines, from 0 to 180 m. I had a total of 5 transects, separated by a random distance between 20 and 40 m.

One last piece of advice

With the above-written in mind, consider where are you sending the request for help. If you are sending an email to someone specific, for example a professor or a more experience colleague, I’d recommend including as much information as possible into the email. Conversely, if you are sending an email to a mailing list, such as the R-BR, dedicated to the software R, or to the mailing list of the Past software, I’d recommend writing a shorter email, explaining the general statistical problem as briefly as possible. Usually questions asked in a mailing list refer to a specific test, and therefore do not need many details. In addition, the shorter the email, the more people will read it in full, and if somebody needs more information to reply, this person may always write asking for it.

And to emphasize…

No statistics can save an ill-planned experiment or sampling design. On the other hand, many tests can be used on a well-planned study. Therefore, always plan well before beginning the study.

Anúncios

Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s