This is a crucial part and also a topic of research by itself. We have to solve a double request:
- take all parameters available to us in order to be objective and complete ;
- take parameters that can be given evolutionary states and represent different evolution stages, avoiding homoplasies and redundancies.
The first request is constrained essentially by the technology at the time you make your study. Hence, there is not much of a choice here, but the parameters must be homogeneously measured or estimated. In addition, the number and kind of available parameters evolve with time.
The second request seems in contradiction with the first one. Indeed, this is probably one of the most interesting part in cladistics. There is not one unique ideal set of parameters, and the behaviors of the parameters have to be understood or investigated. So how could we know whether a given parameter retrieved from a database can be a good character?
We can use our knowledge, that is our theoretical models to predict the behavior of some parameter. However, the drawback of this approach is to introduce a strong bias, be it either subjective or simply due to our limited ability to understand complex phenomena.
The best way to understand the parameters is to study them in the context we are interested in. This is where data mining approaches in general are useful.
Firstly, cladistics itself yields a cladogram which is a phylogenetic hypothesis. This word “hypothesis” is important because it emphasizes that this is a start, and not the end, of the investigation. From this hypothesis, we must derive some understanding of all evolutionary aspects implied by the cladogram, for both the objects (taxa) and the characters.
This shows that cladistics is an investigation tool, an exploration tool, not a black box that provides a definitive classification of things. If the tree obtained is not robust, then the data set is not adequate, some taxa could perturb the analysis, or the parameters are not well behaved. Just looking at the evolution of parameters along the cladogram already tells a lot about synapomorphies, homoplasies, and even contradictory characters. And it is a good idea to try several analyses with different sets of taxa and characters. And compare.
Secondly, there are other very different statistical tools that can be used to assess the discriminant power of the parameters for the object sample under study. For instance, PCA (Principal Component Analysis), ICA (Independent Component Analysis), scatter plots, correlations diagrams etc, can already give some insights.
In Fraix-Burnet (2012), we present a very good illustration of the way one should select parameters to make a cladistic analysis of a given sample of galaxies, and this approach is also valid for clustering in general.
I should now mention some obvious properties for the parameters, so obvious that they can be easily forgotten.
Parameters must be intrinsic to the objects. They must describe the proper evolutionary path. If two perfectly identical objects are situated in different places of the Universe, they should be described in the same way. Hence, in astrophysics, all quantities must be “absolute”, that is corrected for distance and reddening. The parameters should not describe the environment of the objects (the number of neighbors is not a property of the object, but a property of the environment). By definition, the environment of an object is not a part of it.
Parameters can be qualitative or quantitative, but they must be objective. Hmm… forget about the galaxy morphology measured by eye, use disk-to-bulb ratio or some other photometric profile estimators. Also forget about the de Vaucouleurs T index of morphology, it is a categorical indicator, meaning that it is arbitrary and does not correspond to some physical evolutionary stages.
Parameters should represent evolutionary stages. So, for galaxies, a star burst or spiral arms are certainly not good synapomorphies, since they come and go and may come again: is the presence of a starburst is ancestral or derived with respect to its absence?
Quantities derived from models can better represent the physics of an object, but they are generally flawed by some assumptions. Observables are sometimes not directly related to the evolutionary processes. There is no absolute rule here, simply keep these points in mind.
Last but not least, a very very important kind aspect of cladistics: it allows missing values. How? By considering all possibilities for a missing value, the algorithm searches for the most parsimonious tree, which thus corresponds to a particular value of this unknown parameter, yielding a prediction for this parameter for this object. Of course, in practice, this does not work well if there are more than say 20% of missing value in your dataset.
The same philosophy allows cladistics to take uncertainties into account, like for missing values, but restricting the possibilities to the range given by error bars.
Fraix-Burnet, D., Chattopadhyay, T., Chattopadhyay, A.K., Davoust, E., Thuillard M. 2012 : “A six-parameter space to describe galaxy diversification. , A&A 545, A80 Astronomy & Astrophysics (http://fr.arxiv.org/abs/1206.3690)