## Code the characters

Assigning evolutionary states to characters would logically mean coding them into discrete values. This is the way cladistics was designed and is more easily explainable. This is also very much adequate for absence/presence attributes (like arms, bar…) or number of legs. But in astrophysics, nearly all quantities are quantitative and continuous.

Indeed, the same problem occurs also in biology with morphometric data, and we discuss the question of cladistics with continuous data in the next post. The main point here is that most phylogenetic software use discrete data and several studies investigated the best ways to bin data.

There are three questions to solve:

1. how many bins?
2. should we take equal-width bins?
3. which size of bins to best represent the different evolutionary groups in the sample?

Consider a concrete example. We are used to count the number of years since birth as the age of children. Hence the age is coded into years. However, there can be a large difference in evolution between two children of the same age than between two children of consecutive age and born at only a month or days apart. Coding in months is a better option, especially for very young children, but is not adequate for adults. In astrophysics, the use of a logarithmic scale alleviates somewhat such a problem, but it is not necessarily pertinent for all studies.

Also the size and its evolution for  a child depends very much on the child. Should we code the size in the same way as the age?

As we see, there is not one unique answer to any of the above questions. Nevertheless, my experience have shown that coding the continuous data of astrophysics with the maximum number of  (equal) bins allowed by the software yields generally quite reasonable results. A good test I have found is to change the number of bins and see if the result is similar or not. Below 10 or 15 bins, I have often found that the trees are significantly different, whereas for more than 20 bins, it remains stable.

There is an important point, probably more specific to (astro)physics and certainly characteristics of quantitative data. This is the uncertainty associated with each value. We have seen in a previous post that cladistics allows missing values and ranges of values. With uncertainties, it is easy to derive the range of possible values for all parameters and all objects. If the bin width is not too large, then the uncertainty range may fit within a single bin. The smaller the number of bins, the more often this will happen. In this case, uncertainties have no influence on the results.

Coding introduces some brutal cuts in the continuous data, and since we are always dealing with continuous processes in astrophysics, it is expected that the frontiers between groups are somewhat fuzzy at a certain level. On the contrary, coding itself introduces some fuzziness in the sense that each bin corresponds to a range of possible values, that could be attributed to some “cosmic variance” or some uncertainty on the measurement.