Maximum Parsimony is known to be NP-complete and thus cannot really be used for a number of taxa larger than, say, 1000. Since the beginning of astrocladistics I was concerned by this problem, and it took us several years to find a solution.
Indeed, we must cluster the data first, that is make a similarity classification first. It is what I called pre-clustering in our new paper on the WINGS data set of about 4100 galaxies.
I can find several reasons to justify this approach.. First phylogeny is concerned with species, not individuals. Second, many galaxies in our sample are probably similar, hence redundant when reconstruction their diversification history. Third and last, it seems that similarity classification is more stable than phylogenetic classification, this is from my little experience and also from some criticisims of the PHYLOCODE project that inspired me at the beginning of astrocladistics.
So in this work we performed a hierarchical clustering technique to reduce the number of taxa from 4100 to 300. Why 300? I did computation with 100, 200 and 300 and found that this number is a good compromise to represent the diversity of the galaxies in our sample. I believe that there are probably more objective criterion to decide on this number but did not investigate any further this point for this pioneering work.
The Maximum Parsimony was then applied on the 300 pre-clusters.
The WINGS sample we used has 4100 galaxies, 2674 of which belong to a cluster (of galaxies), the 1476 other ones are considered as field galaxies. We did the pre-clustering independently for the full, cluster and field sets.
The seven characters used in the MP analyses are the colour (B-V), the logarithm of effective radius (log(Re)), the Sersic index nV, the mean effective surface brightness 〈μ〉e, the equivalent width of Hβ, the Balmer discontinuity index D4000 and the logarithm of the stellar mass ( log(M∗)). The trees are rooted with low values of B-V (hence blue colour), nV and stellar mass.
Below is the tree obtained with the full sample. Colours corresponds to fifteen groups defined from the tree. Each tree tip is a pre-cluster. We indicate with a bar at the tree tips the proportion of galaxies within each pre-cluster that belong to a cluster of galaxies. As can be seen, there is no distinct evolutionary paths for cluster and field galaxies. There is also a slight increase of cluster galaxies as we go down the tree. This may be interpreted as cluster galaxies being globally more diversified than field ones, which is easily explained by a more violent environment that affects many properties of galaxies through interactions and mergers.
Below we show the two trees for the cluster and field samples. The second one looks slightly more linear but this is difficult to conclude about significant distinct evolutionary scenarios.
The tree can be projected onto a scatter plot that astronomers like a lot: the colour-magnitude diagram. We show below the one for the cluster sample and the one for the field sample at the same scale. The paths are very different: in both cases, galaxies evolve from the blue region (called the blue branch) to the red one but the paths are nearly opposite.
The main outcome of our work is that the strategy to tackle big samples for Maximum Parsimony in astrophysics is validated. We are now able to analyze more complete sets of galaxies, including higher redshift data, to build a broader diversification scenario.
From the astrophysical point of view, I find interesting that galaxies from clusters or from the field tat lived in very different environment do not show distinct evolutionary paths, but rather seem to have diversified at different “speed”.