The first step toward a more general classification is to distinguish the names of the species from the descriptors. This was the solution found by Linné in the first half of the 18th century to solve the problem of the multiplicity of names for the same objects coming from different classifications. His binomial nomenclature, still in use today, is based on the hierarchical organization of the living organisms that have been observed in the many classifications that existed at the end of the 17th century. The idea was to give generic names that have no link to the traits of the organisms. It thus could be a latin word derived from the place where the first specimen has been discovered, or the name of the discoverer and so forth. The first part corresponds to the genus, the second one to the species. Adanson (1763) established rigorously the basic rules of taxonomy: living organisms should be given a name, a definition and a description.
More importantly, he considered that there are two methods of classification, natural and artificial. The first one leads us toward the discovery of the natural organization of objects. The second one is easier to carry out but this is the scientist who imposes the organization of Nature. As a result, the natural methods is more objective and thus must consider all descriptors, not only the ones chosen by the observer.
As an example of these two kinds of classification, Adanson cites the system of Copernicus. At first, it was an artificial method (we would call it a model today), and subsequently it became a natural system, and even the thing itself, the reality in a “pure “state.
By promoting the natural method to build the classification of living organisms, Adanson has so invented what is called now the multivariate analyses. It revolutionized biology, and gave scientists work for more than two and a half centuries. A new discipline also emerged, systematics, which is the science of classification.
Let us now consider again the same matrix:
c1 | c2 | c3 | |
A | 0 | 0 | 1 |
B | 0 | 1 | 0 |
C | 0 | 1 | 1 |
The simple method to illustrate the multivariate classification is to adopt a distance analysis with the Euclidian distance. We thus have:
d(A,B)=2
d(A,C)=1
d(B,C)=1
We can depict these results on the following diagram:
A – C – B
or with dendograms :
________ ________
| __ |__ or __ |__ |
| | | | | |
A C B A C B
There are three classes determined objectively. What are the relationships between these classes? To answer this question, an additional information is required. If we assume that there is some evolution, we have no way to determine the origin or the direction. Four possibilities are acceptable and cannot be discriminated without any further information or assumption:
A -> C -> B or A <- C <- B or A <- C – > B or even A -> C <- B
The two diagram to the left yield a linear representation of evolution which is illustrated in the infamous sketch showing the “evolution” of man from monkeys.
The third one to the right is a very small rooted tree, C being the root, with a branching pattern going to A and B.
The fourth one shows a convergence. This is a very annoying situation when two different objects happen to become similar even through different evolutionary processes.
Astrophysics
Multivariate analyses have been used in astrophysics for quite a while. But it is mainly used to automatically discriminate between galaxies, stars and quasars in observations acquiring thousand and thousands of objects. This is a classification in the sense of putting objects in previously established classes, but it is not a clustering analysis that tries to establish an objective, multivariate and general classification.
Some interesting works have been done to automatically characterize high-resolution spectra (CONNOLLY ET AL., 1995; CABANAC ET AL., 2002; LU ET AL., 2006). With respect to other traditional classifications in astrophysics, this is an enormous step forward since all structures of the spectra are preserved and objects can be compared on this complete and objective basis. Distance analyses are thus excellent tools (ELLIS ET AL., 2005).
A very few works have used this kind of approach (WHITMORE, 1984; WATANABE ET AL., 1985). By using the Principal Component analysis, they managed to build boxes defined by a linear combination of observables. Galaxies are thus organized in a quite unequivocal way.
Statisticians now get involved in this challenge. This is just the beginning, but very promising (CHATTOPADHYAY AND CHATTOPADHYAY, 2006).
However, no general multivariate classifications of galaxies have emerged. Why? I think there are two fundamental reasons for that.
The first one is the difficulty to give a physical meaning to the multivariate clustering. When only one or two parameters are used to define boxes, it is possible to use models to explain why galaxies are gathered in a given box and why the boxes are different. When there are many parameters, it becomes much more difficult because there are two many variables to adjust in the models, and the parameters that defined the boxes might not be causally linked together. And it is much worse when these parameters are linear combinations of observables like from Principal Component analyses.
Indeed, this first reason that prevented astronomers to build more objective classification is also cultural. Do not forget that as physicists, we are not educated with such kind of statistics, and not trained to deal with so many data. Things must change of course, especially with the young astronomers who will have no choice!
The second reason is that there is a missing link. Natural objects, be they living organisms or galaxies, are not arbitrarily put into static boxes. Evolution, that is transformation, is everywhere in our Universe, so that these boxes are merely some snapshots of evolutionary pathways. To understand the result of multivariate classifications, one must consider evolution as the driver of diversity, hence the origin and the link between the boxes.
Here again, these second reason might be cultural, or psychological. Astronomers have a very affective relation with the astronomical objects. Even if sometimes the names, coming from a necessary catalog nomenclature, has nothing to do with poetry, a galaxy is an entity, with its morphology (yielding very nice pictures) and its stars (that are bright), not really a step tin a long sequence of transformation processes.
——————————————————————————————————————————————————
ADANSON, M., 1763. Famille Des Plantes. volume Num. BNF de l’éd. de, Paris : INALF, 1961- (Frantext ; R263Reprod. de l’éd. de, Paris : Vincent).
CABANAC, R.A., DE LAPPARENT, A., HICKSON, P., 2002. Astronomy & Astrophysics 389, 1090–1116.
CHATTOPADHYAY, T., CHATTOPADHYAY, A., 2006. Objective classification of spiral galaxies having extended rotation curves beyond the optical radius. The Astronomical Journal 131, 2452–2468.
CONNOLLY, A.J., SZALAY, A.S., BERSHADY, M.A., L., K.A., CALZETTI, D., 1995. The Astronomical Journal 110, 1071–1082.
ELLIS, S.C., DRIVER, S.P., ALLEN, P.D., LISKE, J.B BLAND-HAWTHORN, J., DE PROPRIS, R., 2005. The millennium galaxy catalogue : on the natural sub-division of galaxies. Monthly Notices of the Royal Astronomical Society 363, 1257–1271. arXiv:astro-ph/0508365.
LU, H., ZHOU, H., WANG, J., WANG, T., DONG, X., ZHUANG, Z., LI, C., 2006. Ensemble learning independent component analysis of normal galaxy spectra. The Astronomical Journal 131, 790–805. astro-ph/0510246.
WATANABE, M., KODAIRA, K., OKAMURA, S., 1985. Digital surface photometry of galaxies toward a quantitative classification. iv – principal component analysis of surface-photometric parameters. Astrophysical Journal 292, 72–78.
WHITMORE, B.C., 1984. An objective classification system for spiral galaxies. i the two dominant dimensions. Astrophysical Journal 278, 61–80.