Diptera wing classification using Topological Data Analysis

Guilherme Vituri F. Pinto; Sergio Ura; Northon

Abstract

We use Topological Data Analysis (TDA) to describe Diptera wing venation and classify specimens at the family level. From 70 binarized wing images representing nine families, we extract two compact persistence summaries: H1 persistence from Vietoris-Rips filtrations of point-cloud samples and H0 persistence from radial filtrations of connected wing images. These 34 topological features are evaluated with a single balanced Random Forest model using repeated stratified 3-fold cross-validation. Because the dataset is imbalanced, performance is summarized primarily with macro-F1, macro-recall, family-level recall, and a row-normalized confusion matrix. We also use a feature-reduction screen to identify a smaller candidate set of topological summaries for biological interpretation. A direct Wasserstein-distance baseline on Rips persistence diagrams is competitive with the Random Forest, suggesting that much of the taxonomic signal is already present in the Rips diagrams themselves.

1 Introduction

Diptera wing venation is a classical taxonomic character: the arrangement of veins and enclosed cells varies among families and provides a natural morphological signature. Topological Data Analysis (TDA) is well suited to this problem because persistent homology summarizes connected components and loops in a way that is less tied to exact pixel coordinates than many raw image descriptors.

We use two complementary filtrations: a Vietoris-Rips filtration on point-cloud samples of each wing, retaining H1 persistence to describe global loops; and a radial filtration on the connected binary wing image, retaining H0 persistence to describe how vein components merge from the center of the wing outward.

The statistical goal is to test whether compact topological summaries carry family-level signal and to identify which summaries are most useful for prediction. To keep the validation aligned with the small and imbalanced dataset, we evaluate one balanced Random Forest model with repeated stratified 3-fold cross-validation and report macro-F1, macro-recall, family-level recall, and the confusion matrix.

2 Methods

2.1 Data and preprocessing

All wing images are stored in images/processed. File names encode the family and specimen identifier. We standardize family names, remove duplicated files that differ only by spacing or spelling variants, blur each image slightly, crop it, and resize it to 150 pixels in height.

Total images after deduplication: 70
Number of families: 9

9×2 DataFrame

Row	family	n
	String	Int64
1	Asilidae	8
2	Bibionidae	6
3	Ceratopogonidae	8
4	Chironomidae	8
5	Rhagionidae	4
6	Sciaridae	6
7	Simuliidae	7
8	Tabanidae	11
9	Tipulidae	12

Source: Article Notebook

2.2 Topological feature extraction

We compute two persistence diagrams for each wing. First, we sample 750 points from the wing point cloud and compute H1 persistence from the Vietoris-Rips filtration. Second, we construct a connected binary image and compute H0 persistence from the radial filtration.

Persistent homology overview

Persistent homology tracks topological features as a filtration parameter changes. H0 records connected components; H1 records loops. Long-lived features are treated as more stable shape information than short-lived features.

The plots below show three representative wings, their radial filtration, and the largest persistence values from each retained diagram.

2.3 Summary statistics

Each persistence diagram is converted into 19 summary statistics. We retain 17 statistics per diagram and exclude skewness and kurtosis because these tail-sensitive summaries are less reliable for small persistence diagrams. This gives 34 features per specimen.

Retained statistics per diagram: 17
Feature matrix: 70 samples x 34 features

3 Classification

We use one classifier: a balanced Random Forest. In each training fold, minority families are upsampled by bootstrap resampling to match the largest family in that fold. Hyperparameters are fixed in advance, so no additional tuning loop is used. The validation procedure is repeated stratified 3-fold cross-validation with 30 repeats.

repeated_stratified_rf_cv (generic function with 1 method)

2×3 DataFrame

Row	metric	mean_percent	sd_percent
	String	Float64	Float64
1	Macro-F1	65.6	4.8
2	Macro-recall	66.9	4.6

The pooled out-of-fold predictions combine all held-out predictions across the 30 repeats. This gives a stable diagnostic view of which families are recovered consistently.

3×2 DataFrame

Row	metric	value_percent
	String	Float64
1	Pooled accuracy	67.7
2	Pooled macro-F1	65.9
3	Pooled macro-recall	66.9

9×5 DataFrame

Row	family	original_n	repeated_support	recall_percent	f1_percent
	String	Int64	Int64	Float64	Float64
1	Asilidae	8	240	39.6	42.9
2	Chironomidae	8	240	46.2	59.4
3	Rhagionidae	4	120	55.0	50.8
4	Ceratopogonidae	8	240	55.8	58.5
5	Tabanidae	11	330	67.3	65.0
6	Sciaridae	6	180	76.7	65.2
7	Bibionidae	6	180	82.2	81.3
8	Tipulidae	12	360	87.2	85.1
9	Simuliidae	7	210	92.4	84.9

3.1 Confusion matrix

The confusion matrix below is row-normalized, so each row sums to one and can be read as the distribution of predicted families for a given true family.

3.2 Feature importance

Feature importance is descriptive. It is computed from balanced bootstrap trees fit on the full feature matrix, so it should be read as a guide to which topological summaries the Random Forest uses, not as an additional estimate of held-out performance.

15×2 DataFrame

Row	feature	importance
	String	Float64
1	Radial_H0__q75	1.0
2	Radial_H0__median	0.896696
3	Radial_H0__q10	0.856581
4	Rips_H1__entropy	0.837236
5	Radial_H0__std_birth	0.831594
6	Radial_H0__q25	0.817996
7	Rips_H1__median	0.777934
8	Rips_H1__median_birth	0.682759
9	Rips_H1__total_pers	0.614264
10	Rips_H1__mean_midlife	0.501807
11	Rips_H1__max_pers	0.477192
12	Rips_H1__q75	0.473651
13	Rips_H1__std_death	0.466785
14	Radial_H0__count	0.446309
15	Rips_H1__median_death	0.416641

3.3 Essential feature reduction

To identify a compact feature set for biological interpretation, we use the Random Forest importance ranking as a reduction path. We evaluate nested subsets of the top-ranked features and select the smallest subset whose screening performance remains within one percentage point of the full 34-feature model for both pooled accuracy and pooled macro-F1. The selected subset is then re-evaluated with the full repeated stratified 3-fold CV configuration.

This procedure is intended to rank features for interpretation. It should not be treated as an independent performance estimate because the feature ranking is learned from the same dataset.

2×5 DataFrame

Row	feature_set	n_features	accuracy_percent	macro_f1_percent	macro_recall_percent
	String	Int64	Float64	Float64	Float64
1	All features	34	67.7	65.9	66.9
2	Essential feature set	11	66.2	65.4	65.9

11×4 DataFrame

Row	rank	block	statistic	relative_importance
	Int64	SubStrin…	SubStrin…	Float64
1	1	Radial_H0	q75	1.0
2	2	Radial_H0	median	0.897
3	3	Radial_H0	q10	0.857
4	4	Rips_H1	entropy	0.837
5	5	Radial_H0	std_birth	0.832
6	6	Radial_H0	q25	0.818
7	7	Rips_H1	median	0.778
8	8	Rips_H1	median_birth	0.683
9	9	Rips_H1	total_pers	0.614
10	10	Rips_H1	mean_midlife	0.502
11	11	Rips_H1	max_pers	0.477

4 Rips Wasserstein distance baseline

Earlier versions of this analysis used a direct distance-matrix approach: compute pairwise Wasserstein distances between the Rips H1 persistence diagrams, then classify by nearest neighbours. We revisit that idea here and add an average-linkage dendrogram. This is a deliberately “pure metric space” baseline because it uses the persistence diagrams only through a single pairwise distance matrix, without extracting interpretable summary features or fitting a flexible classifier.

The Wasserstein constructor has two relevant choices: the Wasserstein order and the ground norm used to match points in the birth-death plane. The previous experiments used the default ground norm, so we compare those settings with Euclidean ground-norm variants.

The dendrogram below uses average linkage on the W1 distance with the default ground norm. It should be read as an exploratory visualization of the diagram geometry, not as a supervised classifier.

To make the comparison with the Random Forest more explicit, we evaluate several distance-based classifiers with the same repeated stratified 3-fold split design used above. The classifier choices are intentionally simple: unweighted k-NN, inverse-distance weighted k-NN, and nearest-family average distance.

12×5 DataFrame

Row	distance	classifier	accuracy_percent	macro_f1_percent	macro_recall_percent
	String	String	Float64	Float64	Float64
1	W1_Linf	3-NN	66.8	63.1	64.0
2	W1_Linf	3-NN weighted	66.8	63.1	64.0
3	W2_L2	3-NN	66.1	62.9	63.3
4	W2_L2	3-NN weighted	66.1	62.9	63.3
5	W2_L2	1-NN	65.4	62.5	62.5
6	W1_L2	3-NN	65.6	61.4	62.8
7	W1_L2	3-NN weighted	65.6	61.4	62.8
8	W2_Linf	3-NN	65.6	61.2	62.2
9	W2_Linf	3-NN weighted	65.6	61.2	62.2
10	W2_Linf	1-NN	62.1	61.1	61.2
11	W1_Linf	1-NN	63.8	61.1	61.5
12	W1_L2	1-NN	63.8	61.0	61.7

2×6 DataFrame

Row	method	representation	classifier	accuracy_percent	macro_f1_percent	macro_recall_percent
	String	String	String	Float64	Float64	Float64
1	Best Rips Wasserstein baseline	Rips H1 diagrams as a distance matrix	3-NN	66.8	63.1	64.0
2	Balanced Random Forest	Rips H1 + radial H0 summary features	Random Forest	67.7	65.9	66.9

The direct Wasserstein approach is more than a weak diagnostic baseline. Its best result is close to the feature-based Random Forest: the accuracy differs by about one percentage point, while macro-F1 and macro-recall differ by only a few percentage points. Given the small sample size, this gap should be interpreted cautiously rather than as clear evidence that the Random Forest is decisively superior.

This result suggests that the Rips H1 persistence diagrams already contain substantial family-level signal. The Wasserstein pipeline is also conceptually simple: it keeps the diagrams as diagrams, compares them with an intrinsic distance, and uses nearest-neighbour classification. Its simplicity is mainly statistical and methodological, however, not necessarily computational, because Wasserstein distances require solving optimal matching problems between diagrams.

The feature-based Random Forest remains useful, but for a more modest reason. It combines Rips H1 summaries with radial H0 summaries, can use nonlinear interactions among summary statistics, and gives feature-importance diagnostics for biological interpretation. The present results therefore do not show that a purely metric-space approach is poor. They show that Wasserstein distance is a strong baseline, and that feature extraction plus Random Forest offers a small performance gain together with better interpretability and flexibility.

5 Discussion

These results suggest that compact topological summaries of wing venation contain family-level signal. The two retained filtrations capture different information: Vietoris-Rips H1 summarizes global loop structure in the vein network, while radial H0 summarizes how connected vein components organize from the center of the wing outward. The direct Wasserstein baseline strengthens this conclusion: even without radial features or feature extraction, the Rips diagrams alone support competitive classification.

The comparison between Wasserstein distance and Random Forest should be read as a tradeoff, not as a decisive ranking. The Random Forest gives a modest improvement in the current validation, but the gap is small for a dataset of this size. Wasserstein distance therefore remains an important baseline and a useful indication that the persistence diagrams themselves carry taxonomic structure. The advantage of the feature-based Random Forest is that it can combine complementary filtrations and provide interpretable feature rankings, not that it overwhelmingly outperforms the metric-space approach.

The validation design is intentionally conservative for the current dataset. Because the dataset contains only 70 specimens and the family counts are uneven, the primary summaries are macro-F1 and macro-recall rather than overall correct rate. Repeating the stratified 3-fold split 30 times reduces dependence on a single partition and gives a clearer view of which families are stable or ambiguous.

The feature-reduction results provide a candidate set of topological summaries for biological follow-up. Features retained in the essential set should be inspected against wing venation traits, including loop structure, the number of persistent components, and the scale at which vein components merge under the radial filtration.

The main practical limitation is still sample size. Several families have fewer than ten specimens, so family-level recall should be interpreted as provisional. Image quality, binarization, and connectivity correction also affect the persistence diagrams. Follow-up work should add taxonomic context, literature references, and a biological interpretation of the retained topological summaries.

Citation

BibTeX citation:

@online{vituri_f._pinto2026,
  author = {Vituri F. Pinto, Guilherme and Ura, Sergio and , Northon},
  title = {Diptera Wing Classification Using {Topological} {Data}
    {Analysis}},
  date = {2026-04-27},
  langid = {en},
  abstract = {We use Topological Data Analysis (TDA) to describe Diptera
    wing venation and classify specimens at the family level. From 70
    binarized wing images representing nine families, we extract two
    compact persistence summaries: H1 persistence from Vietoris-Rips
    filtrations of point-cloud samples and H0 persistence from radial
    filtrations of connected wing images. These 34 topological features
    are evaluated with a single balanced Random Forest model using
    repeated stratified 3-fold cross-validation. Because the dataset is
    imbalanced, performance is summarized primarily with macro-F1,
    macro-recall, family-level recall, and a row-normalized confusion
    matrix. We also use a feature-reduction screen to identify a smaller
    candidate set of topological summaries for biological
    interpretation. A direct Wasserstein-distance baseline on Rips
    persistence diagrams is competitive with the Random Forest,
    suggesting that much of the taxonomic signal is already present in
    the Rips diagrams themselves.}
}

For attribution, please cite this work as:

Vituri F. Pinto, Guilherme, Sergio Ura, and Northon. 2026. “Diptera Wing Classification Using Topological Data Analysis.” Earth and Space Science, April 27.