Diptera wing classification using Topological Data Analysis

Authors
Affiliation

Guilherme Vituri F. Pinto

Universidade Estadual Paulista

Sergio Ura

Northon

Published

April 27, 2026

Abstract

We use Topological Data Analysis (TDA) to describe Diptera wing venation and classify specimens at the family level. From 70 binarized wing images representing nine families, we extract two compact persistence summaries: H1 persistence from Vietoris-Rips filtrations of point-cloud samples and H0 persistence from radial filtrations of connected wing images. These 34 topological features are evaluated with a single balanced Random Forest model using repeated stratified 3-fold cross-validation. Because the dataset is imbalanced, performance is summarized primarily with macro-F1, macro-recall, family-level recall, and a row-normalized confusion matrix. We also use a feature-reduction screen to identify a smaller candidate set of topological summaries for biological interpretation. A direct Wasserstein-distance baseline on Rips persistence diagrams is competitive with the Random Forest, suggesting that much of the taxonomic signal is already present in the Rips diagrams themselves.

Keywords

Topological Data Analysis, Persistent homology, Diptera classification, Wing venation

1 Introduction

Diptera wing venation is a classical taxonomic character: the arrangement of veins and enclosed cells varies among families and provides a natural morphological signature. Topological Data Analysis (TDA) is well suited to this problem because persistent homology summarizes connected components and loops in a way that is less tied to exact pixel coordinates than many raw image descriptors.

We use two complementary filtrations: a Vietoris-Rips filtration on point-cloud samples of each wing, retaining H1 persistence to describe global loops; and a radial filtration on the connected binary wing image, retaining H0 persistence to describe how vein components merge from the center of the wing outward.

The statistical goal is to test whether compact topological summaries carry family-level signal and to identify which summaries are most useful for prediction. To keep the validation aligned with the small and imbalanced dataset, we evaluate one balanced Random Forest model with repeated stratified 3-fold cross-validation and report macro-F1, macro-recall, family-level recall, and the confusion matrix.

2 Methods

2.1 Data and preprocessing

All wing images are stored in images/processed. File names encode the family and specimen identifier. We standardize family names, remove duplicated files that differ only by spacing or spelling variants, blur each image slightly, crop it, and resize it to 150 pixels in height.

Total images after deduplication: 70
Number of families: 9
9×2 DataFrame
Row family n
String Int64
1 Asilidae 8
2 Bibionidae 6
3 Ceratopogonidae 8
4 Chironomidae 8
5 Rhagionidae 4
6 Sciaridae 6
7 Simuliidae 7
8 Tabanidae 11
9 Tipulidae 12
Source: Article Notebook

2.2 Topological feature extraction

We compute two persistence diagrams for each wing. First, we sample 750 points from the wing point cloud and compute H1 persistence from the Vietoris-Rips filtration. Second, we construct a connected binary image and compute H0 persistence from the radial filtration.

NotePersistent homology overview

Persistent homology tracks topological features as a filtration parameter changes. H0 records connected components; H1 records loops. Long-lived features are treated as more stable shape information than short-lived features.

The plots below show three representative wings, their radial filtration, and the largest persistence values from each retained diagram.

2.3 Summary statistics

Each persistence diagram is converted into 19 summary statistics. We retain 17 statistics per diagram and exclude skewness and kurtosis because these tail-sensitive summaries are less reliable for small persistence diagrams. This gives 34 features per specimen.

Retained statistics per diagram: 17
Feature matrix: 70 samples x 34 features

3 Classification

We use one classifier: a balanced Random Forest. In each training fold, minority families are upsampled by bootstrap resampling to match the largest family in that fold. Hyperparameters are fixed in advance, so no additional tuning loop is used. The validation procedure is repeated stratified 3-fold cross-validation with 30 repeats.

repeated_stratified_rf_cv (generic function with 1 method)
2×3 DataFrame
Row metric mean_percent sd_percent
String Float64 Float64
1 Macro-F1 65.6 4.8
2 Macro-recall 66.9 4.6

The pooled out-of-fold predictions combine all held-out predictions across the 30 repeats. This gives a stable diagnostic view of which families are recovered consistently.

3×2 DataFrame
Row metric value_percent
String Float64
1 Pooled accuracy 67.7
2 Pooled macro-F1 65.9
3 Pooled macro-recall 66.9
9×5 DataFrame
Row family original_n repeated_support recall_percent f1_percent
String Int64 Int64 Float64 Float64
1 Asilidae 8 240 39.6 42.9
2 Chironomidae 8 240 46.2 59.4
3 Rhagionidae 4 120 55.0 50.8
4 Ceratopogonidae 8 240 55.8 58.5
5 Tabanidae 11 330 67.3 65.0
6 Sciaridae 6 180 76.7 65.2
7 Bibionidae 6 180 82.2 81.3
8 Tipulidae 12 360 87.2 85.1
9 Simuliidae 7 210 92.4 84.9

3.1 Confusion matrix

The confusion matrix below is row-normalized, so each row sums to one and can be read as the distribution of predicted families for a given true family.

3.2 Feature importance

Feature importance is descriptive. It is computed from balanced bootstrap trees fit on the full feature matrix, so it should be read as a guide to which topological summaries the Random Forest uses, not as an additional estimate of held-out performance.

15×2 DataFrame
Row feature importance
String Float64
1 Radial_H0__q75 1.0
2 Radial_H0__median 0.896696
3 Radial_H0__q10 0.856581
4 Rips_H1__entropy 0.837236
5 Radial_H0__std_birth 0.831594
6 Radial_H0__q25 0.817996
7 Rips_H1__median 0.777934
8 Rips_H1__median_birth 0.682759
9 Rips_H1__total_pers 0.614264
10 Rips_H1__mean_midlife 0.501807
11 Rips_H1__max_pers 0.477192
12 Rips_H1__q75 0.473651
13 Rips_H1__std_death 0.466785
14 Radial_H0__count 0.446309
15 Rips_H1__median_death 0.416641

3.3 Essential feature reduction

To identify a compact feature set for biological interpretation, we use the Random Forest importance ranking as a reduction path. We evaluate nested subsets of the top-ranked features and select the smallest subset whose screening performance remains within one percentage point of the full 34-feature model for both pooled accuracy and pooled macro-F1. The selected subset is then re-evaluated with the full repeated stratified 3-fold CV configuration.

This procedure is intended to rank features for interpretation. It should not be treated as an independent performance estimate because the feature ranking is learned from the same dataset.

2×5 DataFrame
Row feature_set n_features accuracy_percent macro_f1_percent macro_recall_percent
String Int64 Float64 Float64 Float64
1 All features 34 67.7 65.9 66.9
2 Essential feature set 11 66.2 65.4 65.9
11×4 DataFrame
Row rank block statistic relative_importance
Int64 SubStrin… SubStrin… Float64
1 1 Radial_H0 q75 1.0
2 2 Radial_H0 median 0.897
3 3 Radial_H0 q10 0.857
4 4 Rips_H1 entropy 0.837
5 5 Radial_H0 std_birth 0.832
6 6 Radial_H0 q25 0.818
7 7 Rips_H1 median 0.778
8 8 Rips_H1 median_birth 0.683
9 9 Rips_H1 total_pers 0.614
10 10 Rips_H1 mean_midlife 0.502
11 11 Rips_H1 max_pers 0.477

4 Rips Wasserstein distance baseline

Earlier versions of this analysis used a direct distance-matrix approach: compute pairwise Wasserstein distances between the Rips H1 persistence diagrams, then classify by nearest neighbours. We revisit that idea here and add an average-linkage dendrogram. This is a deliberately “pure metric space” baseline because it uses the persistence diagrams only through a single pairwise distance matrix, without extracting interpretable summary features or fitting a flexible classifier.

The Wasserstein constructor has two relevant choices: the Wasserstein order and the ground norm used to match points in the birth-death plane. The previous experiments used the default ground norm, so we compare those settings with Euclidean ground-norm variants.

The dendrogram below uses average linkage on the W1 distance with the default ground norm. It should be read as an exploratory visualization of the diagram geometry, not as a supervised classifier.

To make the comparison with the Random Forest more explicit, we evaluate several distance-based classifiers with the same repeated stratified 3-fold split design used above. The classifier choices are intentionally simple: unweighted k-NN, inverse-distance weighted k-NN, and nearest-family average distance.

12×5 DataFrame
Row distance classifier accuracy_percent macro_f1_percent macro_recall_percent
String String Float64 Float64 Float64
1 W1_Linf 3-NN 66.8 63.1 64.0
2 W1_Linf 3-NN weighted 66.8 63.1 64.0
3 W2_L2 3-NN 66.1 62.9 63.3
4 W2_L2 3-NN weighted 66.1 62.9 63.3
5 W2_L2 1-NN 65.4 62.5 62.5
6 W1_L2 3-NN 65.6 61.4 62.8
7 W1_L2 3-NN weighted 65.6 61.4 62.8
8 W2_Linf 3-NN 65.6 61.2 62.2
9 W2_Linf 3-NN weighted 65.6 61.2 62.2
10 W2_Linf 1-NN 62.1 61.1 61.2
11 W1_Linf 1-NN 63.8 61.1 61.5
12 W1_L2 1-NN 63.8 61.0 61.7
2×6 DataFrame
Row method representation classifier accuracy_percent macro_f1_percent macro_recall_percent
String String String Float64 Float64 Float64
1 Best Rips Wasserstein baseline Rips H1 diagrams as a distance matrix 3-NN 66.8 63.1 64.0
2 Balanced Random Forest Rips H1 + radial H0 summary features Random Forest 67.7 65.9 66.9

The direct Wasserstein approach is more than a weak diagnostic baseline. Its best result is close to the feature-based Random Forest: the accuracy differs by about one percentage point, while macro-F1 and macro-recall differ by only a few percentage points. Given the small sample size, this gap should be interpreted cautiously rather than as clear evidence that the Random Forest is decisively superior.

This result suggests that the Rips H1 persistence diagrams already contain substantial family-level signal. The Wasserstein pipeline is also conceptually simple: it keeps the diagrams as diagrams, compares them with an intrinsic distance, and uses nearest-neighbour classification. Its simplicity is mainly statistical and methodological, however, not necessarily computational, because Wasserstein distances require solving optimal matching problems between diagrams.

The feature-based Random Forest remains useful, but for a more modest reason. It combines Rips H1 summaries with radial H0 summaries, can use nonlinear interactions among summary statistics, and gives feature-importance diagnostics for biological interpretation. The present results therefore do not show that a purely metric-space approach is poor. They show that Wasserstein distance is a strong baseline, and that feature extraction plus Random Forest offers a small performance gain together with better interpretability and flexibility.

5 Discussion

These results suggest that compact topological summaries of wing venation contain family-level signal. The two retained filtrations capture different information: Vietoris-Rips H1 summarizes global loop structure in the vein network, while radial H0 summarizes how connected vein components organize from the center of the wing outward. The direct Wasserstein baseline strengthens this conclusion: even without radial features or feature extraction, the Rips diagrams alone support competitive classification.

The comparison between Wasserstein distance and Random Forest should be read as a tradeoff, not as a decisive ranking. The Random Forest gives a modest improvement in the current validation, but the gap is small for a dataset of this size. Wasserstein distance therefore remains an important baseline and a useful indication that the persistence diagrams themselves carry taxonomic structure. The advantage of the feature-based Random Forest is that it can combine complementary filtrations and provide interpretable feature rankings, not that it overwhelmingly outperforms the metric-space approach.

The validation design is intentionally conservative for the current dataset. Because the dataset contains only 70 specimens and the family counts are uneven, the primary summaries are macro-F1 and macro-recall rather than overall correct rate. Repeating the stratified 3-fold split 30 times reduces dependence on a single partition and gives a clearer view of which families are stable or ambiguous.

The feature-reduction results provide a candidate set of topological summaries for biological follow-up. Features retained in the essential set should be inspected against wing venation traits, including loop structure, the number of persistent components, and the scale at which vein components merge under the radial filtration.

The main practical limitation is still sample size. Several families have fewer than ten specimens, so family-level recall should be interpreted as provisional. Image quality, binarization, and connectivity correction also affect the persistence diagrams. Follow-up work should add taxonomic context, literature references, and a biological interpretation of the retained topological summaries.

Citation

BibTeX citation:
@online{vituri_f._pinto2026,
  author = {Vituri F. Pinto, Guilherme and Ura, Sergio and , Northon},
  title = {Diptera Wing Classification Using {Topological} {Data}
    {Analysis}},
  date = {2026-04-27},
  langid = {en},
  abstract = {We use Topological Data Analysis (TDA) to describe Diptera
    wing venation and classify specimens at the family level. From 70
    binarized wing images representing nine families, we extract two
    compact persistence summaries: H1 persistence from Vietoris-Rips
    filtrations of point-cloud samples and H0 persistence from radial
    filtrations of connected wing images. These 34 topological features
    are evaluated with a single balanced Random Forest model using
    repeated stratified 3-fold cross-validation. Because the dataset is
    imbalanced, performance is summarized primarily with macro-F1,
    macro-recall, family-level recall, and a row-normalized confusion
    matrix. We also use a feature-reduction screen to identify a smaller
    candidate set of topological summaries for biological
    interpretation. A direct Wasserstein-distance baseline on Rips
    persistence diagrams is competitive with the Random Forest,
    suggesting that much of the taxonomic signal is already present in
    the Rips diagrams themselves.}
}
For attribution, please cite this work as:
Vituri F. Pinto, Guilherme, Sergio Ura, and Northon. 2026. “Diptera Wing Classification Using Topological Data Analysis.” Earth and Space Science, April 27.