Total images after deduplication: 70
Number of families: 9
| Row | family | n |
|---|---|---|
| String | Int64 | |
| 1 | Asilidae | 8 |
| 2 | Bibionidae | 6 |
| 3 | Ceratopogonidae | 8 |
| 4 | Chironomidae | 8 |
| 5 | Rhagionidae | 4 |
| 6 | Sciaridae | 6 |
| 7 | Simuliidae | 7 |
| 8 | Tabanidae | 11 |
| 9 | Tipulidae | 12 |
Diptera wing venation is a classical taxonomic character: the arrangement of veins and enclosed cells varies among families and provides a natural morphological signature. Topological Data Analysis (TDA) is well suited to this problem because persistent homology summarizes connected components and loops in a way that is less tied to exact pixel coordinates than many raw image descriptors.
We use two complementary filtrations: a Vietoris-Rips filtration on point-cloud samples of each wing, retaining H1 persistence to describe global loops; and a radial filtration on the connected binary wing image, retaining H0 persistence to describe how vein components merge from the center of the wing outward.
The statistical goal is to test whether compact topological summaries carry family-level signal and to identify which summaries are most useful for prediction. To keep the validation aligned with the small and imbalanced dataset, we evaluate one balanced Random Forest model with repeated stratified 3-fold cross-validation and report macro-F1, macro-recall, family-level recall, and the confusion matrix.
All wing images are stored in images/processed. File names encode the family and specimen identifier. We standardize family names, remove duplicated files that differ only by spacing or spelling variants, blur each image slightly, crop it, and resize it to 150 pixels in height.
Total images after deduplication: 70
Number of families: 9
| Row | family | n |
|---|---|---|
| String | Int64 | |
| 1 | Asilidae | 8 |
| 2 | Bibionidae | 6 |
| 3 | Ceratopogonidae | 8 |
| 4 | Chironomidae | 8 |
| 5 | Rhagionidae | 4 |
| 6 | Sciaridae | 6 |
| 7 | Simuliidae | 7 |
| 8 | Tabanidae | 11 |
| 9 | Tipulidae | 12 |
We compute two persistence diagrams for each wing. First, we sample 750 points from the wing point cloud and compute H1 persistence from the Vietoris-Rips filtration. Second, we construct a connected binary image and compute H0 persistence from the radial filtration.
Persistent homology tracks topological features as a filtration parameter changes. H0 records connected components; H1 records loops. Long-lived features are treated as more stable shape information than short-lived features.
The plots below show three representative wings, their radial filtration, and the largest persistence values from each retained diagram.
Each persistence diagram is converted into 19 summary statistics. We retain 17 statistics per diagram and exclude skewness and kurtosis because these tail-sensitive summaries are less reliable for small persistence diagrams. This gives 34 features per specimen.
Retained statistics per diagram: 17
Feature matrix: 70 samples x 34 features
We use one classifier: a balanced Random Forest. In each training fold, minority families are upsampled by bootstrap resampling to match the largest family in that fold. Hyperparameters are fixed in advance, so no additional tuning loop is used. The validation procedure is repeated stratified 3-fold cross-validation with 30 repeats.
repeated_stratified_rf_cv (generic function with 1 method)
| Row | metric | mean_percent | sd_percent |
|---|---|---|---|
| String | Float64 | Float64 | |
| 1 | Macro-F1 | 65.6 | 4.8 |
| 2 | Macro-recall | 66.9 | 4.6 |
The pooled out-of-fold predictions combine all held-out predictions across the 30 repeats. This gives a stable diagnostic view of which families are recovered consistently.
| Row | metric | value_percent |
|---|---|---|
| String | Float64 | |
| 1 | Pooled accuracy | 67.7 |
| 2 | Pooled macro-F1 | 65.9 |
| 3 | Pooled macro-recall | 66.9 |
| Row | family | original_n | repeated_support | recall_percent | f1_percent |
|---|---|---|---|---|---|
| String | Int64 | Int64 | Float64 | Float64 | |
| 1 | Asilidae | 8 | 240 | 39.6 | 42.9 |
| 2 | Chironomidae | 8 | 240 | 46.2 | 59.4 |
| 3 | Rhagionidae | 4 | 120 | 55.0 | 50.8 |
| 4 | Ceratopogonidae | 8 | 240 | 55.8 | 58.5 |
| 5 | Tabanidae | 11 | 330 | 67.3 | 65.0 |
| 6 | Sciaridae | 6 | 180 | 76.7 | 65.2 |
| 7 | Bibionidae | 6 | 180 | 82.2 | 81.3 |
| 8 | Tipulidae | 12 | 360 | 87.2 | 85.1 |
| 9 | Simuliidae | 7 | 210 | 92.4 | 84.9 |
The confusion matrix below is row-normalized, so each row sums to one and can be read as the distribution of predicted families for a given true family.
Feature importance is descriptive. It is computed from balanced bootstrap trees fit on the full feature matrix, so it should be read as a guide to which topological summaries the Random Forest uses, not as an additional estimate of held-out performance.
| Row | feature | importance |
|---|---|---|
| String | Float64 | |
| 1 | Radial_H0__q75 | 1.0 |
| 2 | Radial_H0__median | 0.896696 |
| 3 | Radial_H0__q10 | 0.856581 |
| 4 | Rips_H1__entropy | 0.837236 |
| 5 | Radial_H0__std_birth | 0.831594 |
| 6 | Radial_H0__q25 | 0.817996 |
| 7 | Rips_H1__median | 0.777934 |
| 8 | Rips_H1__median_birth | 0.682759 |
| 9 | Rips_H1__total_pers | 0.614264 |
| 10 | Rips_H1__mean_midlife | 0.501807 |
| 11 | Rips_H1__max_pers | 0.477192 |
| 12 | Rips_H1__q75 | 0.473651 |
| 13 | Rips_H1__std_death | 0.466785 |
| 14 | Radial_H0__count | 0.446309 |
| 15 | Rips_H1__median_death | 0.416641 |
To identify a compact feature set for biological interpretation, we use the Random Forest importance ranking as a reduction path. We evaluate nested subsets of the top-ranked features and select the smallest subset whose screening performance remains within one percentage point of the full 34-feature model for both pooled accuracy and pooled macro-F1. The selected subset is then re-evaluated with the full repeated stratified 3-fold CV configuration.
This procedure is intended to rank features for interpretation. It should not be treated as an independent performance estimate because the feature ranking is learned from the same dataset.
| Row | feature_set | n_features | accuracy_percent | macro_f1_percent | macro_recall_percent |
|---|---|---|---|---|---|
| String | Int64 | Float64 | Float64 | Float64 | |
| 1 | All features | 34 | 67.7 | 65.9 | 66.9 |
| 2 | Essential feature set | 11 | 66.2 | 65.4 | 65.9 |
| Row | rank | block | statistic | relative_importance |
|---|---|---|---|---|
| Int64 | SubStrin… | SubStrin… | Float64 | |
| 1 | 1 | Radial_H0 | q75 | 1.0 |
| 2 | 2 | Radial_H0 | median | 0.897 |
| 3 | 3 | Radial_H0 | q10 | 0.857 |
| 4 | 4 | Rips_H1 | entropy | 0.837 |
| 5 | 5 | Radial_H0 | std_birth | 0.832 |
| 6 | 6 | Radial_H0 | q25 | 0.818 |
| 7 | 7 | Rips_H1 | median | 0.778 |
| 8 | 8 | Rips_H1 | median_birth | 0.683 |
| 9 | 9 | Rips_H1 | total_pers | 0.614 |
| 10 | 10 | Rips_H1 | mean_midlife | 0.502 |
| 11 | 11 | Rips_H1 | max_pers | 0.477 |
Earlier versions of this analysis used a direct distance-matrix approach: compute pairwise Wasserstein distances between the Rips H1 persistence diagrams, then classify by nearest neighbours. We revisit that idea here and add an average-linkage dendrogram. This is a deliberately “pure metric space” baseline because it uses the persistence diagrams only through a single pairwise distance matrix, without extracting interpretable summary features or fitting a flexible classifier.
The Wasserstein constructor has two relevant choices: the Wasserstein order and the ground norm used to match points in the birth-death plane. The previous experiments used the default ground norm, so we compare those settings with Euclidean ground-norm variants.
The dendrogram below uses average linkage on the W1 distance with the default ground norm. It should be read as an exploratory visualization of the diagram geometry, not as a supervised classifier.
To make the comparison with the Random Forest more explicit, we evaluate several distance-based classifiers with the same repeated stratified 3-fold split design used above. The classifier choices are intentionally simple: unweighted k-NN, inverse-distance weighted k-NN, and nearest-family average distance.
| Row | distance | classifier | accuracy_percent | macro_f1_percent | macro_recall_percent |
|---|---|---|---|---|---|
| String | String | Float64 | Float64 | Float64 | |
| 1 | W1_Linf | 3-NN | 66.8 | 63.1 | 64.0 |
| 2 | W1_Linf | 3-NN weighted | 66.8 | 63.1 | 64.0 |
| 3 | W2_L2 | 3-NN | 66.1 | 62.9 | 63.3 |
| 4 | W2_L2 | 3-NN weighted | 66.1 | 62.9 | 63.3 |
| 5 | W2_L2 | 1-NN | 65.4 | 62.5 | 62.5 |
| 6 | W1_L2 | 3-NN | 65.6 | 61.4 | 62.8 |
| 7 | W1_L2 | 3-NN weighted | 65.6 | 61.4 | 62.8 |
| 8 | W2_Linf | 3-NN | 65.6 | 61.2 | 62.2 |
| 9 | W2_Linf | 3-NN weighted | 65.6 | 61.2 | 62.2 |
| 10 | W2_Linf | 1-NN | 62.1 | 61.1 | 61.2 |
| 11 | W1_Linf | 1-NN | 63.8 | 61.1 | 61.5 |
| 12 | W1_L2 | 1-NN | 63.8 | 61.0 | 61.7 |
| Row | method | representation | classifier | accuracy_percent | macro_f1_percent | macro_recall_percent |
|---|---|---|---|---|---|---|
| String | String | String | Float64 | Float64 | Float64 | |
| 1 | Best Rips Wasserstein baseline | Rips H1 diagrams as a distance matrix | 3-NN | 66.8 | 63.1 | 64.0 |
| 2 | Balanced Random Forest | Rips H1 + radial H0 summary features | Random Forest | 67.7 | 65.9 | 66.9 |
The direct Wasserstein approach is more than a weak diagnostic baseline. Its best result is close to the feature-based Random Forest: the accuracy differs by about one percentage point, while macro-F1 and macro-recall differ by only a few percentage points. Given the small sample size, this gap should be interpreted cautiously rather than as clear evidence that the Random Forest is decisively superior.
This result suggests that the Rips H1 persistence diagrams already contain substantial family-level signal. The Wasserstein pipeline is also conceptually simple: it keeps the diagrams as diagrams, compares them with an intrinsic distance, and uses nearest-neighbour classification. Its simplicity is mainly statistical and methodological, however, not necessarily computational, because Wasserstein distances require solving optimal matching problems between diagrams.
The feature-based Random Forest remains useful, but for a more modest reason. It combines Rips H1 summaries with radial H0 summaries, can use nonlinear interactions among summary statistics, and gives feature-importance diagnostics for biological interpretation. The present results therefore do not show that a purely metric-space approach is poor. They show that Wasserstein distance is a strong baseline, and that feature extraction plus Random Forest offers a small performance gain together with better interpretability and flexibility.
These results suggest that compact topological summaries of wing venation contain family-level signal. The two retained filtrations capture different information: Vietoris-Rips H1 summarizes global loop structure in the vein network, while radial H0 summarizes how connected vein components organize from the center of the wing outward. The direct Wasserstein baseline strengthens this conclusion: even without radial features or feature extraction, the Rips diagrams alone support competitive classification.
The comparison between Wasserstein distance and Random Forest should be read as a tradeoff, not as a decisive ranking. The Random Forest gives a modest improvement in the current validation, but the gap is small for a dataset of this size. Wasserstein distance therefore remains an important baseline and a useful indication that the persistence diagrams themselves carry taxonomic structure. The advantage of the feature-based Random Forest is that it can combine complementary filtrations and provide interpretable feature rankings, not that it overwhelmingly outperforms the metric-space approach.
The validation design is intentionally conservative for the current dataset. Because the dataset contains only 70 specimens and the family counts are uneven, the primary summaries are macro-F1 and macro-recall rather than overall correct rate. Repeating the stratified 3-fold split 30 times reduces dependence on a single partition and gives a clearer view of which families are stable or ambiguous.
The feature-reduction results provide a candidate set of topological summaries for biological follow-up. Features retained in the essential set should be inspected against wing venation traits, including loop structure, the number of persistent components, and the scale at which vein components merge under the radial filtration.
The main practical limitation is still sample size. Several families have fewer than ten specimens, so family-level recall should be interpreted as provisional. Image quality, binarization, and connectivity correction also affect the persistence diagrams. Follow-up work should add taxonomic context, literature references, and a biological interpretation of the retained topological summaries.
@online{vituri_f._pinto2026,
author = {Vituri F. Pinto, Guilherme and Ura, Sergio and , Northon},
title = {Diptera Wing Classification Using {Topological} {Data}
{Analysis}},
date = {2026-04-27},
langid = {en},
abstract = {We use Topological Data Analysis (TDA) to describe Diptera
wing venation and classify specimens at the family level. From 70
binarized wing images representing nine families, we extract two
compact persistence summaries: H1 persistence from Vietoris-Rips
filtrations of point-cloud samples and H0 persistence from radial
filtrations of connected wing images. These 34 topological features
are evaluated with a single balanced Random Forest model using
repeated stratified 3-fold cross-validation. Because the dataset is
imbalanced, performance is summarized primarily with macro-F1,
macro-recall, family-level recall, and a row-normalized confusion
matrix. We also use a feature-reduction screen to identify a smaller
candidate set of topological summaries for biological
interpretation. A direct Wasserstein-distance baseline on Rips
persistence diagrams is competitive with the Random Forest,
suggesting that much of the taxonomic signal is already present in
the Rips diagrams themselves.}
}