Sigma Statistics - Validation

Normality Testing

Validation objective

The Sigma normality testing module was validated against reference implementations in R for the four normality procedures implemented in the tool: Shapiro–Wilk, Kolmogorov–Smirnov normality test in its Lilliefors-corrected form, Anderson–Darling, and D’Agostino–Pearson omnibus K². The purpose of the validation was to assess whether Sigma reproduces the same test statistics, p-values, and final inferential decisions as R across a broad range of simulated datasets.

Validation was performed at the level of the actually implemented JavaScript tool logic, including the same preprocessing rules, sample-size thresholds, and numerical safeguards used in Sigma. The Sigma implementation includes Shapiro–Wilk, Lilliefors-corrected KS, Anderson–Darling, and D’Agostino–Pearson normality testing as coded in the Normality Testing module. :contentReference[oaicite:1]{index=1}

Validation design

A total of 350 datasets were evaluated in Sigma and compared with corresponding analyses in R. Of these, 340 datasets yielded valid paired results for the full comparison of test statistics and p-values. The remaining datasets were excluded from the metric-level comparison whenever a test was not defined because the corresponding minimum sample size requirement was not met or the input was numerically degenerate.

The validation dataset covered a range of sample sizes and distributional scenarios, including clearly non-normal as well as near-normal constellations, in order to test both agreement in continuous outputs and robustness of the binary decision rule at the conventional significance level of α = 0.05.

Reference methods in R

Sigma was compared against established R workflows reproducing the same methodological targets: stats::shapiro.test for Shapiro–Wilk, nortest::lillie.test for the Lilliefors-corrected KS test, nortest::ad.test for Anderson–Darling, and an R implementation of the D’Agostino–Pearson omnibus normality test matching the same statistic definition used in Sigma.

Because some normality procedures rely on approximations, interpolation steps, or finite-sample corrections, exact bitwise identity between software packages is not expected. Validation therefore focused on both numerical agreement and agreement in the final reject / do-not-reject decision.

Validation metrics

For each test, Sigma and R were compared with respect to:

the absolute difference in the reported test statistic,
the absolute difference in the reported p-value, and
the binary decision consistency at α = 0.05.

Summary measures included the mean absolute difference, median absolute difference, 95th percentile of the absolute difference, and maximum absolute difference across all valid paired runs.

Results of the validation

Overall, Sigma showed excellent agreement with R across all four implemented normality tests. For Shapiro–Wilk, numerical agreement was essentially exact for practical purposes, with extremely small absolute deviations for both the W statistic and the corresponding p-value. The mean absolute difference in W was on the order of 10⁻¹⁰, and the mean absolute difference in the p-value was on the order of 10⁻⁸.

For the Lilliefors-corrected KS test, agreement was likewise very high. The D statistic showed negligible differences overall, and p-value agreement was also close, with only very small deviations attributable to interpolation and approximation rules used in Lilliefors-type implementations.

For the Anderson–Darling test, Sigma and R again yielded very similar results. Small deviations were observed for the adjusted AD statistic and its p-value, which is expected because these rely on finite-sample corrections and piecewise approximation formulas. These deviations remained small and did not affect the final inferential classification.

For the D’Agostino–Pearson omnibus K² test, the Sigma implementation also matched R closely. In the large majority of datasets, the difference was numerically negligible, with the median absolute difference effectively equal to zero up to machine precision.

Decision agreement at α = 0.05

Most importantly, the validation demonstrated complete decision agreement between Sigma and R at the significance threshold of α = 0.05. Across all valid paired comparisons, the number of decision mismatches was:

Shapiro–Wilk: 0 mismatches
KS (Lilliefors): 0 mismatches
Anderson–Darling: 0 mismatches
D’Agostino–Pearson: 0 mismatches

Thus, although very small floating-point differences in statistics or p-values may occur, Sigma reproduced the same inferential conclusion as R in every validated case.

Interpretation

The validation confirms that the Sigma normality module is statistically equivalent to the corresponding R-based reference procedures for practical use. Observed numerical differences were consistently very small and compatible with expected implementation-level variation arising from floating-point arithmetic, approximation formulas, and interpolation schemes.

From a user perspective, the key result is that Sigma reproduces the same normality decisions as R while implementing the tests directly in JavaScript within the application environment.

Implementation-specific notes

The validation reflects Sigma exactly as implemented in the tool. This includes test-specific minimum sample size rules, exclusion of missing and non-numeric observations, omission of tests in degenerate cases such as near-constant data, and the explicit use of the Lilliefors-corrected rather than the classical fully specified KS test. :contentReference[oaicite:2]{index=2}

Accordingly, the validation does not claim identity to every possible software implementation of these procedures, but rather equivalence to the corresponding R reference workflow under the same statistical definitions.

Conclusion

The Sigma normality testing module demonstrated excellent numerical agreement with R and perfect agreement in binary inference at α = 0.05 across all validated datasets. These findings support the validity of Sigma’s implementation of Shapiro–Wilk, Lilliefors-corrected KS, Anderson–Darling, and D’Agostino–Pearson normality testing for routine analytical use.

Chi-square (χ²) Test (2×2) and Fisher’s Exact Test

Validation objective

The Sigma 2×2 contingency table module was validated against R with respect to the statistical procedures implemented in the tool: Pearson χ² test with Yates’ continuity correction, two-sided Fisher’s exact test, and the associated contingency-table quantities required for these analyses. Sigma computes Yates-corrected χ² p-values, two-sided Fisher p-values, expected cell counts, a low-expected-count flag, and effect estimates for binary outcomes in 2×2 tables. :contentReference[oaicite:0]{index=0}

Reference methods

Validation was performed against standard R implementations for 2×2 table analysis. The χ² comparison used Pearson’s chi-square test with Yates’ continuity correction, and the exact-test comparison used two-sided Fisher’s exact test. In addition, the underlying 2×2 cell counts and expected counts were compared directly between Sigma and R.

Validation dataset

The validation dataset comprised 900 paired analysis cases. All 900 cases could be matched between Sigma and R. Sigma returned valid results in all 900 cases, with no analysis errors.

Validation metrics

Agreement between Sigma and R was assessed at four levels:

agreement in the number of usable observations,
agreement in the observed 2×2 cell counts,
agreement in expected counts and the low-expected-count flag, and
agreement in χ² and Fisher p-values, including the final significance decision at α = 0.05.

Results of the validation

Agreement between Sigma and R was exact for the basic table structure. The number of usable observations matched in 900/900 cases. The observed 2×2 cell counts matched exactly in 900/900 cases. The low-expected-count flag also matched in 900/900 cases. Expected counts were numerically identical up to floating-point precision, with a maximum absolute difference of approximately 5.0 × 10⁻¹².

Chi-square test validation

For the Yates-corrected χ² test, R returned a comparable finite result in 771 cases. In the remaining 129 cases, R reported no χ² result because of degenerate table constellations. Across the 771 comparable cases, the χ² statistic showed a maximum absolute difference of approximately 4.8 × 10⁻¹² between Sigma and R. The corresponding χ² p-value showed a maximum absolute difference of approximately 4.2 × 10⁻⁷, with a mean absolute difference of approximately 1.4 × 10⁻⁷.

These differences are fully consistent with minor numerical approximation and floating-point effects and do not indicate any relevant methodological discrepancy.

Fisher’s exact test validation

For Fisher’s exact test, all 900 cases were comparable. The maximum absolute difference in the two-sided Fisher p-value between Sigma and R was approximately 6.1 × 10⁻¹², and the mean absolute difference was approximately 5.4 × 10⁻¹³. Thus, Fisher p-values were practically identical across the full validation dataset.

Decision agreement

At the conventional significance level of α = 0.05, Sigma and R yielded identical inferential decisions for all comparable analyses. There were 0 decision mismatches for the Yates-corrected χ² test and 0 decision mismatches for Fisher’s exact test.

Interpretation

The validation demonstrates that Sigma reproduces the same 2×2 contingency tables as R and yields statistically equivalent results for both Yates-corrected χ² testing and two-sided Fisher’s exact testing. The remaining numerical differences are negligible and are attributable to standard floating-point and approximation behavior.

Conclusion

The Sigma Chi-square / Fisher module showed exact agreement with R for usable sample sizes, contingency-table cell counts, expected counts, and low-expected-count classification, together with near-identical p-values and complete agreement in significance decisions. These results support the validity of Sigma’s implementation of 2×2 Pearson χ² testing with Yates’ correction and two-sided Fisher’s exact test for routine analytical use.

Unpaired (Independent Samples) t-Test

Validation objective

The Sigma unpaired t-test module was validated against R with respect to the statistical procedures implemented in the tool: the Welch t-test for unequal variances and the Student t-test with pooled variance under the equal-variance assumption. In addition to p-values, the validation covered group-specific descriptive statistics, mean differences, confidence intervals, and the handling of invalid cases. Sigma reports the Welch two-sided p-value as the primary inferential quantity and additionally computes the pooled-variance t-test, confidence intervals, and standardized effect sizes. :contentReference[oaicite:0]{index=0}

Reference methods

Validation was performed against standard R implementations for independent-samples t-tests. Welch’s test was compared to the unequal-variance formulation in R, and the pooled-variance test was compared to the equal-variance formulation. Group means, standard deviations, sample sizes, mean differences, confidence intervals, and p-values were compared directly between Sigma and R.

Validation dataset

The validation dataset comprised 900 paired analysis cases. Of these, 896 cases were valid in both Sigma and R, while 4 cases were invalid because fewer than two usable observations remained in at least one group after filtering. Sigma classified these invalid cases consistently with the reference results.

Validation metrics

Agreement between Sigma and R was assessed at the following levels:

agreement in the number of usable observations per analysis,
agreement in group-specific descriptive statistics,
agreement in Welch test statistics, degrees of freedom, confidence intervals, and p-values,
agreement in the pooled-variance t-test results, and
agreement in the final significance decision at α = 0.05.

Results of the validation

Agreement between Sigma and R was exact for case validity and usable sample sizes. Among the 896 valid analyses, the number of usable observations matched in all cases. The remaining 4 analyses were consistently classified as invalid because at least one group contained fewer than two usable observations after exclusion of missing or non-numeric values.

Group-specific descriptive statistics were numerically identical up to floating-point precision. The maximum absolute difference in the control-group mean was approximately 6.9 × 10⁻¹⁵, and the maximum absolute difference in the treatment-group mean was approximately 5.6 × 10⁻¹⁵. These differences are negligible and reflect only standard floating-point behavior.

Welch t-test validation

Welch’s test showed excellent agreement between Sigma and R across all valid cases. The two-sided Welch p-value differed only minimally, with a maximum absolute difference of approximately 5.1 × 10⁻¹². No practically relevant discrepancies were observed for the Welch test statistic, degrees of freedom, or confidence interval limits.

The sign convention for the mean difference was consistent with the Sigma implementation, which defines the effect as Treatment − Control. When results were transformed to the same direction of comparison, Sigma and R agreed throughout.

Pooled-variance t-test validation

The pooled-variance t-test also showed agreement with the corresponding R implementation. The pooled mean difference, standard error, confidence interval, and p-value matched within negligible numerical tolerance. No systematic deviations were observed.

Decision agreement

At the conventional significance level of α = 0.05, Sigma and R yielded identical inferential decisions for all valid analyses. There were 0 decision mismatches for the Welch two-sided test.

Interpretation

The validation demonstrates that Sigma reproduces the same independent-samples t-test results as R for both unequal-variance and equal-variance formulations. Remaining numerical differences were negligible and are fully compatible with standard floating-point precision and numerical evaluation of the t distribution.

Conclusion

The Sigma unpaired t-test module showed exact agreement with R for valid versus invalid case classification and usable sample sizes, together with near-identical descriptive statistics, mean differences, confidence intervals, and p-values. These results support the validity of Sigma’s implementation of the Welch t-test and the pooled-variance independent-samples t-test for routine analytical use.

Paired t-Test

Validation objective

The Sigma paired t-test module was validated against R with respect to the statistical procedure implemented in the tool: the classical two-sided paired-samples t-test based on within-subject differences. The validation covered the number of complete pairs, descriptive statistics at Time 1 and Time 2, mean differences, test statistics, degrees of freedom, p-values, and confidence intervals. Sigma defines the paired difference as Time 2 − Time 1 and computes the paired t-statistic, confidence interval, and standardized effect sizes on this basis.

Reference methods

Validation was performed against the paired-samples formulation of R’s stats::t.test with paired = TRUE. Agreement was assessed for the complete-case paired sample used in each analysis and for the corresponding inferential results based on the vector of within-subject differences.

Validation dataset

The validation dataset comprised 900 paired analysis cases. All 900 cases could be matched between Sigma and R. The number of usable complete pairs matched exactly in all 900 cases.

Validation metrics

Agreement between Sigma and R was assessed at the following levels:

agreement in the number of complete paired observations,
agreement in descriptive statistics for Time 1, Time 2, and the paired differences,
agreement in the paired t-statistic and degrees of freedom,
agreement in p-values and 95% confidence intervals, and
agreement in the final significance decision at α = 0.05.

Results of the validation

Agreement between Sigma and R was exact for the complete paired sample. The number of usable complete pairs matched in 900/900 cases. Descriptive statistics also showed excellent agreement, with only negligible floating-point differences for the means at Time 1, Time 2, and for the mean paired difference.

The paired t-statistic matched R up to machine precision after alignment of the sign convention. The maximum absolute difference in the aligned t-statistic was approximately 4.2 × 10⁻¹³. The corresponding p-values were practically identical, with a maximum absolute difference of approximately 4.8 × 10⁻¹⁵.

Confidence interval validation

The 95% confidence interval for the mean paired difference also showed excellent agreement. After alignment to the same direction of subtraction, the maximum absolute difference in the lower and upper confidence limits was approximately 1.1 × 10⁻¹⁰. These differences are numerically negligible and are fully compatible with ordinary floating-point precision.

Direction of the paired difference

Interpretation of the paired t-statistic depends on the direction used to define the within-subject difference. Sigma defines the paired difference as Time 2 − Time 1. Accordingly, the sign of the mean difference, t-statistic, and confidence interval is tied to this direction. When Sigma and R results are expressed using the same subtraction order, the paired t-test results agree throughout.

Decision agreement

At the conventional significance level of α = 0.05, Sigma and R yielded identical inferential decisions for all validated analyses. No practically relevant decision discrepancies were observed.

Interpretation

The validation demonstrates that Sigma reproduces the same paired-samples t-test results as R for complete paired observations. Remaining differences were negligible and attributable solely to floating-point precision and numerical evaluation of the t distribution.

Conclusion

The Sigma paired t-test module showed exact agreement with R for the number of complete pairs and near-identical agreement for paired means, mean differences, t-statistics, p-values, and confidence intervals. These results support the validity of Sigma’s implementation of the classical paired-samples t-test for routine analytical use.

Univariate Logistic Regression

Validation objective

The Sigma univariate logistic regression module was validated against R with respect to the statistical procedure implemented in the tool: binary logistic regression with one predictor at a time, estimated by maximum likelihood. Validation covered the classification of analyzable versus non-analyzable datasets, regression coefficients, standard errors, Wald statistics, p-values, odds ratios, and 95% confidence intervals.

Reference methods

Validation was performed against standard univariate logistic regression in R using the binomial logit model. Agreement was assessed for the complete-case sample used for each predictor and for the corresponding model-based estimates derived from the fitted slope parameter.

Validation dataset

The validation dataset comprised 900 analysis cases. Of these, 139 cases were classified as invalid on the R side and were consistently represented as non-comparable error cases in Sigma. Among the remaining comparable analyses, 594 cases were classified as stable and 167 as separation or extreme cases.

Validation metrics

Agreement between Sigma and R was assessed at the following levels:

agreement in valid versus invalid case classification,
agreement in the estimated regression coefficient \(\beta\),
agreement in standard errors, Wald z-statistics, and p-values,
agreement in odds ratios and Wald confidence intervals, and
agreement in stable versus separation-prone data constellations.

Overall interpretation of the validation

The comparison showed a clear separation between two types of scenarios. In stable datasets, Sigma and R agreed very closely for all core model quantities. In contrast, the largest discrepancies were concentrated in datasets with complete separation, quasi-separation, or otherwise extreme coefficient constellations. This pattern is methodologically expected for Wald-based logistic regression and is consistent with the warning logic implemented in Sigma.

Stable-case validation

In the 594 stable cases, agreement between Sigma and R was excellent. The maximum absolute difference in the regression coefficient was approximately 7.82 × 10⁻⁸. The maximum absolute difference in the standard error was approximately 3.01 × 10⁻⁴, the maximum absolute difference in the Wald z-statistic was approximately 3.32 × 10⁻³, and the maximum absolute difference in the Wald p-value was approximately 1.31 × 10⁻⁴.

Agreement for effect measures was likewise very good. In stable cases, the maximum absolute difference in the odds ratio was approximately 2.70 × 10⁻⁵. The maximum absolute difference in the lower 95% confidence limit of the odds ratio was approximately 1.05 × 10⁻², and the maximum absolute difference in the upper confidence limit was approximately 3.98 × 10⁻¹. These confidence-interval differences were still small relative to the underlying effect scale and are consistent with minor differences in convergence and floating-point behavior.

Stable-case validation across sample sizes

The excellent agreement in stable cases was preserved across all tested sample-size strata. For example, at n = 5, the maximum absolute coefficient difference in stable cases was approximately 8.76 × 10⁻⁹ and the maximum absolute p-value difference was approximately 3.6 × 10⁻⁵. At n = 100, the corresponding maxima remained very small, and at larger sample sizes such as n = 1000 and n = 5000, coefficient and p-value differences continued to remain in the range expected from negligible numerical approximation effects.

Thus, for data constellations without separation or extreme instability, Sigma reproduced R very closely over the full tested range of sample sizes.

Separation and extreme cases

The remaining 167 comparable cases were classified as separation-prone or extreme. In these analyses, discrepancies between Sigma and R were substantially larger than in the stable-case subset. This behavior is expected, because logistic regression estimates become unstable when event counts are very low, when predictor values nearly or perfectly separate events from non-events, or when the likelihood surface is numerically ill-conditioned.

In these situations, coefficient estimates, odds ratios, and Wald confidence intervals can become very large even when the underlying inferential conclusion is qualitatively similar. Sigma’s warning framework is therefore an important part of the implementation and should be viewed as a deliberate safeguard rather than as a discrepancy from the reference method.

Invalid cases

A total of 139 cases were invalid in the reference comparison and were consistently treated as non-comparable error cases. These cases reflect data constellations in which a standard univariate logistic regression model is not meaningfully estimable, for example because one outcome category is absent, the predictor has no usable variability, or the data are otherwise structurally unsuitable for model fitting.

Interpretation

The validation demonstrates that Sigma reproduces R very closely for ordinary, well-behaved univariate logistic regression problems. The largest deviations occur in separation-prone or extreme datasets, where numerical instability is expected and where Sigma explicitly reports warnings to prevent overinterpretation. This is consistent with the intended design of the module and with the known behavior of maximum-likelihood logistic regression in difficult small-sample or separation scenarios.

Conclusion

The Sigma univariate logistic regression module showed excellent agreement with R in stable datasets, with negligible differences in regression coefficients, standard errors, Wald statistics, p-values, and odds ratios across the tested sample-size range. Discrepancies were concentrated in separation-prone and extreme cases, where instability is inherent to the method and appropriately flagged by Sigma. Overall, these findings support the validity of Sigma’s implementation of univariate logistic regression for routine analytical use, while confirming the need for caution in pathological data constellations.

Multivariate Logistic Regression

Validation objective

The Sigma multivariate logistic regression module was validated against R with respect to the statistical procedure implemented in the tool: binary logistic regression with one intercept and multiple predictors fitted simultaneously by maximum likelihood. Validation covered the classification of analyzable versus non-analyzable datasets, regression coefficients, standard errors, Wald statistics, p-values, odds ratios, and 95% confidence intervals for the fitted predictors.

Reference methods

Validation was performed against standard multivariable logistic regression in R using the binomial logit model. Agreement was assessed for the complete-case analysis sample used in each model and for the corresponding coefficient-level estimates derived from the fitted multivariate model.

Validation design

To assess performance under different model dimensions, validation was performed in two separate settings:

a model with 3 predictors fitted simultaneously, and
a model with 5 predictors fitted simultaneously.

Each setting comprised 900 analysis cases. This two-step design was used to assess whether agreement with R remained stable when the number of covariates increased and the likelihood of singularity, sparse-data problems, and separation became higher.

Validation metrics

Agreement between Sigma and R was assessed at the following levels:

agreement in the number of usable observations, events, and non-events,
agreement in the classification of stable, invalid, and separation / singular cases,
agreement in regression coefficients, standard errors, Wald z-statistics, and p-values,
agreement in odds ratios and Wald confidence intervals, and
agreement across different model dimensions (p = 3 and p = 5).

Results for models with 3 predictors

In the validation setting with 3 predictors, all 900 analysis files could be matched between Sigma and R. Agreement in the basic model counts was exact: the number of usable observations, events, and non-events matched in all cases. Among these analyses, 493 were classified as stable, 323 as separation or singular cases, 80 as Sigma-invalid, and 4 as R-invalid.

In the stable subset, agreement was excellent. Across all comparable predictor coefficients (n = 1432 coefficient-level comparisons), the maximum absolute difference in the regression coefficient was approximately 2.02 × 10⁻⁷. The maximum absolute difference in the standard error was approximately 2.34 × 10⁻⁴, the maximum absolute difference in the Wald z-statistic was approximately 2.08 × 10⁻³, and the maximum absolute difference in the Wald p-value was approximately 7.81 × 10⁻⁵. The maximum absolute difference in the odds ratio was approximately 3.53 × 10⁻⁶.

Results for models with 5 predictors

In the validation setting with 5 predictors, all 900 analysis files could again be matched between Sigma and R, and agreement in the number of usable observations, events, and non-events was exact. In this more complex setting, 448 analyses were classified as stable, 261 as separation or singular cases, 87 as Sigma-invalid, and 104 as R-invalid.

In the stable subset, agreement remained excellent despite the higher model dimension. Across all comparable predictor coefficients (n = 2201 coefficient-level comparisons), the maximum absolute difference in the regression coefficient was approximately 1.71 × 10⁻⁷. The maximum absolute difference in the standard error was approximately 4.91 × 10⁻⁴, the maximum absolute difference in the Wald z-statistic was approximately 3.20 × 10⁻³, and the maximum absolute difference in the Wald p-value was approximately 1.04 × 10⁻⁴. The maximum absolute difference in the odds ratio was approximately 3.92 × 10⁻⁶.

Confidence intervals

Agreement in Wald confidence intervals was also very good in the stable subsets. For the p = 3 setting, the maximum absolute difference in the lower odds-ratio confidence limit was approximately 1.05 × 10⁻², and the maximum absolute difference in the upper confidence limit was approximately 3.98 × 10⁻¹. For the p = 5 setting, the maximum absolute difference in the lower confidence limit was approximately 1.28 × 10⁻³, and the maximum absolute difference in the upper confidence limit was approximately 4.00 × 10⁻¹. These differences remained small relative to the scale of the corresponding odds ratios.

Interpretation of unstable and invalid cases

As expected, increasing the number of predictors increased the frequency of unstable model constellations. Compared with the p = 3 setting, the p = 5 setting showed a higher number of invalid and singular or separation-prone cases. This pattern is methodologically plausible, because multivariable logistic regression becomes more sensitive to sparse data, multicollinearity, quasi-separation, and near-singular information matrices as model complexity increases.

Importantly, however, these difficult cases were not hidden by the validation design. They were explicitly separated from the stable subset and are consistent with the warning logic implemented in Sigma, including warnings for low events-per-variable ratios, non-convergence, and separation or multicollinearity. In ordinary well-behaved datasets, agreement with R remained excellent in both the p = 3 and p = 5 settings.

Interpretation

The validation demonstrates that Sigma reproduces R very closely for multivariate logistic regression in stable data constellations and retains this agreement when the number of covariates increases from three to five. The main effect of the larger model dimension was not a deterioration of numerical agreement in stable cases, but rather an expected increase in the number of unstable or non-estimable datasets.

Conclusion

The Sigma multivariate logistic regression module showed exact agreement with R for usable sample sizes, event counts, and non-event counts, together with excellent coefficient-level agreement in the stable subsets for both 3-predictor and 5-predictor models. Discrepancies were concentrated in separation-prone, singular, or otherwise pathological datasets, where instability is inherent to the method and appropriately reflected by Sigma’s warning framework. Overall, these findings support the validity of Sigma’s implementation of multivariate logistic regression for routine analytical use in well-behaved datasets.

Propensity Score Matching (PSM)

Validation objective

The Sigma Propensity Score Matching (PSM) module was validated against R with respect to the statistical workflow implemented in the tool: estimation of propensity scores by logistic regression, greedy nearest-neighbor matching on the propensity score scale, fixed 1:1 matching without replacement, optional caliper restriction, and the resulting balance diagnostics. Validation focused on whether Sigma reproduces the same matched sample structure and the same practical balance assessment as a corresponding R-based reference workflow under the same design settings.

Because Sigma is intended primarily as a design-stage tool, the validation emphasized agreement in (i) numbers of treated and control subjects entering the matching procedure, (ii) numbers of successfully matched treated and control subjects, and (iii) pre- and post-matching balance summaries, especially the maximum absolute standardized mean difference (max |SMD|).

Reference methods

Sigma was compared against a corresponding R workflow using logistic-regression-based propensity score estimation and greedy nearest-neighbor matching with the same target design: 1:1 matching without replacement on the PS (0–1) scale, together with the same complete-case restriction and the same nominal caliper settings where applicable.

In propensity score matching, perfect equality of all reported summary values across software is not always expected even when the matched pairs are the same, because balance summaries may differ slightly depending on implementation details such as inclusion of the distance term in summary tables, handling of factor reference levels, and the exact SMD definition. Therefore, validation focused both on exact agreement of the matched sample size and on near-equivalence of balance summaries.

Validation design

Validation was performed in multiple simulation-based comparison sets reflecting the final streamlined Sigma tool configuration. The validated Sigma tool used:

complete-case analysis only,
propensity score matching on the PS (0–1) scale,
1:1 greedy nearest-neighbor matching without replacement, and
optional caliper restriction.

Two main cohort structures were examined:

an approximately balanced treated-versus-control setting, and
an imbalanced setting with approximately 5:1 control-to-treated distribution before matching.

Within the approximately balanced setting, three caliper scenarios were validated on the PS scale: 0.05, 0.2, and 0.5. In the 5:1 setting, the validated comparison covered the PS-scale caliper setting 0.2.

Validation metrics

Agreement between Sigma and R was assessed at the following levels:

agreement in the number of treated and control subjects entering the analysis sample,
agreement in the number of matched treated and matched control subjects,
agreement in pre- and post-matching balance summaries, especially max |SMD|, and
interpretation of any remaining discrepancies in light of known implementation-level reporting differences.

For the continuous comparison of balance summaries, agreement was quantified using summary measures such as mean absolute error (MAE) and correlation against the corresponding R-derived max |SMD| values.

Results in the approximately balanced cohorts

In the approximately balanced cohorts, Sigma showed exact agreement with R for the most important design-stage quantities across all validated caliper settings. For each of the three PS-scale caliper scenarios (0.05, 0.2, and 0.5), the following counts matched exactly in all 150/150 validated runs:

n_treated,
n_control,
matched_t, and
matched_c.

Thus, the actual matching behavior of Sigma was identical to the reference implementation in all validated balanced-cohort scenarios.

Balance-summary agreement in the approximately balanced cohorts

The remaining differences between Sigma and R were concentrated in the reported max |SMD| summaries, not in the matched sample itself. A naive direct comparison can make Sigma and R appear less similar, because the R matching summary includes the distance row in the balance overview, whereas Sigma’s summary maximum is based on the covariate-balance output actually reported by the tool.

After aligning the comparison by excluding the distance row from the R max-|SMD| aggregation, agreement became excellent in all three balanced-cohort validation tasks. Across the validated caliper settings, pre-matching MAEs were approximately 0.00094, and post-matching MAEs were approximately in the range of 0.0020 to 0.0031. Correlations between Sigma and R max-|SMD| values were essentially perfect before matching and remained extremely high after matching.

Small residual deviations in a few runs were attributable to reporting-level definitions, most notably the treatment of categorical reference levels in the balance table, rather than to differences in the matched pairs themselves.

Results in the 5:1 control-dominant cohort

In the imbalanced validation setting with an approximately 5:1 control-to-treated distribution before matching, Sigma again showed exact agreement with R for the core matched-sample quantities. Among the 147 comparable runs available from the reference export, Sigma and R agreed exactly in all cases for:

n_treated,
n_control,
matched_t, and
matched_c.

This confirms that the Sigma matching algorithm remained stable and R-consistent even under marked control-group excess.

Balance-summary agreement in the 5:1 cohort

In the 5:1 setting, post-matching agreement of the max |SMD| summary remained excellent. After harmonizing the R summary in the same way as in the balanced-cohort comparisons, the post-matching MAE was approximately 0.0074 and the correlation was approximately 0.9997.

Pre-matching agreement was somewhat weaker than in the balanced-cohort scenarios, with an MAE of approximately 0.0103 and a correlation of approximately 0.9807. This pattern is methodologically plausible, because under stronger initial imbalance and smaller treated-group sizes, summary maxima become more sensitive to small definitional differences in balance aggregation. Importantly, however, the post-matching agreement remained extremely strong, which is the more relevant validation target for the actual matched design.

Interpretation of residual differences

The validation showed that the practically relevant agreement between Sigma and R was strongest at the level that matters most for a matching tool: the composition of the matched sample. In every validated scenario, Sigma reproduced the same number of matched treated and matched control subjects as the reference workflow. This indicates that the core implementation of the matching procedure is valid.

The remaining small differences occurred mainly in summary-level balance reporting and were consistent with implementation-specific details rather than with any mismatch in the underlying matched pairs. The most important contributors were:

inclusion of the distance row in the R balance summary but not in Sigma’s practical summary maximum,
slight differences in how categorical reference levels enter the reported balance table, and
minor numerical differences in balance aggregation under stronger imbalance.

Conclusion

The Sigma Propensity Score Matching module showed exact agreement with R for the core matching results across all validated scenarios, including multiple caliper settings in approximately balanced cohorts and a clearly imbalanced 5:1 control-dominant cohort. Balance summaries showed excellent agreement after harmonization of reporting definitions, with only small residual differences attributable to implementation-level presentation details.

Taken together, these findings support the validity of Sigma’s current implementation of complete-case, 1:1 nearest-neighbor propensity score matching without replacement on the PS (0–1) scale for routine analytical use as a design-stage matching tool.

Competing Risk Analysis — CIFs, Gray Test, and Fine & Gray Regression

Validation objective

The Sigma competing risks module was validated against R with respect to the three main analytical components implemented in the tool: non-parametric estimation of cumulative incidence functions (CIFs), Gray’s test for unadjusted two-group comparison of CIFs, and Fine & Gray regression for subdistribution hazards. Validation was performed separately for these components because they differ in numerical complexity and in their sensitivity to small sample sizes and sparse event structures.

Reference methods

Sigma was compared against standard R-based competing-risk workflows. CIF estimation and Gray’s test were compared to the corresponding non-parametric competing-risk procedures, and Fine & Gray regression was compared to the corresponding subdistribution hazard model. Agreement was assessed for event-time grids, CIF estimates, Gray statistics and p-values, and Fine & Gray coefficient-level outputs.

Validation dataset

The validation dataset comprised 900 analysis files. For all analyses, the number of usable observations matched exactly between Sigma and R. Validation was then performed separately for the CIF / Gray components and for Fine & Gray regression.

Validation metrics

Agreement between Sigma and R was assessed at the following levels:

agreement in the number of usable observations,
agreement in CIF point estimates on a common time grid,
agreement in Gray test statistics and p-values,
agreement in Fine & Gray regression coefficients, standard errors, p-values, and confidence intervals, and
agreement in stratified analyses across different sample-size ranges.

Cumulative incidence functions (CIFs)

CIF point estimates showed excellent agreement with R. Across 3184 common file × group × cause curves evaluated on a fixed comparison grid, the maximum absolute difference in CIF values was approximately 7.6 × 10⁻¹⁵. The 95th percentile of the curve-wise maximum absolute CIF difference was approximately 1.6 × 10⁻¹⁵, and the median curve-wise maximum difference was approximately 4.4 × 10⁻¹⁶. These values indicate practical numerical identity of the non-parametric CIF estimates.

This excellent agreement was observed across all evaluated sample-size strata, including very small samples. Thus, the CIF point estimator itself can be considered successfully validated over the full tested range.

Variance and confidence interval components for CIFs

In contrast to the CIF point estimates, agreement for CIF standard errors was not uniformly strong across all sample sizes. Across all 3184 common file × group × cause curves, the maximum absolute difference in the curve-wise standard error was 1.00, the 95th percentile of the curve-wise maximum absolute SE difference was 0.359, and the median curve-wise maximum absolute SE difference was 0.013.

Sample-size-stratified evaluation showed that these discrepancies were concentrated mainly in the smallest samples. For n = 5, the median curve-wise maximum absolute SE difference was 0.228, and 89.7% of curves showed a maximum SE difference greater than 0.1. For n = 10, the corresponding values were 0.086 and 48.9%, and for n = 20 they were 0.044 and 22.3%.

Agreement improved substantially with increasing sample size. At n = 100, the median curve-wise maximum absolute SE difference decreased to 0.0089, with only 2.0% of curves exceeding 0.1. At n = 1000 and n = 5000, the corresponding median differences were 0.0010 and 0.00027, respectively, and only 0.5% of curves exceeded 0.1.

Thus, the CIF point estimates were essentially identical across all sample sizes, whereas the variance and confidence interval components were clearly more sensitive in very small samples and became highly concordant in moderate and large samples.

Gray’s test

Gray’s test also showed very good agreement with R. Across 1721 comparable file × cause analyses, the maximum absolute difference in the Gray test statistic was approximately 5.8 × 10⁻¹¹. The maximum absolute difference in the Gray p-value was approximately 5.0 × 10⁻⁷, with the 95th percentile of the absolute p-value difference remaining of the same order of magnitude.

These differences are fully compatible with negligible floating-point and numerical approximation effects. Gray’s test can therefore be regarded as successfully validated for routine analytical use.

Fine & Gray regression

Fine & Gray regression showed a heterogeneous validation pattern. Across 1721 comparable file × cause analyses, the median absolute coefficient difference was 8.41 × 10⁻⁸, the median absolute standard-error difference was 4.51 × 10⁻⁵, the median absolute z-value difference was 1.28 × 10⁻³, and the median absolute p-value difference was 1.34 × 10⁻⁵. These values indicate very good agreement in the majority of analyses.

However, the overall distribution contained clear outliers. The maximum absolute coefficient difference was 5.000025, the 95th percentile of the absolute coefficient difference was 4.000020, the maximum absolute z-value difference was 9.237883, and the maximum absolute p-value difference was 0.158. Differences for the derived subdistribution hazard ratios and confidence intervals were correspondingly large in these outlying cases, reflecting instability of exponentiated coefficients when the underlying regression estimates diverged.

Taken together, these results indicate that Fine & Gray regression agreed very well with R in the bulk of analyses, but not uniformly across the full tested range. The large outliers were not random; they were concentrated in small-sample and sparse-event settings.

Sample-size-stratified interpretation of Fine & Gray validation

Stratified evaluation by sample size showed that Fine & Gray agreement depended strongly on the amount of available information. For n = 5, the median absolute coefficient difference was 3.000017, the 95th percentile of the absolute coefficient difference was 4.000022, and 55.2% of analyses showed an absolute coefficient difference greater than 0.1. For n = 10, the median absolute coefficient difference was already near zero (4.17 × 10⁻¹⁰), but the upper tail remained pronounced, with a 95th percentile of 4.000022 and 31.0% of analyses exceeding 0.1. For n = 20, the 95th percentile remained at 4.000021, and 10.7% of analyses still exceeded 0.1.

Agreement improved markedly in moderate samples. At n = 100, the median absolute coefficient difference was 1.06 × 10⁻⁷, the 95th percentile decreased to 0.004596, and only 2.68% of analyses exceeded 0.1. At n = 1000, the 95th percentile was 0.000571 and no analysis exceeded 0.1. At n = 5000, the 95th percentile decreased further to 0.000148, again with 0% of analyses exceeding 0.1.

A similar pattern was seen for p-values. The proportion of analyses with an absolute p-value difference greater than 0.01 was 2.4% at n = 5, 3.2% at n = 10, 3.8% at n = 20, 1.3% at n = 100, 0% at n = 1000, and 0.3% at n = 5000.

This pattern supports a stratified interpretation: Fine & Gray regression showed unstable behavior in very small samples, acceptable but not yet uniformly excellent agreement in smaller moderate samples, and very strong agreement from moderate-to-large sample sizes onward, particularly for n ≥ 100 and especially for n ≥ 1000.

Conclusion

The Sigma competing-risk module showed excellent agreement with R for CIF point estimates and very good agreement for Gray’s test. Fine & Gray regression showed strong agreement in moderate and large samples, while the largest discrepancies were concentrated in very small sample sizes and sparse-event scenarios. Overall, the validation supports the use of Sigma for CIF estimation and Gray’s test across the tested range, and supports Fine & Gray regression in analyses with adequate sample size and event information, while caution is warranted in extreme small-sample settings.

Sigma Statistics - Validation

Validation - Sigma Statistics

Normality Testing

Validation objective

Validation design

Reference methods in R

Validation metrics

Results of the validation

Decision agreement at α = 0.05

Interpretation

Implementation-specific notes

Conclusion

Chi-square (χ²) Test (2×2) and Fisher’s Exact Test

Validation objective

Reference methods

Validation dataset

Validation metrics

Results of the validation

Chi-square test validation

Fisher’s exact test validation

Decision agreement

Interpretation

Conclusion

Unpaired (Independent Samples) t-Test

Validation objective

Reference methods

Validation dataset

Validation metrics

Results of the validation

Welch t-test validation

Pooled-variance t-test validation

Decision agreement

Interpretation

Conclusion

Paired t-Test

Validation objective

Reference methods

Validation dataset

Validation metrics

Results of the validation

Confidence interval validation

Direction of the paired difference

Decision agreement

Interpretation

Conclusion

Univariate Logistic Regression

Validation objective

Reference methods

Validation dataset

Validation metrics

Overall interpretation of the validation

Stable-case validation

Stable-case validation across sample sizes

Separation and extreme cases

Invalid cases

Interpretation

Conclusion

Multivariate Logistic Regression

Validation objective

Reference methods

Validation design

Validation metrics

Results for models with 3 predictors

Results for models with 5 predictors

Confidence intervals

Interpretation of unstable and invalid cases

Interpretation

Conclusion

Propensity Score Matching (PSM)

Validation objective

Reference methods

Validation design

Validation metrics

Results in the approximately balanced cohorts

Balance-summary agreement in the approximately balanced cohorts

Results in the 5:1 control-dominant cohort

Balance-summary agreement in the 5:1 cohort

Interpretation of residual differences

Conclusion

Competing Risk Analysis — CIFs, Gray Test, and Fine & Gray Regression