ANN - Approximate Nearest Neighbor Library wrapper
This easy tool is a wrapper I wrote in Borland C++ Builder for David M. Mount and Sunil Arya's free ANN library used to solve the KNN problem popular in statistics / numerical algorithms. Some options that were previously not accessible in the library are now open for change.
Features include:
- Working with multiple-dimension data (from 1 to 100) to calculate distances
- Input of sample data and query data from text files with any extension: columns of data (i.e. 'dimensions') must be separated by TABs
- Calculation of distances in any metric: L1 (Manhattan / Cityblock distance), L2 (Euclidean), LP (custom), and L-infinity (Chebyshev)
- Ability to indicate max dimensions and data points
- Automatic calculation of max neighbors, dimension space, and data point count
- Using error bound (Epsilon) to approximate KNN search (default = 0)
- 3 search modes: brute force (full loop), unbalanced (kd) tree, balanced (bd) tree -- the latter is good for large searches
- 3 search 'ranges': standard, priority, and fixed radius
- Ability to tweak split and shrink rules for search trees
- Ability to stop search after reaching a specified visit count (early termination)
- Displaying comprehensive descriptive statistics for data and query arrays (min, max, mean, median, count, sum, SOS, variance, standard deviation, skewness, kurtosis)
- Displaying quick search statistics (options selected)
- Displaying 2D graph of data, query, and nearest neighbors (dot lines) -- (only for 2 dimensions!)
As yet this is a rather small application, but with additional effort and time it can be converted into a full-fledged data interpolation / simulation app. For that one needs to implement: - Covariance functions (variograms, cov properties: sill, nugget, range, angles etc.)
- Probability calculation (expected values, total / conditional / unconditional probabilities, Bayesian probability)
- Probability distributions (normal, Poisson's, Student's, Pearson's X-square, exponential, etc. + corresponding prob density distributions)
- Some linear algebra and matrix calculus (matrix manipulation, Gaussian elimination, LU / Cholesky / QR etc. decomposition)
- Kriging (simple, ordinary, co-kriging, universal, indicator)
- Random path generation
- Advanced data handling (reading and writing various file formats, lists and vectors in lieu of C arrays etc.)
- Maybe have to use additional libraries (boost libs for LA, smart pointers, etc.)
- Simulation models (sequential Gaussian -- SGSIM, sequential normal-equation -- SNESIM, indicator -- SIS, truncated etc.)
It is clear now that I am still so far from calling this tool 'a useful tool for statistics and data interpolation'. But it's a start... EXE + Data file samples
|