I have a project at work where I need to find the nearest neighbor for a set of points on a map. We store these points in SQL Server, which has spatial query support.

It looks like scikit-learn got support for nearest neighbors using Haversine distance (what we need to use for measuring map distance) earlier in 2019 (https://github.com/scikit-learn/scikit-learn/pull/13289), but I just needed somethng quick – if I could keep all the work in SQL Server, that was ideal in this case.

Then I found this post, which does a lovely job of showing how we can perform some simple calculations to reduce the set of points that we need to measure distance between:

Fast nearest-neighbor location finder for SQL

The post describes how to find all the nearest neighbors for a single point. In my case, I have a large set of points, and for each one, I want its nearest neighbors within the same set. So, how can we modify the query to handle this?

A cross join is an easy way of pairing every point with every other point in the set. But that introduces a new problem – now we can’t simply use a LIMIT N as shown in the blog post (or TOP N in my case, since we’re using SQL Server) clause, because we’ll be getting back the nearest neighbors for each point in the set. We need to get the TOP N per point.

So to address that concern, we can use a windowing function to construct a ranking of the nearest neighbors for each point, and then limit the results to only include the top N ranks in the results. The resulting query ends up looking like this:

DECLARE @projectId INT = 123;                -- Set this to the project ID of interest
DECLARE @topRankLimit INT = 5;               -- Show at most N nearest neighbors for each location
DECLARE @distanceThresholdKM FLOAT = 16.0934 -- Only show nearest neighbors within 16.0934km (~= 10 miles)

WITH nn AS 
(
SELECT 
    loc.Id,
    loc.Location.Lat AS Lat,
    loc.Location.Long AS Long,
    nearest_neighbor_loc.Id AS NearestNeighborId,
    nearest_neighbor_loc.[Location].Lat AS NearestNeighborLat,
    nearest_neighbor_loc.[Location].Long AS NearestNeighborLong,
    loc.[Location].STDistance(nearest_neighbor_loc.[Location]) AS DistanceM,
    ROW_NUMBER() OVER (PARTITION BY loc.Id ORDER BY loc.Id, loc.[Location].STDistance(nearest_neighbor_loc.[Location])) AS DistanceRank
FROM
    ProjectLocations AS loc
CROSS JOIN ProjectLocations AS nearest_neighbor_loc
WHERE
    loc.Id <> nearest_neighbor_loc.Id AND
    nearest_neighbor_loc.[Location].Lat BETWEEN 
        (loc.[Location].Lat - (@distanceThresholdKM / 111.045)) AND 
        (loc.[Location].Lat + (@distanceThresholdKM / 111.045)) AND
    nearest_neighbor_loc.[Location].Long BETWEEN 
        (loc.[Location].Long - (@distanceThresholdKM / (111.045 * COS(RADIANS(loc.[Location].Lat))))) AND
        (loc.[Location].Long + (@distanceThresholdKM / (111.045 * COS(RADIANS(loc.[Location].Lat))))) AND
    loc.[Location].STDistance(nearest_neighbor_loc.[Location]) IS NOT NULL
)
SELECT
    *
FROM
    nn
WHERE
    DistanceRank <= @topRankLimit AND
   (DistanceM / 1000.0) <= @distanceThresholdKM         
ORDER BY
    Id,
    DistanceM DESC
;

I’ll plan on updating this post soon with some additional discussion and explanation of the query.

Here’s a task I had the other day. While working on it I thought, “this might make a nice interview question.”

You have an m × n matrix of data, for example:

1 2 3 4 
5 6 7 8 

It has been stored into a 1-D array in column-major order:

1 5 2 6 3 7 4 8

Write a function that takes this array and returns a new array with the same data in row-major order:

1 2 3 4 5 6 7 8

The code that I ended up with:

public double[] ToRowMajor(double[] colMajor1D, int nrow)
{
    if (colMajor1D == null)
    {
        throw new ArgumentNullException(nameof(colMajor1D));
    }

    var len = colMajor1D.Length;

    if (len <= 0)
    {
        throw new ArgumentException(
            @"Data must not be empty",
            nameof(colMajor1D));
    }
    if (nrow < 1)
    {
        throw new ArgumentOutOfRangeException(
            nameof(nrow), 
            @"Must have a positive number of rows");
    }
    if (len % nrow != 0)
    {
        throw new ArgumentException(
            $@"Data should contain exactly {nrow} rows",
            nameof(nrow));
    }

    var ncol = len / nrow;
    var rowMajor1D = new double[len];

    int offset = 0;
    for (int i = 0; i < nrow; i++)
    {
        for (int j = 0; j < ncol; j++)
        {
            rowMajor1D[offset++] = colMajor1D[(j * nrow) + i];
        }
    }

    return rowMajor1D;
}

Romaine Chain

Head of romaine lettuce

Last week, the CDC warned Americans not to eat romaine lettuce due to an outbreak of E. coli infections.

The industry is going to adopt improved labeling to narrow down future outbreaks more quickly. This exchange appeared in my Twitter feed today:

Nitlin’s question is a good one: what about blockchain?

Earlier this month, Newsweek’s cover story focused on the state of blockchain technology, including IBM Food Trust, which is focused on traceability of our food supply.

I thought it would be interesting to try to understand the level of adoption of IBM Food Trust. IBM announced general availability less than two months ago, and integration of new technology into established business processes is always challenging, so we should certainly expect it to be very early days. But can we quantify it?

The Newsweek story stated that “4 million individual products have been logged on the blockchain-based IBM Food Trust.” The term “individual product” wasn’t defined in the story, but when a company representative provides statistics related to their offerings, you can be assured they’ll use metrics that result in the most impressive sounding quantities possible. Accordingly, I took “individual product” to mean, “the equivalent of a line item on your grocery store receipt.”

The average supermarket carries roughly 30,000 unique SKUs. And of course, there will be multiple units of each SKU kept in stock. For the purposes of this discussion, let’s assume an average of 5 units in stock for each SKU. That brings us to approximately 150,000 items on the shelf (i.e., 150,000 “individual products) in your average supermarket.

So the number of individual products tracked through IBM Food Trust to date might fill about 27 grocery stores (4,000,000 products ÷ 150,000 products stocked per supermarket = 26.667 supermarkets) at a given moment in time. This is less than one-tenth of one percent of the 38,571 supermarkets in the United States as of 2017 (Source: FMI).

Furthermore, this figure is just a snapshot – it doesn’t account for the expected inventory turnover of supermarkets over the timeframe that Food Trust has been operational. Nor does it account for the romaine lettuce purchased through wholesale channels by restaurants and the foodservice industry.

So while blockchain may have huge potential for the food supply chain, it is understandable why improved labeling might be instituted as a stopgap measure for improving food safety.

Book Review - Vue.js: Up and Running

Book cover

The clearest introduction to Vue.js that I’ve found.

There are a few small typos in the text and code snippets; keep O’Reilly’s Errata page handy as you read the book.

The book doesn’t do much to walk you through running the code on your own. On the plus side, this keeps things concise, which I take to be a goal of the “Up and Running” series. And actually having to do a little bit of work to get code snippets running locally helped solidify my understanding of the development environment and common Vue.js project structure.

But on the downside, I think this makes the book less useful for less experienced developers, and for developers that do not already possess a certain level of familiarity with modern Javascript and its ecosystem.

★★★★☆