Euclidean Distance

The second type of algorithm that we focused on was based on finding the Euclidean distance between phones/users. We spent the majority of our time on developing and testing numerous versions of this algorithm.

Euclidean distance is defined as the length of the line segment between two points, with these points being in the Cartesian plane. In our project, we replicated this concept through the Euclidean distance algorithm, in which rather than performing Euclidean distance calculations on actual Cartesian coordinates, we substitute in the RSSI values from BSS networks and replicate the exact same logic. The formula for Euclidean distance is derived from the Pythagorean theorem, as shown in the figure below.

Graphical representation of Euclidean distance, derived from the Pythagorean theorem.
Simple diagram illustrating how contact tracing via Wi-Fi footprints works.

As seen in the figure above, the ‘Distance’ between the two phones shown by the arrow corresponds to the value of ‘d’ in the previous figure. The blue lines represent the signals from access point 1 to phones A & B, the green lines represent the signals from access point 2 to phones A & B, and let the red lines represent the same for a third access point. In order to calculate this value of distance, we are using the formula for Euclidean distance as shown in the equation below. This is the basic idea of how to visualize this calculation, upon which we expanded into multiple modified, more complex versions as seen in the upcoming sections on this page.

Euclidean distance formula as used in our algorithms, where coordinates are replaces by RSSI measurements.

Penalty

One modification that we made to the basic Euclidean distance algorithm was in regards to instances in which one phone received signal(s) from an access point from which the other user did not. As seen in our video on how we joined together two scans of Wi-Fi data into a single DataFrame, when this situation occurs, the RSSI value recorded for the user who did not pick up activity from this access point will be a null value. In our earliest implementation of the basic Euclidean distance algorithm, all access points for which this was the case were dropped from the proceeding calculation, therefore requiring that both users would have needed to receive signals from the access point in order for those values to be included in the calculation. However, we realized that this condition could cause omission of very valuable data, including scenarios in which one user was simply too far away from an access point in comparison to the other person, indicating a high probability of a large distance between them.

To avoid this deletion of valuable information, we decided to fill this null value with some sort of ‘penalty’ for cases of unmatched networks. Our first approach to the penalty was to fill the null value with the minimum RSSI value received throughout the scans of both phones, representing the access point being very far from the user who did not scan it. Upon further analysis, we also explored using a grand average of RSSI values as the replacement, which acts as less of a ‘penalty’ and more of a ‘substitution,’ upon the acknowledgement that RSSI data can be inherently irregular and that a lack of representation from an access point may not always indicate a large distance from the device.

Ultimately, we found stronger results with the average penalty, and examples of our decision-making process are shown in the plots below.

Example of early plots indicating a clearer correlation between true and Euclidean distance when the RSSI values of unmatched networks are filled with an average penalty (right) as opposed to a minimum penalty (left). These results were confirmed with more detailed, extensive plots, including test statistics and lines of best fit.

Normalization

One interesting characteristic of our plots that we noticed early in the project was that for different tests, the magnitudes of the y-axis (i.e. Euclidean distance values) often varied immensely. We attributed this variation in scale mainly to the fact that depending on the number of access points observed in certain tests, the Euclidean distances produced could be inherently larger. Specifically, if we compare a test performed in an environment with a dense concentration of access points, then the number of individual terms included in the Euclidean distance calculation will naturally be larger than a test in which there was a lower number of surrounding access points.

Given this fact, we knew that we would need to find a way to be able to compare results fairly across all testing scenarios, leading us to methods of normalization. Our approaches included dividing the Euclidean distance by n, n2, and the square root of n, where n represents the total number of access points seen amongst the joined Wi-Fi scans. Ultimately, our most successful results came from dividing Euclidean distance by n, which logically aligned with our expectations, since this represents a natural weighting of the total number of ‘dimensions’ seen in different environments.

The plots on the left are the results of a basic Euclidean distance algorithm, including a test taken in Bauer Hall (top row), as well as two different apartment settings (middle and bottom row). The plots on the right are the results from these same exact tests, modified algorithmically by the inclusion of a normalization by the number of networks picked up during the scan. This is an incredible transformation, as the direct relationship between true and Euclidean distance evidently grows & the dispersion of data points simultaneously decreases with the application of this normalization.

Standard Deviation

Another variation of the Euclidean distance algorithm that we developed was based upon first filtering out access points considered to be “unreliable.” We did this by calculating the standard deviation the RSSI value from each BSS (unique access point) as part of the groupby BSS step in our joining of data sets. This allowed us to remove access points considered “unreliable” based on their standard deviation. Specifically, we classified these access points as unreliable if there was a high level of inconsistency in the values produced by their signals. We iterated through numerous values to test what would be an optimal threshold for standard deviation. While the results were not entirely conclusive, leading us to believe that this value is somewhat dependent on the specific test environment, we found values of around 7 to be an effective upper limit of standard deviation.

Rank

We observed that the range of RSSI values read was very dependent on the environment and therefore could cause inherent differentiation in Euclidean distance. Specifically, if one environment produced generally lower/higher RSSI values as a whole, or if one phone recorded lower/higher values based on some sort of bias on the device, the value of the Euclidean distance would therefore be inherently lower/higher. This would cause a major problem in universal conversion from Euclidean distance to true distance in the process of classifying an exposure. Given this consideration, we created an algorithm that sorted Wi-Fi scans for each respective device by strength and reassigned their RSSI value as their “rank” of i*(1/n) for the ith strongest network. The algorithm only looks at a limited number of total networks to ensure equal ranking. This algorithm then calculated the Euclidean distance using the re-assigned RSSI values.

N Strongest

Another way that we explored filtering the data was to only consider the top n strongest networks. The logic behind this was that access points farther from from a user (access points with lower RSSI values) would be less reliable and less sensitive to changes in true distance. This algorithm essentially kept the top n strongest networks seen by both devices individually and dropped the rest before calculating the Euclidean distance.

RSSI Threshold

We continued to address the unreliability of access points at a greater distance from the user by creating a Euclidean distance algorithm that filtered out networks based on a minimum acceptable RSSI value. In essence, this algorithm dropped all access points where the average RSSI value within the comparison interval was less than the specified threshold. We iterated through many possible thresholds and, similar to the standard deviation algorithm, did not find an optimal value but did find that incorporating this condition delivered significantly improved results.

BSS Frequency

Our last algorithm that was intended to handle “unreliable” networks accomplished this task by filtering out networks that showed up less than n times within a comparison interval. Most Wi-Fi networks produced 20-30 unique signal recordings in a 60-second interval. However, this level of recording was not constant amongst all BSSs, making this interesting criteria for investigation. Hence, we created a version of the Euclidean distance algorithm that dropped networks if their BSS showed up less than n times in the 60 second comparison interval.