|Markus Becker 623fff13f9||4 months ago|
|.vscode||4 months ago|
|ingest||4 months ago|
|validate||6 months ago|
|.gitignore||6 months ago|
|README.md||4 months ago|
|setup.py||6 months ago|
Import GeoLife into SQLite for further analysis.
This tool allows/requires you to ingest at least on of the following datasets:
After installing this tool using pip run:
loclib-ingest -p /path/to/privamov/folder -g /path/to/geolife/folder -d traces.db
This operation will take a while (fastest I've seen is around 10 minutes for PRIVA'MOV alone, around 5 for GeoLife), but the resulting database is both smaller, a lot easier to query and most importantly unifies the two very different datasets.
The resulting SQLite database is kept very simple, only two tables are created:
This means it is easy to query for all samples of a trace, or all samples within an area. It even allows to ignore the underlying datasets concept of traces entirely by querying for all samples belonging to a user within a certain timeframe.
It being a SQLite database doesn't bring the highest performance, you can expect around 5s for each trace query. I'll consider adding other database backends in the future. The ingest of PRIVA'MOV is parallelized across a number of processes (--cpu) to speed it up and chunked (--chunk) to reduce memory usage. If your ingest takes very long you can try to optimize: reduce the number of chunks as far as your RAM allows you to and increase the number of processes (up to the number of cpu threads), the major bottleneck is the saving into the SQLite database (which only one thread can do at a time).
The idea is to collect (many) different functions which operate on separate traces to grade their likelihood of being legit. These functions are to return values between -1 and 1. Those values represent:
Then the application on top can give weights to the functions which will be used to summarize them into one final score.
Here's some pseudocode on how functions are supposed to work:
fun grade_metric_A(history, head, hint): # determine how likely head is to be a true reading given the history according to metric A # the hint contains the rating of this metric so far, to allow for some short-term-memory. if /* head looks completely fabricated */ return -1 elif /* head is believable */ return 1 else return 0
The application itself will determine the trustworthiness of a trace as follows:
fun grade_trace(trace, weights): metrics = [grade_metric_A, grade_metric_B, ...] scores = [1 for m in metrics] for l in trace: for i, m in metrics: scores[i] = scores[i] * (m(trace[0:l], trace[l], score[i]*2-1)+1)/2 # returns trustworthiness (between 0 = untrustworthy and 1 = trustworty) return sum(weights .* scores)
It is currently to be expected for users to receive either the worst-case or moving average rating of their trace-ratings. However more sophisticated methods could be added.
A median of all locations is determined, which could be represented by a moving average of the end of
Any datapoint close to this home base is more trustworthy than datapoints further away.
Determines the largest
l for which
[history[-l:], head] has a correlation coefficient
larger than some value
R>0.9. For very small
l's the metric will return 1, whereas larger
l's will lead to negative values as the path the user
is taken is too linear.
In the target applications malicious users will attempt to gain a large Ortsfaktor (engl. location factor) to influence the outcome of event-ratings. It is important to incorporate this into a metric.
The Ortsfaktor is calculated as exp(−0.0240259⋅x) where x is the distance between the position of the user and the location the event has taken place at. If we assume the malicious user has used their last action to place themselves to maximise their influence using the Ortsfaktor we have to reduce their trustworthiness in this metric by 1-exp(−0.0240259⋅x) where x represents the distance between the last two samples. As this metric will converge on 0 over time some factor needs to be introduced to allow for long traces to be trustworthy.
Longer traces are inherently more trustworthy, this can simply be represented by a function which maps longer traces onto values closer to 1 and shorter traces to values close to 0. It is important to consider that traces with a higher frequency will appear longer, which is acceptable as they give more data to determine their validity.
I kinda want to run a 2d-FFT on the locations and see if there's some dominant frequencies real data will always contain.
We check whether the user is moving at speeds which are above any land vehicle. Traces which contain any traversal at speeds above car speeds are marked as untrustworthy.
graph TD Metrics --> Statistics Statistics --> Linearity Statistics --> 2D-Frequency Metrics --> Loyalty Loyalty --> Length Loyalty --> hb[Home Base] Metrics --> Attacks Attacks --> oc[Ortsfaktor consistency] Attacks --> pp[Physical possibility]