|
5 months ago | |
---|---|---|
.vscode | 5 months ago | |
ingest | 5 months ago | |
validate | 5 months ago | |
.gitignore | 5 months ago | |
README.md | 10 months ago | |
setup.py | 12 months ago |
README.md
location-ingest
Import GeoLife into SQLite for further analysis.
Ingest
This tool allows/requires you to ingest at least on of the following datasets:
- GeoLife (1.6 GB extracted, 800 MB ingested)
- PRIVA'MOV GPS (8.1 GB extracted, 5 GB ingested)
After installing this tool using pip run:
loclib-ingest -p /path/to/privamov/folder -g /path/to/geolife/folder -d traces.db
This operation will take a while (fastest I've seen is around 10 minutes for PRIVA'MOV alone, around 5 for GeoLife), but the resulting database is both smaller, a lot easier to query and most importantly unifies the two very different datasets.
The resulting SQLite database is kept very simple, only two tables are created:
- samples
- id: unique id
- lat: latitude
- lon: longitude
- unixtime: timestamp
- trace: unique id of trace this sample is part of
- traces
- id: unique id
- user: user id (prefixed with 1 for GeoLife and 2 for PRIVA'MOV)
This means it is easy to query for all samples of a trace, or all samples within an area. It even allows to ignore the underlying datasets concept of traces entirely by querying for all samples belonging to a user within a certain timeframe.
It being a SQLite database doesn't bring the highest performance, you can expect around 5s for each trace query. I'll consider adding other database backends in the future. The ingest of PRIVA'MOV is parallelized across a number of processes (--cpu) to speed it up and chunked (--chunk) to reduce memory usage. If your ingest takes very long you can try to optimize: reduce the number of chunks as far as your RAM allows you to and increase the number of processes (up to the number of cpu threads), the major bottleneck is the saving into the SQLite database (which only one thread can do at a time).
Method for rating traces
The idea is to collect (many) different functions which operate on separate traces to grade their likelihood of being legit. These functions are to return values between -1 and 1. Those values represent:
- -1: Trace is completely unbelievable
- 1: Trace is completely believable
- 0: Can't decide either way
Then the application on top can give weights to the functions which will be used to summarize them into one final score.
Here's some pseudocode on how functions are supposed to work:
fun grade_metric_A(history, head, hint):
# determine how likely head is to be a true reading given the history according to metric A
# the hint contains the rating of this metric so far, to allow for some short-term-memory.
if /* head looks completely fabricated */
return -1
elif /* head is believable */
return 1
else
return 0
The application itself will determine the trustworthiness of a trace as follows:
fun grade_trace(trace, weights):
metrics = [grade_metric_A, grade_metric_B, ...]
scores = [1 for m in metrics]
for l in trace:
for i, m in metrics:
scores[i] = scores[i] * (m(trace[0:l], trace[l], score[i]*2-1)+1)/2
# returns trustworthiness (between 0 = untrustworthy and 1 = trustworty)
return sum(weights .* scores)
Method for rating users
It is currently to be expected for users to receive either the worst-case or moving average rating of their trace-ratings. However more sophisticated methods could be added.
Criterions for rating traces
Distance from home base
A median of all locations is determined, which could be represented by a moving average of the end of history
.
Any datapoint close to this home base is more trustworthy than datapoints further away.
Linearity
Determines the largest l
for which [history[-l:], head]
has a correlation coefficient
larger than some value R
>0.9. For very small l
's the metric will return 1, whereas larger l
's will lead to negative values as the path the user
is taken is too linear.
Ortsfaktor consistency
In the target applications malicious users will attempt to gain a large Ortsfaktor (engl. location factor) to influence the outcome of event-ratings. It is important to incorporate this into a metric.
The Ortsfaktor is calculated as exp(−0.0240259⋅x) where x is the distance between the position of the user and the location the event has taken place at. If we assume the malicious user has used their last action to place themselves to maximise their influence using the Ortsfaktor we have to reduce their trustworthiness in this metric by 1-exp(−0.0240259⋅x) where x represents the distance between the last two samples. As this metric will converge on 0 over time some factor needs to be introduced to allow for long traces to be trustworthy.
Length
Longer traces are inherently more trustworthy, this can simply be represented by a function which maps longer traces onto values closer to 1 and shorter traces to values close to 0. It is important to consider that traces with a higher frequency will appear longer, which is acceptable as they give more data to determine their validity.
2D-Frequency
I kinda want to run a 2d-FFT on the locations and see if there's some dominant frequencies real data will always contain.
Physical possibility
We check whether the user is moving at speeds which are above any land vehicle. Traces which contain any traversal at speeds above car speeds are marked as untrustworthy.
Map
graph TD
Metrics --> Statistics
Statistics --> Linearity
Statistics --> 2D-Frequency
Metrics --> Loyalty
Loyalty --> Length
Loyalty --> hb[Home Base]
Metrics --> Attacks
Attacks --> oc[Ortsfaktor consistency]
Attacks --> pp[Physical possibility]