Import GeoLife into SQLite for further analysis
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Markus Becker d51a4e6a3c First successful GridSearchCV run completed 2 years ago
.vscode Added ML validator 2 years ago
ingest Added ML validator 2 years ago
validate First successful GridSearchCV run completed 2 years ago
.gitignore Added ML validator 2 years ago Document PRIVA'MOV changes 2 years ago Prototype code for trace validation 2 years ago


Import GeoLife into SQLite for further analysis.


This tool allows/requires you to ingest at least on of the following datasets:

  • GeoLife (1.6 GB extracted, 800 MB ingested)
  • PRIVA'MOV GPS (8.1 GB extracted, 5 GB ingested)

After installing this tool using pip run:

loclib-ingest -p /path/to/privamov/folder -g /path/to/geolife/folder -d traces.db

This operation will take a while (fastest I've seen is around 10 minutes for PRIVA'MOV alone, around 5 for GeoLife), but the resulting database is both smaller, a lot easier to query and most importantly unifies the two very different datasets.

The resulting SQLite database is kept very simple, only two tables are created:

  • samples
    • id: unique id
    • lat: latitude
    • lon: longitude
    • unixtime: timestamp
    • trace: unique id of trace this sample is part of
  • traces
    • id: unique id
    • user: user id (prefixed with 1 for GeoLife and 2 for PRIVA'MOV)

This means it is easy to query for all samples of a trace, or all samples within an area. It even allows to ignore the underlying datasets concept of traces entirely by querying for all samples belonging to a user within a certain timeframe.

It being a SQLite database doesn't bring the highest performance, you can expect around 5s for each trace query. I'll consider adding other database backends in the future. The ingest of PRIVA'MOV is parallelized across a number of processes (--cpu) to speed it up and chunked (--chunk) to reduce memory usage. If your ingest takes very long you can try to optimize: reduce the number of chunks as far as your RAM allows you to and increase the number of processes (up to the number of cpu threads), the major bottleneck is the saving into the SQLite database (which only one thread can do at a time).

Method for rating traces

The idea is to collect (many) different functions which operate on separate traces to grade their likelihood of being legit. These functions are to return values between -1 and 1. Those values represent:

  • -1: Trace is completely unbelievable
  • 1: Trace is completely believable
  • 0: Can't decide either way

Then the application on top can give weights to the functions which will be used to summarize them into one final score.

Here's some pseudocode on how functions are supposed to work:

fun grade_metric_A(history, head, hint):
    # determine how likely head is to be a true reading given the history according to metric A
    # the hint contains the rating of this metric so far, to allow for some short-term-memory.
    if /* head looks completely fabricated */
        return -1
    elif /* head is believable */
        return 1
        return 0

The application itself will determine the trustworthiness of a trace as follows:

fun grade_trace(trace, weights):
    metrics = [grade_metric_A, grade_metric_B, ...]
    scores = [1 for m in metrics]
    for l in trace:
        for i, m in metrics:
            scores[i] = scores[i] * (m(trace[0:l], trace[l], score[i]*2-1)+1)/2
    # returns trustworthiness (between 0 = untrustworthy and 1 = trustworty)
    return sum(weights .* scores)

Method for rating users

It is currently to be expected for users to receive either the worst-case or moving average rating of their trace-ratings. However more sophisticated methods could be added.

Criterions for rating traces

Distance from home base

A median of all locations is determined, which could be represented by a moving average of the end of history. Any datapoint close to this home base is more trustworthy than datapoints further away.


Determines the largest l for which [history[-l:], head] has a correlation coefficient larger than some value R>0.9. For very small l's the metric will return 1, whereas larger l's will lead to negative values as the path the user is taken is too linear.

Ortsfaktor consistency

In the target applications malicious users will attempt to gain a large Ortsfaktor (engl. location factor) to influence the outcome of event-ratings. It is important to incorporate this into a metric.

The Ortsfaktor is calculated as exp(−0.0240259⋅x) where x is the distance between the position of the user and the location the event has taken place at. If we assume the malicious user has used their last action to place themselves to maximise their influence using the Ortsfaktor we have to reduce their trustworthiness in this metric by 1-exp(−0.0240259⋅x) where x represents the distance between the last two samples. As this metric will converge on 0 over time some factor needs to be introduced to allow for long traces to be trustworthy.


Longer traces are inherently more trustworthy, this can simply be represented by a function which maps longer traces onto values closer to 1 and shorter traces to values close to 0. It is important to consider that traces with a higher frequency will appear longer, which is acceptable as they give more data to determine their validity.


I kinda want to run a 2d-FFT on the locations and see if there's some dominant frequencies real data will always contain.

Physical possibility

We check whether the user is moving at speeds which are above any land vehicle. Traces which contain any traversal at speeds above car speeds are marked as untrustworthy.


graph TD
    Metrics --> Statistics
    Statistics --> Linearity
    Statistics --> 2D-Frequency
    Metrics --> Loyalty
    Loyalty --> Length
    Loyalty --> hb[Home Base]
    Metrics --> Attacks
    Attacks --> oc[Ortsfaktor consistency]
    Attacks --> pp[Physical possibility]