Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to only process log entries that haven't been processed before #232

Open
wants to merge 2 commits into
base: 3.x-dev
Choose a base branch
from

Conversation

mackuba
Copy link
Contributor

@mackuba mackuba commented Nov 17, 2018

I want to use Matomo with log analytics only. My Nginx logs are rotated every week, but I want my reports to be updated much earlier, e.g. every hour. If I just feed the same log file with already reported visits to the importer, I will have duplicated entries, so I need to either rotate logs every hour (very inconvenient) or somehow prevent logs from being imported twice. Based on what I could find, there is currently no easy way to do this.

This pull request solves this by tracking the latest visit timestamp found in an imported log file and then saving it to a file specified in a --timestamp-file option. On the next run this timestamp is loaded at startup and all visits before or on this timestamp are ignored (like --exclude-older-than, but inclusive, since the log with equal timestamp was already parsed).

This kind of solves #144.

I've put initial_timestamp (loaded from the file at the beginning) in the config and latest_timestamp (updated after every log record) in the stats. This can be moved elsewhere if it's not the best place.

I've also added some lines to the summary to print the status of the timestamp-based filtering, and included the older/newer than filtering too since it's related:

Logs import summary
-------------------

    85 requests imported successfully
    36 requests were downloads
    10627 requests ignored:
        73 HTTP errors
        2 HTTP redirects
        0 invalid log lines
        345 filtered log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        153 requests done by bots, search engines...
        10054 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

    Processed logs since: 2018-11-06 19:05:12 +0000
    Saved last timestamp: 2018-11-07 11:59:42 +0000

I also tweaked the printing there to remove extra empty lines (more than 2 newlines are compacted into 2) - this was already a problem before, as the space between the 2nd and 3rd section was bigger than between 1/2 and 3/4 because of %(sites_ignored)s, but was made more visible with the date filtering section added.

@cweiske
Copy link
Contributor

cweiske commented Oct 5, 2019

Works fine here.
The only problem I see is that the timestamp file gets updated even when using --dry-run.

@tsteur tsteur changed the base branch from master to 3.x-dev January 13, 2020 22:45
@DevDavido
Copy link

This is definitely a useful enhancement. Would certainly love to see that in the Matomo Log Importer 👍 .

@strager
Copy link

strager commented Feb 10, 2022

I resolved merge conflicts with 4.x-dev (commit 6f66f96) here: https://github.com/strager/matomo-log-analytics/tree/timestamp

@strager
Copy link

strager commented Jun 28, 2022

ping. I've been using my version of this patch for a while and I've been happy with it.

strager added a commit to quick-lint/quick-lint-js that referenced this pull request Nov 16, 2022
Matomo does not automatically import Apache logs. Importing needs to be
done manually.

Write a systemd service which runs the log importer (using our fork [1]
to incrementally import logs [2]), and a systemd timer to run the
service daily.

[1] https://github.com/strager/matomo-log-analytics/tree/timestamp
[2] matomo-org/matomo-log-analytics#232
strager added a commit to quick-lint/quick-lint-js that referenced this pull request Nov 16, 2022
Matomo does not automatically import Apache logs. Importing needs to be
done manually.

Write a systemd service which runs the log importer (using our fork [1]
to incrementally import logs [2]), and a systemd timer to run the
service daily.

[1] https://github.com/strager/matomo-log-analytics/tree/timestamp
[2] matomo-org/matomo-log-analytics#232
@atom-box
Copy link

Help wanted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants