Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix harvest validation (advice on proper PR creation needed) #444

Closed
wants to merge 19 commits into from

Conversation

bonnland
Copy link
Contributor

This PR is only supposed to contain the changes in ckanharvester.py, but I am struggling with how to isolate this file in the PR. I hope someone can help me clean up this PR, or suggest how to redo it better to contain just ckanharvester.py.

This code change is based on #312 and fixes a problem for harvest instances that implement their own validators.

The WAFs that we harvest often contain metadata records that fail our extra validation rules.

Here's the behavior I'm seeing with ckanext-harvest version 1.3.3, when we harvest a WAF containing records that fail validation:

  • When the harvest queues have been newly started or purged, the harvest job is successfully marked as finished and an error email is sent.
  • A second harvest of the same WAF causes this stack trace in gather_err.log:
2021-05-18 09:31:21,153 DEBUG [ckanext.harvest.queue] Received harvest job id: 04c1c1db-087d-436b-88b1-ea9e910db1c3
Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 8, in <module>
    sys.exit(run())
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 235, in command
    utils.gather_consumer()
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/utils.py", line 336, in gather_consumer
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 347, in gather_callback
    job = HarvestJob.get(id)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/model/__init__.py", line 116, in get
    o = cls.filter(**kwds).first()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2755, in first
    ret = list(self[0:1])
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2547, in __getitem__
    return list(res)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2855, in __iter__
    return self._execute_and_instances(context)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2876, in _execute_and_instances
    close_with_result=True)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2885, in _get_bind_args
    **kw
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2867, in _connection_from_session
    conn = self.session.connection(**kw)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 998, in connection
    execution_options=execution_options)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1003, in _connection_for_bind
    engine, execution_options)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 382, in _connection_for_bind
    self._assert_active()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 276, in _assert_active
    % self._rollback_exception
sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (psycopg2.DatabaseError) could not receive data from server: Connection timed out
 [SQL: 'INSERT INTO harvest_log (id, content, level, created) VALUES (%(id)s, %(content)s, %(level)s, %(created)s)'] [parameters: {'content': u'Received harvest job id: 04c1c1db-087d-436b-88b1-ea9e910db1c3', 'created': datetime.datetime(2021, 5, 18, 15, 31, 21, 155237), 'id': u'c4225513-9d7f-48ff-b9c5-4363d549054e', 'level': 'DEBUG'}]

This harvest job eventually times out, though sometimes 1-5 hours after the timeout value of 5 minutes that we have set.

If, however, I purge the queues before the second harvest job, then the job finishes successfully and there is no stack trace.

This PR will validate before create_or_update(), and I have verified that I no longer have to purge the harvest queues in order to keep getting harvest jobs marked as finished. Instead, we get an error email each time. Note that we prefer the repeat email behavior as we might otherwise forget that there is a problem that needs fixing with one of our records.

@bonnland
Copy link
Contributor Author

It just hit me that I can simply base this PR off master instead of our dev branch. I will resubmit a better PR.

@bonnland bonnland closed this May 18, 2021
@bonnland bonnland deleted the fix-validation branch May 18, 2021 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant