CorrectOCR is a tool used to improve text from OCR processes on printed text in PDF documents.
Is available at readthedocs
Local development is done using docker-compose: docker-compose up
This command mounts code, workspace and tests directories. It is based on the 'Dockerfile-dev' build, which doesn't include the beforementioned directories, but which is otherwise based on the production build file (Dockerfile).
Note that settings such as workspace location and database credentials can be set using CorrectOCR.INI and/or with environmental variables.
If none of these variables are set the code has default values set.
To start containers (use --build
flag to rebuild):
docker-compose up
To prepare tokens from /app/workspace/original/
docker-compose exec backend python -m CorrectOCR prepare --all --step server --autocrop --precache_images --loglevel DEBUG
To open shell on db (run in another terminal):
docker exec -it $(docker ps -q --filter name=db) bash
To open shell on backend:
docker exec -it $(docker ps -q --filter name=backend) bash
Deployment is done using Travis-CI. When pushing new commits, Travis-CI starts a new build, which builds a Docker image and pushes it to AWS ECR.
The deployment is then done by starting a deployment process at AWS Elastic Beanstalk, which pulls the newly build image.
The code runs as a Docker container on AWS Elastic Beanstalk.
The workspace directory is mounted on the EC2 host from EFS. This ensures, that changes in the workspace are kept if the EC2 instance is changed or the environment is rebuild.
CorrectOCR depends on a database. In production it connects to a RDS database.
Al settings concerning mounting and database connections are set using environmental variables in Elastic Beanstalk.
CorrectOCR is based on code created by:
- Caitlin Richter [email protected]
- Matthew Wickes [email protected]
- Deniz Beser [email protected]
- Mitchell Marcus [email protected]
See their article “Low-resource Post Processing of Noisy OCR Output for Historical Corpus Digitisation” (LREC-2018) for further details, it is available online: http://www.lrec-conf.org/proceedings/lrec2018/pdf/971.pdf
The original python 2.7 code (see original
-tag in the repository)
has been licensed under Creative Commons Attribution 4.0
CC-BY-4.0, see also
license.txt
in the repository).
The code has subsequently been updated to Python 3 and further expanded by Mikkel Eide Eriksen ([email protected]) for the Copenhagen City Archives (mainly structural changes, the algorithms are generally preserved as-is). Pull requests welcome!