This repository contains the source code to implement the Enterprise Knowledge Solution (EKS) on Google Cloud Platform (GCP). The solution is composed of modular components that collectively enable the creation of end-to-end workflow for document processing, management and analysis:
- Document Ingestion: Upload and import a variety of document types.
- Document Processing: Validate, extract information, and transform document content.
- Document Storage: Securely store and manage processed documents.
- Document Indexing: Enabling efficient search and retrieval of document information.
- Search and Summarization: Search and summarization of document content.
- Document Retrieval: Access to the original document files.
The solution comprises the following key components:
Component | Description |
---|---|
Document Processing | Python tools and deployments for executing document processing tasks (extraction, transformation, enrichment). |
Common Infrastructure | Provides the shared infrastructure foundation for the EKS (networking, storage, datasets etc.). |
Workflow Orchestrator | Orchestrates the end-to-end document processing workflow using Cloud Composer. |
Web UI | Offers a user interface for interacting with the EKS (search, summarization, document views etc). |
The above diagram depicts a Dataflow of how the documents uploaded into Google Cloud Storage bucket is processed and prepared for search and summarization.
This Solution assumes that you have already configured an enterprise-ready foundation. The foundation is not a technical prerequisite (meaning, you can use the deployment guide without a foundation). However, we recommend that you build an enterprise-ready foundation before releasing production workloads with sensitive data.
For more details, see Deploying Solutions to an enterprise-ready foundation
This section provides a step-by-step instructions on how to deploy the Enterprise Knowledge Solution
on Google Cloud using Terraform.
To deploy this solution, perform the follow steps:
-
Create or select a Google Cloud project and ensure that billing is enabled for your Google Cloud project.
-
To provide a secure and reliable connection to solutions Web UI, you need to own a domain name used to access the web application. A SSL load balancer with managed certificate are provisioned for your domain and securely routes traffic to the WebUI application.
-
This example code is deployed through terraform using the identity of a least privilege service account. To create this service account and validate other requirements with a setup script, your user identity must have the following IAM Roles on your project:
- Project IAM Admin
- Role Admin
- Service Account Admin
- Service Usage Admin
-
To deploy this repository using an online terminal with software and authentication preconfigured, use Cloud Shell.
Alternatively, to deploy this repository using a local terminal:
- install and initialize the gcloud CLI
- install Terraform
- install the Git CLI
-
In Cloud Shell or your preferred terminal, clone this repository:
git clone https://github.com/GoogleCloudPlatform/document-processing-and-understanding.git
-
Navigate to the Sample Directory:
cd <YOUR_REPOSITORY>/sample-deployments/composer-orchestrated-process
Where
<YOUR_REPOSITORY>
is the path to the directory where you cloned this repository. -
Set the following environment variables:
export PROJECT_ID="<your Google Cloud project id>" export REGION="<Google Cloud Region for deploying the resources>" export IAP_ADMIN_ACCOUNT="the email of the group or user identity displayed as the support_email field on Oauth consent screen. This must be either the email of the user running the script, or a group of which they are Owner."
-
(Optional) By default, this repository automatically creates and uses a service account "deployer@$PROJECT_ID.iam.gserviceaccount.com" to deploy terraform resources. The necessary IAM roles and authentication are automatically configured in the setup script for ease of dpeloyment. If you have a service account in your existing terraform pipeline that you want to use instead, additionally set the optional environment variables to configure your custom deployer service account with the least privilege IAM roles:
export SERVICE_ACCOUNT_ID="your existing service account identity to be used for Terraform."
-
-
Run the following script to setup your environment and your cloud project for running terraform. This script configures the following:
- Validate software dependencies
- Enable the required APIs defined in
project_apis.txt
. - Enable the required IAM roles on the service account you'll use to deploy terraform resources, defined in
persona_roles_DEPLOYER.txt
. - Setup the OAuth consent screen (brand) required for IAP. While most infrastructure resrouces are created through terraform, we recommend bootstrapping this resource with a user identity rather than a service account to avoid issues related to support_email ownership and destroying a terraform-managed Brand resource.
- Enables the required IAM roles used for underlying Cloud Build processes
- Authenticate Application Default Credentials with the credentials of your service account to be used by Terraform.
- Build a custom container image used for form parsing
scripts/pre_tf_setup.sh
The script also creates a pop-up window "Sign in with Google" asking you to authenticate the Google Auth Library. Follow the directions to complete the Authentication flow with your user account, which will then configure Application Default Credentials using the impersonated service account credentials to be used by terraform.
-
Initialize Terraform:
terraform init
-
Create a terraform.tfvars file with the following variables:
project_id = # Your Google Cloud project ID. region = # The desired region for deploying single-region resources (e.g., "us-central1", "europe-west1"). vertex_ai_data_store_region = # The multiregion for your Agent Builder Data Store, the possible values are ("global", "us", or "eu"). docai_location = # Sets the location for Document AI service. webui_domains = # Your domain name for Web UI access (e.g., ["webui.example.com"]) iap_access_domains = # List of domains granted for IAP access to the Web UI (e.g., ["domain:example.com"])
-
(Optional) By default, this repository creates a new VPC network in the same project as other resources. You can use an existing VPC network instead by configuring optional terraform variables.
create_vpc_network = false # default is true vpc_name = # The name of your existing vpc, (e.g., "myvpc")
If using an existing VPC, you should validate that your existing vpc has firewall policies and DNS zones that enable the traffic pathways defined in vpc.tf, and grant Compute Network User on your shared VPC to the deployer service account.
-
-
Review the proposed changes, and apply them:
terraform apply
The provisioning process may take approximately an hour to complete.
-
Print the DNS configuration for the WebUI and configure the DNS records for the WebUI accordingly:
terraform output webui_dns_config
-
Migrate Terraform state to the remote Cloud Storage backend:
terraform init -migrate-state
Terraform detects that you already have a state file locally and prompts you to migrate the state to the new Cloud Storage bucket. When prompted, enter
yes
.
If you update the source code or pull the latest changes from the repository, re-run the following command to apply the changes to your deployed environment:
terraform apply
After successfully completing the steps in thge previous section Deployment Guide, you can test the entire EKS workflow.
-
Get the Input Bucket Name:
terraform output gcs_input_bucket_name
This command will display the name of the Cloud Storage bucket designated for uploading documents.
-
Open the Input Bucket:
- Go to the Cloud Storage console
- Locate the input bucket using the name obtained in the previous step.
-
Upload Your Documents:
- Click the "Upload Files" button or drag and drop your files into the bucket. Supported file types:
- MS Outlook (msg)
- MS Excel (xlsx, xlsm)
- MS Word (docx)
- MS PowerPoint (pptx)
- PDF with text only content
- PDF with forms
- HTML
- TXT
- ZIP containing any of above supported file types
- Click the "Upload Files" button or drag and drop your files into the bucket. Supported file types:
-
Get the Cloud Composer Airflow URI:
terraform output composer_uri
This command will display the web interface URI of the Cloud Composer Airflow environment.
-
Access the Airflow UI:
- Open your web browser and navigate to the URI obtained in the previous step.
- First time you will need to authenticate with your Google Cloud credentials.
-
Trigger the Workflow:
- In the Airflow UI, locate the DAG (Directed Acyclic Graph) named:
run_docs_processing
, which represents the document processing workflow. - Click the "Trigger DAG" button to access the trigger page. Here, you can view the input parameters for the workflow.
- Leave the default parameters as they are and click the "Trigger" button to initiate the workflow.
- Set the
classifier
parameter per your environment, with the following structure:
projects/<CLASSIFIER_PROJECT>/locations/<CLASSIFIER_LOCATION>/processors/<CLASSIFIER_ID>
All these parameters are available from the Cloud Console, in the classifier overview page.
- In the Airflow UI, locate the DAG (Directed Acyclic Graph) named:
-
Monitor Execution Progress:
- Navigate to the DAG details view using the URL:
<composer_uri>/dags/run_docs_processing
(replace<composer_uri>
with the URI you obtained earlier). - This page displays the progress of each task in the workflow, along with logs and other details.
- Navigate to the DAG details view using the URL:
Once the workflow completes successfully, all documents will be imported into the Vertex AI Agent Builder Data Store named Document Processing & Understanding
.
-
Get the Agent Build App URI:
terraform output agent_app_uri
-
Access the Agent Build App console:
- Open your web browser and navigate to the URI obtained in the previous step.
-
Search and Explore:
- On the console page, you'll find an input bar. Enter your questions or queries related to the documents you've uploaded.
- The app will provide summarized answers based on the content of your documents, along with references to the specific source documents.
-
Access the EKS Web UI:
- Open your web browser and navigate to domain address which you have configured for the WebUI.
- First time y will need to authenticate with your Google Cloud credentials
-
Search and Explore:
- In the
Search Documents
page, enter your questions or queries related to the documents you've uploaded and press enter to get summarized answers, along with references to the specific source documents. - In the
Browse Documents
page, explore and view the documents stored in the Data Store.
- In the
For more information on the Web UI component, please refer to its Readme.
-
Identify the document you want to delete:
- Open Agent Builder Datastore and note down the ID and URI of the document that you want to delete from DP&U.
- Make sure the file in the URI exists in the Google Cloud Storage bucket
- Please note that this script will not delete the GCS Folder that contains the file
- Based on the URI, identify and note down the name of the BQ Table that contains the document metadata
- Please note that this script will not delete the BQ Table that contains the document metadata
-
Execute the bash script to delete a single document:
scripts/delete_doc.sh -d <DOC_ID> -u <DOC_URI> -t <BQ_TABLE> -l <LOCATION> [-p <PROJECT_ID>]
-
Execute the bash script to delete a batch of documents:
scripts/delete_doc.sh -b <BATCH_ID> -l <LOCATION> [-p <PROJECT_ID>]
To classify documents, you must create a custom document classifier in the Google Cloud console.
-
You can use the test documents and forms to train and evaluate the classifier in your GCP environment.
-
We have created an annotated dataset to expedite the training process. Please contact your Google account representative to get access to the annotated dataset.
-
The output labels of the classifier MUST match the configured labels in the composer DAG configuration
doc-ai-processors
. Out of the box, the solution supportsform
andinvoice
labels. Any other label would cause the flow to treat the document as a generic document and will process it without extracting structured data from the document. -
After training the custom classifier, set the classifier ID to composer as a default argument. Add the following variable to your Terraform variables file and run
terraform apply
again.custom_classifier_id = projects/<CLASSIFIER_PROJECT>/locations/<CLASSIFIER_LOCATION>/processors/<CLASSIFIER_ID>