All posts by Patrik Nagel

Content Repository for Unstructured Data with Multilingual Semantic Search: Part 2

Post Syndicated from Patrik Nagel original https://aws.amazon.com/blogs/architecture/content-repository-for-unstructured-data-with-multilingual-semantic-search-part-2/

Leveraging vast unstructured data poses challenges, particularly for global businesses needing cross-language data search. In Part 1 of this blog series, we built the architectural foundation for the content repository. The key component of Part 1 was the dynamic access control-based logic with a web UI to upload documents.

In Part 2, we extend the content repository with multilingual semantic search capabilities while maintaining the access control logic from Part 1. This allows users to ingest documents in content repository across multiple languages and then run search queries to get reference to semantically similar documents.

Solution overview

Building on the architectural foundation from Part 1, we introduce four new building blocks to extend the search functionality.

Optical character recognition (OCR) workflow: To automatically identify, understand, and extract text from ingested documents, we use Amazon Textract and a sample review dataset of .png format documents (Figure 1). We use Amazon Textract synchronous application programming interfaces (APIs) to capture key-value pairs for the reviewid and reviewBody attributes. Based on your specific requirements, you can choose to capture either the complete extracted text or parts the text.

Sample document for ingestion

Figure 1. Sample document for ingestion

Embedding generation: To capture the semantic relationship between the text, we use a machine learning (ML) model that maps words and sentences to high-dimensional vector embeddings. You can use Amazon SageMaker, a fully-managed ML service, to build, train, and deploy your ML models to production-ready hosted environments. You can also deploy ready-to-use pre-trained models from multiple avenues such as SageMaker JumpStart. For this blog post, we use the open-source pre-trained universal-sentence-encoder-multilingual model from TensorFlow Hub. The model inference endpoint deployed to a SageMaker endpoint generates embeddings for the document text and the search query. Figure 2 is an example of n-dimensional vector that is generated as the output of the reviewBody attribute text provided to the embeddings model.

Sample embedding representation of the value of reviewBody

Figure 2. Sample embedding representation of the value of reviewBody

Embedding ingestion: To make the embeddings searchable for the content repository users, you can use the k-Nearest Neighbor (k-NN) search feature of Amazon OpenSearch Service. The OpenSearch k-NN plugin provides different methods. For this blog post, we use the Approximate k-NN search approach, based on the Hierarchical Navigable Small World (HNSW) algorithm. HNSW uses a hierarchical set of proximity graphs in multiple layers to improve performance when searching large datasets to find the “nearest neighbors” for the search query text embeddings.

Semantic search: We make the search service accessible as an additional backend logic on Amazon API Gateway. Authenticated content repository users send their search query using the frontend to receive the matching documents. The solution maintains end-to-end access control logic by using the user’s enriched Amazon Cognito provided identity (ID) token claim with the department attribute to compare it with the ingested documents.

Technical architecture

The technical architecture includes two parts:

  1. Implementing multilingual semantic search functionality: Describes the processing workflow for the document that the user uploads; makes the document searchable.
  2. Running input search query: Covers the search workflow for the input query; finds and returns the nearest neighbors of the input text query to the user.

Part 1. Implementing multilingual semantic search functionality

Our previous blog post discussed blocks A through D (Figure 3), including user authentication, ID token enrichment, Amazon Simple Storage Service (Amazon S3) object tags for dynamic access control, and document upload to the source S3 bucket. In the following section, we cover blocks E through H. The overall workflow describes how an unstructured document is ingested in the content repository, run through the backend OCR and embeddings generation process and finally the resulting vector embedding are stored in OpenSearch service.

Technical architecture for implementing multi-lingual semantic search functionality

Figure 3. Technical architecture for implementing multilingual semantic search functionality

  1. The OCR workflow extracts text from your uploaded documents.
    • The source S3 bucket sends an event notification to Amazon Simple Queue Service (Amazon SQS).
    • The document transformation AWS Lambda function subscribed to the Amazon SQS queue invokes an Amazon Textract API call to extract the text.
  2. The document transformation Lambda function makes an inference request to the encoder model hosted on SageMaker. In this example, the Lambda function submits the reviewBody attribute to the encoder model to generate the embedding.
  3. The document transformation Lambda function writes an output file in the transformed S3 bucket. The text file consists of:
    • The reviewid and reviewBody attributes extracted from Step 1
    • An additional reviewBody_embeddings attribute from Step 2
      Note: The workflow tags the output file with the same S3 object tags as the source document for downstream access control.
  4. The transformed S3 bucket sends an event notification to invoke the indexing Lambda function.
  5. The indexing Lambda function reads the text file content. Then indexing Lambda function makes an OpenSearch index API call along with source document tag as one of the indexing attributes for access control.

Part 2. Running user-initiated search query

Next, we describe how the user’s request produces query results (Figure 4).

Search query lifecycle

Figure 4. Search query lifecycle

  1. The user enters a search string in the web UI to retrieve relevant documents.
  2. Based on the active sign-in session, the UI passes the user’s ID token to the search endpoint of the API Gateway.
  3. The API Gateway uses Amazon Cognito integration to authorize the search API request.
  4. Once validated, the search API endpoint request invokes the search document Lambda function.
  5. The search document function sends the search query string as the inference request to the encoder model to receive the embedding as the inference response.
  6. The search document function uses the embedding response to build an OpenSearch k-NN search query. The HNSW algorithm is configured with the Lucene engine and its filter option to maintain the access control logic based on the custom department claim from the user’s ID token. The OpenSearch query returns the following to the query embeddings:
    • Top three Approximate k-NN
    • Other attributes, such as reviewid and reviewBody
  7. The workflow sends the relevant query result attributes back to the UI.

Prerequisites

You must have the following prerequisites for this solution:

Walkthrough

Setup

The following steps deploy two AWS CDK stacks into your AWS account:

  • content-repo-search-stack (blog-content-repo-search-stack.ts) creates the environment detailed in Figure 3, except for the SageMaker endpoint, which you create in a spearate step.
  • demo-data-stack (userpool-demo-data-stack.ts) deploys sample users, groups, and role mappings.

To continue setup, use the following commands:

  1. Clone the project Git repository:
    git clone https://github.com/aws-samples/content-repository-with-multilingual-search content-repository
  2. Install the necessary dependencies:
    cd content-repository/backend-cdk 
    npm install
  3. Configure environment variables:
    export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text)
    export CDK_DEFAULT_REGION=$(aws configure get region)
  4. Bootstrap your account for AWS CDK usage:
    cdk bootstrap aws://$CDK_DEFAULT_ACCOUNT/$CDK_DEFAULT_REGION
  5. Deploy the code to your AWS account:
    cdk deploy --all

The complete stack set-up may take up to 20 minutes.

Creation of SageMaker endpoint

Follow below steps to create the SageMaker endpoint in the same AWS Region where you deployed the AWS CDK stack.

    1. Sign in to the SageMaker console.
    2. In the navigation menu, select Notebook, then Notebook instances.
    3. Choose Create notebook instance.
    4. Under the Notebook instance settings, enter content-repo-notebook as the notebook instance name, and leave other defaults as-is.
    5. Under the Permissions and encryption section (Figure 5), you need to set the IAM role section to the role with the prefix content-repo-search-stack. In case you don’t see this role automatically populated, select it from the drop-down. Leave the rest of the defaults, and choose Create notebook instance.

      Notebook permissions

      Figure 5. Notebook permissions

    6. The notebook creation status changes to Pending before it’s available for use within 3-4 minutes.
    7. Once the notebook is in the Available status, choose Open Jupyter.
    8. Choose the Upload button and upload the create-sagemaker-endpoint.ipynb file in the backend-cdk folder of the root of the blog repository.
    9. Open the create-sagemaker-endpoint.ipynb notebook. Select the option Run All from the Cell menu (Figure 6). This might take up to 10 minutes.

      Run create-sagemaker-endpoint notebook cells

      Figure 6. Run create-sagemaker-endpoint notebook cells

    10. After all the cells have successfully run, verify that the AWS Systems Manager parameter sagemaker-endpoint is updated with the value of the SageMaker endpoint name. An example of value as the output of the cell is in Figure 7. In case you don’t see the output, check if the preceding steps were run correctly.

      SSM parameter updated with SageMaker endpoint

      Figure 7. SSM parameter updated with SageMaker endpoint

    11. Verify in the SageMaker console that the inference endpoint with the prefix tensorflow-inference has been deployed and is set to status InService.
    12. Upload sample data to the content repository:
      • Update the S3_BUCKET_NAME variable in the upload_documents_to_S3.sh script in the root folder of the blog repository with the s3SourceBucketName from the AWS CDK output of the content-repo-search-stack.
      • Run upload_documents_to_S3.sh script to upload 150 sample documents to the content repository. This takes 5-6 minutes. During this process, the uploaded document triggers the workflow described in the Implementing multilingual semantic search functionality.

Using the search service

At this stage, you have deployed all the building blocks for the content repository in your AWS account. Next, as part of the upload sample data to the content repository, you pushed a limited corpus of 150 sample documents (.png format). Each document is in one of the four different languages – English, German, Spanish and French. With the added multilingual search capability, you can query in one language and receive semantically similar results across different languages while maintaining the access control logic.

  1. Access the frontend application:
    • Copy the amplifyHostedAppUrl value of the AWS CDK output from the content-repo-search-stack shown in the terminal.
    • Enter the URL in your web browser to access the frontend application.
    • A temporary page displays until the automated build and deployment of the React application completes after 4-5 minutes.
  2. Sign into the application:
    • The content repository provides two demo users with credentials as part of the demo-data-stack in the AWS CDK output. Copy the password from the terminal associated with the sales-user, which belongs to the sales department.
    • Follow the prompts from the React webpage to sign in with the sales-user and change the temporary password.
  3. Enter search queries and verify results. The search action invokes the workflow described in Running input search query. For example:
    • Enter works well as the search query. Note the multilingual output and the semantically similar results (Figure 8).

      Positive sentiment multi-lingual search result for the sales-user

        Figure 8. Positive sentiment multilingual search result for the sales-user

    • Enter bad quality as the search query. Note the multilingual output and the semantically similar results (Figure 9).

      Negative sentiment multi-lingual search result for the sales-user

      Figure 9. Negative sentiment multi-lingual search result for the sales-user

  4. Sign out as the sales-user with the Log Out button on the webpage.
  5. Sign in using the marketing-user credentials to verify access control:
    • Follow the sign in procedure in step 2 but with the marketing-user.
    • This time with works well as search query, you find different output. This is because the access control only allows marketing-user to search for the documents that belong to the marketing department (Figure 10).

      Positive sentiment multi-lingual search result for the marketing-user

      Figure 10. Positive sentiment multilingual search result for the marketing-user

Cleanup

In the backend-cdk subdirectory of the cloned repository, delete the deployed resources: cdk destroy --all.

Additionally, you need to access the Amazon SageMaker console to delete the SageMaker endpoint and notebook instance created as part of the Walkthrough setup section.

Conclusion

In this blog, we enriched the content repository with multi-lingual semantic search features while maintaining the access control fundamentals that we implemented in Part 1. The building blocks of the semantic search for unstructured documents—Amazon Textract, Amazon SageMaker, and Amazon OpenSearch Service—set a foundation for you to customize and enhance the search capabilities for your specific use case. For example, you can leverage the fast developments in Large Language Models (LLM) to enhance the semantic search experience. You can replace the encoder model with an LLM capable of generating multilingual embeddings while still maintaining the OpenSearch service to store and index data and perform vector search.

Content Repository for Unstructured Data with Multilingual Semantic Search: Part 1

Post Syndicated from Patrik Nagel original https://aws.amazon.com/blogs/architecture/content-repository-for-unstructured-data-with-multilingual-semantic-search-part-1/

Unstructured data can make up to 80 percent of data in the day-to-day business of financial organizations. For example, these organizations typically store and read PDFs and images for claim processing, underwriting, and know your customer (KYC). Organizations need to make this ingested data accessible and searchable across different entities while logically separating data access according to role requirements.

In this two-part series, we use AWS services to build an end-to-end content repository for storing and processing unstructured data with the following features:

  • Dynamic access control-based logic over unstructured data
  • Multilingual semantic search capabilities

In part 1, we build the architectural foundation for the content repository, including the resource access control logic and a web UI to upload and list documents.

Solution overview

The content repository includes four building blocks:

Frontend and interaction: For this function, we use AWS Amplify, which is a set of purpose-built tools and features to help frontend web and mobile developers quickly build full-stack applications on AWS. The React application uses the AWS Amplify authentication feature to quickly set up a complete authentication flow integrated into Amazon Cognito. Amplify also hosts the frontend application.

Authentication and authorization: Implementing dynamic resource access control with a combination of roles and attributes is fundamental to your content repository security. Amazon Cognito provides a managed, scalable user directory, user sign-up and sign-in flows, and federation capabilities through third-party identity providers. We use Amazon Cognito user pools as the source of user identity for the content repository. You can work with user pool groups to represent different types of user collection, and you can manage their permissions using a group-associated AWS Identity and Access Management (IAM) role.

Users authenticate against the Amazon Cognito user pool. The web app will exchange the user pool tokens for AWS credentials through an Amazon Cognito identity pool in the content repository. You can complement the IAM role-based authorization model by mapping your relevant attributes to principal tags that will be evaluated as part of IAM permission policies. This allows a dynamic and flexible authorization strategy. For use cases that need federation with third-party identity providers, you can base your user collection on existing user group attributes, such as Active Directory group membership.

Backend and business logic: Authenticated users are redirected to the Amazon API Gateway. API Gateway provides managed publishing for application programming interfaces (APIs) that act as the repository’s “front door.” API Gateway also interacts with the repository’s backend through RESTful APIs. This makes the business logic of the content repository extensible for future use cases, such as transcription and translation. We use AWS Lambda as a serverless, event-driven compute service to run specific business logic code, such as uploading a document to the content repository.

Content storage: Amazon Simple Storage Service (Amazon S3) provides virtually unlimited scalability and high durability. With Amazon S3, you can cost-effectively store unstructured documents in their native formats and make it accessible in a secure and scalable way. Enriching the uploaded documents with tags simplifies data governance with fine-grained access control.

Technical architecture

The technical architecture of the content repository with these four components can be found in Figure 1.

Technical architecture of the content repository

Figure 1. Technical architecture of the content repository

Let’s explore the architecture step by step.

  1. The frontend uses the Amplify JS library to add the authentication UI component to your React app, allowing authenticated users to sign in.
  2. Once the user provides their sign-in credentials, they are redirected to Amazon Cognito user pools to be authenticated.
  3. Once the authentication is successful, Amazon Cognito invokes a pre-token generation Lambda function. This function customizes the identity (ID) token with a new claim called department. This new claim is the Amazon Cognito group name from the cognito:preferred_role claim.
  4. Amazon Cognito returns the identity, access, and refresh token in JSON format to the frontend.
  5. The Amplify client library stores the tokens and handles refreshes using the refresh token while the React frontend application calls the API Gateway with the ID token. Note: Usually, you would use the access token to grant access to authorized resources. For this architecture, we use the ID token because we have enriched it with the custom claim during step 3.
  6. API Gateway uses its native integration with Amazon Cognito and validates the ID token’s signature and expiration using Amazon Cognito user pool authorizer. For more complex authorization scenarios, you can use API Gateway Lambda authorizer with the AWS JSON Web Token (JWT) Verify library for verifying JWTs signed by Amazon Cognito.
  7. After successful validation, API Gateway passes the ID token to the backend Lambda function, which can verify and authorize upon it for access control.
  8. Upon document upload action, the backend Lambda function calls the Amazon Cognito identity pool to exchange the ID token for the temporary AWS credentials associated with the cognito:preferred_role claim.
  9. The document upload Lambda function returns a pre-signed URL with the custom department claim in the Amazon S3 path prefix as well as the object tag. The Amazon S3 pre-signed URL is used for the document upload from the frontend application directly to Amazon S3.
  10. Upon document list action, similar to step 8, the backend Lambda function exchanges the ID token for the temporary AWS credentials. The Lambda function returns only the documents based on the user’s preferred group and associated custom department claim.

Prerequisites

You must have the following prerequisites for this solution:

Walkthrough

Setup

The following steps will deploy two AWS CDK stacks into your AWS account:

  • content-repo-stack (blog-content-repo-stack.ts) creates the environment detailed in Figure 1.
  • demo-data-stack (userpool-demo-data-stack.ts) deploys sample users, groups, and role mappings.

To continue setup, use the following commands:

  1. Clone the project git repository:
    git clone https://github.com/aws-samples/content-repository-with-dynamic-access-control content-repository
  2. Install the necessary dependencies:
    cd content-repository/backend-cdk
    npm install
    
  3. Configure environment variables:
    export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text)
    export CDK_DEFAULT_REGION=$(aws configure get region)
    
  4. Bootstrap your account for AWS CDK usage:
    cdk bootstrap aws://$CDK_DEFAULT_ACCOUNT/$CDK_DEFAULT_REGION
  5. Deploy the code to your AWS account:
    cdk deploy --all

Using the repository

Once you deploy the CDK stacks in your AWS account, follow these steps:

1. Access the frontend application:

a. Copy the amplifyHostedAppUrl value shown in the AWS CDK output from the content-repo-stack.

b. Use the URL with your web browser to access the frontend application.

c. A temporary page displays until the automated build and deployment of the React application completes after 4-5 minutes.

2. Application sign-in and role-based access control (RBAC):

a. The React webpage prompts you to sign in and then change the temporary password.

b. The content repository provides two demo users with credentials as part of the demo-data-stack in the AWS CDK output. In this walkthrough, we use the sales-user user, which belongs to the sales department group to validate RBAC.

3. Upload a document to the content repository:

a. Authenticate as sales-user.

b. Select upload to upload your first document to the content repository.

c. The repository provides sample documents in the assets sub-folder.

4. List your uploaded document:

a. Select list to show the uploaded sales content.

b. To verify the dynamic access control, repeat steps 2 and 3 for the marketing-user user, which belongs to the marketing department group.

c. Sign-in to the AWS Management Console and navigate to the Amazon S3 bucket with the prefix content-repo-stack-s3sourcebucket to confirm that all the uploaded content exists.

Implementation notes

Frontend deployment and cross-origin access

The content-repo-stack contains an AwsCustomResource construct. This construct uses the Amplify API to start the release job of the Amplify hosted frontend application. The preBuild step of the Amplify application build specification dynamically configures its backend for the Amazon Cognito-based authentication. The required Amazon Cognito configuration parameters are retrieved from the AWS Systems Manager Parameter Store during build time. Similarly, the Amplify application postBuild step updates the Amazon S3 cross-origin resource sharing (CORS) rule for the Amazon S3 bucket to only allow cross-origin access from the Amplify-hosted URL of the frontend application.

Application sign-in and access control

The Amazon Cognito identity pool configuration is set to Choose role from token for authenticated users, as in Figure 2. This setup permits authenticated users to pass the roles in the ID token that the Amazon Cognito user pool assigned. Backend Lambda functions use the roles that appear in the cognito:roles and cognito:preferred_role claims in the ID token for RBAC.

Amazon Cognito identity pool configuration – using tokens to assign roles to authenticated users

Figure 2. Amazon Cognito identity pool configuration – using tokens to assign roles to authenticated users

In the attributes for access control section, we configured a custom mapping from the augmented department token claim to a tag key, as in Figure 3. The backend logic uses the tag key to match the PrincipalTag condition in IAM policies to control access to AWS resources.

Amazon Cognito identity pool configuration – custom mapping from claim names to tag keys

Figure 3. Amazon Cognito identity pool configuration – custom mapping from claim names to tag keys

Document upload

The presigned_url.py Lambda function generates a pre-signed Amazon S3 URL using the department token claim as the key. This function automatically organizes the uploaded document into a logical structure in the Amazon S3 source bucket. Accordingly, the cognito:preferred_role used for the Amazon S3 client credentials in the Lambda function has a permission policy using the PrincipalTag department to dynamically limit access to the Amazon S3 key, as in Figure 4.

Permission policy using PrincipalTag to upload documents to Amazon S3

Figure 4. Permission policy using PrincipalTag to upload documents to Amazon S3

Document listing

The list functionality only shows the uploaded content belonging to the preferred group of authenticated Amazon Cognito user pool user. To only list the files that a specific user (for example, sales-user) has access to, use the PrincipalTag s3:prefix condition, as in Figure 5.

Permission policy using s3:prefix condition with session tags to list documents

Figure 5. Permission policy using s3:prefix condition with session tags to list documents

Cleanup

In the backend-cdk subdirectory, delete the deployed resources:

cdk destroy --all

Conclusion

In this blog, we demonstrated how to build a content repository with an easy-to-use web application for unstructured data that ingests documents while maintaining dynamic access control for users within departments. These steps provide a foundation to build your own content repository to store and process documents. As next steps, based on your organization’s security requirements, you can implement more complex access control use cases by balancing IAM role and principal tags. For example, you can use Amazon Cognito user pool custom attributes for additional dimensions such as document “clearance” with optional modification in the pre-token generation Lambda.

In the next part of this blog series, we will enrich the content repository with multi-lingual semantic search features while maintaining the access control fundamentals we’ve already implemented. For additional information on how you can build a solution to search for information across multiple scanned documents, PDFs, and images with compliance capabilities, please refer our Document Understanding Solution from AWS Solutions Library.