Ingestion Pipeline

The Ingestion Pipeline is the foundation of Health Grade Search (HGS), transforming raw client data into structured embeddings ready for vector search.
This section provides a detailed overview of how the pipeline works, the components it uses, and how to configure it to meet your needs. Whether you are ingesting data for provider information, clinical documents, or EMRs, the pipeline offers flexibility and scalability.

Overview

The ingestion pipeline processes and converts data into a format optimized for vector-based search using Clinia’s embedding models. It consists of two main processors:

  • Segmenter (Optional): Splits large documents into smaller segments for efficient processing and retrieval.
  • Vectorizer (Required): Transforms text data into vectors using Clinia's proprietary AI models, enabling the data to be indexed and searched.

Pipeline Processors

The pipeline is built around two key processors: the Segmenter and the Vectorizer. Here’s how they work:

1. Segmenter

The Segmenter breaks down larger documents (e.g., articles, EMRs) into smaller, manageable segments, called "passages." This approach improves the accuracy and relevance of search results by enabling the vectorizer to focus on specific, context-rich sections of text.

Configuration Options:

  • Built-in Segmenter: Our default segmenter can be configured to suit your data needs, allowing you to define properties like:
    • inputProperty: The data field to segment (e.g., "abstract" or "full_text").
    • propertyKey: The key under which segmented passages are stored (e.g., "passages").
    • maxTokens: The maximum number of tokens per segment (e.g., 512).
  • Custom Segmenter: Clients can choose to integrate their own segmenter, especially if they have proprietary algorithms or intellectual property around segmentation.

Example Configuration:

{
  "steps": [
    {
      "type": "SEGMENTER",
      "segmenter": {
        "inputProperty": "abstract",
        "propertyKey": "passages",
        "maxTokens": 512
      }
    }
  ]
}

2. Vectorizer

The Vectorizer converts text segments into vectors using Clinia's embedding models. These vectors are the core of HGS's AI-powered search, enabling the platform to identify and rank relevant results based on semantic similarity.

Configuration Options:

  • Model Selection:
    Choose from available Clinia models tailored to specific healthcare data types:
    • embedder_medical_journals_qa: Ideal for searching medical journal articles and clinical documentation.
    • embedder_providers (coming soon): Optimized for provider search.
    • embedder_emr (coming soon): Designed for querying EMR data.
  • Properties:
    • inputProperty: The text property from the segmenter output (e.g., "abstract.passages").
    • propertyKey: The key under which the vectorized output is stored (e.g., "vector").
    • modelID: The Clinia model used for vectorization (e.g., "embedder_medical_journals_qa").

Example Configuration:

{
  "steps": [
    {
      "type": "VECTORIZER",
      "vectorizer": {
        "inputProperty": "abstract.passages",
        "propertyKey": "vector",
        "modelID": "embedder_medical_journals_qa"
      }
    }
  ]
}

Setting Up the Ingestion Pipeline

To set up and configure the ingestion pipeline, follow these steps:

1. Define Your Data Structure

Using already defined Profiles, such as:

{
  "type": "ROOT",
  "properties": {
    "title": {
      "type": "symbol"
    },
    "abstract": {
      "type": "markdown"
    }
  }
}

2. Configure the Ingestion Pipeline

The pipeline configuration is a JSON object specifying the sequence of processing steps. A simple configuration example looks like this:

{
  "steps": [
    {
      "type": "SEGMENTER",
      "segmenter": {
        "inputProperty": "abstract",
        "propertyKey": "passages",
        "maxTokens": 512
      }
    },
    {
      "type": "VECTORIZER",
      "vectorizer": {
        "inputProperty": "abstract.passages",
        "propertyKey": "vector",
        "modelID": "embedder_medical_journals_qa"
      }
    }
  ]
}

This configuration instructs the pipeline to segment the abstract field into passages and then vectorize those passages using the specified model.

Once configured, your pipeline will run the ingestion when data are imported into the targeted data source, and make the properties searchable through the Partitions