SEが最近起こったことを書くブログ

ITエンジニアが試したこと、気になったことを書いていきます。

Document IntelligenceでダウンロードしたHTMLファイルをmarkdownに変換する

データ分析 LLM

Document Intelligenceでダウンロード済HTMLファイルをmarkdownに変換する処理を作成してみたので、メモ

Doucment Intelligenceを動かすために、DocumentIntelligenceのライブラリをインストール

pip install azure-ai-documentintelligence

以下のコードを実行。ポイントはHTMLをバイナリで読み込み、bodyパラメータに渡すこと

# coding: utf-8

# -------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See License.txt in the project root for
# license information.
# --------------------------------------------------------------------------

"""
FILE: sample_analyze_documents_output_in_markdown.py

DESCRIPTION:
    This sample demonstrates how to analyze an document in markdown output format.

USAGE:
    python sample_analyze_documents_output_in_markdown.py

    Set the environment variables with your own values before running the sample:
    1) DOCUMENTINTELLIGENCE_ENDPOINT - the endpoint to your Document Intelligence resource.
    2) DOCUMENTINTELLIGENCE_API_KEY - your Document Intelligence API key.
"""

import os

def analyze_documents_output_in_markdown():
    # [START analyze_documents_output_in_markdown]
    from azure.core.credentials import AzureKeyCredential
    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.ai.documentintelligence.models import AnalyzeDocumentRequest, DocumentContentFormat, AnalyzeResult

    endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"]
    key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"]
    document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
    with open("<用意したHTMLのファイル名>", "rb") as f:
        html_bytes = f.read()
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout",
            body=AnalyzeDocumentRequest(bytes_source=html_bytes),
           output_content_format=DocumentContentFormat.MARKDOWN,
        )
    result: AnalyzeResult = poller.result()

    print(f"Here's the full content in format {result.content_format}:\n")
    print(result.content)
    # [END analyze_documents_output_in_markdown]
    return result

result=analyze_documents_output_in_markdown()

参考URL