Document Intelligenceでダウンロード済HTMLファイルをmarkdownに変換する処理を作成してみたので、メモ
Doucment Intelligenceを動かすために、DocumentIntelligenceのライブラリをインストール
pip install azure-ai-documentintelligence
以下のコードを実行。 ポイントはHTMLをバイナリで読み込み、bodyパラメータに渡すこと
# coding: utf-8 # ------------------------------------------------------------------------- # Copyright (c) Microsoft Corporation. All rights reserved. # Licensed under the MIT License. See License.txt in the project root for # license information. # -------------------------------------------------------------------------- """ FILE: sample_analyze_documents_output_in_markdown.py DESCRIPTION: This sample demonstrates how to analyze an document in markdown output format. USAGE: python sample_analyze_documents_output_in_markdown.py Set the environment variables with your own values before running the sample: 1) DOCUMENTINTELLIGENCE_ENDPOINT - the endpoint to your Document Intelligence resource. 2) DOCUMENTINTELLIGENCE_API_KEY - your Document Intelligence API key. """ import os def analyze_documents_output_in_markdown(): # [START analyze_documents_output_in_markdown] from azure.core.credentials import AzureKeyCredential from azure.ai.documentintelligence import DocumentIntelligenceClient from azure.ai.documentintelligence.models import AnalyzeDocumentRequest, DocumentContentFormat, AnalyzeResult endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"] key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"] document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key)) with open("<用意したHTMLのファイル名>", "rb") as f: html_bytes = f.read() poller = document_intelligence_client.begin_analyze_document( "prebuilt-layout", body=AnalyzeDocumentRequest(bytes_source=html_bytes), output_content_format=DocumentContentFormat.MARKDOWN, ) result: AnalyzeResult = poller.result() print(f"Here's the full content in format {result.content_format}:\n") print(result.content) # [END analyze_documents_output_in_markdown] return result result=analyze_documents_output_in_markdown()