1. Text-and-image all in one chat function by ChatGPT4o

2. many experiments for extracting data by two ways:
page text or page image.
This commit is contained in:
Blade He 2024-08-26 17:17:39 -05:00
parent 843f588015
commit 15720d8bfd
7 changed files with 216 additions and 18 deletions

View File

@ -0,0 +1,47 @@
Instructions:
Please read the image carefully.
1. Identify the text in the PDF page image.
The text will be as output with key: "text".
2. Identify and format the all of tables in the PDF page image.
Table contents should be as markdown format,
ensuring the table structure and contents are exactly as in the PDF page image.
The format should be: |Column1|Column2|\n|---|---|\n|Row1Col1|Row1Col2|
Each cell in the table(s) should be in the proper position of relevant row and column.
The markdown table(s) will be as output with key: "table_contents".
3. Extract data from upon parsed text and table(s) contents.
3.1 The upon parsed text and table(s) contents as context.
3.2 Data Extraction from parsed table contents
Maybe there are TER, performance fees data in the parsed table(s) contents.
The TER reported name could be:Total Expense Ratio, TER, Annualised TER including performance fees,etc.
The performance fees reported name could be:performance fees, performance fees ratio, etc.
If exist both of "TER including performance fees" and "TER excluding performance fees", the performance fees should be:
TER including performance fees - TER excluding performance fees.
The TER and performance fees value is percentage number, it means the value should be less than 100.
Most of cases, the data is in the table(s) of context.
3.3 Fund name/ share class name extraction from upon context
Please extract fund name and share class name from the context.
If can't find fund name or share class name from table contents,
please try to find them from parsed text contents.
3.4 Output
If possible, please extract fund name, share class name, TER or performance fees value as the output.
One fund could be with multiple share classes and relevant TER or performance fees values.
The output should be JSON format, the format is like:
{
"text": "text from image",
"table_contents": ["|Column1|Column2|\n|---|---|\n|Row1Col1|Row1Col2|"],
"data":
[{
"fund name": "fund 1",
"share data": [{"share name": "share 1", "ter": 1.23, "performance fees": 0.2},{"share name": "share 2", "ter": 2.56, "performance fees": 1.2}]
},
{
"fund name": "fund 2",
"share data": [{"share name": "share a", "ter": 1.16, "performance fees": 0.5},{"share name": "share b", "ter": 1.45, "performance fees": 1.1}]
}]
}
Only output JSON data.
If can't find share class name in context, please output empty JSON data: []
Answer:

View File

@ -0,0 +1,29 @@
Context:
{page_text}
Read the context carefully.
Maybe there are TER, performance fees data in the context, the TER reported name could be:
Total Expense Ratio, TER, Annualised TER including performance fees,etc.
The performance fees reported name could be:
performance fees, performance fees ratio, etc.
If exist both of "TER including performance fees" and "TER excluding performance fees", the performance fees should be:
TER including performance fees - TER excluding performance fees.
The TER and performance fees value is percentage number, it means the value should be less than 100.
Most of cases, the data is in the table(s) of context.
If with multiple TER/ performance fee values in same row, please extract the latest.
If possible, please extract fund name, share class name, TER or performance fees value as the output.
One fund could be with multiple share classes and relevant TER values.
The output should be JSON format, the format is like:
[{
"fund name": "fund 1",
"share data": [{"share name": "share 1", "ter": 1.23, "performance fees": 0.2},{"share name": "share 2", "ter": 2.56, "performance fees": 1.2}]
},
{
"fund name": "fund 2",
"share data": [{"share name": "share a", "ter": 1.16, "performance fees": 0.5},{"share name": "share b", "ter": 1.45, "performance fees": 1.1}]
}]
Only output JSON data.
If can't find share class name in context, please output empty JSON data: []
Answer:

View File

@ -0,0 +1,35 @@
Smith is a professional to process financial report.
He want to extract table(s) from PDF, output as markdown format.
He decides to aks ChatGPT4o to help him for this.
Smith's prompt is as below:
--------------------------------------Smith's prompts start--------------------------------------
Instructions:
Please read the image carefully.
Answer below questions:
1. Please find the table or tables in the image.
2. Output the table contents as markdown format, it's like:
|name|age|hobby|
|Annie|18|music|
The contents should be exactly precise as the image contents.
3. Please output the results as JSON format, the result member is with legal markdown table format, the example is:
{
"tables": ["
|name|age|hobby|
|Annie|18|music|
"]
}
4. Only output JSON with tables
Here is the answer from ChatGPT4o:
--------------------------------------ChatGPT4o start--------------------------------------
|Share Class|TER for the year (Note 6)|\n|---|---|\n|AI - Shares| |\n|BF - Shares| |\n|BI - Shares| |\n|BP - Shares| |\n|E - Shares|0.30%|\n|HAF - SEK Shares|0.84%|\n|HAI - SEK Shares|1.59%|\n|HB - EUR Shares| |\n|HB - SEK Shares| |\n|HBC - EUR Shares|0.65%|\n|HBF - EUR Shares| |\n|HBF - NOK Shares| |\n|HBF - SEK Shares| |\n|HBI - DKK Shares| |\n|HBI - EUR Shares| |\n|HBI - NOK Shares| |\n|HBI - SEK Shares| |\n|HY - DKK Shares| |\n|HY - EUR Shares| |\n|HY - SEK Shares| |\n|LE - Shares| |\n|LP - Shares| |\n|X - Shares| |\n|Y - Shares|0.09%|
--------------------------------------ChatGPT4o end--------------------------------------
But it's incorrect, the correct answer is as below:
--------------------------------------correct answer start--------------------------------------
|Share Class|TER for the year (Note 6)|\n|---|---|\n|AI - Shares| |\n|BF - Shares| |\n|BI - Shares|0.30%|\n|BP - Shares|0.84%|\n|E - Shares|1.59%|\n|HAF - SEK Shares| |\n|HAI - SEK Shares| |\n|HB - EUR Shares| |\n|HB - SEK Shares| |\n|HBC - EUR Shares|0.65%|\n|HBF - EUR Shares| |\n|HBF - NOK Shares| |\n|HBF - SEK Shares| |\n|HBI - DKK Shares| |\n|HBI - EUR Shares| |\n|HBI - NOK Shares| |\n|HBI - SEK Shares| |\n|HY - DKK Shares| |\n|HY - EUR Shares| |\n|HY - SEK Shares| |\n|LE - Shares| |\n|LP - Shares| |\n|X - Shares| |\n|Y - Shares|0.09%|
--------------------------------------correct answer end--------------------------------------
Please analyze the image, incorrect answer, correct answer, help Mr. Smith to optimize the instructions and output as JSON format: {"Instructions": "optimized instructions"}
Answer:

View File

@ -0,0 +1,11 @@
Instructions:
Please read the image carefully.
Answer the following questions:
1. Identify the table or tables in the image.
2. Output the table contents in markdown format, ensuring the table structure and contents are exactly as in the image.
The format should be: |Column1|Column2|\n|---|---|\n|Row1Col1|Row1Col2|
3. Output the results in JSON format with the key 'tables' containing the markdown table(s).
The format should be:
{"tables": ["|Column1|Column2|\n|---|---|\n|Row1Col1|Row1Col2|"]}
4. Only output JSON with tables.
Answer:

View File

@ -0,0 +1,11 @@
Instructions:
Please read the image carefully.
Answer the following questions:
1. Identify the text contents in the image.
2. Output the text contexts, ensuring the contents are exactly as in the image.
The format should be totally same as the sequences in the image.
3. Output the results in JSON format with the key 'text' containing the markdown table(s).
The format should be:
{"text": "image contents text"}
4. Only output JSON with text.
Answer:

View File

@ -7,10 +7,9 @@ from utils.logger import logger
from utils.gpt_utils import chat
def get_base64_pdf_image_list(pdf_file: str,
pdf_page_index_list: list,
output_folder: str=None) -> dict:
def get_base64_pdf_image_list(
pdf_file: str, pdf_page_index_list: list, output_folder: str = None
) -> dict:
if pdf_file is None or pdf_file == "" or not os.path.exists(pdf_file):
logger.error("pdf_file is not provided")
return None
@ -19,8 +18,9 @@ def get_base64_pdf_image_list(pdf_file: str,
pdf_page_index_list = list(range(pdf_util.get_page_count()))
if output_folder is not None and len(output_folder) > 0:
os.makedirs(output_folder, exist_ok=True)
pdf_image_info = pdf_util.extract_images(pdf_page_index_list=pdf_page_index_list,
output_folder=output_folder)
pdf_image_info = pdf_util.extract_images(
pdf_page_index_list=pdf_page_index_list, output_folder=output_folder
)
return pdf_image_info
@ -31,15 +31,20 @@ def encode_image(image_path: str):
return base64.b64encode(image_file.read()).decode("utf-8")
def chat_with_image(pdf_file: str,
pdf_page_index_list: list,
image_folder: str,
gpt_folder: str):
def chat_with_image(
pdf_file: str,
pdf_page_index_list: list,
image_instructions_file: str,
image_folder: str,
gpt_folder: str,
):
if pdf_file is None or pdf_file == "" or not os.path.exists(pdf_file):
logger.error("pdf_file is not provided")
return None
pdf_image_info = get_base64_pdf_image_list(pdf_file, pdf_page_index_list, image_folder)
image_instructions_file = r'./instructions/table_extraction_image_prompts.txt'
pdf_image_info = get_base64_pdf_image_list(
pdf_file, pdf_page_index_list, image_folder
)
with open(image_instructions_file, "r", encoding="utf-8") as file:
image_instructions = file.read()
os.makedirs(gpt_folder, exist_ok=True)
@ -57,7 +62,9 @@ def chat_with_image(pdf_file: str,
response_json = json.loads(response)
except:
response_json = json_repair.loads(response)
response_json_file = os.path.join(gpt_folder, f"{pdf_base_name}_{page_index}.json")
response_json_file = os.path.join(
gpt_folder, f"{pdf_base_name}_{page_index}.json"
)
with open(response_json_file, "w", encoding="utf-8") as file:
json.dump(response_json, file, indent=4)
logger.info(f"Response for image in page {page_index}: {response}")
@ -65,8 +72,66 @@ def chat_with_image(pdf_file: str,
if __name__ == "__main__":
pdf_file = r"/data/emea_ar/small_pdf/382366116.pdf"
pdf_page_index_list = [29, 35, 71, 77, 83, 89, 97, 103, 112, 121, 130, 140, 195, 250, 305]
# Table extraction by image
# pdf_file = r"/data/emea_ar/small_pdf/382366116.pdf"
# pdf_page_index_list = [29, 35, 71, 77, 83, 89, 97, 103, 112, 121, 130, 140, 195, 250, 305]
# pdf_file = r"/data/emea_ar/small_pdf/389171486.pdf"
# pdf_page_index_list = [13]
# pdf_file = r"/data/emea_ar/small_pdf/402181770.pdf"
# pdf_page_index_list = [29]
# image_instructions_file = r"./instructions/table_extraction_image_prompts_v2.txt"
# image_output_folder = r"/data/emea_ar/small_pdf_image/"
# gpt_output_folder = r"/data/emea_ar/output/gpt_image_response/table/"
# chat_with_image(
# pdf_file,
# pdf_page_index_list,
# image_instructions_file,
# image_output_folder,
# gpt_output_folder,
# )
# Data extraction by image
# pdf_file = r"/data/emea_ar/small_pdf/402181770.pdf"
# pdf_page_index_list = [29]
pdf_file = r"/data/emea_ar/small_pdf/389171486.pdf"
pdf_page_index_list = [13]
image_output_folder = r"/data/emea_ar/small_pdf_image/"
gpt_output_folder = r"/data/emea_ar/output/gpt_image_response/"
chat_with_image(pdf_file, pdf_page_index_list, image_output_folder, gpt_output_folder)
gpt_output_folder = r"/data/emea_ar/output/gpt_image_response/data/"
image_instructions_file = r"./instructions/data_extraction_image_prompts.txt"
chat_with_image(
pdf_file,
pdf_page_index_list,
image_instructions_file,
image_output_folder,
gpt_output_folder,
)
# Text extraction by image
# pdf_file = r"/data/emea_ar/small_pdf/389171486.pdf"
# pdf_page_index_list = [13]
# image_instructions_file = r"./instructions/text_extraction_image_prompts.txt"
# image_output_folder = r"/data/emea_ar/small_pdf_image/"
# gpt_output_folder = r"/data/emea_ar/output/gpt_image_response/text/"
# chat_with_image(
# pdf_file,
# pdf_page_index_list,
# image_instructions_file,
# image_output_folder,
# gpt_output_folder,
# )
# pdf_file = r"/data/emea_ar/small_pdf/389171486.pdf"
# pdf_page_index_list = [13]
# image_instructions_file = r"./instructions/table_extraction_image_optimize_prompts.txt"
# image_output_folder = r"/data/emea_ar/small_pdf_image/"
# gpt_output_folder = r"/data/emea_ar/output/gpt_image_response/optimized_instructions/"
# chat_with_image(
# pdf_file,
# pdf_page_index_list,
# image_instructions_file,
# image_output_folder,
# gpt_output_folder,
# )