update instructions fund name section structure

This commit is contained in:
Blade He 2025-03-28 00:51:51 -05:00
parent 8a5723c150
commit 46f86b124b
3 changed files with 95 additions and 274 deletions

View File

@ -17,8 +17,8 @@
"data_business_features": { "data_business_features": {
"common": [ "common": [
"## General rules", "## General rules",
"- 1. The data is in the context, perhaps in table(s), semi-table(s) or paragraphs.", "1. The data is in the context, perhaps in table(s), semi-table(s) or paragraphs.",
"- 2. Fund name: ", "2. Fund name: ",
"a. The full fund name should be main fund name + sub-fund name, e,g, main fund name is Black Rock European, sub-fund name is Growth, the full fund name is: Black Rock European Growth.", "a. The full fund name should be main fund name + sub-fund name, e,g, main fund name is Black Rock European, sub-fund name is Growth, the full fund name is: Black Rock European Growth.",
"b. The sub-fund name may be as the first column or first row values in the table.", "b. The sub-fund name may be as the first column or first row values in the table.",
"b.1 fund name example:", "b.1 fund name example:",
@ -67,12 +67,12 @@
"---Example 3 End---", "---Example 3 End---",
"Although exist \"Retirement account\" and \"Transition to Retirement account\", but the investment option is not exist, so fund name and share name should be: \"Rest Pension\".", "Although exist \"Retirement account\" and \"Transition to Retirement account\", but the investment option is not exist, so fund name and share name should be: \"Rest Pension\".",
"\n", "\n",
"- 3. Only extract the latest data from context:", "3. Only extract the latest data from context:",
"If with multiple data values in same row, please extract the latest.", "If with multiple data values in same row, please extract the latest.",
"\n", "\n",
"- 4. Reported names:", "4. Reported names:",
"Only output the values which with significant reported names.", "Only output the values which with significant reported names.",
"- Multiple data columns with same reported name but different post-fix:", "Multiple data columns with same reported name but different post-fix:",
"If there are multiple reported names with different post-fix text, here is the priority rule:", "If there are multiple reported names with different post-fix text, here is the priority rule:",
"The pos-fix text is in the brackets: (gross), (net), pick up the values from (net).", "The pos-fix text is in the brackets: (gross), (net), pick up the values from (net).",
"---Example Start---", "---Example Start---",
@ -80,14 +80,14 @@
"---Example End---", "---Example End---",
"The output should be:", "The output should be:",
"{\"data\": [{\"fund name\": \"Allan Gray Australian Equity Fund\", \"share name\": \"Class A\", \"management_fee_and_costs\": 1.19, \"management_fee\": 0.77, \"administration_fees\": 0.42}]}", "{\"data\": [{\"fund name\": \"Allan Gray Australian Equity Fund\", \"share name\": \"Class A\", \"management_fee_and_costs\": 1.19, \"management_fee\": 0.77, \"administration_fees\": 0.42}]}",
"- 5. Please ignore these words as fund names, it means never extract these words as fund names. They are:", "5. Please ignore these words as fund names, it means never extract these words as fund names. They are:",
"\"Ready-made portfolios\", \"Simple choice\", \"Build-your-own portfolio\".", "\"Ready-made portfolios\", \"Simple choice\", \"Build-your-own portfolio\".",
"- 6. Identify the value of data point and if it is written 0% or 0.00% or 0 or 0.00 then extract the same as 0 do not assume null for the same and return its values as 0", "6. Identify the value of data point and if it is written 0% or 0.00% or 0 or 0.00 then extract the same as 0 do not assume null for the same and return its values as 0",
"---Example Start---", "---Example Start---",
"Retirement account \n\nInvestment option \n(A) Investment fees \nand costs (including \n(B) performance \nfees) (pa)* \n(B) Performance \nfees (pa) \n# \n(C) Transaction \ncosts (pa)*^ \n(A) + (C) Total \ninvestment cost \n(pa) \nBalanced Indexed 0.00% 0.00% 0.00% 0.00%\n", "Retirement account \n\nInvestment option \n(A) Investment fees \nand costs (including \n(B) performance \nfees) (pa)* \n(B) Performance \nfees (pa) \n# \n(C) Transaction \ncosts (pa)*^ \n(A) + (C) Total \ninvestment cost \n(pa) \nBalanced Indexed 0.00% 0.00% 0.00% 0.00%\n",
"---Example End---", "---Example End---",
"For this example, as \"Investment fees and costs (including (B) performance fees)\" and \"Performance fees (pa)\" mentioned as 0.00% so return 0 as datapoint values.", "For this example, as \"Investment fees and costs (including (B) performance fees)\" and \"Performance fees (pa)\" mentioned as 0.00% so return 0 as datapoint values.",
"- 7. If for data point value specifically Nil is written in the value then return NULL('') for the same" "7. If for data point value specifically Nil is written in the value then return NULL('') for the same"
], ],
"investment_level": { "investment_level": {
"total_annual_dollar_based_charges": "Total annual dollar based charges is share level data.", "total_annual_dollar_based_charges": "Total annual dollar based charges is share level data.",
@ -320,7 +320,8 @@
"FOUND \"Cost of product\", IGNORE ALL OF INFORMATION BELOW IT!!! JUST RETURN EMPTY RESPONSE!!!", "FOUND \"Cost of product\", IGNORE ALL OF INFORMATION BELOW IT!!! JUST RETURN EMPTY RESPONSE!!!",
"The output should be:", "The output should be:",
"{\"data\": []}", "{\"data\": []}",
"L. Do NOT infer or copy investment fees or management fees from examples provided for specific funds to other investment options. Only extract 'management_fee_and_costs' and 'management_fee' if explicitly stated separately for each investment option." "L. Do NOT infer or copy investment fees or management fees from examples provided for specific funds to other investment options. Only extract 'management_fee_and_costs' and 'management_fee' if explicitly stated separately for each investment option.",
"M. Identify the value of management fee and costs, and if it is written 0% or 0.00% or 0 or 0.00, then extract the same as 0, please don't ignore it."
], ],
"administration_fees":[ "administration_fees":[
"### Administration fees and costs", "### Administration fees and costs",

80
main.py
View File

@ -1522,8 +1522,8 @@ if __name__ == "__main__":
# get_aus_prospectus_document_category() # get_aus_prospectus_document_category()
re_run_extract_data = False re_run_extract_data = True
re_run_mapping_data = False re_run_mapping_data = True
force_save_total_data = True force_save_total_data = True
doc_source = "aus_prospectus" doc_source = "aus_prospectus"
# doc_source = "emea_ar" # doc_source = "emea_ar"
@ -1531,46 +1531,44 @@ if __name__ == "__main__":
# document_sample_file = ( # document_sample_file = (
# r"./sample_documents/aus_prospectus_verify_6_documents_sample.txt" # r"./sample_documents/aus_prospectus_verify_6_documents_sample.txt"
# ) # )
document_sample_file_list = [ document_sample_file = (
r"./sample_documents/aus_prospectus_46_documents_sample.txt", r"./sample_documents/aus_prospectus_46_documents_sample.txt"
r"./sample_documents/aus_prospectus_verify_6_documents_sample.txt", )
] logger.info(f"Start to run document sample file: {document_sample_file}")
for document_sample_file in document_sample_file_list: with open(document_sample_file, "r", encoding="utf-8") as f:
logger.info(f"Start to run document sample file: {document_sample_file}") special_doc_id_list = [doc_id.strip() for doc_id in f.readlines()
with open(document_sample_file, "r", encoding="utf-8") as f: if len(doc_id.strip()) > 0]
special_doc_id_list = [doc_id.strip() for doc_id in f.readlines() # special_doc_id_list = ["420339794"]
if len(doc_id.strip()) > 0] pdf_folder: str = r"/data/aus_prospectus/pdf/"
# special_doc_id_list = ["401212184"] output_pdf_text_folder: str = r"/data/aus_prospectus/output/pdf_text/"
pdf_folder: str = r"/data/aus_prospectus/pdf/" output_extract_data_child_folder: str = (
output_pdf_text_folder: str = r"/data/aus_prospectus/output/pdf_text/" r"/data/aus_prospectus/output/extract_data/docs/"
output_extract_data_child_folder: str = ( )
r"/data/aus_prospectus/output/extract_data/docs/" output_extract_data_total_folder: str = (
) r"/data/aus_prospectus/output/extract_data/total/"
output_extract_data_total_folder: str = ( )
r"/data/aus_prospectus/output/extract_data/total/" output_mapping_child_folder: str = (
) r"/data/aus_prospectus/output/mapping_data/docs/"
output_mapping_child_folder: str = ( )
r"/data/aus_prospectus/output/mapping_data/docs/" output_mapping_total_folder: str = (
) r"/data/aus_prospectus/output/mapping_data/total/"
output_mapping_total_folder: str = ( )
r"/data/aus_prospectus/output/mapping_data/total/" drilldown_folder = r"/data/aus_prospectus/output/drilldown/"
)
drilldown_folder = r"/data/aus_prospectus/output/drilldown/"
batch_run_documents( batch_run_documents(
doc_source=doc_source, doc_source=doc_source,
special_doc_id_list=special_doc_id_list, special_doc_id_list=special_doc_id_list,
pdf_folder=pdf_folder, pdf_folder=pdf_folder,
output_pdf_text_folder=output_pdf_text_folder, output_pdf_text_folder=output_pdf_text_folder,
output_extract_data_child_folder=output_extract_data_child_folder, output_extract_data_child_folder=output_extract_data_child_folder,
output_extract_data_total_folder=output_extract_data_total_folder, output_extract_data_total_folder=output_extract_data_total_folder,
output_mapping_child_folder=output_mapping_child_folder, output_mapping_child_folder=output_mapping_child_folder,
output_mapping_total_folder=output_mapping_total_folder, output_mapping_total_folder=output_mapping_total_folder,
drilldown_folder=drilldown_folder, drilldown_folder=drilldown_folder,
re_run_extract_data=re_run_extract_data, re_run_extract_data=re_run_extract_data,
re_run_mapping_data=re_run_mapping_data, re_run_mapping_data=re_run_mapping_data,
force_save_total_data=force_save_total_data force_save_total_data=force_save_total_data
) )
elif doc_source == "emea_ar": elif doc_source == "emea_ar":
special_doc_id_list = ["321733631"] special_doc_id_list = ["321733631"]
batch_run_documents( batch_run_documents(

File diff suppressed because one or more lines are too long