update instructions fund name section structure
This commit is contained in:
parent
8a5723c150
commit
46f86b124b
|
|
@ -17,8 +17,8 @@
|
||||||
"data_business_features": {
|
"data_business_features": {
|
||||||
"common": [
|
"common": [
|
||||||
"## General rules",
|
"## General rules",
|
||||||
"- 1. The data is in the context, perhaps in table(s), semi-table(s) or paragraphs.",
|
"1. The data is in the context, perhaps in table(s), semi-table(s) or paragraphs.",
|
||||||
"- 2. Fund name: ",
|
"2. Fund name: ",
|
||||||
"a. The full fund name should be main fund name + sub-fund name, e,g, main fund name is Black Rock European, sub-fund name is Growth, the full fund name is: Black Rock European Growth.",
|
"a. The full fund name should be main fund name + sub-fund name, e,g, main fund name is Black Rock European, sub-fund name is Growth, the full fund name is: Black Rock European Growth.",
|
||||||
"b. The sub-fund name may be as the first column or first row values in the table.",
|
"b. The sub-fund name may be as the first column or first row values in the table.",
|
||||||
"b.1 fund name example:",
|
"b.1 fund name example:",
|
||||||
|
|
@ -67,12 +67,12 @@
|
||||||
"---Example 3 End---",
|
"---Example 3 End---",
|
||||||
"Although exist \"Retirement account\" and \"Transition to Retirement account\", but the investment option is not exist, so fund name and share name should be: \"Rest Pension\".",
|
"Although exist \"Retirement account\" and \"Transition to Retirement account\", but the investment option is not exist, so fund name and share name should be: \"Rest Pension\".",
|
||||||
"\n",
|
"\n",
|
||||||
"- 3. Only extract the latest data from context:",
|
"3. Only extract the latest data from context:",
|
||||||
"If with multiple data values in same row, please extract the latest.",
|
"If with multiple data values in same row, please extract the latest.",
|
||||||
"\n",
|
"\n",
|
||||||
"- 4. Reported names:",
|
"4. Reported names:",
|
||||||
"Only output the values which with significant reported names.",
|
"Only output the values which with significant reported names.",
|
||||||
"- Multiple data columns with same reported name but different post-fix:",
|
"Multiple data columns with same reported name but different post-fix:",
|
||||||
"If there are multiple reported names with different post-fix text, here is the priority rule:",
|
"If there are multiple reported names with different post-fix text, here is the priority rule:",
|
||||||
"The pos-fix text is in the brackets: (gross), (net), pick up the values from (net).",
|
"The pos-fix text is in the brackets: (gross), (net), pick up the values from (net).",
|
||||||
"---Example Start---",
|
"---Example Start---",
|
||||||
|
|
@ -80,14 +80,14 @@
|
||||||
"---Example End---",
|
"---Example End---",
|
||||||
"The output should be:",
|
"The output should be:",
|
||||||
"{\"data\": [{\"fund name\": \"Allan Gray Australian Equity Fund\", \"share name\": \"Class A\", \"management_fee_and_costs\": 1.19, \"management_fee\": 0.77, \"administration_fees\": 0.42}]}",
|
"{\"data\": [{\"fund name\": \"Allan Gray Australian Equity Fund\", \"share name\": \"Class A\", \"management_fee_and_costs\": 1.19, \"management_fee\": 0.77, \"administration_fees\": 0.42}]}",
|
||||||
"- 5. Please ignore these words as fund names, it means never extract these words as fund names. They are:",
|
"5. Please ignore these words as fund names, it means never extract these words as fund names. They are:",
|
||||||
"\"Ready-made portfolios\", \"Simple choice\", \"Build-your-own portfolio\".",
|
"\"Ready-made portfolios\", \"Simple choice\", \"Build-your-own portfolio\".",
|
||||||
"- 6. Identify the value of data point and if it is written 0% or 0.00% or 0 or 0.00 then extract the same as 0 do not assume null for the same and return its values as 0",
|
"6. Identify the value of data point and if it is written 0% or 0.00% or 0 or 0.00 then extract the same as 0 do not assume null for the same and return its values as 0",
|
||||||
"---Example Start---",
|
"---Example Start---",
|
||||||
"Retirement account \n\nInvestment option \n(A) Investment fees \nand costs (including \n(B) performance \nfees) (pa)* \n(B) Performance \nfees (pa) \n# \n(C) Transaction \ncosts (pa)*^ \n(A) + (C) Total \ninvestment cost \n(pa) \nBalanced – Indexed 0.00% 0.00% 0.00% 0.00%\n",
|
"Retirement account \n\nInvestment option \n(A) Investment fees \nand costs (including \n(B) performance \nfees) (pa)* \n(B) Performance \nfees (pa) \n# \n(C) Transaction \ncosts (pa)*^ \n(A) + (C) Total \ninvestment cost \n(pa) \nBalanced – Indexed 0.00% 0.00% 0.00% 0.00%\n",
|
||||||
"---Example End---",
|
"---Example End---",
|
||||||
"For this example, as \"Investment fees and costs (including (B) performance fees)\" and \"Performance fees (pa)\" mentioned as 0.00% so return 0 as datapoint values.",
|
"For this example, as \"Investment fees and costs (including (B) performance fees)\" and \"Performance fees (pa)\" mentioned as 0.00% so return 0 as datapoint values.",
|
||||||
"- 7. If for data point value specifically Nil is written in the value then return NULL('') for the same"
|
"7. If for data point value specifically Nil is written in the value then return NULL('') for the same"
|
||||||
],
|
],
|
||||||
"investment_level": {
|
"investment_level": {
|
||||||
"total_annual_dollar_based_charges": "Total annual dollar based charges is share level data.",
|
"total_annual_dollar_based_charges": "Total annual dollar based charges is share level data.",
|
||||||
|
|
@ -320,7 +320,8 @@
|
||||||
"FOUND \"Cost of product\", IGNORE ALL OF INFORMATION BELOW IT!!! JUST RETURN EMPTY RESPONSE!!!",
|
"FOUND \"Cost of product\", IGNORE ALL OF INFORMATION BELOW IT!!! JUST RETURN EMPTY RESPONSE!!!",
|
||||||
"The output should be:",
|
"The output should be:",
|
||||||
"{\"data\": []}",
|
"{\"data\": []}",
|
||||||
"L. Do NOT infer or copy investment fees or management fees from examples provided for specific funds to other investment options. Only extract 'management_fee_and_costs' and 'management_fee' if explicitly stated separately for each investment option."
|
"L. Do NOT infer or copy investment fees or management fees from examples provided for specific funds to other investment options. Only extract 'management_fee_and_costs' and 'management_fee' if explicitly stated separately for each investment option.",
|
||||||
|
"M. Identify the value of management fee and costs, and if it is written 0% or 0.00% or 0 or 0.00, then extract the same as 0, please don't ignore it."
|
||||||
],
|
],
|
||||||
"administration_fees":[
|
"administration_fees":[
|
||||||
"### Administration fees and costs",
|
"### Administration fees and costs",
|
||||||
|
|
|
||||||
80
main.py
80
main.py
|
|
@ -1522,8 +1522,8 @@ if __name__ == "__main__":
|
||||||
|
|
||||||
# get_aus_prospectus_document_category()
|
# get_aus_prospectus_document_category()
|
||||||
|
|
||||||
re_run_extract_data = False
|
re_run_extract_data = True
|
||||||
re_run_mapping_data = False
|
re_run_mapping_data = True
|
||||||
force_save_total_data = True
|
force_save_total_data = True
|
||||||
doc_source = "aus_prospectus"
|
doc_source = "aus_prospectus"
|
||||||
# doc_source = "emea_ar"
|
# doc_source = "emea_ar"
|
||||||
|
|
@ -1531,46 +1531,44 @@ if __name__ == "__main__":
|
||||||
# document_sample_file = (
|
# document_sample_file = (
|
||||||
# r"./sample_documents/aus_prospectus_verify_6_documents_sample.txt"
|
# r"./sample_documents/aus_prospectus_verify_6_documents_sample.txt"
|
||||||
# )
|
# )
|
||||||
document_sample_file_list = [
|
document_sample_file = (
|
||||||
r"./sample_documents/aus_prospectus_46_documents_sample.txt",
|
r"./sample_documents/aus_prospectus_46_documents_sample.txt"
|
||||||
r"./sample_documents/aus_prospectus_verify_6_documents_sample.txt",
|
)
|
||||||
]
|
logger.info(f"Start to run document sample file: {document_sample_file}")
|
||||||
for document_sample_file in document_sample_file_list:
|
with open(document_sample_file, "r", encoding="utf-8") as f:
|
||||||
logger.info(f"Start to run document sample file: {document_sample_file}")
|
special_doc_id_list = [doc_id.strip() for doc_id in f.readlines()
|
||||||
with open(document_sample_file, "r", encoding="utf-8") as f:
|
if len(doc_id.strip()) > 0]
|
||||||
special_doc_id_list = [doc_id.strip() for doc_id in f.readlines()
|
# special_doc_id_list = ["420339794"]
|
||||||
if len(doc_id.strip()) > 0]
|
pdf_folder: str = r"/data/aus_prospectus/pdf/"
|
||||||
# special_doc_id_list = ["401212184"]
|
output_pdf_text_folder: str = r"/data/aus_prospectus/output/pdf_text/"
|
||||||
pdf_folder: str = r"/data/aus_prospectus/pdf/"
|
output_extract_data_child_folder: str = (
|
||||||
output_pdf_text_folder: str = r"/data/aus_prospectus/output/pdf_text/"
|
r"/data/aus_prospectus/output/extract_data/docs/"
|
||||||
output_extract_data_child_folder: str = (
|
)
|
||||||
r"/data/aus_prospectus/output/extract_data/docs/"
|
output_extract_data_total_folder: str = (
|
||||||
)
|
r"/data/aus_prospectus/output/extract_data/total/"
|
||||||
output_extract_data_total_folder: str = (
|
)
|
||||||
r"/data/aus_prospectus/output/extract_data/total/"
|
output_mapping_child_folder: str = (
|
||||||
)
|
r"/data/aus_prospectus/output/mapping_data/docs/"
|
||||||
output_mapping_child_folder: str = (
|
)
|
||||||
r"/data/aus_prospectus/output/mapping_data/docs/"
|
output_mapping_total_folder: str = (
|
||||||
)
|
r"/data/aus_prospectus/output/mapping_data/total/"
|
||||||
output_mapping_total_folder: str = (
|
)
|
||||||
r"/data/aus_prospectus/output/mapping_data/total/"
|
drilldown_folder = r"/data/aus_prospectus/output/drilldown/"
|
||||||
)
|
|
||||||
drilldown_folder = r"/data/aus_prospectus/output/drilldown/"
|
|
||||||
|
|
||||||
batch_run_documents(
|
batch_run_documents(
|
||||||
doc_source=doc_source,
|
doc_source=doc_source,
|
||||||
special_doc_id_list=special_doc_id_list,
|
special_doc_id_list=special_doc_id_list,
|
||||||
pdf_folder=pdf_folder,
|
pdf_folder=pdf_folder,
|
||||||
output_pdf_text_folder=output_pdf_text_folder,
|
output_pdf_text_folder=output_pdf_text_folder,
|
||||||
output_extract_data_child_folder=output_extract_data_child_folder,
|
output_extract_data_child_folder=output_extract_data_child_folder,
|
||||||
output_extract_data_total_folder=output_extract_data_total_folder,
|
output_extract_data_total_folder=output_extract_data_total_folder,
|
||||||
output_mapping_child_folder=output_mapping_child_folder,
|
output_mapping_child_folder=output_mapping_child_folder,
|
||||||
output_mapping_total_folder=output_mapping_total_folder,
|
output_mapping_total_folder=output_mapping_total_folder,
|
||||||
drilldown_folder=drilldown_folder,
|
drilldown_folder=drilldown_folder,
|
||||||
re_run_extract_data=re_run_extract_data,
|
re_run_extract_data=re_run_extract_data,
|
||||||
re_run_mapping_data=re_run_mapping_data,
|
re_run_mapping_data=re_run_mapping_data,
|
||||||
force_save_total_data=force_save_total_data
|
force_save_total_data=force_save_total_data
|
||||||
)
|
)
|
||||||
elif doc_source == "emea_ar":
|
elif doc_source == "emea_ar":
|
||||||
special_doc_id_list = ["321733631"]
|
special_doc_id_list = ["321733631"]
|
||||||
batch_run_documents(
|
batch_run_documents(
|
||||||
|
|
|
||||||
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue