update instructions fund name section structure
This commit is contained in:
parent
8a5723c150
commit
46f86b124b
|
|
@ -17,8 +17,8 @@
|
|||
"data_business_features": {
|
||||
"common": [
|
||||
"## General rules",
|
||||
"- 1. The data is in the context, perhaps in table(s), semi-table(s) or paragraphs.",
|
||||
"- 2. Fund name: ",
|
||||
"1. The data is in the context, perhaps in table(s), semi-table(s) or paragraphs.",
|
||||
"2. Fund name: ",
|
||||
"a. The full fund name should be main fund name + sub-fund name, e,g, main fund name is Black Rock European, sub-fund name is Growth, the full fund name is: Black Rock European Growth.",
|
||||
"b. The sub-fund name may be as the first column or first row values in the table.",
|
||||
"b.1 fund name example:",
|
||||
|
|
@ -67,12 +67,12 @@
|
|||
"---Example 3 End---",
|
||||
"Although exist \"Retirement account\" and \"Transition to Retirement account\", but the investment option is not exist, so fund name and share name should be: \"Rest Pension\".",
|
||||
"\n",
|
||||
"- 3. Only extract the latest data from context:",
|
||||
"3. Only extract the latest data from context:",
|
||||
"If with multiple data values in same row, please extract the latest.",
|
||||
"\n",
|
||||
"- 4. Reported names:",
|
||||
"4. Reported names:",
|
||||
"Only output the values which with significant reported names.",
|
||||
"- Multiple data columns with same reported name but different post-fix:",
|
||||
"Multiple data columns with same reported name but different post-fix:",
|
||||
"If there are multiple reported names with different post-fix text, here is the priority rule:",
|
||||
"The pos-fix text is in the brackets: (gross), (net), pick up the values from (net).",
|
||||
"---Example Start---",
|
||||
|
|
@ -80,14 +80,14 @@
|
|||
"---Example End---",
|
||||
"The output should be:",
|
||||
"{\"data\": [{\"fund name\": \"Allan Gray Australian Equity Fund\", \"share name\": \"Class A\", \"management_fee_and_costs\": 1.19, \"management_fee\": 0.77, \"administration_fees\": 0.42}]}",
|
||||
"- 5. Please ignore these words as fund names, it means never extract these words as fund names. They are:",
|
||||
"5. Please ignore these words as fund names, it means never extract these words as fund names. They are:",
|
||||
"\"Ready-made portfolios\", \"Simple choice\", \"Build-your-own portfolio\".",
|
||||
"- 6. Identify the value of data point and if it is written 0% or 0.00% or 0 or 0.00 then extract the same as 0 do not assume null for the same and return its values as 0",
|
||||
"6. Identify the value of data point and if it is written 0% or 0.00% or 0 or 0.00 then extract the same as 0 do not assume null for the same and return its values as 0",
|
||||
"---Example Start---",
|
||||
"Retirement account \n\nInvestment option \n(A) Investment fees \nand costs (including \n(B) performance \nfees) (pa)* \n(B) Performance \nfees (pa) \n# \n(C) Transaction \ncosts (pa)*^ \n(A) + (C) Total \ninvestment cost \n(pa) \nBalanced – Indexed 0.00% 0.00% 0.00% 0.00%\n",
|
||||
"---Example End---",
|
||||
"For this example, as \"Investment fees and costs (including (B) performance fees)\" and \"Performance fees (pa)\" mentioned as 0.00% so return 0 as datapoint values.",
|
||||
"- 7. If for data point value specifically Nil is written in the value then return NULL('') for the same"
|
||||
"7. If for data point value specifically Nil is written in the value then return NULL('') for the same"
|
||||
],
|
||||
"investment_level": {
|
||||
"total_annual_dollar_based_charges": "Total annual dollar based charges is share level data.",
|
||||
|
|
@ -320,7 +320,8 @@
|
|||
"FOUND \"Cost of product\", IGNORE ALL OF INFORMATION BELOW IT!!! JUST RETURN EMPTY RESPONSE!!!",
|
||||
"The output should be:",
|
||||
"{\"data\": []}",
|
||||
"L. Do NOT infer or copy investment fees or management fees from examples provided for specific funds to other investment options. Only extract 'management_fee_and_costs' and 'management_fee' if explicitly stated separately for each investment option."
|
||||
"L. Do NOT infer or copy investment fees or management fees from examples provided for specific funds to other investment options. Only extract 'management_fee_and_costs' and 'management_fee' if explicitly stated separately for each investment option.",
|
||||
"M. Identify the value of management fee and costs, and if it is written 0% or 0.00% or 0 or 0.00, then extract the same as 0, please don't ignore it."
|
||||
],
|
||||
"administration_fees":[
|
||||
"### Administration fees and costs",
|
||||
|
|
|
|||
14
main.py
14
main.py
|
|
@ -1522,8 +1522,8 @@ if __name__ == "__main__":
|
|||
|
||||
# get_aus_prospectus_document_category()
|
||||
|
||||
re_run_extract_data = False
|
||||
re_run_mapping_data = False
|
||||
re_run_extract_data = True
|
||||
re_run_mapping_data = True
|
||||
force_save_total_data = True
|
||||
doc_source = "aus_prospectus"
|
||||
# doc_source = "emea_ar"
|
||||
|
|
@ -1531,16 +1531,14 @@ if __name__ == "__main__":
|
|||
# document_sample_file = (
|
||||
# r"./sample_documents/aus_prospectus_verify_6_documents_sample.txt"
|
||||
# )
|
||||
document_sample_file_list = [
|
||||
r"./sample_documents/aus_prospectus_46_documents_sample.txt",
|
||||
r"./sample_documents/aus_prospectus_verify_6_documents_sample.txt",
|
||||
]
|
||||
for document_sample_file in document_sample_file_list:
|
||||
document_sample_file = (
|
||||
r"./sample_documents/aus_prospectus_46_documents_sample.txt"
|
||||
)
|
||||
logger.info(f"Start to run document sample file: {document_sample_file}")
|
||||
with open(document_sample_file, "r", encoding="utf-8") as f:
|
||||
special_doc_id_list = [doc_id.strip() for doc_id in f.readlines()
|
||||
if len(doc_id.strip()) > 0]
|
||||
# special_doc_id_list = ["401212184"]
|
||||
# special_doc_id_list = ["420339794"]
|
||||
pdf_folder: str = r"/data/aus_prospectus/pdf/"
|
||||
output_pdf_text_folder: str = r"/data/aus_prospectus/output/pdf_text/"
|
||||
output_extract_data_child_folder: str = (
|
||||
|
|
|
|||
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue