Commit Graph

69 Commits

Author SHA1 Message Date
Blade He 4edc4b4768 clean code 2025-03-24 17:10:16 -05:00
Blade He 9be6d1296d update benchmark check logic 2025-03-19 00:52:25 -05:00
Blade He 5ba39a394b 1. keep fund/ share db list before applying LLM
2. add key words for interposed_vehicle_performance_fee_cost
2025-03-18 22:15:31 -05:00
Blade He c71936c5ff 1. optimize benchmark_name instructions
2. consider possible with multiple same raw fund names in documents, not to remove unmatched_db_list when match relevant raw fund/ share name
Otherwise, it will occur some raw names couldn't match db name issue.
2025-03-18 17:22:21 -05:00
Blade He 0cea2e501b For AUS Prospectus, cancel visiting Vision ChatGPT when page contents without any numeric text or perhaps with messy code.
(But should keep this logic for EMEA LUX AR, because of some special providers cases for this market documents.)
2025-03-18 14:15:43 -05:00
Blade He b3941ee4b3 update instructions for total_annual_dollar_based_charges 2025-03-17 15:07:02 -05:00
Blade He dd15c1c48e Optimize for benchmark name 2025-03-14 11:51:10 -05:00
Blade He f539340d04 1. optimize instructions
Only load relevant fund name for investment objective, instead of full page text with the most recent investment objective
2. Exclude the table which with only one numeric column: Cost Product
2025-03-14 01:04:51 -05:00
Blade He 551f754379 Fix issue when saving data extraction data 2025-03-13 18:36:04 -05:00
Blade He a48af9ddf0 A. Metrics score
Blade's updates
1. Set the secondary key to be the share class name, instead of the fund name
2. Remove the data point which support is 0 to calculate the metrics
3. Add the message list to store the error message
4. Support save metrics/ error message to excel file
5. Support statistics for different document list
6. Set F1-Score to the first column in the metrics table
B. Optimize instructions for benchmark_name
2025-03-13 17:52:06 -05:00
Blade He c2c0b33015 align fund name based on production name
optimize performance relevant prompts
2025-03-12 21:52:00 -05:00
Blade He 6f17c2253c optimize instructions for document 412778803 2025-03-12 17:24:39 -05:00
Blade He 765772e5a8 optimize performance_fee_costs by document 391080133 2025-03-12 14:45:48 -05:00
Blade He c7c36dbdd2 1. update performance_fee name to performance_fee_costs
2. support extract data for total_annual_dollar_based_charges
2025-03-11 17:15:39 -05:00
Blade He e9f6383258 apply configuration file to replace disorder table header contents 2025-03-10 11:09:00 -05:00
Blade He 4ee762963e optimized for management_fee_and_costs and administration_fees 2025-03-08 21:40:00 -06:00
Blade He fa2dede454 optimize for management_fee_and_costs and management_fee 2025-03-07 18:38:36 -06:00
Blade He 2cd4f5f787 Supplement provider information to ground truth data
Calculate metrics based on providers
Integrate "merge" data algorithm for AUS Prospectus final outputs
2025-03-07 15:02:12 -06:00
Blade He 52515fc152 1. simplify management_fee_and_costs instructions
2. optimize management_fee_and_costs instructions
3. resolve the issues for complex scenarios: need sum management_fee, recoverable_expenses, indirect_costs as management_fee_and_costs
2025-03-06 17:27:18 -06:00
Blade He c4ed65770d Try to support more complex management_fee_and_costs scenarios
Support calculate all of data points metrics
2025-03-05 17:21:13 -06:00
Blade He f4b4d00f58 optimize instructions for management fee and costs.
support dynamic loading complex instructions by keywords
2025-03-04 08:32:55 -06:00
Blade He d3be711859 optimize administration fees instructions 2025-02-28 22:12:18 -06:00
Blade He d4bc3aba4e optimize for management fees 2025-02-28 16:55:33 -06:00
Blade He d0295995d8 support judge whether next page contents with same structure table as current page.
If yes, handle next page data extraction pipeline.
2025-02-27 23:08:57 -06:00
Blade He d0128d6279 1. optimize for administration fees.
2. optimize for management fees
2025-02-27 17:36:41 -06:00
Blade He 543cab74e1 1. get production name
2. if some data point with production name, set each fund/ share with relevant data point value(s)
2025-02-27 12:07:49 -06:00
Blade He 70079d176e Support remove duplicated values to keep the values to be the latest ones. 2025-02-26 17:05:58 -06:00
Blade He f467945cd4 support benchmark name data extraction 2025-02-26 10:05:46 -06:00
Blade He 357bb6d580 1. support dynamic show fund level data examples.
2. optimize for minimum_initial_investment data point
2025-02-25 10:35:53 -06:00
Blade He 75ea383354 support identify aus prospectus document category: MIS or Super 2025-02-24 15:08:15 -06:00
Blade He bb6862b179 update a little 2025-02-19 14:32:08 -06:00
Blade He 705933bbdd optimized for phase 2 data 2025-02-18 18:52:26 -06:00
Blade He 01e2a0e38d add configuration for datapoints data types
update configuration for minimum initial investment
support apply value to all of funds for minimum initial investment
2025-02-05 12:08:12 -06:00
Blade He a8810519f8 optimize instructions configuration
optimize drilldown part logic
2025-02-04 15:29:24 -06:00
Blade He b15d260a58 migrate name mapping algorithm from Ravi 2025-01-21 16:55:08 -06:00
Blade He f10ff8ee33 update for deployment 2025-01-16 20:34:43 -06:00
Blade He 9f0e77a11e support load configurations by doc_source parameter 2025-01-16 11:17:48 -06:00
Blade He a89aa9c4de support fetch data from Prospectus 2025-01-14 16:21:48 -06:00
Blade He 201a809ffa comment remove_abundant_data function 2025-01-06 15:27:43 -06:00
Blade He 309bb714f6 fix issue for parsing data via Vision Function. 2024-12-11 16:49:04 -06:00
Blade He d673a99e21 switch back to extract data from image stream directly, instead of getting text from image stream as the first step, then extract data from extracted text.
The reason is: the quality of getting text from image steam is not good enough.
2024-12-10 16:17:47 -06:00
Blade He f71e2968cc simplify code 2024-12-09 22:24:40 -06:00
Blade He 75ea5e70de 1. support fetch data from messy-code page by ChatGPT4o Vision function.
2. multilingual share features configuration
2024-12-09 17:47:42 -06:00
Blade He d96f77fe00 Split share class names which with multiple share classes in same line 2024-12-06 16:31:42 -06:00
Blade He a25991e2bb 1. Set TOR reported name priority
2. Optimize investment mapping logic
2024-12-06 09:54:43 -06:00
Blade He 95c386911c Clean fund name after getting response from ChatGPT 2024-12-04 22:08:09 -06:00
Blade He 70362b554f Fix issue for "The last fund name of previous PDF page" logic:
If current page fund name starts with "The last fund name of previous PDF page" and with more contents below, then remove "The last fund name of previous PDF page".
2024-12-04 16:57:52 -06:00
Blade He 36fbaa946e Add the statement when transferring the last fund name of previous PDF page:
The last fund name of previous PDF page:
page_text = f"\nThe last fund name of previous PDF page: {previous_page_fund_name}\n{page_text}"
2024-12-03 11:50:31 -06:00
Blade He a11a99fdc3 1. Optimize instructions: not to fetch the data with "up to" statement.
2. Add exception handler in function.
2024-12-03 11:27:28 -06:00
Blade He bc32860f87 remove_abundant_data 2024-12-02 17:16:56 -06:00