Commit Graph

121 Commits

Author SHA1 Message Date
Blade He f9ef4cec96 update sql_query cache file store location
At most cache 5 days, then clean from local disk.
2025-01-31 10:59:54 -06:00
Blade He 7f37f3532f switch example document 2025-01-27 14:59:26 -06:00
Blade He 6f831e241c Merge branch 'aus_prospectus_ravi' 2025-01-27 12:32:42 -06:00
Blade He 41f8c307ff a little change 2025-01-27 12:32:36 -06:00
Blade He 47c41e492f 1. only get name mapping data from document mapping
2. Compare name mapping metrics between Ravi's and mine.
2025-01-27 12:29:49 -06:00
Blade He d9b0bed39a a little change 2025-01-22 09:57:42 -06:00
Blade He 350550d1b0 fix issue for removing item from list 2025-01-21 17:24:05 -06:00
Blade He e2b9bcbdbc initial abbreviation configurations 2025-01-21 17:09:45 -06:00
Blade He b15d260a58 migrate name mapping algorithm from Ravi 2025-01-21 16:55:08 -06:00
Blade He d41fae3dba prepare for 100 multi-funds document samples 2025-01-17 16:26:31 -06:00
Blade He b93a8d55e8 update for output data as template 2025-01-17 11:41:58 -06:00
Blade He f10ff8ee33 update for deployment 2025-01-16 20:34:43 -06:00
Blade He fb4a6402f0 support output merged data format 2025-01-16 16:31:04 -06:00
Blade He 2eace81f51 support more configurable parts 2025-01-16 13:54:45 -06:00
Blade He db0827435b supplement EMEA AR configuration files 2025-01-16 11:30:44 -06:00
Blade He 9f0e77a11e support load configurations by doc_source parameter 2025-01-16 11:17:48 -06:00
Blade He acc30d4b72 if fail to get text by pdf to html API, then try to get text by pymupdf. 2025-01-15 18:36:02 -06:00
Blade He ace0ac2674 a little change 2025-01-15 18:22:08 -06:00
Blade He a89aa9c4de support fetch data from Prospectus 2025-01-14 16:21:48 -06:00
Blade He e230a5bf15 a little change 2025-01-09 12:19:24 -06:00
Blade He 91c86bb983 update AUS Prospectus relevant configuration 2025-01-08 17:40:57 -06:00
Blade He 0a867dcf07 complete configuration for AUS Prospectus 2025-01-07 16:25:13 -06:00
Blade He 201a809ffa comment remove_abundant_data function 2025-01-06 15:27:43 -06:00
Blade He c335992ced update requirements.txt 2025-01-06 13:56:09 -06:00
Blade He 9348e32caa support more performance fee keywords 2025-01-06 13:14:20 -06:00
Blade He 65e752e25a realize merge_output_data function, whether to output as this format, depends on confirmation with data/ developer teams 2024-12-18 09:19:55 -06:00
Blade He 309bb714f6 fix issue for parsing data via Vision Function. 2024-12-11 16:49:04 -06:00
Blade He d673a99e21 switch back to extract data from image stream directly, instead of getting text from image stream as the first step, then extract data from extracted text.
The reason is: the quality of getting text from image steam is not good enough.
2024-12-10 16:17:47 -06:00
Blade He f71e2968cc simplify code 2024-12-09 22:24:40 -06:00
Blade He 75ea5e70de 1. support fetch data from messy-code page by ChatGPT4o Vision function.
2. multilingual share features configuration
2024-12-09 17:47:42 -06:00
Blade He d96f77fe00 Split share class names which with multiple share classes in same line 2024-12-06 16:31:42 -06:00
Blade He d79b05885d optimize prompts for TOR 2024-12-06 14:50:34 -06:00
Blade He a25991e2bb 1. Set TOR reported name priority
2. Optimize investment mapping logic
2024-12-06 09:54:43 -06:00
Blade He 95c386911c Clean fund name after getting response from ChatGPT 2024-12-04 22:08:09 -06:00
Blade He 70362b554f Fix issue for "The last fund name of previous PDF page" logic:
If current page fund name starts with "The last fund name of previous PDF page" and with more contents below, then remove "The last fund name of previous PDF page".
2024-12-04 16:57:52 -06:00
Blade He 36fbaa946e Add the statement when transferring the last fund name of previous PDF page:
The last fund name of previous PDF page:
page_text = f"\nThe last fund name of previous PDF page: {previous_page_fund_name}\n{page_text}"
2024-12-03 11:50:31 -06:00
Blade He a11a99fdc3 1. Optimize instructions: not to fetch the data with "up to" statement.
2. Add exception handler in function.
2024-12-03 11:27:28 -06:00
Blade He bc32860f87 remove_abundant_data 2024-12-02 17:16:56 -06:00
Blade He c146497052 optimize share feature judgment logic:
accumulation with capitalisation and institutional
income with distribution

Document: 337293427
2024-12-02 13:11:49 -06:00
Blade He 352886ade2 update instructions for TER, OGC, Performance Fees 2024-12-02 11:45:19 -06:00
Blade He 276ff93a1d Optimize drilldown algorithm
Share class names with currency
Reason
The currency in document not next to share name
Solution
If can't get relevant text from PDF page contents, and the last word of share class name belongs to currency, remove currency from share class name, then try again.
After implementing this solution, recall is from 95% to 96%
Can't find relevant text from current PDF page text
Reason
Hence apply try to merge previous page text into current page, perhaps the text is from previous page text.
Solution
Try to get previous page and search relevant value.
After implementing this solution, recall is from 96% to 98%.
2024-11-26 16:35:07 -06:00
Blade He a09778d9d1 Create EMEA AR API code file.
Optimize annotation list for drilldown.
2024-11-26 11:24:29 -06:00
Blade He fb356fce76 1. optimize drilldown algorithm
2. support calculate drilldown recall metrics
2024-11-25 15:11:03 -06:00
Blade He 78fb283130 update python libraries 2024-11-25 11:11:02 -06:00
Blade He fc80093557 optimize investment mapping 2024-11-22 14:54:52 -06:00
Blade He f1c0290588 Optimize investment mapping algorithm.
1. Get proper currency if exist multiple currencies in share name, e.g. CHF EUR
2. Default currency should be based on scenario: USD or EUR.
3. Remove special chars should be based on \W, instead of [^a-zA-Z0-9\s]
2024-11-21 16:36:58 -06:00
Blade He 5b9f9416de 1. Update for mapping multilingual share class names.
2. Optimize getting currency logic
2024-11-21 11:37:58 -06:00
Blade He 843bbbd13f dynamic loading instructions for multilingual. 2024-11-20 17:00:22 -06:00
Blade He 067d89e0f9 Add datapoint_reportedname.json for dynamic loading reported names based on document language. 2024-11-19 16:49:15 -06:00
Blade He 8223ca9a5c a little change 2024-11-18 16:13:24 -06:00