If current page fund name starts with "The last fund name of previous PDF page" and with more contents below, then remove "The last fund name of previous PDF page".
Share class names with currency
Reason
The currency in document not next to share name
Solution
If can't get relevant text from PDF page contents, and the last word of share class name belongs to currency, remove currency from share class name, then try again.
After implementing this solution, recall is from 95% to 96%
Can't find relevant text from current PDF page text
Reason
Hence apply try to merge previous page text into current page, perhaps the text is from previous page text.
Solution
Try to get previous page and search relevant value.
After implementing this solution, recall is from 96% to 98%.
1. Get proper currency if exist multiple currencies in share name, e.g. CHF EUR
2. Default currency should be based on scenario: USD or EUR.
3. Remove special chars should be based on \W, instead of [^a-zA-Z0-9\s]
Examples document 532422720
M&G European Credit Investment Fund A CHFH Acc -> M&G European Credit Investment Fund A CHF H Accumulation
M&G European Credit Investment Fund A CHFHInc -> M&G European Credit Investment Fund A CHF H Income
M&G European High Yield Credit Investment Fund E GBPHedgedAcc -> M&G European High Yield Credit Investment Fund E GBP Hedged Accumulation
Consider some share class names are with multiple short name, e.g.
CPR Invest Global Disruptive Opportunities Class I sw EUR - Acc
The short names are I and sw
The purpose is to support get all of short names from share class name.
For multiple currencies in fund/ share name, if exist USD, remove it
Fix the issue for split words without space
If there is no currency in share class name, try to get same currency from document mapping which with same fund name and same short share class name.