A Real Scenario Machine Learning Question

Recently I got a real scenario machine learning question, which does not have existing models. I would like to record my thoughts here which may save my time later. Here is the question: A company receives thousands of documents everyday uploaded by our users. Generally these documents are invoices or bills. We would like to extract the vendor​ and amount​ from these documents automatically (i.e. using software rather than human inspection). They store the following pieces of information for each document: The pdf document uploaded by the user (please see example.pdf attached) The text extracted from that pdf (please see example.txt attached - Note: often the extracted text would not be in an order that seems natural to a human reader) Labels of what the vendor and amount should be for each document (in the attached example, vendor would be "Marketing Fuel Biz.", and amount would be "747.50"). Question: Describe a machine learning solution to this problem. Additon: Some percentage of the stored labels may be incorrect. What would you change to mitigate this problem.
Abner Chou Jun 18, 2018