Recently I got a real scenario machine learning question, which does not have existing models. I would like to record my thoughts here which may save my time later.
Here is the question:
A company receives thousands of documents everyday uploaded by our users. Generally these documents are invoices or bills. We would like to extract the vendor and amount from these documents automatically (i.e. using software rather than human inspection).
They store the following pieces of information for each document:
The pdf document uploaded by the user (please see example.pdf attached)
The text extracted from that pdf (please see example.txt attached - Note: often the extracted text would not be in an order that seems natural to a human reader)
Labels of what the vendor and amount should be for each document (in the attached
example, vendor would be “Marketing Fuel Biz.”, and amount would be “747.50”).
Question: Describe a machine learning solution to this problem.
Additon: Some percentage of the stored labels may be incorrect. What would you change to mitigate this problem.
The sample pdf and OCR output txt is downloadable.
Continue Reading