For a physical bill to be turned into searchable metadata, the digital image of the bill needs processing before data can be extracted from it with OCR. (Image: Ian Hutchinson/IF).
The physical world is a complicated place
The first step in getting a computer to understand a physical document is to turn it into something digital. We thought about physical scanners, mechanical turks and even redirecting your post to the Co-op. But we decided to make use of pocket technology and use the smartphone camera to import bills into Paperfree. We looked into how “scanning” apps for phones work like Scanner Pro and Evernote.
It’s tricky to get the quality of images that a computer needs because lots of people find it hard to take photos that are in focus, bright enough and have good contrast.
Something we used in development was the rectangle detection feature of iOS, which can be used to find sheets of paper, even if the camera is held at an angle. It removes the perspective in the image and removes the background, returning a flat image of the bill. Using this feature takes the pressure off the person scanning the bills by understanding they won’t always hold the camera perfectly. It makes the interface more fluid, important in the repetitive process of scanning documents.
Importing bills into Paperfree with a smartphone camera, referencing techniques found in “scanner” apps. (Image: Iona Wolff/IF).
Using open tools to interpret text
Now that the paper bills are in a format a computer can understand, we can work on interpreting the text on the bills. Computers do this through optical character recognition, or OCR. We used a library – reusable code that adds specific functionality into a programme – called Tesseract.
Tesseract is picky about the kind of images that are fed into it, so we needed to do image processing on the photos our scanning app produced. We used OpenCV for this, an open source tool with image processing features.
Image thresholding turns the image into fully black or fully white pixels and helps the OCR library read text more accurately by removing shadows. Expanding text into blobs and finding them helps split up a bill and improves results by giving OCR smaller bits of data to work with. It also helps us identify the parts of the bill we want to lift values from.
These open source tools let us quickly understand the practicalities of understanding bills. We could work with them in Python, so it was easy to integrate them into the Django web framework this prototype is built on. We ran the app with Vagrant on our own computers, so we could also stay in control of where the data we were using with these libraries was going.
To improve the quality of OCR results, we used OpenCV to adjust the contrast of photos and identify areas on the bill. (Image: Ian Hutchinson/IF).
Bringing your own bills
To test our work on interpreting bills, we needed to give our code something to look at. We started off by creating a dummy bill. We fed our prototype an image of this, and compared what the OCR thinks the bill says with our test values.
That worked well but, in reality, bills are complicated documents. They use different fonts to one another. They use different phrases (“amount due” instead of “total due”, for instance). They use colours to distinguish different sections on the bill.
To make our code aware of these subtle variations, we needed to use some real bills. So we started to use our own bills, and this prompted us to be more mindful of privacy in the development environment. We used open source tools that were run locally and we worked on our own computers, so we didn’t have to worry about anyone else seeing our bills (there are popular cloud-based tools that mean information is sent across the web into third-party machines). For that reason we also didn’t include any bill images in our GitHub commits.
We knew everything couldn’t stay local forever. We were keen to get our prototype up on a private server so we could show our work to others at Co-op Digital. This got us thinking beyond this exploration and into how Paperfree would work.
We’re looking at issues like how information people upload can be encrypted, how to deal with the variety of document types and how to manage backups. These are really knotty problems, and it’s great we’ve learned so much from this early build.
To find out where development is heading next follow the Co-op Digital blog.