Making your own document scanner in 40 lines of code

Onyeka Okonji
4 min readNov 25, 2022

--

One of the benefits of being proficient with Machine Learning is having a good understanding of the algorithms that run some of the wonderful features we see on our devices. When Apple, the computer device manufacturing company, released the iOS16 version, one of the new functionalities was the ability to use the default Notes app as a digital scanner, think of it as a “scanner in your palm”, borrowing a similar phrase from the legendary Steve Jobs. Prior to when it was introduced, I had to use other services usually apps downloaded from the App Store for the purpose of scanning documents with my phones, some paid some free and some of the free apps come with the disadvantage of a watermark which somewhat defeats the purpose unless you subscribe to a paid version. Having worked on a number of computer vision projects, I thought, would it be possible there is some computer vision library or ML algorithm one can use to replicate what’s been done in my phone? this is the purpose of this article.

In this article, we will be using a very popular library familiar to most MLEs familiar with deep learning particular computer vision: OpenCV. At the end of this article, you should have a good understanding of the steps required to create your own digital scanner.

Step 1: Importing the libraries and the test image

import cv2
import matplotlib.pyplot as plt
import numpy as np

img = cv2.imread('b_o_a_t.jpeg')
img_copy = img.copy()

img_copy = cv2.cvtColor(img_copy, cv2.COLOR_BGR2RGB)

What I have done above should be familiar to any MLE who has done some basic work in computer vision. Concretely, I imported the OpenCV library which is a powerful library in computer vision, along with other necessary libraries which will be needed — the Numpy library which we will use for numerical computations and the Matplotlib library which we will use to visualize our images pre and post-process. Because OpenCV loads images in BGR format by default, the last line in the code block works on the conversion of the image to RGB format which we will be using and which just makes sense in this case. The uploaded image can be seen here:

Credit goes to Andrew Neel at Unsplash: https://unsplash.com/photos/-FVaZbu6ZAE

Step 2: Extract corner cordinates of the object of interest

In order to extract the objects of interest from the image, we will need to obtain the xy-cordinates of the corners of the object, in this case, the 4 corners of the book. These cordinates will be needed when we need to perform what’s called a perspective transform. This is where the matplotlib library comes in handy, when you view an image using the matplotlip.pyplot module, you can extract the corner cordinates of the image being displayed. In this case, we will be taking these cordinates in an anti-clockwise direction.

In addition to obtaining the cordinates, we will need to calculate the dimensions of the object in this case, its width and height. To do this, we will need to calculate the Euclidean distance between the corners relative to their neighbours in the vertical and horizontal axis

pointA = [1588, 892]
pointB = [1577, 2696]
pointC = [4026, 2707]
pointD = [3982, 914]

# calculate the width and height of the object in question
width_AD = np.sqrt(((pointA[0] - pointD[0]) ** 2) + ((pointA[1] - pointD[1]) ** 2))
width_BC = np.sqrt(((pointB[0] - pointC[0]) ** 2) + ((pointB[1] - pointC[1]) ** 2))
maxWidth = max(int(width_AD), int(width_BC))

height_AB = np.sqrt(((pointA[0] - pointB[0]) ** 2) + ((pointA[1] - pointB[1]) ** 2))
height_CD = np.sqrt(((pointC[0] - pointD[0]) ** 2) + ((pointC[1] - pointD[1]) ** 2))
maxHeight = max(int(height_AB), int(height_CD))

From the codeblock above, points A, B, C & D represent the xy-cordinates of the book starting from the top left corner and moving anti-clockwise. Using these figures we can then invoke the numpy library for calculating the euclidean distance and finally the maximum width and height of the object, this will be fed into OpenCV for the final stage of the process.

Step 3: Performing a Perspective transform using OpenCV

This step comprises of the final aspects of the process. In this step, we feed both the original xy-cordinates as obtained above and the new dimensions using the maximum height and width both as numpy arrays to the cv2.getPerspectiveTransform method from OpenCV, which returns a perspective matrix which serves as a new “perspective” of the original image, this new view is then served to the cv2.warpPerspective method of OpenCV which is tasked with the purpose of warping / “transforming”, the input image, from its original view to a new view which focuses only on the object of interest, the output of which is found below.

perspective_matrix = cv2.getPerspectiveTransform(src=input_points, dst=output_points)
new_img = cv2.warpPerspective(img, perspective_matrix, (maxWidth, maxHeight), flags=cv2.INTER_LINEAR)
input image transformed with focus on the object of interest

And that’s it, the new image can be saved for future use. In this short article, we’ve built a digital document scanner from scratch in just a few lines of code. The full repo detailing a few lines I left out is available here.

I hope you find this useful, should you have any questions or comments, please drop a message or you could reach me on LinkedIn and if you loved it, please give it 50 claps 😀

--

--

Onyeka Okonji

Machine Learning Engineer passionate about Computer Vision