Email Images & PDF Data Extraction System using MODI (OCR)

Email Data Management system that filters emails based on predetermined parameters.
Brief Description:
The project I am working on is to develop an intelligent system that filters emails which are required by the user and then move the contents of the mail or attachment in database. For this software to do these tasks, I am using multiple techniques so the whole procedure may be done without any failure.
Tools and Techniques:
To filter mails based on predetermined parameters, I am using various tools and techniques so that this software will provide best results. This software is written in C #.NET language and for the database I have used MS SQL Server. I am going to mention the functionality of the system and tools and techniques which have been used in it.
Ø  This Software is a web based application which provides users the facility to get their mails filtered and the data which they required saved in DB.
Ø  User needs ID and Password for the software then after getting in the system, user will provide his/her email ID and password to the system so that it may filter the emails from user account.
Ø  This software will first login to the user account by signing in through ID and password that was provided to the system by user
Ø  Now after logging in the account, system will scan all the mails in the account. For the system to respond fast and work efficiently, it’s using threading in it. Multiple threads are made for various purposes. First, is for client server to show that system is working on client side and send background call to server. Secondly, as mails are continued to be received by the system, it makes threads pool and assign thread to each mail to handle all those mails at a same time.
Ø  Through this process of threading, Natural language processing (NLP) will be used to find the relevant files from mails. As this software is made especially to handle attachments and move their contents in DB, so it will look for the files with attachments. The reason of NLP here is because of human-computer interaction. As the files with the mails let’s say it’s a CV is in pdf form and it will be in Natural Language (Human Language) so, for the software to understand Natural Language, the term NLP will be used here to make computer-human interaction possible. Now the CV is the file that user actually wants its data to be stored in DB. So parameters would have been given beforehand to find the relevant content. Now NLP will use parsing algorithm to match the string taken from the document that will be in Natural language and will then match the pattern with the fields given in DB. For example the field given in DB is Name and in CV the parser (also known as syntax analyzer) gets the string of N A M E, it will then match the pattern and if the pattern is matched then this file will be considered as the wanted file whose contents are needed to be save in DB.
Ø  For reading the attachment with the email, OCR is used so that it may read the file and then produce the output that is to be saved in DB. There are three best OCR software that are; ABBYY FineReader, Microsoft Office Document imaging (MODI) and Tesseract OCR engine. I am using here MODI OCR to read file.
Ø  Now for MODI, if the file is in tiff format then it will begin reading that file to produce output but if the file is in some other format then the system will convert the format into tiff file because MODI reads tiff files. If the file is the image file then image converter will be used to convert that image into tiff format. If the file is in pdf format then the software will use Ghostscript library to convert the pdf file into tiff file. For the conversion into tiff files there are many other libraries as well like ImageMagick which is also very popular but after trying both with my system I found that Ghostscript is rather fast and uses less memory than ImageMagick.
Ø  Now when the file is converted into tiff file, it will be sent to MODI to read the file and produce output.
Ø  When the file has been read by MODI, it will produce output and send that output to software. Normally, the output that OCR produces is not usually in the same order as the original file was. MODI gives a full proper word but the sequence of the words like their order is disturbed in output. For example the file is a CV that is produced by MODI but in output we get the first word Name then second word Cell No but actually after Name there should be some name. So, for making the right order of the contents in the file, this software will use sorting algorithm and regular expression (regex or regexp).
Ø  Sorting algorithm is used to sort the contents of the file in a proper way. Means it will give a right sequence to words. For example, the original contents sequence of CV is somewhat like Name, ABC then on next line Address, XYZ etc. Now there I have tried two Sorting Algorithms, one after another to see which one gives the best results. Those two sorting algorithms are: Quick Sort and Heap Sort. I found the results of both and then compared their results. Both are in place sorting and Quick sort was faster in sorting than heap sort and its worst case running time is O(n^2). On the other hand, Heap Sort was really slower than Quick Sort and its worst case running time is O(n*log(n)). So, in this software I am using Quick Sort for the fastest results.
Ø  Now by using Quick Sort, this software will get the right sequence of file like Name, ABC, Address, XYZ etc. So, sorting has now given the right sequence of words but the file is yet to be organized further as the contents are not in proper rows and columns. This output that sorting gives is in one line. So, for making a complete file with proper words in their proper lines, this system will use then regular expression (regex or regexp). Regular expression is used for string search or pattern matching. R.E will get a string form DB to match its pattern to the contents that Quick Sort has provided. In this example, R.E will get a string Name from DB and then from file it will match the pattern. As there is Name in the file so R.E will take Name and then whatever comes after Name will be considered the person’s name by R.E until it find next string matched. Like next string is Address in DB and in file when R.E finds Address, it moves to second line along with the word Address and before the word Address, all words are consider Name by R.E. This is how it will make a complete proper output file.
Ø  Then the contents from file will be retrieved by the software and would be saved in DB. The term of content retrieving is known as Information Retrieval (IR).
Ø  Then the software will repeat the same procedure with other mails that have attachments.

Software Development Kit (SDK):
All the tools that are used in this software are:
1.      NLP
2.      Parser/Syntax Analyzer
3.      Ghostscript
4.      MODI OCR
5.      Quick Sort
6.      Regular Expression(Regex or Regexp)

Post a Comment