Title:
Email Data Management system
that filters emails based on predetermined parameters.
Brief
Description:
The project I am working
on is to develop an intelligent system that filters emails which are required
by the user and then move the contents of the mail or attachment in database.
For this software to do these tasks, I am using multiple techniques so the
whole procedure may be done without any failure.
Tools
and Techniques:
To filter mails based on predetermined
parameters, I am using various tools and techniques so that this software will
provide best results. This software is written in C #.NET language and for the
database I have used MS SQL Server. I am going to mention the functionality of
the system and tools and techniques which have been used in it.
Ø This
Software is a web based application which provides users the facility to get
their mails filtered and the data which they required saved in DB.
Ø User needs
ID and Password for the software then after getting in the system, user will
provide his/her email ID and password to the system so that it may filter the
emails from user account.
Ø This
software will first login to the user account by signing in through ID and
password that was provided to the system by user
Ø Now
after logging in the account, system will scan all the mails in the account. For
the system to respond fast and work efficiently, it’s using threading in it.
Multiple threads are made for various purposes. First, is for client server to
show that system is working on client side and send background call to server.
Secondly, as mails are continued to be received by the system, it makes threads
pool and assign thread to each mail to handle all those mails at a same time.
Ø Through this
process of threading, Natural
language processing (NLP) will be used to find the relevant files from mails. As
this software is made especially to handle attachments and move their contents
in DB, so it will look for the files with attachments. The reason of NLP here
is because of human-computer interaction. As the files with the mails let’s say
it’s a CV is in pdf form and it will be in Natural Language (Human Language) so,
for the software to understand Natural Language, the term NLP will be used here
to make computer-human interaction possible. Now the CV is the file that user
actually wants its data to be stored in DB. So parameters would have been given
beforehand to find the relevant content. Now NLP will use parsing algorithm to
match the string taken from the document that will be in Natural language and
will then match the pattern with the fields given in DB. For example the field
given in DB is Name and in CV the parser (also known as syntax analyzer) gets
the string of N A M E, it will then match the pattern and if the pattern is
matched then this file will be considered as the wanted file whose contents are
needed to be save in DB.
Ø For
reading the attachment with the email, OCR is used so that it may read the file
and then produce the output that is to be saved in DB. There are three best OCR
software that are; ABBYY FineReader, Microsoft Office Document imaging (MODI)
and Tesseract OCR engine. I am using here MODI OCR to read file.
Ø Now
for MODI, if the file is in tiff format then it will begin reading that file to
produce output but if the file is in some other format then the system will
convert the format into tiff file because MODI reads tiff files. If the file is
the image file then image converter will be used to convert that image into
tiff format. If the file is in pdf format then the software will use
Ghostscript library to convert the pdf file into tiff file. For the conversion
into tiff files there are many other libraries as well like ImageMagick which
is also very popular but after trying both with my system I found that
Ghostscript is rather fast and uses less memory than ImageMagick.
Ø Now
when the file is converted into tiff file, it will be sent to MODI to read the
file and produce output.
Ø When
the file has been read by MODI, it will produce output and send that output to
software. Normally, the output that OCR produces is not usually in the same
order as the original file was. MODI gives a full proper word but the sequence
of the words like their order is disturbed in output. For example the file is a
CV that is produced by MODI but in output we get the first word Name then
second word Cell No but actually after Name there should be some name. So, for
making the right order of the contents in the file, this software will use
sorting algorithm and regular expression (regex or regexp).
Ø Sorting
algorithm is used to sort the contents of the file in a proper way. Means it
will give a right sequence to words. For example, the original contents
sequence of CV is somewhat like Name, ABC then on next line Address, XYZ etc.
Now there I have tried two Sorting Algorithms, one after another to see which
one gives the best results. Those two sorting algorithms are: Quick Sort and
Heap Sort. I found the results of both and then compared their results. Both
are in place sorting and Quick sort was faster in sorting than heap sort and
its worst case running time is O(n^2). On the other hand, Heap Sort was really
slower than Quick Sort and its worst case running time is O(n*log(n)). So, in
this software I am using Quick Sort for the fastest results.
Ø Now
by using Quick Sort, this software will get the right sequence of file like
Name, ABC, Address, XYZ etc. So, sorting has now given the right sequence of
words but the file is yet to be organized further as the contents are not in
proper rows and columns. This output that sorting gives is in one line. So, for
making a complete file with proper words in their proper lines, this system
will use then regular expression (regex or regexp). Regular expression is used
for string search or pattern matching. R.E will get a string form DB to match
its pattern to the contents that Quick Sort has provided. In this example, R.E
will get a string Name from DB and then from file it will match the pattern. As
there is Name in the file so R.E will take Name and then whatever comes after
Name will be considered the person’s name by R.E until it find next string
matched. Like next string is Address in DB and in file when R.E finds Address,
it moves to second line along with the word Address and before the word
Address, all words are consider Name by R.E. This is how it will make a complete
proper output file.
Ø Then
the contents from file will be retrieved by the software and would be saved in
DB. The term of content retrieving is known as Information Retrieval (IR).
Ø Then
the software will repeat the same procedure with other mails that have attachments.
Software
Development Kit (SDK):
All the tools that are
used in this software are:
1. NLP
2. Parser/Syntax
Analyzer
3. Ghostscript
4. MODI
OCR
5. Quick
Sort
6. Regular
Expression(Regex or Regexp)
0 Comments