Word
Recognition Software
Note dt : 24 Nov 1996
* when Job Portals had not arrived in
India and even email resumes were a rarity . Nearly 100 % of the resumes arriving in our office , were typed hard-copies ( sometimes barely readable ) and sent through snail mail !
Uploaded
: 04
Nov 2016 )
--------------------------------------------------------------------------------------------------------------------------------
Any given word ( a cluster of characters ) , can be
classified ( in English ) into one of the following " categories " :
* Verb
* Adverb
* Preposition
* Adjective
* Noun ( Common Noun
/ Proper Noun )
So , the first task is to create a " Directory
" of each of these categories . Then each " word " must be
compared to the words contained in a given directory
If a match occurs then that word would get categorized
as belonging to that category
The process has to be repeated again and again by
trying to match the word with the words contained in each of the categories ,
till a match is found
If no " match " is found , that word should
be separately stored in a file marked ,
" UNMATCHED WORDS
"
Every day , an expert would study all the words
contained in this file and assign each of these words , a definite category ,
using his " HUMAN INTELLIGENCE "
In this way , over a period of time , human
intelligence will identify / categorize each and every word contained in
ENGLISH language . This will be the process of transferring human intelligence
to computer
Essentially , the trick lies in getting the computer (
Software ) to MIMIC
the process followed by a Human Brain while scanning a set of words ( ie:
reading ), and by analyzing the "
Sequence " in which these words are arranged , to assign a " MEANING
" to each word or to a string of words ( a phrase or a sentence )
I cannot believe that no one has attempted this before
( especially since it has so much commercial value )
We don't know who has developed this software and where
to find it , so we must end up re-discovering the wheel !
Our computer files contain some 900,000 words which
have repeatedly occurred in our records - mostly converted bio data or words
captured from bio data
We have , in our files , some 3500 converted bio data .
It has taken us about 6 years to accomplish this feat , ie:
* Approx 600
converted bio data per year , OR
* Approx 2 bio
data " converted " every working day !
Assuming that all those ( converted ) bio data which
are older than 2 years , are OBSOLETE , this means that perhaps no more than
1200 are current / valid / useful !
So , one thing becomes clear
The " Rate of Obsolescence " is faster than
the " Rate of Conversion " !
Of course , we can argue :
" Why should we waste / spend our time in '
converting ' a bio data ?
All we need to do is to capture the ESSENTIAL /
MINIMUM DATA ( from each bio data
) , which would qualify that person to get searched / spotted
If he gets short-listed , we can always , at that point
of time , spend time / effort to fully convert his bio data "
In fact , this is what we have done so far - because
there was a premium on the time of the data entry operators
That time was best utilized in capturing the "
essential / minimum " data
But , if latest technology permits / enables us to convert
200 bio data each day ( instead of just
2 ) , with the SAME effort / time / cost , then why not convert 200 ? Why be
satisfied with just 2 per day ?
If this can be made to " happen " , we would
be in a position to send out / fax out / email , converted bio data to our
clients in a matter of " minutes " instead of " days " -
which it takes to day !
That is not all
A converted bio data has far more " Key words
" ( Knowledge - Skills - Attributes - Attitudes etc ) , than the MINIMUM
DATA , so there is an improved chance of " spotting " the RIGHT MAN ,
using a QUERRY which contains a large number of Key words
So , today , if the client " likes " only ONE
converted bio data , out of TEN sent to him ( a huge waste of everybody's time
/ effort ) , then under the new situation , he should be able to " like
" , 4 out of every 5 converted bio data sent to him !
This would vastly improve the chance of at least ONE
executive getting appointed in each assignment . This should be our goal
This goal could be achieved only if ,
STEP # 1
Each bio data received every day is " scanned
" on the same day
STEP # 2
Converted to TEXT ( ASCII )
STEP # 3
PEN ( Permanent Executive Number ) given serially
STEP # 4
WORD - RECOGNIZED ( a step beyond OCR - Optical
Character Recognition )
STEP # 5
Each word " categorized " and indexed and
stored in appropriate FIELDS of the DATABASE
STEP # 6
Database " Re-constituted " to create "
Converted " bio data as per our Standard Format
Steps # 1 ,
2 and 3
are not difficult
Step # 4 , is difficult
Step # 5 , is more difficult
Step # 6 , is most difficult
But , if we keep working on this problem , it can be
solved ,
50 % accurate in 3 months
70 % accurate in 6 months
90 % accurate in 12 months
Even though there are about 900,000 indexed WORDS in
our ISYS file , all of these do not occur ( in a bio data / record ) , with the
same FREQUENCY
Some occur far more frequently , some frequently , some
regularly , some occasionally and some rarely
Then of course , ( in the English language ) , there
must be thousands of other words , which have not occurred EVEN ONCE in any of
the bio data !
Therefore , we won't find them amongst the existing
" Indexed File " of 900,000 words
It is quite possible that some of these ( so far
missing words ) , may occur if this file ( of words ) , were to grow to 2
million
As this file of words , grows and grows , the
probabilities of ,
> a word
having been left out , AND
> such a
" left out " word likely to occur ( in the next bio data ) ,
are " decreasing "
The Frequency Distribution curve might look like
follows ( a skewed Normal Curve / Skewed to left ) :
X Axis
> % of Words in
English language or in ISYS - 900,000 ( 10 % - 20 % - 30 % ...and similar
intervals )
Y Axis
> Number of times
occurred ,
Meaning :
Some 20 % of the
WORDS ( in English language ) make up , may be 90 % of all the "
Occurrences "
This would become clear when we plot the "
Frequency Distribution " curve of the 900,000 words which we have already
indexed
And , even when this population grows to 2 million ,
the shape ( the Nature ) of the frequency distribution curve is NOT likely to
change !
Only with a much larger WORD POPULATION , the "
Accuracy " will marginally increase
So , our search is to find ,
* WHICH are
these 20 % ( 20 % * 900,000 = 180,000 ) WORDS, which make up 90 % " Area
under the Curve " , ie: the POPULATION ?
Then focus our efforts in " Categorizing " these
180,000 words in first place
If we manage to do this , 90 % of our battle is won
Of course , this pre-supposes that before we can
attempt " Categorization " , we must be able to " Recognize
" each of them as a WORD
-------------------------------------------------------------------------------------------------------------------
No comments:
Post a Comment