Hi Friends,

Even as I launch this today ( my 80th Birthday ), I realize that there is yet so much to say and do.

There is just no time to look back, no time to wonder,"Will anyone read these pages?"

With regards,
Hemen Parekh
27 June 2013

Thursday, 3 November 2016

Word Recognition Software

Word   Recognition   Software


Note dt :  24  Nov  1996 

*   when Job Portals had not arrived in India and even email resumes were a rarity . Nearly 100 % of the resumes arriving in our office , were typed hard-copies ( sometimes barely readable ) and sent through snail mail ! 


Uploaded  :  04  Nov  2016 )

--------------------------------------------------------------------------------------------------------------------------------

Any given word ( a cluster of characters ) , can be classified ( in English ) into one of the following " categories " :

*   Verb

*   Adverb

*   Preposition

*   Adjective

*   Noun (  Common Noun  /  Proper Noun )


So , the first task is to create a " Directory " of each of these categories . Then each " word " must be compared to the words contained in a given directory


If a match occurs then that word would get categorized as belonging to that category


The process has to be repeated again and again by trying to match the word with the words contained in each of the categories , till a match is found


If no " match " is found , that word should be separately stored in a file marked ,
UNMATCHED  WORDS  "


Every day , an expert would study all the words contained in this file and assign each of these words , a definite category , using his " HUMAN  INTELLIGENCE "


In this way , over a period of time , human intelligence will identify / categorize each and every word contained in ENGLISH language . This will be the process of transferring human intelligence to computer


Essentially , the trick lies in getting the computer ( Software ) to MIMIC the process followed by a Human Brain while scanning a set of words ( ie: reading ), and by analyzing  the " Sequence " in which these words are arranged , to assign a " MEANING " to each word or to a string of words ( a phrase or a sentence )


I cannot believe that no one has attempted this before ( especially since it has so much commercial value )


We don't know who has developed this software and where to find it , so we must end up re-discovering the wheel !

Our computer files contain some 900,000 words which have repeatedly occurred in our records - mostly converted bio data or words captured from bio data


We have , in our files , some 3500 converted bio data . It has taken us about 6 years to accomplish this feat , ie:

*  Approx 600 converted bio data per year ,  OR

*  Approx 2 bio data " converted " every working day !


Assuming that all those ( converted ) bio data which are older than 2 years , are OBSOLETE , this means that perhaps no more than 1200 are current / valid / useful !


So , one thing becomes clear

The " Rate of Obsolescence " is faster than the " Rate  of Conversion " !



Of course , we can argue :

" Why should we waste / spend our time in ' converting ' a bio data ? 

All we need to do is to capture the ESSENTIAL  /  MINIMUM  DATA ( from each bio data ) , which would qualify that person to get searched / spotted


If he gets short-listed , we can always , at that point of time , spend time / effort to fully convert his bio data  "



In fact , this is what we have done so far - because there was a premium on the time of the data entry operators

That time was best utilized in capturing the " essential / minimum " data


But , if latest technology permits / enables us to convert 200 bio data each day  ( instead of just 2 ) , with the SAME effort / time / cost , then why not convert 200 ? Why be satisfied with just 2 per day ?


If this can be made to " happen " , we would be in a position to send out / fax out / email , converted bio data to our clients in a matter of " minutes " instead of " days " - which it takes to day !


That is not all

A converted bio data has far more " Key words " ( Knowledge - Skills - Attributes - Attitudes etc ) , than the MINIMUM DATA , so there is an improved chance of " spotting " the RIGHT MAN , using a QUERRY which contains a large number of Key words


So , today , if the client " likes " only ONE converted bio data , out of TEN sent to him ( a huge waste of everybody's time / effort ) , then under the new situation , he should be able to " like " , 4 out of every 5 converted bio data sent to him !


This would vastly improve the chance of at least ONE executive getting appointed in each assignment . This should be our goal


This goal could be achieved only if ,


STEP  #  1
Each bio data received every day is " scanned " on the same day


STEP  #  2
Converted to TEXT ( ASCII )


STEP  #  3
PEN ( Permanent Executive Number ) given serially


STEP  #  4
WORD - RECOGNIZED ( a step beyond OCR - Optical Character Recognition )


STEP  #  5
Each word " categorized " and indexed and stored in appropriate FIELDS of the DATABASE


STEP  #  6
Database " Re-constituted " to create " Converted " bio data as per our Standard Format


Steps #  1 , 2  and 3  are not difficult

Step # 4 , is difficult

Step # 5 , is more difficult

Step # 6 , is most difficult


But , if we keep working on this problem , it can be solved ,

50 % accurate in 3 months

70 % accurate in 6 months

90 % accurate in 12 months


Even though there are about 900,000 indexed WORDS in our ISYS file , all of these do not occur ( in a bio data / record ) , with the same FREQUENCY


Some occur far more frequently , some frequently , some regularly , some occasionally and some rarely


Then of course , ( in the English language ) , there must be thousands of other words , which have not occurred EVEN ONCE in any of the bio data !


Therefore , we won't find them amongst the existing " Indexed File " of 900,000 words

It is quite possible that some of these ( so far missing words ) , may occur if this file ( of words ) , were to grow to 2 million

As this file of words , grows and grows , the probabilities of ,

>  a word having been left out  ,  AND

>  such a " left out " word likely to occur ( in the next bio data ) ,
are " decreasing "


The Frequency Distribution curve might look like follows ( a skewed Normal Curve / Skewed to left ) :


X  Axis  

>    % of Words in English language or in ISYS - 900,000 ( 10 % - 20 % - 30 % ...and similar intervals )


Y  Axis  

>    Number of times occurred ,



Meaning :

Some 20 %  of the WORDS ( in English language ) make up , may be 90 % of all the " Occurrences "

This would become clear when we plot the " Frequency Distribution " curve of the 900,000 words which we have already indexed

And , even when this population grows to 2 million , the shape ( the Nature ) of the frequency distribution curve is NOT likely to change !


Only with a much larger WORD POPULATION , the " Accuracy " will marginally increase

So , our search is to find ,

*  WHICH are these 20 % ( 20 % * 900,000 = 180,000 ) WORDS, which make up 90 % " Area under the Curve " , ie: the POPULATION ?


Then focus our efforts in " Categorizing " these 180,000 words in first place


If we manage to do this , 90 % of our battle is won


Of course , this pre-supposes that before we can attempt " Categorization " , we must be able to " Recognize " each of them as a WORD

-------------------------------------------------------------------------------------------------------------------


   

No comments:

Post a Comment