Reports: Word Recognition Software

Word Recognition Software

Note dt : 24 Nov 1996

* when Job Portals had not arrived in India and even email resumes were a rarity . Nearly 100 % of the resumes arriving in our office , were typed hard-copies ( sometimes barely readable ) and sent through snail mail !

Uploaded : 04 Nov 2016 )

--------------------------------------------------------------------------------------------------------------------------------

Any given word ( a cluster of characters ) , can be classified ( in English ) into one of the following " categories " :

* Verb

* Adverb

* Preposition

* Adjective

* Noun ( Common Noun / Proper Noun )

So , the first task is to create a " Directory " of each of these categories . Then each " word " must be compared to the words contained in a given directory

If a match occurs then that word would get categorized as belonging to that category

The process has to be repeated again and again by trying to match the word with the words contained in each of the categories , till a match is found

If no " match " is found , that word should be separately stored in a file marked ,

" UNMATCHED WORDS "

Every day , an expert would study all the words contained in this file and assign each of these words , a definite category , using his " HUMAN INTELLIGENCE "

In this way , over a period of time , human intelligence will identify / categorize each and every word contained in ENGLISH language . This will be the process of transferring human intelligence to computer

Essentially , the trick lies in getting the computer ( Software ) to MIMIC the process followed by a Human Brain while scanning a set of words ( ie: reading ), and by analyzing the " Sequence " in which these words are arranged , to assign a " MEANING " to each word or to a string of words ( a phrase or a sentence )

I cannot believe that no one has attempted this before ( especially since it has so much commercial value )

We don't know who has developed this software and where to find it , so we must end up re-discovering the wheel !

Our computer files contain some 900,000 words which have repeatedly occurred in our records - mostly converted bio data or words captured from bio data

We have , in our files , some 3500 converted bio data . It has taken us about 6 years to accomplish this feat , ie:

* Approx 600 converted bio data per year , OR

* Approx 2 bio data " converted " every working day !

Assuming that all those ( converted ) bio data which are older than 2 years , are OBSOLETE , this means that perhaps no more than 1200 are current / valid / useful !

So , one thing becomes clear

The " Rate of Obsolescence " is faster than the " Rate of Conversion " !

Of course , we can argue :

" Why should we waste / spend our time in ' converting ' a bio data ?

All we need to do is to capture the ESSENTIAL / MINIMUM DATA ( from each bio data ) , which would qualify that person to get searched / spotted

If he gets short-listed , we can always , at that point of time , spend time / effort to fully convert his bio data "

In fact , this is what we have done so far - because there was a premium on the time of the data entry operators

That time was best utilized in capturing the " essential / minimum " data

But , if latest technology permits / enables us to convert 200 bio data each day ( instead of just 2 ) , with the SAME effort / time / cost , then why not convert 200 ? Why be satisfied with just 2 per day ?

If this can be made to " happen " , we would be in a position to send out / fax out / email , converted bio data to our clients in a matter of " minutes " instead of " days " - which it takes to day !

That is not all

A converted bio data has far more " Key words " ( Knowledge - Skills - Attributes - Attitudes etc ) , than the MINIMUM DATA , so there is an improved chance of " spotting " the RIGHT MAN , using a QUERRY which contains a large number of Key words

So , today , if the client " likes " only ONE converted bio data , out of TEN sent to him ( a huge waste of everybody's time / effort ) , then under the new situation , he should be able to " like " , 4 out of every 5 converted bio data sent to him !

This would vastly improve the chance of at least ONE executive getting appointed in each assignment . This should be our goal

This goal could be achieved only if ,

STEP # 1

Each bio data received every day is " scanned " on the same day

STEP # 2

Converted to TEXT ( ASCII )

STEP # 3

PEN ( Permanent Executive Number ) given serially

STEP # 4

WORD - RECOGNIZED ( a step beyond OCR - Optical Character Recognition )

STEP # 5

Each word " categorized " and indexed and stored in appropriate FIELDS of the DATABASE

STEP # 6

Database " Re-constituted " to create " Converted " bio data as per our Standard Format

Steps # 1 , 2 and 3 are not difficult

Step # 4 , is difficult

Step # 5 , is more difficult

Step # 6 , is most difficult

But , if we keep working on this problem , it can be solved ,

50 % accurate in 3 months

70 % accurate in 6 months

90 % accurate in 12 months

Even though there are about 900,000 indexed WORDS in our ISYS file , all of these do not occur ( in a bio data / record ) , with the same FREQUENCY

Some occur far more frequently , some frequently , some regularly , some occasionally and some rarely

Then of course , ( in the English language ) , there must be thousands of other words , which have not occurred EVEN ONCE in any of the bio data !

Therefore , we won't find them amongst the existing " Indexed File " of 900,000 words

It is quite possible that some of these ( so far missing words ) , may occur if this file ( of words ) , were to grow to 2 million

As this file of words , grows and grows , the probabilities of ,

> a word having been left out , AND

> such a " left out " word likely to occur ( in the next bio data ) ,

are " decreasing "

The Frequency Distribution curve might look like follows ( a skewed Normal Curve / Skewed to left ) :

X Axis

> % of Words in English language or in ISYS - 900,000 ( 10 % - 20 % - 30 % ...and similar intervals )

Y Axis

> Number of times occurred ,

Meaning :

Some 20 % of the WORDS ( in English language ) make up , may be 90 % of all the " Occurrences "

This would become clear when we plot the " Frequency Distribution " curve of the 900,000 words which we have already indexed

And , even when this population grows to 2 million , the shape ( the Nature ) of the frequency distribution curve is NOT likely to change !

Only with a much larger WORD POPULATION , the " Accuracy " will marginally increase

So , our search is to find ,

* WHICH are these 20 % ( 20 % * 900,000 = 180,000 ) WORDS, which make up 90 % " Area under the Curve " , ie: the POPULATION ?

Then focus our efforts in " Categorizing " these 180,000 words in first place

If we manage to do this , 90 % of our battle is won

Of course , this pre-supposes that before we can attempt " Categorization " , we must be able to " Recognize " each of them as a WORD

-------------------------------------------------------------------------------------------------------------------

Thursday, 3 November 2016

Word Recognition Software

No comments:

Post a Comment