Summary of today's work:
• Made new function that increases accuracy of word counting.
• Added a few words to the exclusion list.
• Found a perfect hash function generator for a set of reserved words.
To do next:
• Find or write a function to combine words that belong as one.
• Change the program to read chunks at a time.
• Read more on how the hash generator works and implement a perfect hash function for the list of reserved words.
• Change the structure holding the words from an array into a splay tree.
Today I coded another function to improve the accuracy of the word count. It does two things. First, it removes special characters from words so that periods, commas and quotation marks attached to words don't end up counting those words as separate from the originals. Second, it uncapitalizes words so that words at the beginnings of sentences don't count separately either.
I also found an open source, perfect hash function generator, which I intend to use for the excluded words list.
I copied the words on the web page for the generator, found here, into a test file and ran the word counter on it. After calling the function described in my first paragraph, the results improved dramatically. Forty-eight more occurrences of the most common word were found, raising its perceived occurrence by roughly 69%. Moreover, that word happened to be the name of the function that the page was describing, which was getting beaten out in the first run by "This" and "The". Removing the capitals on "This" and "The" allowed them to be recognized as words that should be discarded, bringing more relevant results into the final data.
Here are the results from before the function was applied:
This 80
The 74
gperf 71
C 64
work 63
code 63
hash 61
GNU 58
Output 51
If 43
License 41
Declarations 35
Gperf 33
A 30
must 30
Previous: 29
Up: 29
keywords 28
function 27
Input 26
Format 25
search 24
Next: 24
same 24
copyright 24
And here are the results after:
gperf 119
C 113
code 103
license 103
work 100
output 79
hash 72
declarations 67
generated 65
input 62
GNU 58
file 56
program 55
table 48
keywords 47
options 46
function 45
source 45
keyword 44
functions 44
covered 41
format 40
copyright 40
perfect 37
previous 36
Note that C and GNU are still in capital letters. The function is intelligent enough to leave single capitalized letters and typical acronyms alone. Proper nouns will, of course, be all lower case in the results. With any luck the user should be able to edit resulting data to fix such small errors. As long as all the words are counted properly, this should not be an issue.
The words "is", "are", and "may" were also added to the exclusion list.
|