Michael Klosiewski's CSCI460 project weblog



Date
(all year 2013)
Entry Current phase
Friday
February 1st

After deciding to attempt to install Flash on my computer despite it not meeting the system requirements, I found out that the disk that came with the book is, in fact, not a copy of Flash. Now I need to either find a new place to work, or, once again, a new medium to work it. I'm not going to give up on Flash yet. I'll see what can be done about a place to work.

On top of that, with three other time consuming classes to take, I haven't really gotten anything done this week. I have ideas, but I have yet to act on them--a precarious place to be. I want to work this weekend, and I think I will have time, but I have yet to plot out the actual times I will do so.

Since we have been asked to detail our timelines and put them on here, here is mine:

February 24th, three weeks and two days from today, planning should be done. This includes knowing all tasks that have to be taken care of, what algorithms and data structures I am going to use to do so and what languages they will be written in, and simple tests to see that they do what I hope they will do.

March 12th, two weeks and two days from the end of the planning phase, coding should be done. This means having a basic working copy of all parts of the project. At a minimum, these working copies will perform the initial goals of the project when they are used correctly.

April 6th, three weeks and three days after coding is complete, all parts of the project should be robust, having been rigorously tested, and should be fully documented. Any expected features that may have been lacking in the first draft should have been improved. I should be able to submit the project with confidence at this point.

Following April 6th, I will attempt to add any features that could not be added into the second draft. In addition, I will try to add as features to make the project more usable. This might include multiple ways to open files, displaying more information detailing processing, and others. Backup copies will always be kept (throughout the entire development, but especially here), because if a feature cannot be implemented, I should be able to submit the latest working version when the project comes due.

- Mike


Planning
Sunday
February 3rd

Progress over the weekend:

I have looked at hash algorithms, which will be important in looking up words quickly, and decided on a couple I'd like to try for sample data in testing.

I have also developed a comprehensive plan for how I am going to enable the program to handle large amounts of data as efficiently as possible, and in the process I have solved the issue of keeping data true to the original problem of "What are the top 25 words?" In other words, there should be very little risk of picking the wrong words due to not keeping enough possible words, but at the same time I shouldn't be too worried about storing an astronomical amount of data at any one time.

As part of this plan, I have separated what needs to be working and in place in each draft by self imposed deadline set out in my previous blog entry, at least on the data processing end.

Next steps to take:

- Talk to professors about design, both short term and long run

- Decide, based on design, what the Windows application needs to look like, and create a dummy interface that contains all those things

- Either get working on figuring flash out, or move to some other graphics production method

I feel like I am already late on that third requirement, and I am eager to get working on it.

I'm leaning much more toward doing data processing in C++ rather than C#, because I think I can avoid processing overhead that way, and I am still more familiar with C++ when it comes to handling data. The Windows interface will still be built in C#, because C# is a very good tool for making Windows applications. I have a vague desire to process the data in C, because that would be one step further in the direction of fast processing, but with all of the strings I'm going to be dealing with I don't feel I have the courage to do so. Perhaps that's for the better, as I am functionally still a novice in C, which is more different from C++ than I originally imagined.


Planning/Design
Monday
February 11th

This is really more like two entries in one, as I should have written last night after my efforts this past weekend.

I spent a decent chunk of the two days previous to this trying to write a simple program that counted the words in a text file. Unfortunately, although it runs, I have not yet been able to make sense of the output. Since I was not able to work on it today, my hope is that I will have time tomorrow, and that perhaps I will even be able to finish that part so that I can move on to the smaller parts of the algorithm that look at throwing out words, combining similar words and making sure I am not bringing in special characters with the words, like periods and commas at the ends of words.

I feel like I was able to get some good ideas at the group meeting today, but after some consideration I still feel like I need to make two passes through the data to collect all the information I need without using tremendous resources. I also suspect that processing the data that way may be faster. I am going to change what I was going to do a little bit though. Rather than putting the task of counting words to both passes, I am going to do all the counting in the first pass, processing all the data rather than a sample of it. Because I am not trying to simultaneously count words and keep track of where they are, this should run relatively fast. By the second pass, I have got my top X number of words without a doubt, and can focus on only gathering sentence data about those words. This should simplify organization quite a bit. I will create a full plan for this tomorrow.


Planning/Design/Module Coding
Tuesday
February 12th

I got my word counter working. It now writes the top ten words of a test file to another file, along with their occurrence rates, in proper order (greatest first). The only strange thing it's doing is printing fifteen lines of five spaces followed by -858993460. I just figured out why it's doing this though, so it should be an easy fix. I also learn that the number is 0xcccccccc (in hex) and is used by Microsoft to detect buffer overflows, or so says this site at least.

I still have to test it with over 25 words, and over 250 words to test proper behavior with numbers over the current array sizes, and then I can move on to making it more efficient and doing all the other little things like removing unwanted words and combining others. To end today on a positive note, I think I'll stop here for the time being.


Planning/Design/Module Coding
Sunday
February 17th

Over the weekend I finally managed to test the word counter on slightly larger data files with more words. It has now been found to be reliable with a number of words exceeding its maximum capacity, which I am leaning toward making at least 2000 for the initial gathering of data, if only because my 46 kilobyte file easily broke 500 words. I can only imagine how many words a data set of several gigabytes would have if written by many people.

Coding of virtually all other modules that handle data should be done this week to remain on schedule. It may be a rough week. I also think that planning and design for the interface are going to run well into what was originally supposed to be the coding phase. I'm not too worried about that because it should be relatively simple using C#. I'm a little bit more worried about the Flash representation. From what I recall I should be able to start using it this week, but due to all the other things I feel I should get done, I probably won't move beyond initial experimentation with it until at least next week.


Planning/Module Coding & Testing
Monday
February 18th

I have made a module that excludes words correctly, and fixed cases where the data set has fewer words than asked for, not that I believe that will be a frequent problem. I have still not dealt with data structures for any part of the project. Everything is being stored in arrays and accessed via linear search. This will be changed for the first draft. I am still researching appropriate data structures.


Planning/Module Coding & Testing
Saturday
February 23rd

Summary of today's work:

• Made new function that increases accuracy of word counting.

• Added a few words to the exclusion list.

• Found a perfect hash function generator for a set of reserved words.

To do next:

• Find or write a function to combine words that belong as one.

• Change the program to read chunks at a time.

• Read more on how the hash generator works and implement a perfect hash function for the list of reserved words.

• Change the structure holding the words from an array into a splay tree.


Today I coded another function to improve the accuracy of the word count. It does two things. First, it removes special characters from words so that periods, commas and quotation marks attached to words don't end up counting those words as separate from the originals. Second, it uncapitalizes words so that words at the beginnings of sentences don't count separately either.

I also found an open source, perfect hash function generator, which I intend to use for the excluded words list.

I copied the words on the web page for the generator, found here, into a test file and ran the word counter on it. After calling the function described in my first paragraph, the results improved dramatically. Forty-eight more occurrences of the most common word were found, raising its perceived occurrence by roughly 69%. Moreover, that word happened to be the name of the function that the page was describing, which was getting beaten out in the first run by "This" and "The". Removing the capitals on "This" and "The" allowed them to be recognized as words that should be discarded, bringing more relevant results into the final data.

Here are the results from before the function was applied:

This 80
The 74
gperf 71
C 64
work 63
code 63
hash 61
GNU 58
Output 51
If 43
License 41
Declarations 35
Gperf 33
A 30
must 30
Previous: 29
Up: 29
keywords 28
function 27
Input 26
Format 25
search 24
Next: 24
same 24
copyright 24

And here are the results after:
gperf 119
C 113
code 103
license 103
work 100
output 79
hash 72
declarations 67
generated 65
input 62
GNU 58
file 56
program 55
table 48
keywords 47
options 46
function 45
source 45
keyword 44
functions 44
covered 41
format 40
copyright 40
perfect 37
previous 36

Note that C and GNU are still in capital letters. The function is intelligent enough to leave single capitalized letters and typical acronyms alone. Proper nouns will, of course, be all lower case in the results. With any luck the user should be able to edit resulting data to fix such small errors. As long as all the words are counted properly, this should not be an issue.

The words "is", "are", and "may" were also added to the exclusion list.


Planning/Module Coding & Testing
Sunday
February 24th

I think the time has come for a reflection on how the project is progressing, due mostly to the fact that as of tomorrow I am supposed to be exiting my design phase and moving more toward making a first draft. First I will summarize how I believe myself to be doing in terms of progress, and second I will briefly describe my current design.

Although I feel a little bit behind, because of recent simplifications to my design, I am not worried if things keep progressing steadily as they have been. In other words, I am still confident that I will have a working draft of the part of the project that handles data processing by March 13th, or shortly thereafter.

The second part, the actual visual representation of data is another story. I have not done anything on that part of the project, and will be lucky if I can construct a good design with some testing by the time my first draft of the data processor is done. I pray things go smoothly, but I have no way to know.

As far as the design of the data processor is concerned, it will be a console application. Two passes will be made on the data being processed, one to find out which words are most frequent, and the second to log where they occur. On the first pass, every time the program counts X number of different words an output file will be generated with a portion of their totals, and at the end of processing those totals will be added. This step-by-step approach was suggested by my professors Dr. Pankratz and Dr. McVey as a way to achieve some results in real time, but will also increase accuracy while saving memory space, as well as preserving some of the work done in the event of computer failure or other interruption. As such, in addition I would like to try to save file pointer state every time a file completes.

The second file pass will take the final word list, stored in a hash table for quick access, and find the locations of each word, saving their context sentences and what words are located in that sentence. The structure of the final data will be two files. One that has the top Y words and the locations of each within the sentence data, and the sentence data itself.

Module Coding & Testing

Website Information