Michael Klosiewski's CSCI460 project weblog

Date (all year 2013)	Entry	Current phase
Friday February 1st	After deciding to attempt to install Flash on my computer despite it not meeting the system requirements, I found out that the disk that came with the book is, in fact, not a copy of Flash. Now I need to either find a new place to work, or, once again, a new medium to work it. I'm not going to give up on Flash yet. I'll see what can be done about a place to work. On top of that, with three other time consuming classes to take, I haven't really gotten anything done this week. I have ideas, but I have yet to act on them--a precarious place to be. I want to work this weekend, and I think I will have time, but I have yet to plot out the actual times I will do so. Since we have been asked to detail our timelines and put them on here, here is mine: February 24th, three weeks and two days from today, planning should be done. This includes knowing all tasks that have to be taken care of, what algorithms and data structures I am going to use to do so and what languages they will be written in, and simple tests to see that they do what I hope they will do. March 12th, two weeks and two days from the end of the planning phase, coding should be done. This means having a basic working copy of all parts of the project. At a minimum, these working copies will perform the initial goals of the project when they are used correctly. April 6th, three weeks and three days after coding is complete, all parts of the project should be robust, having been rigorously tested, and should be fully documented. Any expected features that may have been lacking in the first draft should have been improved. I should be able to submit the project with confidence at this point. Following April 6th, I will attempt to add any features that could not be added into the second draft. In addition, I will try to add as features to make the project more usable. This might include multiple ways to open files, displaying more information detailing processing, and others. Backup copies will always be kept (throughout the entire development, but especially here), because if a feature cannot be implemented, I should be able to submit the latest working version when the project comes due. - Mike	Planning
Sunday February 3rd	Progress over the weekend: I have looked at hash algorithms, which will be important in looking up words quickly, and decided on a couple I'd like to try for sample data in testing. I have also developed a comprehensive plan for how I am going to enable the program to handle large amounts of data as efficiently as possible, and in the process I have solved the issue of keeping data true to the original problem of "What are the top 25 words?" In other words, there should be very little risk of picking the wrong words due to not keeping enough possible words, but at the same time I shouldn't be too worried about storing an astronomical amount of data at any one time. As part of this plan, I have separated what needs to be working and in place in each draft by self imposed deadline set out in my previous blog entry, at least on the data processing end. Next steps to take: - Talk to professors about design, both short term and long run - Decide, based on design, what the Windows application needs to look like, and create a dummy interface that contains all those things - Either get working on figuring flash out, or move to some other graphics production method I feel like I am already late on that third requirement, and I am eager to get working on it. I'm leaning much more toward doing data processing in C++ rather than C#, because I think I can avoid processing overhead that way, and I am still more familiar with C++ when it comes to handling data. The Windows interface will still be built in C#, because C# is a very good tool for making Windows applications. I have a vague desire to process the data in C, because that would be one step further in the direction of fast processing, but with all of the strings I'm going to be dealing with I don't feel I have the courage to do so. Perhaps that's for the better, as I am functionally still a novice in C, which is more different from C++ than I originally imagined.	Planning/Design
Monday February 11th	This is really more like two entries in one, as I should have written last night after my efforts this past weekend. I spent a decent chunk of the two days previous to this trying to write a simple program that counted the words in a text file. Unfortunately, although it runs, I have not yet been able to make sense of the output. Since I was not able to work on it today, my hope is that I will have time tomorrow, and that perhaps I will even be able to finish that part so that I can move on to the smaller parts of the algorithm that look at throwing out words, combining similar words and making sure I am not bringing in special characters with the words, like periods and commas at the ends of words. I feel like I was able to get some good ideas at the group meeting today, but after some consideration I still feel like I need to make two passes through the data to collect all the information I need without using tremendous resources. I also suspect that processing the data that way may be faster. I am going to change what I was going to do a little bit though. Rather than putting the task of counting words to both passes, I am going to do all the counting in the first pass, processing all the data rather than a sample of it. Because I am not trying to simultaneously count words and keep track of where they are, this should run relatively fast. By the second pass, I have got my top X number of words without a doubt, and can focus on only gathering sentence data about those words. This should simplify organization quite a bit. I will create a full plan for this tomorrow.	Planning/Design/Module Coding
Tuesday February 12th	I got my word counter working. It now writes the top ten words of a test file to another file, along with their occurrence rates, in proper order (greatest first). The only strange thing it's doing is printing fifteen lines of five spaces followed by -858993460. I just figured out why it's doing this though, so it should be an easy fix. I also learn that the number is 0xcccccccc (in hex) and is used by Microsoft to detect buffer overflows, or so says this site at least. I still have to test it with over 25 words, and over 250 words to test proper behavior with numbers over the current array sizes, and then I can move on to making it more efficient and doing all the other little things like removing unwanted words and combining others. To end today on a positive note, I think I'll stop here for the time being.	Planning/Design/Module Coding
Sunday February 17th	Over the weekend I finally managed to test the word counter on slightly larger data files with more words. It has now been found to be reliable with a number of words exceeding its maximum capacity, which I am leaning toward making at least 2000 for the initial gathering of data, if only because my 46 kilobyte file easily broke 500 words. I can only imagine how many words a data set of several gigabytes would have if written by many people. Coding of virtually all other modules that handle data should be done this week to remain on schedule. It may be a rough week. I also think that planning and design for the interface are going to run well into what was originally supposed to be the coding phase. I'm not too worried about that because it should be relatively simple using C#. I'm a little bit more worried about the Flash representation. From what I recall I should be able to start using it this week, but due to all the other things I feel I should get done, I probably won't move beyond initial experimentation with it until at least next week.	Planning/Module Coding & Testing
Monday February 18th	I have made a module that excludes words correctly, and fixed cases where the data set has fewer words than asked for, not that I believe that will be a frequent problem. I have still not dealt with data structures for any part of the project. Everything is being stored in arrays and accessed via linear search. This will be changed for the first draft. I am still researching appropriate data structures.	Planning/Module Coding & Testing
Saturday February 23rd	Summary of today's work: • Made new function that increases accuracy of word counting. • Added a few words to the exclusion list. • Found a perfect hash function generator for a set of reserved words. To do next: • Find or write a function to combine words that belong as one. • Change the program to read chunks at a time. • Read more on how the hash generator works and implement a perfect hash function for the list of reserved words. • Change the structure holding the words from an array into a splay tree. Today I coded another function to improve the accuracy of the word count. It does two things. First, it removes special characters from words so that periods, commas and quotation marks attached to words don't end up counting those words as separate from the originals. Second, it uncapitalizes words so that words at the beginnings of sentences don't count separately either. I also found an open source, perfect hash function generator, which I intend to use for the excluded words list. I copied the words on the web page for the generator, found here, into a test file and ran the word counter on it. After calling the function described in my first paragraph, the results improved dramatically. Forty-eight more occurrences of the most common word were found, raising its perceived occurrence by roughly 69%. Moreover, that word happened to be the name of the function that the page was describing, which was getting beaten out in the first run by "This" and "The". Removing the capitals on "This" and "The" allowed them to be recognized as words that should be discarded, bringing more relevant results into the final data. Here are the results from before the function was applied: This 80 The 74 gperf 71 C 64 work 63 code 63 hash 61 GNU 58 Output 51 If 43 License 41 Declarations 35 Gperf 33 A 30 must 30 Previous: 29 Up: 29 keywords 28 function 27 Input 26 Format 25 search 24 Next: 24 same 24 copyright 24 And here are the results after: gperf 119 C 113 code 103 license 103 work 100 output 79 hash 72 declarations 67 generated 65 input 62 GNU 58 file 56 program 55 table 48 keywords 47 options 46 function 45 source 45 keyword 44 functions 44 covered 41 format 40 copyright 40 perfect 37 previous 36 Note that C and GNU are still in capital letters. The function is intelligent enough to leave single capitalized letters and typical acronyms alone. Proper nouns will, of course, be all lower case in the results. With any luck the user should be able to edit resulting data to fix such small errors. As long as all the words are counted properly, this should not be an issue. The words "is", "are", and "may" were also added to the exclusion list.	Planning/Module Coding & Testing
Sunday February 24th	I think the time has come for a reflection on how the project is progressing, due mostly to the fact that as of tomorrow I am supposed to be exiting my design phase and moving more toward making a first draft. First I will summarize how I believe myself to be doing in terms of progress, and second I will briefly describe my current design. Although I feel a little bit behind, because of recent simplifications to my design, I am not worried if things keep progressing steadily as they have been. In other words, I am still confident that I will have a working draft of the part of the project that handles data processing by March 13th, or shortly thereafter. The second part, the actual visual representation of data is another story. I have not done anything on that part of the project, and will be lucky if I can construct a good design with some testing by the time my first draft of the data processor is done. I pray things go smoothly, but I have no way to know. As far as the design of the data processor is concerned, it will be a console application. Two passes will be made on the data being processed, one to find out which words are most frequent, and the second to log where they occur. On the first pass, every time the program counts X number of different words an output file will be generated with a portion of their totals, and at the end of processing those totals will be added. This step-by-step approach was suggested by my professors Dr. Pankratz and Dr. McVey as a way to achieve some results in real time, but will also increase accuracy while saving memory space, as well as preserving some of the work done in the event of computer failure or other interruption. As such, in addition I would like to try to save file pointer state every time a file completes. The second file pass will take the final word list, stored in a hash table for quick access, and find the locations of each word, saving their context sentences and what words are located in that sentence. The structure of the final data will be two files. One that has the top Y words and the locations of each within the sentence data, and the sentence data itself.	Module Coding & Testing

Website Information