My current project status:
I am using a Windows Form for my project. The words and their frequencies are being read from a simple text file and is then stored in a List of my tagLabel struct (a tagLabel contains a string and a frequency). I have not written the code that takes a set of data and produces this text file. On my user interface, there is a panel in which I am drawing text graphics to represent the cloud. I currently have an intersection method, but have not yet worked out my placement method. I intend to place my tags in a spiral fashion, working from the middle and making my way out. My current algorithm accounts for comparing the word I am placing to the last placed word, but it does not compare to any others that came before it - resulting in overlapping words. My thoughts after today is that I will graphically represent the panel with a 2D array containing 1's or 0's. It will be broken into something like 5 pixels x 5 pixels or 10 pixels x 10 pixels (or something of the like). Then, when I am placing my rectangles, I will update this array with 1's where the rectangle lies. When I begin the next rectangle, I will be sure to place it in a space that does not yet contain any text. This will be a good next step, but will introduce a significant amount of white space. From there, it is my intention that I will either "scrunch" the words in when I am finished going through my list or when I have ran out of room and cannot place anymore words. My user interface does allow users to change the color scheme of a tag cloud, but it is missing the shape aspect (a huge component of a tag cloud). My intent is that I will have "bounds" for each available shape - it is unclear whether I will draw all the tags and then use a cookie cutter to get rid of the extra tags, or if I will create the shape as I go. I have completed my IntersectsWith function to determine if two tags are overlapping. I have also placed an updated copy of my project in the Documentation section of my website.
Things to start tonight: - Data Gather: would like to make it so the user can input whatever source they like and the application will generate it, with options for them to input their own "common words" vs. using the one I generate. - Swirling placement: my algorithm currently places tags randomly and I would like to begin looking at a more organized way to place the tags. Hazah - another post tonight.
I have found an algorithm I believe could work called hierarchial bounding boxes or bounding volume hierarchies, originally discussed here: http://static.mrfeinberg.com/bv_ch03.pdf Here is also a nice video on bounding volume hierarchies: https://www.youtube.com/watch?v=QDgkBPGQDXA What this does is it will recursively break down a rectangle into smaller rectangles to better define a boundary around a non-rectangle shape. It stores the bounds in a tree (if rectangle is a bound, add it to the tree and keep going, else, keep going). The stop condition would be when the rectangles are cut down to a certain size. My problem is this - how do I tell which squares are considered to be bounds and what squares are considered to be bounds? More to come... I need to do more research on figuring out how to figure out if my labels are overlapping. I found a C# function of a Rectangle called IntersectsWith that returns a boolean if one rectangle is intersecting with another. It looks like this function can only be used with rectangles.
With this information in mind, my thought is to create a rectangle for each tag and write the tag inside the rectangle. Then, it will be easy to test whether or not they are overlapping. I don't think this will eliminate a lot of white space though. I would like to explore other options though and need to do more research. For now, my next step is I am going to create a program with just rectangles and randomly place them, testing for overlap while I go. I have made some progress, but not nearly as much as intended. Learning C# has been a major setback. I think I was looking for help in the wrong places - I was browsing through sites to learn C# as quick as possible, but it wasn't coming back to me and I was getting frustrated. After wasting numerous hours, I finally just decided to start my code and just go from there - it was a good idea because things came rushing back quickly.
I started by making a program that randomly places 5 words on the screen. It does not yet check if words are overlapping. That is tonight's task. I have placed my up-to-date code in the Documents section of my website if you would like to take a peek. I have dated and documented some of the functions that I wrote, but still need to do some more. I would like to keep the working copy of my program in the documents location and I will try to be as vigilant as possible with this. More to come on tag overlap handling later tonight... Can't seem to get to sleep tonight, so I decided to do some research. I found an algorithm to determine where to place the next word on my tag cloud!
The article attached below discusses a common tag cloud generator named Wordle. The algorithm used is to use an ongoing spiral to place all words. Start with one worf in the middle. Move up a "space" and check if the words are overlapping. If not, place the word. If they are, move another space on the spiral and check again. The only problem I see with this algorithm could be lengthy. If my spaces are too short, the application will spend a lot of time testing, but space on the cloud will be maximized. If spaces are too long, the cloud will be extremely spaced apart, but will run more efficiently. A "just right" spacing will need to be worked out. Although this helps a little and gives me more direction than I had before, I am unsure how to check if two words will be overlapping with C#. One step forward and one step back. Back to the books for me. https://stackoverflow.com/questions/342687/algorithm-to-implement-a-word-cloud-like-wordle 1. Gather
I must gather data to use in my tag cloud. One thought that I had was to have a Door County themed tag cloud that utilizes data from a travel website or possibly from Facebook reviews of the Door County Visitor Bureau. Not sure how to get the data out of Facebook reviews for a certain place, so I would need to do more research. Otherwise, it would be easy to grab the html behind a travel website to get my data. I may need to personalize my project website a bit more to make it a better fit if this is the route I would like to go. In reality, my code will be built so that it can be utilized for any data source, so it should not matter. 2. Analysis Once the data is gathered, I must analyze it. The files that will need to be provided are the data file itself and a list of common words to ignore i.e. and, the, am, to. If getting data from a website, the common words could be editted to ignore .html keywords. From there, the program would parse the data file. Slowly, it will build a string until it reaches a special character or space - apostrophes and dashes would be considered part of a word. Once it reaches a special character or space, it will check if it is a common words. If yes, it will be ignored. If no, it will then check if it has already been added to "the list". I will call it "the list" because I have not yet determined the data structure I intend to use (probably a linked list because they were always my favorite - each node stores the word and the frequency). If it is in "the list", it will increase its frequency by one. "The list" will be kept in some order, either alphabetical or based upon frequency, so that it will be easy to search. If it is not in "the list", it will be added. Then it will continue to the next word in the data file. 3. Representation After generating "the list", the application will need to read it and visually represent it on the screen. The font size of each word will be set upon its frequency - if "the list" is not sorted by frequency, then I will need to sort it here before displaying the words. It is probably best to start near the middle with your most frequent words - that way, the most important words will be on the screen for sure. Slowly, the other words will be added around it. I will have set boundaries and if it attempts to place a word that is not within the boundaries, it will be rejected. There are a couple things that definitely need to be thought about here: 1. How do I make it so that no words overlap? 2. How do I handle words that do not fit? Do I try to put them elsewhere or are they entirely rejected? How do I control this? I am assuming my application will need the ability to zoom in because making the outline of Door County to look accurate will be tough, unless a fairly small font size is used. I will have to evaluate that when I get to that point though. I haven't dedicated nearly as much time as I would have liked over the last week to this. However, I have had some time to develop a plan of where to start.
I would like to start by generating a "fake" data file with words and their frequencies. From there, I will read it into my application and display them on the screen. For my initial test of this, I do not intend to start with any sort of shape or in any order - that can always come later. Before is an example of what my first tag cloud may look like. tag cloud illustration example project plan computer programming data fun My thinking is that a C# application will be the best way to accomplish this (and what I have noticed many others who did this project chose). My experience with C# is minimal, so it will have to be something I review. Today, I began to build my website and, hence, here I am writing my first blog post.
Last week, we were presented with our capstone projects and I received Illustrating Text by Tag Clouds. To read more about the project itself, please see the Project Description section. I was sitting on my couch at home when I received my project and after explaining it to my boyfriend, he asked, "Is that supposed to be hard?" Pretty discouraging. After thinking about the project and what it would entail, my excitement seemed to grow. There are pieces of the project that interest me and other pieces in which will greatly challenge me. It's always incredible to me how much instructors seem to know about their students. After talking with Dr. Pankratz last week, we identified some initial steps and those will be my focus over the next few days: 1. Investigate tag clouds - both prior projects and what the Internet has to offer 2. Create a project plan |
Archives
May 2019
Categories |