Tuesday, October 15, 2013

Parallel Coordinates

You can download my dataset from http://www.sci.utah.edu/~mavinm/cs6630/parallelCoordinates.tar.gz.  Understand that the dataset requires authentication.  Please contact me if you would like to request access.  The dataset attached to this zip file is the cars dataset.  If you want the other dataset with cameras, you can download this archive http://www.sci.utah.edu/~mavinm/cs6630/parallelCoordinatesData2.tar.gz.  Use the same password.

For the Parallel Coordinates assignment, I have decided to play with a dataset using "Cars" data given to me by the class web page.  I have also decided to use "FloatTable", a database reader for csv formatted files.  Because the "Cars" dataset is in a different format, I had to change the "FloatTable" class accordingly to read the first column also and to separate values by spaces and new lines rather than tabs in the last assignment.

To first draw a Parallel Coordinates system, I first sketched up my drawing of a nice design I thought would work well.



As you see, I really enjoyed playing with all the datasets.  They all have a common pattern in design layout.  My favorite graph was the "Nutrient Contents" given on the class web page.

I'm now ready to pull the data accordingly during my process and the first thing I decided to do was draw the axis with labels.  I wanted to include the title on the top and bottom of the graph so you could know what you are looking at whether focusing on the top or bottom.


I tried this vertical axis format for all of the test cases and had to modify the graphs accordingly.  I made it so that if you have a range difference of larger than 100, to not show the decimal places.  I also eliminated all of the decimal places that were not necessary to display where the value was indeed an integer.  After all of the changes to make the labels look nice, I got the following output.


Things are starting to look the way I'd like them to.  Now is time to get a static image of all the connecting points on the graph.  This gave me serious trouble since my computer doesn't render quickly the method Processing initially uses.  Through the help of the class tutor, I found that if I add P2D as the rendering technique to my screen size parameters, that it renders much quicker on my computer.  Below is the result.

In this visualization, I was not able to visualize everything I wanted to as on the sketch.  The things that I sketched that I wanted to put in was the interactivity by selecting how many data points and also clustering the groups.  This is because I sketched out my data too far in the run.  Another feature I couldn't add was interpolation.  I tried to initially add vertex coordinates but found out that Processing has a bug that doesn't translate these coordinates.  I had to revert to another style of drawing lines.  I will now be sketching the data for interactivity.



As you can see, I wanted to have several functionalities that you could interact that would make the tool nice and useful.  In each scenario, we will start out with the top-left, I want you to be able to grab a title and make it scroll horizontally as you move your mouse.  I also want to have a vertical filter scroll when you select the bar that would go down and filter the data you want to look at.  On the top-right case, I want to be able to switch between the axes when you grab one axes and move it to the other side of the axes.  On the bottom-left, I want to be able to focus in on one of the data-lines.  The other lines will be faded out unless selected.  I also want to show the percentage of how many data rows you have selected at the top right of the page shown in the bottom-center.  Finally, the visualization needs to be invertible so I'm going to make the right-click button invert the data.

The first thing that I worked on was inverting the axes.  This was a simple tool to write and is useful for when data crosses over.  I put a highlight on the axes to know that you are hovering over it and there is a message at the bottom that states that you can right click to invert the axes.  Below are the results.


Now I'm going to work on the selection of the switch case method where you can switch between the different axes.  This will involve a hand grabbing tool that will allow me to grab an axes and translate it along the x-axes.  The version I created works like a charm.  I made a reset position so everything knows where they should snap to if you release the button.  Below is an example of the output of axes being mixed up.  You simply use your left mouse button when the axes is highlighted and move it along the axes.

Now I would like to build a filtering that filters a given row where you can select it from any of the axes.  I will first create a tool that will be able and help the viewer understand.  In this case, I created different cursors to let you know you can do stuff with the data.  Because this helps you know that you can edit the axes, I have taken away the highlight for redundancy.  

Now for the filtering of the data, I'm now able to filter the data and show the viewer that the data is being filtered though it doesn't do anything to the data yet.  Below is the box that shows the user they are selecting a specific dataset.

Now it is time to make it so that the user's specified filter dims the rest of the data that is not within the filter.  This was easy to do, it just required some algorithm's that added storage space and I wanted to make sure it could do it in O(n) time in computer language.  Below is the filter I sketched and created.

Everything looked too messy so I decided that I didn't like this algorithm at all.  I scrapped the idea of color filtering and decided that I would take away anything not focused giving me the following result.


I really liked this design better so that I could focus on what I'm only interested in.  Overall, I'm very satisfied with the technique I have implemented for parallel coordinate visualizations.  The last thing I was planning on implementing by showing the percentage of selected rows will not be implemented because of time constraints.  Instead, I'm going to continue on into Data Clustering.

I wanted to add help text that will help the user know to middle click to get data clustering.  Rather than cluttering the view window, I decided to create a small help icon at the bottom-right of the window telling the user how to change function modes that are not obvious by the cursor.  All you have to do is simply scroll over the icon.



Clustering was a hard task in implementing.  I actually tried to implement it myself for the first bit and decided with time constraints that it would be much faster to find an implementation online.  I did spatial clustering where the closest vertices on a specified axes would be grouped together.  Middle Clicking on each axes will show a different cluster organization.  To activate this, I placed a description into the help text "?" dialog to right click on the axes to show the cluster groupings.  Below are the results.








Even finding something online that worked was very hard since they didn't support what I'd like them to do.  I found one of them that works most of the time, but sometimes Processing throws bugs out if you don't have enough memory.  The images above can be reproduced if you have enough memory and you produce them on a retina display.  I don't have the capacity of testing this on another display.  For the color choices, I have grouped the items into 5 colors using the color pallet provided by colorbrewer2.org.  I remember Edward Tufte stating that it's hard for people to comprehend more than 4 colors at one time but I felt 5 was pretty easily comprehendible according to the color brewer template choices.  Later, I found inside of the "Visual Thinking and Design" Book that it is hard to comprehend more than between 6-12 simultaneous colors.  I redid the clustering to compensate for 6 colors but though I was able to distinguish specific colors, it was hard to look at it as a hole so I reverted it back to 5 colors.

So the first thing I wanted to do was data exploration of the car dataset.  The first thing I wanted to see was where the acceleration of the car was at its peak.    My initial thought was that it would have a lower MPG but I was absolutely wrong, it was the opposite.  The reason they had higher acceleration was because they were lighter in weight and they even had a small number of cylinders to my surprise.
I then looked at the opposite end and found that the slower cars not only weighed more, they also had a lot of cylinders, and were mostly 8 cylinders.  The funny thing was that the acceleration wasn't as good in the past since those cars were made in 1970.


Now I became interested in any correlations between the entire dataset since I did colored clustering.  I decided to look at MPG and see the clustering behaviors of that dataset.  The interesting thing was that the lines stayed pretty consistent across the axes until it got to Acceleration and Year make.


I wanted to dig deeper into the issue so the first thing I did was switch the "Year" and "Acceleration" axes.  It proved that the lines were also scattered for the Year matrix.  Now was the time to filter the data.  As I looked at each clustered filter set, I found some interesting results shown below.






It showed that you can't just trust the large data visualization since there were lines everywhere in my dataset no matter where I was.  When you follow the density measures, that is the time that your eyes are more correct.  I learned from these visualizations that it is important to focus in on filtered data to make sure you account for noise also.

Now is time to focus on another dataset.  I chose for my second dataset to use the camera dataset.  Since there were 13 parameters, I had to change some of the labels to make it fit.  For this case, I also swapped between the label being above the axes and below the axes.  This way the labels didn't clutter each other.  Not my favorite preference but it works well for many axes.  I was also interested in the camera brand number so I have grouped the camera types to numbers to go on the axes, for example; the Model 0 would tie to all of the "Agfa ePhoto" cameras, and anything in model category 1 ties to "Cannon PowerShot"  I grouped the models together except for the version numbers like "A10", "A20", "A200" and so on.  If you want to look at my code, I have written a python parser that formatted the code to the same version of the cars dataset and used the naming convention for models in this way to see the models tied together along the axes.



Now to do some data analysis.  The first thing I did was invert the axes and move the axes around so that the data was easier to view.



The first interesting thing that I was interested in was why some cameras were $7,999.  I would have thought it would have the most storage and everything.  To my surprise, there were three define lines from the same model 4 which mapped to "Canon" in the following Models list.

Models
1: Agfa ePhoto
2: Canon PowerShot
3: Canon EOS
4: Canon
5: Casio Exilim
6: Casio
7: Contax N
8: Contax TVS
9: Epson PhotoPC
10: Epson
11: Fujifilm FinePix
12: Fujifilm
13: HP Photosmart

Another interesting thing is that they came with no storage, they were super heavy though so you know they were probably commercial.  They didn't even offer a zoom so that means you probably had to buy your own focus lense with it meaning it was going to be a lot more.



The next interesting thing I wanted to see was all cameras under $1000.  All of these cheaper cameras seemed to follow the same pattern.  The interesting thing as I clustered the camera by Release Date was that zoom telephoto dramatically improved over time in this price range.  The only thing that didn't seem correlated over time was the Normal focus range so that must have not hit a revolution yet.

A final interesting thing is that you can see more density in clusters as the time has moved forward.  The clusters in orange when focusing on the year seems pretty spread out but now cameras seem a lot larger having their own clusters as cameras have improved.

I really liked this assignment.  It was definitely a hard assignment to do and took a lot of time.  It was cool when it all worked and I'm impressed in my own work.  I do see the advantages and disadvantages to parallel plots.  It does help you look for correlation though it is still probably not the most effective visualization for many data types.

No comments:

Post a Comment