Tuesday, September 24, 2013

Time Series

The program I have written has been uploaded to http://www.sci.utah.edu/~mavinm/cs6630/timeSeries.zip.  It does require authentication to access the data.  Please contact me if you need to access my dataset.

Using a program called "Processing", I will be using a Java environment to create data plots and explore the visualization.  Here is how the IDE looks like below.

This is a nice working interface that allows visualizations to be drawn using setup() and draw().  Everything else I will be implementing outside of these functions to create the desired graphs.

Following the tutorial in Ben Fry's book "Visualizing Data", I recreated Chapter 4 to create the following graph.

There are some very cool things in this graph to visualize.  To improve the visualization, I followed the requirements the class curriculum I'm in desired with numerical labels to Georgia font, and all the text titles to the Verdana font.  I also made the "Year" title larger and the "Milk" title larger.  Finally I rotated the "Gallons consumed per capita" 90 degrees and took out the extra lines.

To add to some changes, I made the ticks on the vertical bar go all the way across and I made it's weight lighter so it didn't take away from the graph.  This will allow better visualization in where these data points close to the line actually lie.  Finally, I took Edward Tufte's principle of less ink by taking the background color away.  Below are the changes I implemented.

I continued the tutorial in the chapter and came up with the following graphs.  The graph that is most appropriate would be the filled chart because these are the number of gallons consumed.  The semantics would be much more understood with meaning if the area was filled beneath.  Below are the images produced.





Continuing the tutorial in the book, I was able to create tabs and also an integrator method which is super cool because it animates the changes between the different graphs as you select one or another by click or by your keyboard.  Below is the shown results of the tabs getting changed for one of the datatypes.




Showing the tabs is super useful so you know what datasets exist and know that there is an option to view different datasets.  The integrator is even more useful since it animates the data moving within 1 second.  My favorite part of the integrator is that the speed of the data changes more drastically if it is farther apart and you see small movements if the data is super close in a given year.  The thing that is missing is hovering your mouse over one of the tabs.  I would have made it known to the user that you can click on something by changing the background color if you hover over a section that is clickable.  I would change the tool if showing tons of data points (more than 1000 points) by allowing you to zoom into certain regions and filter the data accordingly to what you want to look at.

To choose a view to display all datasets, I have combined all of the data into one plot.  I also placed a key as the other tabs that shows the color coordination between the data points.  The reason I chose this method is so that the user doesn't have to learn a new graph.  They are already used to looking at this scale in data so it would make it easier to compare all the data points together.  I wasn't a fan of connecting points so I used curved vertex lines to portray my data.  I also allowed the functionality of scrolling over a piece of data and finding the actual value.  Below is the output of my graph.


Since I'm a graduate student, I went a step further than the assignment and calculated the linear regression of each plot I was able to find the effectiveness of least squares estimator.  I really liked the interpolator so I made the linear regression animate when changing between data values.  The linear regression is more effective when the data is spread out on the entire sheet.  For the example of "Tea", it is hard to really tell the linear regression and it looks as though the least squares remains constant though that may not be the case.  There are cases where linear regression would be ineffective.  The cases would be if the data followed a polynomial pattern.  This would make the straight line inconsistent with the noise in the data.  In those cases, you would use a curved regression to find the least squares.  Below I have an image of the result.






Tuesday, September 10, 2013

Data Exploration

All blog posts on this site demonstrates large scale data visualizations at the University of Utah.  I'm using a program called Tableau to visualize datasets.  The benefit of the blog is to be treated as my journal.  I will post the progress along my journey in exploring the data on this site.

My choice in dataset is "movie data" compiled into a spreadsheet provided by the class I'm participating at the University of Utah.  You can find ratings and revenue from this dataset at RottenTomatoes and IMDB.  The initial question I'm wondering is if there is a pattern in a genre rising to the top in some point of time and quickly dropping almost instantly to a non-popular genre within a year or two.


Exploring the data confirmed that this question could be answered easily with specific filters.  Digging in, one of the first things I started noticing when looking at the data is that there is a field called "Number of Records", comparing those to the titles shows results that there are more than one record for some movies after sorting the data from largest to smallest.




This is a conflict since when comparing other data to the dimension for measurement, we will have inaccurate data.  I was able to filter out the inaccurate data.  As I started exploring other data, I started seeing NULL information.  I decided to exclude that in the Genre's along with the year to keep interest in the data understandable.  Looking at the other data gave me good confirmation that my filter was good enough to explore the data.

The first thing I decided to do is explore the gross income per genre in a given year.  I have split the data into Worldwide Gross and US Gross to compare the difference.  The data became super simple to compare to one another.  The lines of the World Gross was pretty similar to the patterns of the US Gross which shows good correlation.  The line was steady for Western and Musical Films along with some others so I filtered those away since they didn't answer my initial question.  The thing is that the data may be relevant in a future question so I hid the data rather than filtered it out as excluded.

This simplified my data to the following.  

In the image above the data is grouped into genre's on top of one another and time moves forward on the x-axis.  The orange lines display the world-wide gross which map to values on the right-side axis.  The US gross is the blue lines and maps to the values on the left.  This wouldn't be the graph I would choose if I was to compare numbers but since I'm looking at slope changes to answer my question of a huge pitfall, this answers my question very well.

As you notice, the slopes are pretty consistent Worldwide and in the US so we can almost eliminate the option of a genre being popular in the US but not popular at all outside of the US.

Now there is some weird information that shows the time line going pass 2013.  This is impossible since 2035 has not existed yet.  Looking at another bar graph of time, the noisy values seem to be before 1950 and after 2011 so I excluded those results out of my data.  The values were still too much to look at so I excluded the data that was pretty low since money was not as big in the pass.  I filtered the data to 1996 and further giving me this result.

This is much nicer to read since we can focus on the slopes much better for details.  

To answer my own question, I'm looking for the area of the graph that looks most like a mountain since I also wanted a rapid increase before a rapid decrease showing a spike in the chart.  Looking at the figure above, I concluded that in different years, there were dramatic increase then decrease in popularity of genre.  These are Thriller/Suspense in 1997, Drama in 2000, Comedy in 2006, Adventure in 2004, and Action in 2003.

This was surprising since I didn't think many genre's would have this kind of effect.  The next question of curiosity is which genre out of these results had the most significant rise and drop.  Looking at the graph, it really comes down to Action or Thriller/Suspense so we need to compare them close to one another.

A factor to add in when looking at these two since they look very identical is the amount of money that they also made on these movies.  I was able to get my graph down to the following area chart.



Looking at this area graph, you can see that Action definitely made a lot more money.  Looking at the beginning and end points we can see that Thriller/Suspense came back to where it was.  Action actually dropped a lot more making 1.4 billion dollars and dropping to 1.15 billion dollars in 2004.  That is a large impact since Thriller/Suspense went up and came down to about 160 million dollars.  I would have to conclude that action took the greater fall since they lost more money when they dropped than when they started to increase.

Overall, Tableau is super fun.  I never knew of all the functionality that a program that visualizes large scale data that would allow you to understand the data quickly and efficiently.  My favorite thing is changing between graphs and saving my spreadsheets to come back when I don't get anywhere with some other data formatting.  I'm excited to use Tableau in my research and studies throughout the semester.