I am thrilled to be working in the D3 library. I especially enjoyed working on the D3 techniques, such as the zoom and hierarchy features. These features can make very complex dataset much more accessible to users. For example, in my dataset, I was able to divide users’ metadata headers into categories, then added users as child nodes.
Adding graph theory and hierarchy into D3 is very useful. To implement such feature, you must use D3.hierarchy and specify the dataset data structure. Then, you must follow the same data structure in the data file, whether it’s CSV or JSON. In my case, I used nested JSON structure to configure the tree, placing the parent at first and then added children to an array. And for each child, it would have an array to specify its children until we reach the leaf node, where no additional children are included.
This structure greatly assisted me in providing the zoom in feature in D3, where each zoom is considered a level in the tree structure.
At moments, sometimes we feel that having data in an excel sheet or a text file is good enough to understand our data. For instance, you can scroll over to find the sales percentage or the total users. However, When the dataset gets large or complex, it becomes challenging to understand.
I found many good examples in my data visualization class to support my statement above. For example, one group used H-1 visa applications dataset, which included millions of rows. In this situation, a visualization is crucial to understand the dataset. In fact, without a visualization, you cannot understand the dataset since it has millions of rows which are beyond our human capacity to understand at a glance. Moreover, the group had no idea what the dataset would tell them and at the end introduced fascinating results once visualized. For example, the number of full-time applicants dropped by almost 50% in 2016 and the part-time applicants increased by 100%.
In my dataset, it was a midsize dataset. I had no idea how to make sense of the dataset. However, after I visualized the dataset, I quickly discovered many critical business performance indicators for my community partner. These indicators unveiled as I was visualizing different parameters. Additionally, with my community partner’s feedback and guidance, I was able to pinpoint what is important to their organizations.
While I was working in Tableau, I reached a point where I wanted to share my visualizations with my community partner. I knew that I had to publish my visualizations online, since sending Tableau worksheet is tedious and require a license to view the visualizations. Luckily, Tableau had a service to host the visualizations in the browser. However, the service comes in a separate license and with a 14-day free-trial. I shared my visualization with my community partner, but the server shut down after 14 days. Thus, I had to find an alternative solution to Tableau server.
I started exploring Data Studio. I was very impressed by the visualizations look and feel. Additionally, Data Studio included a server to host the visualizations. I played around with the sample dataset, and once I was comfortable with the tool, I moved my visualizations to Data Studio. Then, I prepared the dataset using jq and regex and uploaded the dataset to the tool. Finally, I produced the visualizations and was very satisfied with the final output.
Going to Viacom to talk about Visualizations was very exciting to me. At the start of the day, Viacom team present their current work on data visualizations.
The first presenter used AWS Image Recognition to get the colors of the top hit songs for the last 30 years. The colors are then used to translate songs mood. She picked ten frames per second to gather color palettes. The findings for the past decade were that many hits songs used the color pink and light blue. The reason for bright colors to only show in the past decade was because before commercializing LED, it was costly to have shinning lighting on stage.
The second presenter talked about his journey with deep learning. He explained the basic concepts of deep network and the different challenges that might arise with Time Series data. For his module, he used Long short-term memory (LSTM) for generating predictions of a dataset. Initally, his module could not predict the validation data. Then, he realized that the prediction is different than generating- meaning that if you give the module ten Time Series points, you must predict for ten periods. He initially was trying to predict one point in the future with ten training points.
The last presenter talked about how our memory story data. He and his team developed a model to show the estimated time to forget a celebrity, which ranges from 5-20 years.
The visit was a wonderful opportunity to network and present for Viacom. Additionally, it allowed us to see in action how to extend data visualizations with machine learning
When I received a dataset from my community partner, I did not expect the files would be that large – around 0.5 GB. The files were not structured properly for a plotting library such as D3. The other challenge was that users’ locations were taken as city and state, so I must convert them into latitude and longitude using a geocoding service. Additionally, all the files were stored in JSON format, where each user is in a separate file. Thus, to make use of users’ metadata, all these files had to be merged into a single file. Fortunately, I found jq library that would process JSON files on the command line. I used jq library to merge all files into one file. Then, I used jq and regular expression language to apply additional filters to take out null and create categories from users’ metadata.
I have never worked with D3 directly. In the past, I used libraries that were built on top of D3, such as C3, to plot basic visualization. Basic visualizations include a line chart, a pie chart or bubble chart etc.
However, for the disease disparities dataset, it was very hard for me to convert the dataset into one of these visualizations. Thus, I looked at the D3 documentation to find the level of expertise I needed to accomplish. I signed up for two D3 crash courses in Linda.com to understand the D3 basics and how to work with its elements. In conclusion, although D3 might be overwhelming at first glance, it is a very powerful library with unlimited choices. I recommend anyone working with a dataset to take a look at it and learn it if time allows.
Working with a community partner is a wonderful opportunity since it puts me in place to think deeply about my choices in the visualizations and toolkit. For this blog, I will discuss how one can find an effective visualization and identify an ineffective visualization.
Many end users evaluate the visualization work based on how it looks. However, crucial steps of the visualization are data processing and data cleaning. For example, without having a proper location data, it would not be possible to plot a map. I faced this problem with my dataset. The dataset included city, state, and country, without latitude and longitude, and I was not able to produce a GeoJSON file to plot a map in MapBox or Google Maps.
Now, assuming we process and clean the data, what makes a visualization effective or ineffective? To answer it simply, each visualization must have a title and axis labels. Then, one should consider the audience to know how to present the visualizations. Additionally, one should truly understand the dataset to identify all possible charts.
Now, we know the general guidelines, let me mention one effective visualization and on ineffective visualization based on my dataset. My dataset includes users name, dates, locations, and questions.
An effective visualization is a map of users. Although this visualization is overused, it can instantly tell users about the organization success. Moreover, it can also be used in the organization internally to drive business decisions. One example of such visualization is below, taken from Atlas of Knowledge. The map includes a heat scale to show intensity.
On the other hand, an ineffective visualization for my dataset would be a network map. Network map usually connects different elements to show their dependency and how they fit into a network. My dataset structure does not need to examine dependency since all users have the same hierarchy. Thus, a network map is not a useful plot for my dataset. A figure of a network map taken from Atlas of Knowledge is included below.
What a fun Journey it has been!
To access the visualization, tab on the Link
You can hover over to walk through the cases and percentage of each disease verses ethnicity and gender. As of now, I have not build it to work on mobile phones.
I decided to work on the Be More Dataset to represent Causes of Death with Ethnicity and race. I wanted to begin working on the dataset from Tableau and then look for a charting tool such as D3 and C3.
I was a bit overwhelmed by the D3 learning curve, so I decided to work on library that are build on top of D3. I found dimplejs with many options. I then choose the image below.
However, the library has no flexibility to adjust for my many variables dataset. Thus, I pushed myself into D3 and watched “D3.js Essential Training for Data Scientists” by Emma Saunders. The class was super helpful to understand and work with D3 library.
Afterwards, I choose the style of visualization below.
I spend a day and half to recreate it, see below.
However, the final visualization will have many problems with scaling. For example, if the visualization is opened from a mobile phone, the visualization will either be very small or loose the axes.
Finally, I found what I think the best way to represent my many variable dataset. The chart uses a hierarchy to go from disease to ethnicity to gender using pies. The final visualization is below and hosted on the cloud.
To access the visualization, tab on the Link