Data is one of the most talked about resources these days. It has the power to narrate insightful stories, which can lead to strategically important and competitively advantageous decisions.
Last Saturday, on 25th March, we organized organised a data visualization meetup at . The objective of the meetup was to understand the complexity and the nuances of the stellar work done by individuals and corporations in the space of data visualization.
The meetup had two key speakers,
Mr. Avinash Celestine from How India Lives
Mr. Akhilesh Srivastava & Mr. Shubhadip Biswas from Open Government Data Platform, India
Avinash, has worked as a business journalist, across a variety of national papers and magazines for last 15 years. In the last couple of years he has learnt to code and delivers his stories with the visualization. He also has a personal blog at datastories.in
Mr. Akhilesh is a statistician, with an experience of 15 years and has been working as a senior technical manager and supports various data initiatives of Open Government Data. He has contributed to various research studies, has conducted various statistical analysis using wide variety of tools and have published articles in Indian Journal.
Mr. Shubhadip is a senior analyst with the Open Government Data and works in the areas of big data management, statistical models, and is involved in various research studies on better governance, socio-economic impact of open data, and data-driven decision-making.
Avinash shared with us the key case studies along with his methodologies, challenges, and the guidelines for making effective visualizations. He emphasized on the following principles:
- Understanding of audience
- Clear purpose of the visualization
- Identifying the right metrics
- Contextualization of the Data & the visualization
- Identifying assumptions
- Data cleaning and sanitization
- Choice of type of visualization
- Explicitly guiding the audience
The objective of any data visualization is to answer certain key questions. Data is a key ingredient of a visualization and more often than not data preparation is the step that takes the longest(up to 80% of the time). He highlighted the challenges of:
- Merging the data from two sources where the definitions of the same metrics could differ
- Data types in a column could be inconsistent
- NA/missing values
- Treatment of missing values
- Invisible characters (line breaks with fields, leading and trailing whitespace, spaces )
- Different record length etc.
Above mentioned issues need to be explicitly addressed or they pose a serious challenge to the concerned visualization and in turn the insights from the visualization.
Although tools and technologies make it easy for us to visualize the data but the real value-add from the visualization would come only after addressing the above-mentioned challenges.
Visualization often helps us address some key questions and solve interesting problems. For example, Avinash spoke about how their team helped one of the diaper makers to identify the opportunity areas and the right market. They did this by visualizing the publicly available data on the percentage of babies born in private hospitals (Data source: National Health Information Systems).
Again, the power of visualization rests on the assumption that the underlying data is cleaned, sanitized and free of the above-mentioned challenges.
The team from NIC talked about the visualization engine on the open data platform. It’s an engine which allows the user to perform the following tasks:
- Use a new dataset/use an existing dataset from https://data.gov.in/
- Copy the data from the CSV or JSON URL
- Select the visualization type
- Make changes to the data like filtering, adding a field, etc.
- Create a visualization
The team also demonstrated the working of this visualization engine and showcased how the users can create interesting maps and different visualizations with just a few clicks.
The NIC team has worked with more than 15 open source libraries and frameworks. Some of them include D3.js, C3.js, NVD3 and jVectormap for creating the all India map, Python, Leaflet, Openstreetmap for geolocation.
The NIC team is constantly adding features to this visualization engine and to the open data portal. The features like “Suggest a dataset” is certainly a boon for the data analysts out there.
Open data platform has more than 74K datasets and this visualization engine is certainly an added incentive for the users to experiment with ideas and data on the portal.
The speakers, once again, emphasized on the need to clean the data for visualization. They had a checklist of steps to clean the datasets, which included the following steps:
- Remove the formulas from the excel sheet if any Unmerge the cells in the excel sheet
- Keep the header in the first row
- Remove the blank cells and replace the NA cells with appropriate values
- Remove the special characters
- Remove the spaces File name, dataset title and dataset should not come in metadata file, etc.
This is just a glimpse of all the steps that the team at NIC undertakes to make the datasets ready for visualization.
We often read about the tools & techniques to make great visualizations. Great is the visualization, which achieves the purpose of solving a problem or addresses the pre-set questions. To ensure that a visualization is powerful one needs to make certain that the underlying dataset is carrying the right and clean information.
Both the speakers emphasized the need to have clean data sets and also outlined the steps for the same, but these steps were in their own list, on their local machines and discussion environments. We suggested that such steps should be publicly available so that all visualization experiments yield powerful insights and decisions.
Srijan's data visualization team and all meetup attendees gained some valuable insights from our speakers. Working to create short experimental data stories, we have experienced first-hand the challenges the speakers talked about. And now we are better equipped to resolve them and deliver better solutions.