Case Study on CSE India: Web Portal with Drupal Migration
Migrations from 4 disparate content databases to Drupal
Centre for Science and Environment (CSE) were inspired to build an environment portal as part of the push for portal development from the National Knowledge Commission, Govt. of India. CSE decided to release 20 years of their research articles spread over 3 proprietary systems, including a Library Management System, a Content Management System written in ASP/MS-SQL for managing the website of their premier Science and Environment magazine - Down to Earth (DTE) - and some Access Databases.
The total number of records exceeded 250,000. CSE also had a very extensive tag vocabulary, or thesaurus, of their library tags, organized as What, Where and Who lists, which represented nature of articles, geographies they represented, and people they were associated with including the authors, respectively. All the articles in its library and DTE database had been classified tagged with these.
Migrate all 300,000 records from these scattered systems into one common system
Have an excellent tagging system to cross-access content across the site
Have a powerful search to enable environment journalists and researchers to enable content filtering thus reaching 100% of the content available
CSE's vision of the portal was still evolving, and being a non-profit they had started out with a very limited budget
Srijan was engaged with a small Information Architecture exercise for the portal, which included studying of their library management system, the way they tagged articles, and gain an understanding of DTE
The DTE content management system had been around for 4-5 years and had grown along the way to meet the changing requirements of the print magazine, without any documentation whatsoever. The database therefore was in a bad shape from a design perspective, with redundant tables and data, adding to the confusion.
Srijan proposed to break the project into multiple phases, with the first being a pilot for data migration of the DTE database, which formed the bulk of the data, into an open source Content Management System.
The initial choices were Drupal or TYPO3. We eventually chose Drupal, for the following reasons:
Drupal had excellent vocabulary and tag management core modules, which would have to be written in TYPO3
TYPO3 had a separate Admin interface which would have proven to be difficult to manage and use for the library team
Drupal's caching mechanism is better than TYPO3; this would be a critical requirement in the high volume traffic that CSE was expecting for the portal
A powerful and fast search engine was required for searching through 250,000 articles. Drupal had an Apache Solr search module integrated with it, which was a candidate for implementing this search.
Looking at the MS-SQL database we knew that the database architecture was poor with data redundancies.
There were some cases where UNIQUE fields were having duplicate values!
Database Normalization was missing. Thus the flexibility, data integrity and efficiency were not so good. Required to safeguard the database from anomalies
Since the data was populated through a Library Management tool, the data were having many special characters which were stored in database. During migration of those data to Mysql database, we had to clean up those. It was very much time consuming and repeated work to cleanup the data without the availability of any documentation of the database
Designing our database(MySQL) to accommodate the specific concepts of box stories and cover stories that comes with a magazine like website. It required, understanding the relationships of tables and data, the way it was maintained in MS-SQL database. Again, many redundancies/anomalies in the database for these tables.
The next phases of the project comprised of setting up Drupal (with a default theme) and build all the required modules to start showcasing the content in the desired manner. A key component of these phases was the setting up of a powerful search. Apache Solr was researched with, and selected as the choice of the search. It turned out to be an excellent decision, as in retrospect, not even Google Mini (the other search candidate) would have proved to be so beneficial for researchers. In another case study specifically on Apache Solr, we will describe how researchers within CSE are using the same to their absolute delight.