Wednesday, January 24, 2018

Collections Database History

a.k.a. Thinking outside the relational database box


Like most software developers, I have "pet projects" that are used to try new technologies and platforms while performing some useful utility function for work or leisure. I have one such project that has been running for 36 years ... and it's finally practically finished!

The offending pet project is casually called "the collections database". It started around 1985 as a text DataSet on a Fujitsu OS/IV X8 mainframe. The DataSet originally listed my LP and EP records, but over the decades it grew to include other formats of audio and video titles as well as books, magazines and collectible items along with pictures and media samples. The various incarnations of this project accidentally provide an interesting historical timeline of changing technology and the various never ending fads that grip the IT industry now and then.

But the main reason I'm writing this post, is because for the first time in decades, after a dozen attempts using different technologies, the project is practically completed. And it was completed in about two weeks of evening hobby time, whereas previous attempts have dragged on unfinished for years.

What happened!? It's basically due to changing from a relational database (SQL Server) to a document database (Cosmos DB). The lesson in this is that RDBs can be used almost by habit because they have been so prevalent for 20 years or so, but the world is changing and there are alternative databases that can liberate you from the confined RDB world. Read on...

The Early Years

In May 1992 I purchased my first PC running Windows 3.1 (see Computers History). I immediately transcribed a fanfold printout of the mainframe DataSet into an Excel spreadsheet. Access 2.0 was released later that year and I imported the data into joined relational tables, thereby undergoing a very useful training exercise in data normalisation and RDB design principles. Despite years of hobby work in Access I remained frustrated by the restrictive, clumsy and verbose development environment and could never produce an application that I liked.

Around the same time I ran experiments with C++, MFC and the Windows SDK to write a native desktop application, but the complexity and difficulty level was so staggeringly high that it would have burnt my life and patience away.

The .NET Years

During the first 10 years after the release of .NET the collections database was restarted many times with project suites named Opus, Jade, Topaz, Nimbus, Agate and Folio. These suites used the following technologies, platforms and kits at various times as their popularity waxed and waned.

  •  Windows Forms  A rich and stable way of creating sophisticated desktop programs, but once you get used to data binding in WPF you will never go back to Windows Forms.
  •  SQL Server  The majority of my suites have used either the express or full versions as the collections database. It's a famous product, but it has a gigantic installation and runtime footprint, you need special skills and tools to maintain it, and even creating a basic database requires schema planning, scripting and management. SQL Server is heavyweight overkill for hobby projects.
  •  SQLite  One incarnation of my suite had abstracted the underlying database away so it could be replaced with SQL Server, SQLite or any other popular relational database. This delicate technical exercise eventually had no practical value and was abandoned. Despite the popularity of SQLite, I found it to be most irritating because of 32/64-bit distribution issues, the shocking difficulty of getting Visual Studio designer support working, everything inside it is stored as text, poor Unicode support, and worst of all: you cannot alter the schema without recreating the whole database. Update Feb-2021 : See SQLite Rehabilitated (it's much better now).
  •  netTiers  This CodeSmith template was originally a fabulous tool for generating database CRUD, but it eventually became dependent upon a non-free version of CodeSmith. It also generated STEs (self-tracking entities) which fell out of favour because they tie your app too closely to the data model. The release of Entity Framework eventually made netTiers redundant.
  •  Entity Framework  EF4 was the first primitive but useful release. I fell for the early trap of using their STEs which were dropped from later versions because they fell out of favour, and I agree they were a bad idea (except for simple utility work). EF6 has matured well and I use it in a variety of current projects. Never pass raw entities outside the data layer, create special DTO classes for that purpose.
  •  ASP.NET Web Forms  It's really hard to write non-trivial Web Forms apps. The complicated event pipeline combined with the sheer dumbness of HTML will have you clawing your face off trying to create an app that looks good and works reliably on different platforms. The modern chaos of the web itself is partly to blame. See: I'm in the future of the web and it doesn't work.
  •  ASP.NET MVC  This was originally going to be the "Web Forms killer" framework with a greatly simplified pipeline that gave you much more control over rendering. I soon found that they had replaced one heavy and complex Web framework with a different type of complex Web framework. There is so much hidden plumbing and conventions inside ASP.NET MVC that it's a real struggle to find out how to do something to correct way. If you have to decide which framework is the least worst, then MVC probably wins. Update Dec-2020 : Forget ASP.NET completely and read Blazor Webassembly Notes.
  •  Silverlight  One of my suites had a beautiful rich Silverlight UI, and I have Silverlight active in some live applications very effectively creating web apps that would have been impossible by any other means. Sadly, in 2012 Microsoft announced end-of-life for the product. This leaves us with no rich UI toolkit for rendering business apps in the browser. The only alternative is use huge slabs of JavaScript with SGV images and the canvas to create the illusion of rich UI. It's a tragedy. See my realted blog post Silverlight Death and Funeral.
  •  SOAP  This XML based network messaging protocol is simple to use and configure, but it has unfortunately been pushed aside by the rapid adoption of web services using the REST convention. This is also a tragedy, as SOAP is a complete standard protocol, whereas REST is a vague convention. See: Web API Status Codes and Errors.
  •  WCF  This was Microsoft's attempt to unify many communications protocols under a single set of service-oriented APIs (including SOAP). Unfortunately, the result is over-complicated and fiendishly difficult to configure and extend. WCF is also eclipsed by the rise in popularity of REST.
  •  JavaScript  Forget it. I recently researched the creation of a browser hosted app using Angular (because of its popularity) but was utterly shocked by what I discovered. The whole JavaScript ecosystem is a putrid smouldering pit. It's so appalling that I have written a separate damning blog post. Although a web search reveals there are already countless scathing anti-JavaScript and Angular articles to read!
All of the frameworks, tools and kits listed above have been used over the last 16 years to write a collections database suite. Many of them are now obsolete, not-supported, out of favour, or too difficult to use. This shows how volatile the software development industry is, riven by fads, competition, conflicts and lack of long-term focus.

    Complexity and RDB Tables

    All of my attempts to create a collections database suite have been hampered by complexity in two areas: (1) The user interface (2) Processing a relational database.

    Custom user interfaces are delicate and laborious to create, and I think all developers have learned to live with that. The introduction of WPF with XAML and binding improved the productivity and stability of writing desktop app UIs, but creating good quality web UIs remains an odious task.

    However, I have been continually frustrated by the effort required to pull and push the required data in and out of relational databases. This is the classical problem sometimes called the impedance mismatch, where data that has been beautifully normalised and stored in RDB tables is unlikely to be in a form suitable for processing by an application. Sometimes editing a single title in the collections database would require stitching together data from several tables and then shredding it to reverse the process. You can use an ORM like Entity Framework, but you have to add packages and references, create the data model, generate files, create DTO classes, perhaps write and map stored procedures, and so on. Just using an RDB bloats your code and increases complexity.

    On the Agate project Wiki page I have an archived article which describes how the collection database was reduced to 3NF (3rd normal form). The process of creating normalised RDB tables can be a rewarding experience which has a kind of mathematical elegance. It is this normalised elegance however that produces the previously mentioned impedance mismatch and increases project complexity.

    With the recent rapidly growing popularity and availability of cloud-based schema-less databases, I suddenly realised it was time to completely rethink how to implement a collections database and app suite. Read on...

    Cosmos DB

    During the 2017-18 Xmas break I decided to re-familiarise myself with Microsoft's Cosmos DB which is a rebranded Document DB product from several months earlier. I was quickly impressed. There are NuGet packages, a simple managed SQL-like API, configuration from the Azure portal and good documentation.

    As an experiment, I migrated my complete current SQL Server collections database into a Cosmos DB collection. While doing so, I flattened (de-normalised) all titles into individual documents with nested properties for values like owners, genres, artists, files, tracks, etc.

    I suddenly realised that storing my collection as "documents" instead of normalised tables is more natural and it makes manipulating the data much easier. Each title in the collection is a self-contained document that can be loaded and saved in one go. There is no need to mess about with relational tables, joins, DTOs and similar things that I previously complained about.

    To continue the experiment, I created a WPF desktop program to browse and edit the documents (see screenshot). And to my surprise, thanks to the greatly simplified processing, I had a working program over one weekend, and over two weekends it was practically finished. A similar program using an RDB would have taken several times longer.
    Side Note: You may occasionally have data which is best represented as a tree structure, or more generally as a graph of connected objects. In this case neither RDBs or simple documents are suitable, so consider using the Graph API over Cosmos DB documents which provides a startlingly powerful new way of storing and querying data.

    Summary

    Relational databases are fabulous if you have data that can be normalised and rarely changes shape, and you don't mind the burden of hosting, designing, scripting and code generation. The history of my collections reveals to me that I was using RDBs out of force-of-habit, and that I had over-normalised my data, which produced elegant tables but caused me to suffer from the impedance mismatch problem mentioned above. As a result of this, my collections projects were always clogged with complex code and libraries to manipulate relational data. Once I realised that my data was better represented as documents and I moved to Cosmos DB, it was like having a huge weight lifted off my shoulders, and thanks to the reduced code and dependencies my project was finally completed after so many decades.

    The new collections database management suite is called Hoarder and the full source code is available as a reference in an Azure DevOps repository.