Friday, February 3, 2017

Why most databases have so many useless and confusing objects in them

Recently I was working with a database system, trying to validate some output from an OLAP cube. I ran into discrepancies, and (with a lot of work and troubleshooting) narrowed it down to the fact that in one query, two tables were joined with the Product_ID field, and in another query, the two tables were joined via the dim_Product_ID field. These two fields, Product_ID and dim_Product_ID were identical most of the time, but not always! They were different enough to cause major problems trying to match numbers.

Here's another example of useless and confusing objects - the database system I'm working with now is littered with the remains of what look like three abortive attempts to create a data warehouse, all of them currently unused. But the fact that there's all these separate tables and databases out there, replicating very similar data in different formats - it's enormously confusing to anyone trying to do new work on the system, and trying to figure out what's being used and what's not.

Why is it? Why do databases become cluttered with unused objects, that cause massive headaches for anyone trying to use the data?

Here's what I think. It's a heck of a lot of work to try to figure out what's used and what's not, and archive off/delete the unused objects. Not only is it a a lot of work, but it's unrewarding and unappreciated work. You need to check with lots of people, try to figure out what might actually be used, compare lots of code and data.

And most managers don't really have a solid understanding of how much more work basic database tasks can be if the database contains a lot of leftover tables and other database objects. So the task of cleanup is never prioritized. Even though, over the long run, a clean, well-ordered database makes life so much easier for everyone that uses it.

The work that gets the glory and the praise is building new things - a brand new shiny data warehouse, for instance - even though it may just be adding to the general confusion in the database, especially (as happens so often) if it ends up not being used.

Wednesday, June 8, 2016

For readable code, avoid code you need to translate in your head

I'm a big advocate of readable code. These include things like good aliasing for tables when necessary, separating code out into logical chunks, and CTE (common table expressions) instead of derived tables.

One thing I've always tried to do is avoiding code that requires translation. What do I mean by that? I try to avoid code such as this:

-- Call Answer Time After Hours
when IsBusinessHours = 'False' and SLARuleID = 2  
    then Sum([AnswrIn30Seconds]+[AnswrIn45Seconds])

It's okay when there's just one chunk of code like this, but if there's a whole set of code like this, you're more likely to make mistakes, because you need to internally translate, "So, if it's after hours, than the IsBusinessHours must be False." This caused me a hard-to-troubleshoot bug recently.

What I'd rather read is something like this:

-- Call Answer Time After Hours
when IsAfterHours = 'True' and SLARuleID = 2  
    then Sum([AnswrIn30Seconds]+[AnswrIn45Seconds])
It's much easier to understand. So what I do now in these situations is to set up a field that's the opposite of the original, when it would help readability. I use something like this

        ,IsAfterHours   = 
                    When IsBusinessHours = 1 Then 0
                    When IsBusinessHours = 0 Then 1
                    Else Null 

Monday, February 8, 2016

Flexible update trigger - take the column value when provided, otherwise have the trigger provide a value

We have a small, departmental database with some tables that will sometimes be updated via a batch process, and sometimes via a people inputting the data directly in a datasheet. Here's a neat trick to allow us to have auditing fields (UpdateDate , LastUpdatedBy) that will either reflect the value provided by the update statement, OR provide a default value, if no value is provided in the update statement.  The trigger does this using the Update() function, which has apparently been around for a while, but I'd never used it before.

Here's the trigger:

 IF EXISTS (SELECT * FROM sys.triggers WHERE object_id = OBJECT_ID('dbo.trg_Update_SLAMetric'))  
 DROP TRIGGER dbo.trg_Update_SLAMetric  
 CREATE TRIGGER trg_Update_SLAMetric  
 ON dbo.SLAMetric  
 -- Update UpdateDate , LastUpdatedBy  
   UpdateDate = GETDATE()  
   ,LastUpdatedBy =   
       When UPDATE(LastUpdatedBy) then Inserted.LastUpdatedBy  
       else SUSER_NAME()  
 from dbo.SLAMetric SLAMetric  
   join Inserted  
     on Inserted.SLAMetricID = SLAMetric.SLAMetricID  

It correctly updates the LastUpdatedBy field in both situations - updating the table and including the LastUpdatedBy field, and also just updating other fields (in which case it's just set to SUSER_NAME().

 update slametric set MetricNumerator = 0  
 select * from slametric  
 update slametric set LastUpdatedBy = 'test'  
 select * from slametric  

Tuesday, March 10, 2015

Microsoft Azure error - "Cannot open database "master" requested by the login"

I've been busy setting up a database for a TSQL programming class that I'll be teaching in a few weeks. I've run into a few things that made me scratch my head. Here's one of them.

I created a sample database, a new login, and a user to go with the login. Then I gave the user read only access to the database, via sp_addrolemember.

When I used that login to connect via SSMS (File, Connect Object Explorer, enter the new login, also click on the Options button to specify the correct database), it appears to connect fine, and I see the connection.

However, when I expand the databases tab, and try to right-click the sample database to get the New Query option, I get the message "Cannot open database "master" requested by the login. The login failed.". So even though I was trying to open a query in the sample database, it tried accessing the master database.

This is a strange one, but the workaround is to right click on the connection object, and not the database. This will allow you to open up a new query, and will automatically put you in the database you specified in the connection options.

Doing this involves a bit of rethinking, because most developers who have worked extensively with SSMS are accustomed to right clicking on a database and selecting New Query to open a connection. But this won't work with Azure, unless you're also a user in the Master database.

Hopefully this write-up will save a few people some time when they go searching for details on this issue. Feel free to comment with more information.

Monday, March 9, 2015

Error "The database entered is not valid" when trying to set a default database, connecting to SQL Server Azure

Are you getting the error "The database entered is not valid" when trying to set a default database in the ODBC Data Source Administrator, DSN configuration? I did too, and it took me a while to figure out the root cause.

It turns out I was using the SQL Server driver (version 6.01.7601.17514) , when I should have been using the SQL Server Native Client 11.0 driver (2011.110.3000.00)

There are apparently a few other causes of this error as well, but switching to the more recent driver took care of the issue for me.

Monday, March 2, 2015

Review - Microstrategy Analytics Express

I'm doing a review of Microstrategy Analytics Express, and thought it might be interesting to jot down a few things I run into as I'm working my way through it. I'm connecting with a SQL Server Azure 11 database, onto which I've loaded the AdventureWorks 2012 sample database.

A few notes, in no particular order:

- I'm getting the below error a lot when I try to view the available tables:
TypeError - Cannot read property 'n' of undefined
I've narrowed it down, and it appears to happen whenever the table I'm trying to view has a column with a data type of HierarchyID, or any user-defined data type. Which, unfortunately, is about 90 percent of the tables in the AdventureWorks sample database. This is quite problematic. So far I haven't found a workaround.

- In the manual, there's a distinction made between the "Quick, easy-to-use dashboard", and the "Customizable, detailed dashboard". These phrases are repeated over and over again. For the sake of clarity, they really should have figured out a name for these 2 types of dashboards - for instance, Lite Dashboard and Superuser Dashboard. It's a little confusing to read these phrases again and again when two types of objects like this would be given different names.

- It looks like for text files, only comma delimited files are supported. That's a little surprising - most products have supported multiple delimiters for decades.

- It's not possible to edit a the underlying source of a dataset in Analytics Express. A dataset is an imported (not linked) set of data. Once you've imported, it you can't edit the underlying SQL in any way. You can, however, do things like put an expression on one of the fields. You also can't name a dataset. It's automatically given a name which is the name of the underlying table or tables with the date/time appended to it.

- When creating a new dataset, the field names are interpreted for you - Microstrategy tries to guess whether it's a metric, or an attribute. Frequently, however, it guesses wrong, and I haven't found a way to correct it. For instance, a column name of BusinessEntityId is determined to be a metric, and there's no way to reclassify it.

- When creating a new dataset, once you've chosen your fields, you'd expect them to actually show up when you click "Continue". But it doesn't. The fields that show up are the fields from the first dataset you've ever created. I was scratching my head over this, wondering what incorrect button I clicked on, or option I picked incorrectly. But I've redone it a few times now, and this is really what happens. If you actually want to see the resultset from the dataset you just created, you have to close down the grid that shows up by default, and open up the one you just created. It's confusing.

I'm looking forward to trying out the Desktop Developer option, which I'm assuming doesn't have some of the limitations of the online Analytics Express.

Monday, September 8, 2014

Troubleshooting slow SQL Server stored procedures - using the Modulo function to improve productivity

 Recently I needed to create a stored procedure to process and cleanse a huge set of call center data on a daily basis - more than half a million calls a day. And because data could be backed up and unavailable at times, sometimes a run of multiple days would be necessary, so many millions of rows. 

While I was still working out the logic of the cleansing, my procedure was still pretty slow. Even one day's run took 10 minutes. This meant that every time I needed to tweak the logic, it would take at least 10 minutes to examine the results! 

That kind of work pattern gets old quickly. The problem with having to take a 10 minute break all the time is that task switching is a pain and makes you forget things. It's easier to sustain focus if you don't have to spend 10 minutes out of every 20 working on something else.

So, how to speed up the procedure and get rid of all those inefficient breaks?  For this stage of my work, it would have been fine to work with a much smaller subset of data, as long as I got a good sampling of records. This little hack allowed me to get this subset of data. My first step was creating a temporary table to bring in the records to be processed. So while I was still working out the logic, I added a where clause like this to get a much smaller subset of data:

   convert(date, CallEnd) in (select CallDate from @DatesToFetch)  
   -- Just get 1 out of 100 records for now  
   and CustomerCallD % 100 = 0   

This uses the Modulo function (%), which gives the remainder of the division of one number by another. In the above example, the CustomerCallD was an identity value increasing by 1 each time, so it would get every CustomerCallID ending in 00 (where the remainder of CustomerCallID divided by 100 was 0). This was a great way to cut back dramatically on the time required to run the procedure, yet still get a great sampling of records. Then when you're ready to run the full set of data, just comment out the filter.