LINQ and problem of Premature Materialization

Standard

I am a huge believer in the power and usefulness of LINQ. At the same time I often see common mistakes that can have large impacts on performance. When working with an IEnumerable it’s important to remember that you may not be working with the actual data, but a promise of data being available.

IEnumerable represents the concept of a collection of data. Similar to a cursor in SQL, an IEnumerable is forward only traversable collection that loads and iterates over the records one at a time. If you have a 1000 records read from a database using an IEnumerable, there is technically only one record in memory at a given time.

Materialization in Memory
When you force materialization in memory, all records must be loaded into memory for manipulation. I’m going to use the following method as an example.

    string GetDefaultSMSPhoneNumber(IEnumerable<PhoneNumbers> patientNumbers)
    {
        const int PHONE_TYPE_HOME = 1;
        const int PHONE_TYPE_OFFICE = 3;
        const int PHONE_TYPE_OTHER = 9;

        var phoneNumberByType = patientNumbers.Where(p => p.sms_capable == 1).GroupBy(p => p.phone_type_id);

        // Select the phone number last used in creating a prescription
        if (patientNumbers.Where(p => p.sms_capable == 1 && p.last_used_for_rx == 1).Count() > 0)
        {
            return patientNumbers.Where(p => p.sms_capable == 1 && p.last_used_for_rx == 1).FirstOrDefault().phone_number;
        }

        // If no number has been used, select a configured SMS number in the following order (Other, Home, Office) 
        if (patientNumbers.Where(p => p.sms_capable == 1 && p.phone_type_id == PHONE_TYPE_OTHER).Count() > 0)
        {
            return patientNumbers.Where(p => p.sms_capable == 1 && p.phone_type_id == PHONE_TYPE_OTHER).FirstOrDefault().phone_number;
        }

        // If no number has been used, select a configured SMS number in the following order (Other, Home, Office) 
        if (patientNumbers.Where(p => p.sms_capable == 1 && p.phone_type_id == PHONE_TYPE_HOME).Count() > 0)
        {
            return patientNumbers.Where(p => p.sms_capable == 1 && p.phone_type_id == PHONE_TYPE_HOME).FirstOrDefault().phone_number;
        }

        // If no number has been used, select a configured SMS number in the following order (Other, Home, Office) 
        if (patientNumbers.Where(p => p.sms_capable == 1 && p.phone_type_id == PHONE_TYPE_OFFICE).Count() > 0)
        {
            return patientNumbers.Where(p => p.sms_capable == 1 && p.phone_type_id == PHONE_TYPE_OFFICE).FirstOrDefault().phone_number;
        }

        return string.Empty;
    }

Aside from high cyclomatic complexity, this method materializes the collection a minimum of 3 times and a maximum of 9 times. Some of the issues above can remedied simply by removing the uneeded or redundant calls. For example:

// The following statement can be cleaned up a few different ways
if (patientNumbers.Where(p => p.sms_capable == 1 && p.last_used_for_rx == 1).Count() > 0)
{
     return patientNumbers.Where(p => p.sms_capable == 1 && p.last_used_for_rx == 1).FirstOrDefault().phone_number;
}

// The criteria in the Where can be moved to the count
// Because it's only looking for existence, count should be replaced with Any
// This approach will iterate the collection twice, once for the Any and once for FirstOrDefault
if (patientNumbers.Any(p => p.sms_capable == 1 && p.last_used_for_rx == 1))
{
     // Similarly he criteria in the Where clause can moved directly to the FirstOrDefault
     return patientNumbers.FirstOrDefault(p => p.sms_capable == 1 && p.last_used_for_rx == 1).phone_number;
}

// Alternatively you can do the following which will only materialize your collection once
var patient = patientNumbers.FirstOrDefault(p => p.sms_capable == 1 && p.last_used_for_rx == 1);
if (patient != null)
    return patient.phone_number
 

While refactoring our “if” statements cleans up our code, each one is still materializing our collection which is undesirable. If we look at the objective of our method we can refactor the logic to use Linq itself instead of using logic on top of data retrieved from Linq. We a list of phone numbers, broken into prioritized group and we want to return the first eligible number.

        string GetDefaultSMSPhoneNumber(IEnumerable<PhoneNumbers> patientNumbers)
        {
            // Because the numeric values for the phone types don't match the order of importance
            // a dictionary is used to order number by the desired priority.
            var phoneTypeSortOrder = new Dictionary<int, int> { { (int)PhoneType.Other, 1 }, { (int)PhoneType.Home, 2 }, { (int)PhoneType.Office, 3 } };
            
            // Filter to only SMS capable number and check to see if a number has been previously used
            var smsNumbers = patientNumbers.Where(p => p.sms_capable == 1);              
            var lastUsedNumber = smsNumbers.FirstOrDefault(p => p.last_used_for_rx == 1);
            if (lastUsedNumber != null) 
                return lastUsedNumber.phone_number; 
            
            // If no number has been used, select a configured SMS number in the following order (Other, Home, Office)             
            var configuredNumbers = smsNumbers.Where(x => phoneTypeSortOrder.ContainsKey(x.phone_type_id))
                .GroupBy(p => phoneTypeSortOrder[p.phone_type_id])  // Put the number into groups based on type
                .OrderBy(g => g.Key)                                // Order the numbers by their group
                .SelectMany(g => g)                                 // Apply the order to all the numbers in group
                .FirstOrDefault();                                  // Return the first phone number
            return configuredNumbers?.phone_number ?? string.Empty; 
        }

As you can see the refactored code is much smaller, is still easy to read, and is more performant in all scenarios. In the best case scenario the collection is materialized only once when the Phone Number last used for Rx is retrieved. In the worse case scenario our collection is materialized twice. Once as referenced above and once during the GroupBy operation.

Hopefully this provides some help and insight as you continue to work with Linq.
Happy to hear your thoughts!