In my previous post I created an extension method to select a random element from a generic collection and it works great but I wanted to see if there was any way to optimize this method for better performance. Since my extension method uses the LINQ extension method ElementAt(), I first looked at the .NET code for System.Linq.Enumerable.ElementAt() and found the following:
public static TSource ElementAt<TSource>(this IEnumerable<TSource> source, int index) { TSource current; if (source == null) { throw Error.ArgumentNull("source"); } IList<TSource> list = source as IList<TSource>; if (list != null) { return list[index]; } if (index < 0) { throw Error.ArgumentOutOfRange("index"); } using (IEnumerator<TSource> enumerator = source.GetEnumerator()) { Label_0036: if (!enumerator.MoveNext()) { throw Error.ArgumentOutOfRange("index"); } if (index == 0) { current = enumerator.Current; } else { index--; goto Label_0036; } } return current; }
It first tries to cast the source collection to an IList and if that is successful, it indexes into the collection to the specified index. If that fails, it simply travels down the source collection using the Enumerator.MoveNext() method until it reaches the desired index. After seeing this my first thought was to see which of the two methods, casting to an IList and indexing or traveling down the collection one element at a time, was faster. I figured since the .NET Framework first attempts to cast the collection to an IList, that would be the fastest method.
Here are the two methods I tested, the first utilizing my original method of just calling the ElementAt() method and the second using an Enumerator on the collection and traversing the collection until the desired element is reached.
public static T GetRandomElement<T>(this IEnumerable<T> list) { if (list == null) throw new ArgumentNullException("list"); // If there are no elements in the collection, return the default value of T if (list.Count() == 0) return default(T); return list.ElementAt(random.Next(list.Count())); } public static T GetRandomElement1<T>(this IEnumerable<T> list) { if (list == null) throw new ArgumentNullException("list"); // Get the number of elements in the collection int count = list.Count(); // If there are no elements in the collection, return the default value of T if (count == 0) return default(T); // Get a random index int index = random.Next(list.Count()); using (IEnumerator<T> enumerator = list.GetEnumerator()) { // Move down the collection one element at a time. // When index is -1 we are at the random element location while (index >= 0 && enumerator.MoveNext()) index--; // Return the current element return enumerator.Current; } }
To perform the test I used the following code which generates arrays of strings with 1 to 500 elements each and then uses each of the two GetRandomElement() methods to select random elements from the array. Before I call each method I generate a new array of strings to prevent false performance enhancement due to caching. I performed the test twice; the first time selecting 2,000 random elements and the second time selecting 20,000 random elements. (Note this test uses the RandomText class in my previous post to generate the random strings.)
public static void Test1() { RandomText random = new RandomText(); StringBuilder results = new StringBuilder(); results.AppendLine("List Length,Method One, Method Two, Difference"); Stopwatch sw = new Stopwatch(); // Test on lists of length 1 to 500 for (int listLength = 1; listLength <= 500; listLength++) { Console.WriteLine("Current list length: " + listLength); long method1 = 0; long method2 = 0; // Get an array of random strings with a length of the current listLength string[] strings = random.Sentance(listLength).Split(' '); sw.Reset(); sw.Start(); // Get a random element 20,000 times from the array using method 1 for (int i = 0; i < 20000; i++) { string s = strings.GetRandomElement<string>(); } sw.Stop(); // Record the time required for method 1 method1 = sw.ElapsedMilliseconds; // Get a different array of random strings to ensure we don't // get performance enhancement due to caching strings = random.Sentance(listLength).Split(' '); sw.Reset(); sw.Start(); // Get a random element 20,000 times from the array using method 2 for (int i = 0; i < 20000; i++) { string s = strings.GetRandomElement1<string>(); } sw.Stop(); // Record the time required for method 2 method2 = sw.ElapsedMilliseconds; // Store the results results.AppendLine(listLength + "," + method1 + "," + method2 + "," + (method1 - method2)); } StreamWriter writer = new StreamWriter(@"C:\Temp\RandomStats.csv"); writer.Write(results.ToString()); writer.Close(); Console.WriteLine(); Console.WriteLine("Press enter to close..."); Console.ReadLine(); }
Here are the summarized results:
It turns out that in both tests, when the collection of strings has between 1 and 100 elements, the second method using the Enumerator actually performs better! I repeated the test multiple times and it produced the same results. I even swapped which method got called first just to ensure a fair test and as well similar results were produced.
After my first test I decided to create a new GetRandomElement() method that took advantage of this new information. The new method first checks the length of the collection and if it contains 100 or less elements, it gets a random element using the Enumerator.MoveNext() method, otherwise it just calls the Enumerable.ElementAt() method. Here is the code:
public static T GetRandomElement<T>(this IEnumerable<T> list) { if (list == null) throw new ArgumentNullException("list"); // Get the number of elements in the collection int count = list.Count(); // If there are no elements in the collection, return the default value of T if (count == 0) return default(T); // Get a random index int index = random.Next(list.Count()); // When the collection has 100 elements or less, get the random element // by traversing the collection one element at a time. if (count <= 100) { using (IEnumerator<T> enumerator = list.GetEnumerator()) { // Move down the collection one element at a time. // When index is -1 we are at the random element location while (index >= 0 && enumerator.MoveNext()) index--; // Return the current element return enumerator.Current; } } // Get an element using LINQ which casts the collection // to an IList and indexes into it. return list.ElementAt(index); }
I then wanted to check the performance of my original method against that of the new one. I’ll spare you the code as it is quite similar to the code above but instead of simply incrementing the collection size on each iteration from 1 to 500, on each iteration I created a collection of a random size between 1 and 200 elements. My reasoning for choosing the range of 1 to 200 elements is that my new method makes its decision on which way it generates the random element by checking the collection size and the breaking point is at 100 elements. Thus, generating a collection of random size between 1 and 200 allows for an equal opportunity for each method of generating the random element. Again I performed this test selecting both 2,000 and 20,000 random elements from each collection using my original method and then my new method. Here are the results:
In both cases the time required to generate a random element was decreased using my new optimized method. When selecting 2,000 random elements, the new method completed on average 0.338 ms quicker and completed 169 ms faster over all. When selecting 20,000 random elements, the new method completed on average 2.312 ms quicker and completed 1156 ms faster over all.
Seeing the improvement in performance as the number of random elements selected increases, I decided to perform the same test but selecting 200,000 elements on each iteration. Here are the results:
The new method completed on average 32.346 ms quicker and completed 16.172 seconds faster over all.
True the majority of the time we are only talking about an improvement of a matter of milliseconds here but nonetheless those add up if you are performing a large number of operations.
For those that are interested, the raw test result data can be found on Google Docs here.