Musings of a Software Developer: July 2010

Monday, July 19, 2010

Bing! Are you kidding me!

Seriously?? If these things happen, Bing! would never be the search engine that I wish it would become.

Friday, July 16, 2010

WPF Datagrid – Load and Performance

This post is not about performance numbers of WPF Datagrid but simply about what you should be aware of in order to make it perform well. I was not motivated enough to use profiler to show realistic numbers but instead used the Stopwatch class wherever applicable. This post does not go into techniques to handle large amounts of data such as Paging or how to implement paging, but focuses on how to make the datagrid work with large data.

Here is the C# class that generates the data I want to load the Datagrid with.

public class DataItem
    {
        public long Id { get; set; }
        public string FirstName { get; set; }
        public string LastName { get; set; }
        public long Age { get; set; }
        public string City { get; set; }
        public string Designation { get; set; }
        public string Department { get; set; }
    }

    public static class DataGenerator
    {
        private static int _next = 1;
        public static IEnumerable GetData(int count)
        {
            for (var i = 0; i < count; i++)
            {
                string nextRandomString = NextRandomString(30);
                yield return new DataItem
                                 {
                                     Age = rand.Next(100),
                                     City = nextRandomString,
                                     Department = nextRandomString,
                                     Designation = nextRandomString,
                                     FirstName = nextRandomString,
                                     LastName = nextRandomString,
                                     Id = _next++
                                 };
            }
        }

        private static readonly Random rand = new Random();

        private static string NextRandomString(int size)
        {
            var bytes = new byte[size];
            rand.NextBytes(bytes);
            return Encoding.UTF8.GetString(bytes);
        }
    }

My ViewModel has been defined as shown below.

 public class MainWindowViewModel : INotifyPropertyChanged
    {
        private void Notify(string propName)
        {
            if (PropertyChanged != null)
                PropertyChanged(this, new PropertyChangedEventArgs(propName));
        }
        public event PropertyChangedEventHandler PropertyChanged;

        private Dispatcher _current;
        public MainWindowViewModel()
        {
            _current = Dispatcher.CurrentDispatcher;
            DataSize = 50;
            EnableGrid = true;
            _data = new ObservableCollection();
        }

        private int _dataSize;
        public int DataSize
        {
            get { return _dataSize; }
            set
            {
                LoadData(value - _dataSize);
                _dataSize = value;
                Notify("DataSize");
            }
        }

        private ObservableCollection _data;
        public ObservableCollection Data
        {
            get { return _data; }
            set
            {
                _data = value;
                Notify("Data");
            }
        }

        private bool _enableGrid;
        public bool EnableGrid
        {
            get { return _enableGrid; }
            set { _enableGrid = value; Notify("EnableGrid"); }
        }

        private void LoadData(int more)
        {
            Action act = () =>
                             {
                                 EnableGrid = false;
                                 if (more > 0)
                                 {
                                     foreach (var item in DataGenerator.GetData(more))
                                         _data.Add(item);
                                 }
                                 else
                                 {
                                     int itemsToRemove = -1 * more;
                                     for (var i = 0; i < itemsToRemove; i++)
                                         _data.RemoveAt(_data.Count - i - 1);
                                 }
                                 EnableGrid = true;
                             };
            //act.BeginInvoke(null, null);
            _current.BeginInvoke(act, DispatcherPriority.ApplicationIdle);
        }
    }

As you can see, as the DataSize is changed, the data would be loaded. Currently I use a slider to change the load size. This is all pretty easy and fun stuff starts in the XAML.

In order to apply this "Data" to my WPF datagrid, I apply this viewmodel instance to the DataContext of my class. See below for the code-behind that I have for my window

 public partial class MainWindow : Window
    {
        private MainWindowViewModel vm;

        public MainWindow()
        {
            InitializeComponent();
            vm = new MainWindowViewModel();
            this.Loaded += (s, e) => DataContext = vm;
        }
    }

Lets start with the following XAML.

<stackpanel>
	<slider maximum="100" minimum="50" value="{Binding DataSize}" />
        <label grid.row="1" content="{Binding DataSize}">
        <datagrid grid.row="2" isenabled="{Binding EnableGrid}" itemssource="{Binding Data}">
	</datagrid>
</stackpanel>

Now build the application and run. The result appear as shown below.

As you can see above, I loaded 100 items yet I do not see the scrollbar. Lets change the slider’s Maximum property from 100 to 1000 and rerun the application. Dragging the slider to 1000 at once. So even for the 1000 items, the grid does not respond that well.

Let us look at the memory usage.

This is pretty heavy for an application with just 1000 items of data loaded. So what is using all this memory? You can hook up a Memory Profiler or use Windbg to look at the memory content but since I already know what is causing this issue, I am not going through that.

This issue is that the DataGrid has been placed inside a StackPanel. When vertically stacked, the StackPanel basically gives its children all the space that they ask for. This makes the DataGrid create 1000 rows (all the UI elements needed for each column of each row !!) and render it. The virtualization of the DataGrid did not come into play here.

So let us make a simple change and put the DataGrid inside a grid. The XAML for which is shown below.

<Grid>
        <Grid.RowDefinitions>
            <RowDefinition Height="30"/>
            <RowDefinition Height="30"/>
            <RowDefinition Height="*"/>
        </Grid.RowDefinitions>
        <Slider Value="{Binding DataSize}" Minimum="50" Maximum="1000"/>
        <Label Content="{Binding DataSize}" Grid.Row="1"/>
        <DataGrid ItemsSource="{Binding Data}" Grid.Row="2" IsEnabled="{Binding EnableGrid}">           
        </DataGrid>
    </Grid>

When I run the application, you would notice that as I load 1000 items, the performance of the same application (no code changes, except that XAML one I just talked about) is a lot better than what it was. Moreover I see nice scrollbars.

Let us look at the memory usage.

Wow! 10 folds difference. This until now appears to be a re-talk about my previous post on WPF Virtualization. The same rules applies to DataGrid as well. Read this post if you are intertested.

So what else am I talking here.

If you notice the ViewModel code, you should be seeing that I disable the grid as I load data and enable it back once I am done. I have not really tested to see if this technique helps, but I did use this technique in HTML pages where loads of items in a listbox were all to be selected and this technique was very useful.
In all the screenshots I showed, the grid is sorted. So as the data changes, the grid has to keep sorting the data and show based on what you chose to sort. This, I believe, is a big overhead. Consider removing sort of the datagrid before you change the data if it is a viable option and does not impact the end user. Have not tested this, but the same should apply to the groupings as well (which most of the time cannot be simply removed).

With a simple point of loading the DataGrid into any other panel like Grid, instead of a StackPanel you get to see a lot of difference. The WPF datagrid performs just fine, as long as you keep the viewable region of the grid small.

Shown below is my grid with almost 1 Million data items loaded. The footprint is pretty small compared to the amount of data loaded. This means – either WPF Controls are memory intensive or WPF UI Virtualization is a boon.

Impact of sorting on the DataGrid

With no sorting applied on the datagrid, it took almost 20 seconds to load 1 Million items into my collection.
With sorting enabled, loading half those items iteself took over 2 minutes and the complete items took over 5 minutes and I killed the application because it was a pain. This matters because the application keeps the CPU busy with all the sort that has to keep happening as your data changes. So for every item added, the sort might be triggered, since I am placing it directly into an observable collection.
Instead consider sorting on the backend and not using the datagrid.

I can still scroll the application if the virtualization was properly utilized, inspite of the grid binding to 1 million items.

USING BeginInit() and EndInit() on the datagrid.

I changed the ViewModel’s LoadData() such that it calls BeginInit() as it starts loading the data and EndInit() when it done loading the data. This has helped quite a lot. Loading 1 Million items (without any sort applied on the grid) only took around 8 seconds (compared to the 18 seconds it took earlier). Unfortunately I did not spend enough time to use a profiler to show real numbers.

The changed code-behind for the Window is as shown.

public partial class MainWindow : Window
    {
        private MainWindowViewModel vm;

        public MainWindow()
        {
            InitializeComponent();
            vm = new MainWindowViewModel();
            this.Loaded += (s, e) => DataContext = vm;
            vm.DataChangeStarted += () => dg.BeginInit();
            vm.DataChangeCompleted += () => dg.EndInit();
        }
    }

I also had to include the DataChangeStarted and DataChangeCompleted actions to the Viewmodel class. The changed portion of the ViewModel class is shown below.

	public event Action DataChangeStarted ;
        public event Action DataChangeCompleted;

        private void LoadData(int more)
        {
            Action act = () =>
                             {
				 //Before the data starts change, call the method.
                                 if (DataChangeStarted != null) DataChangeStarted();
                                 var sw = Stopwatch.StartNew();
                                 EnableGrid = false;
                                 if (more > 0)
                                 {
                                     foreach (var item in DataGenerator.GetData(more))
                                         _data.Add(item);
                                 }
                                 else
                                 {
                                     int itemsToRemove = -1 * more;
                                     for (var i = 0; i < itemsToRemove; i++)
                                         _data.RemoveAt(_data.Count - i - 1);
                                 }
                                 EnableGrid = true;
                                 sw.Stop();
                                 Debug.WriteLine(sw.ElapsedMilliseconds);
                                 if (DataChangeCompleted != null) DataChangeCompleted();
                             };
            //act.BeginInvoke(null, null);
            _current.BeginInvoke(act, DispatcherPriority.ApplicationIdle);
        }

You can try this out and notice the performance difference yourself.

If the sorting is applied on the datagrid, the performance still hurts in spite of using the above mentioned trick. The overhead of sorting out weighs the performance gain we get calling the BeginInit and EndInit. May be having 1 million records is not realistic enough.

Thursday, July 15, 2010

Using LINQ Aggregate to solve the previous problem

In the previous post I talked about the problem which I simply re-iterate here. From the data which can look like

Name, Value

Sridhar, 1

Ashish,2

PRasanth,3

Ashish,5

Sridhar,6

Prasanth,34

.....

I want to aggregate the values for the names. Look at the previous post for some information on other approaches to solve this simple problem.

The LINQ way to do this would be :


[Test]
        public void BTest()
        {
            var nvcs = tl.GroupBy(s => s.Name)
                .Select(s => new NameValueCollection
                                 {
                                     {"Name", s.Key},
                                     {"DrawerId", s.Aggregate(new StringBuilder(), 
                                        (seed, g) => seed.AppendFormat("{0};",g.DrawerId)).ToString()}
                                 });
            //foreach (var nvc in nvcs)
            //    Console.WriteLine(nvc["Name"] + " : " + nvc["DrawerId"]);
            Assert.AreEqual(4, nvcs.Count());
        }

Note that I wanted am generating a list of NameValueCollection and this is not of significance here. If you compare it with the previous implementation that uses dictionary or lists to generate, this solution appears more concise and to those who already knows LINQ should find this really simple.

All I would like to take away from this post is that the IEnumerable.Aggregate() method is a great method that is not often mentioned around. We often accumulate some value over a collection of items and aggregate method lets you do just that without all the extra for and seeds that you should track.

Algorithms, performance and getting burnt

After a long time, I am writing something on my blog. So here it is ..

This post is about me starting to solve a small but interesting problem with different approaches and ended up breaking my head against why an algorithm with supposedly O(n) complexity is 4 times slower than O(n^2).

So here's the issue. I have the following data :

Name,Value

Sridhar,1

Ashish,2

Prasanth,3

Sridhar,4

Ashish,5

Sridhar,8

and so on .. I hope you get the idea.

Now what I would like to do is to print the following output.

Sridhar : 1;4;8;....

Ashish : 2;4;.....

Prasanth: 3;......

Note that here, it does not matter what the values are, I am giving this data just for the example. So shown below is my setup which would be used by my implementations. (I am demoing it as a test).


  private Stopwatch sw;
        [SetUp]
        public void SetUp()
        {
            GC.GetTotalMemory(true); // I dont know why i did this!
            tl = new List(10000);
            var names = new[] { "Krishna", "Ashish", "Sridhar", "Prasanth" };
            foreach (var name in names)
                for (var i = 0; i < 2500; i++)
                    tl.Add(new Ud { Name = name, DrawerId = i.ToString() });
            tl.OrderBy(s => s.DrawerId);
            sw = Stopwatch.StartNew();
        }

        [TearDown]
        public void TearDown()
        {
            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
            sw = null;
        }

        public class Ud
        {
            public string Name { get; set; }
            public string DrawerId { get; set; }
        }

        private List tl;

The above code is self explanatory. I basically create a lot of Ud objects which generate the data that I presented earlier. Shown below is the most straight forward way to do it. It has two for-loops which makes the complexity O(n^2).


[Test]
        public void BasicImplementation()
        {
            var nvcs = new List();
            var list = new List();
            foreach (var item in tl)
            {
                if (list.Contains(item.Name)) continue;

                string val = string.Empty;

                foreach (var item2 in tl)
                {
                    if (item2.Name == item.Name)
                        val += item2.DrawerId + ";";
                }

                nvcs.Add(new NameValueCollection { { "Name", item.Name }, { "DrawerId", val } });
                list.Add(item.Name);
            }
            //foreach (var nvc in nvcs)
            //    Console.WriteLine(nvc["Name"] + " : " + nvc["DrawerId"]);
            Assert.AreEqual(4, nvcs.Count);
        }

Now I went ahead and added another potential implementation which gives the same result but instead makes use of dictionary to track the strings that we build for each name in the list of objects. So instinctively, it appears that the dictionary method would be way faster than the one mentioned above. Lets look at that code.


        [Test]
        public void ADictionary()
        {
            var vals = new Dictionary();
            foreach (var item in tl)
            {
                if (!vals.ContainsKey(item.Name))
                    vals[item.Name] = item.DrawerId;
                else
                    vals[item.Name] = vals[item.Name] + item.DrawerId + ";";
            }
            Assert.AreEqual(4, vals.Values.Count);
        }

When I ran these two tests, I did not notice any performance gain with the above O(n) implementation and in fact it was three times slower. So why was it slower? Look at the setup, it has GC.GetTotalMemory(true) which forced a full garbage collection and its time was accounted in the time consumed by this dictionary as well since for the second time (when test with dictionary was executing) it had a lot of strings to clean up. So why did I put it in the first place? The answer is "I was not thinking straight". Never ever use GC classes in your code. It is a bad-bad-bad practice.

So I remove this GC call made and rerun the tests again. Yet I do not see any performance gain. WHY?? I took a lot of time trying to diagnose why this is happening and eventually gave up manual inspection. I downloaded the trail version of dotTrace (which is freaking awesome tool) Performance 4.0 and made it profile both the tests. The culprit was the strings. If you look at the code right, we are generate a lot of strings whose "Concat" operation was so time consuming that it dominated the gain that we obtained using O(n) algorithm.

So the lesson here is "Be watchful of the strings that are generated when your code executes, otherwise you would be burned". It does not matter how small the string concatenation may seem but in cases like above it piles up a lot and screws up your clever algorithm. All I did was to change the tests to use stringbuilder instead of Strings.

Do not use GC calls in your code, especially those which force GC.
Use a profiler to accurately capture performance information of specific methods or your program. Stopwatch, Timers, etc are not good enough and waste of time.
Be aware of the impact of string operations. Use StringBuilder wherever possible. Use String.Format() in other simpler cases.

I will continue in the next post with some code that shows you how to approach the problem I initially started with using LINQ and how simple things would appear.

Sunday, July 11, 2010

Issues with SyntaxHighlighter on my blog

I just messed up my blog template and could not get the syntaxhighlighter plugin to work properly. I will be fixing this shortly but in the meantime if the code seems really ugly to you all, I apologize for the inconvenience.