1 [[meta title="Profiles for Mozilla Trender on i965"]]
3 [[tag exa performance xorg]]
6 I recently [[posted|mozilla_trender]] results showing EXA (and XAA)
7 performing quite badly on the Mozilla Trender benchmarks. As a
8 reminder, here is the chart showing the results on an i965 card:
10 [[mozilla_trender/i965.png]]
12 As a quick followup, here are the top functions when profiling the
13 entire Trender suite in the NoAccel, XAA, and EXA cases.
15 [[NoAccel|noaccel.oprofile]]:
17 CPU: Core 2, speed 2133.49 MHz (estimated)
18 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00
19 (Unhalted core cycles) count 100000
20 samples % image name app name symbol name
21 1940211 41.7382 libxul.so libxul.so (no symbols)
22 955760 20.5605 libc-2.5.so libc-2.5.so (no symbols)
23 115195 2.4781 libfb.so libfb.so fbSolidFillmmx
24 108663 2.3376 libfb.so libfb.so fbCopyAreammx
25 78728 1.6936 libpixman.so.0.0.0 libpixman.so.0.0.0 pixman_rasterize_edges
26 76356 1.6426 libpixman.so.0.0.0 libpixman.so.0.0.0 pixmanCompositeRect
27 63186 1.3593 vmlinux vmlinux get_page_from_freelist
28 59977 1.2902 libpixman.so.0.0.0 libpixman.so.0.0.0 mmxCombineOverU
29 51859 1.1156 vmlinux vmlinux __d_lookup
30 49805 1.0714 libpixman.so.0.0.0 libpixman.so.0.0.0 pixman_image_composite
31 46590 1.0023 libpixman.so.0.0.0 libpixman.so.0.0.0 mmxCombineMaskU
33 As a baseline, this NoAccel profile looks pretty good. Mozilla itself
34 is taking up 40% of the time in its libxul code. I'm not sure if the
35 20% in libc is on behalf of mozilla or X. Meanwhile, we can see X
36 doing software rasterization and compositing with the pixman code, but
37 no single function is chewing up any large proportion of the time.
41 CPU: Core 2, speed 2133.49 MHz (estimated)
42 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00
43 (Unhalted core cycles) count 100000
44 samples % image name app name symbol name
45 1895990 32.7139 libxul.so libxul.so (no symbols)
46 1065154 18.3785 libc-2.5.so libc-2.5.so (no symbols)
47 790802 13.6447 libpixman.so.0.0.0 libpixman.so.0.0.0 mmxCombineOverU
48 202183 3.4885 libpixman.so.0.0.0 libpixman.so.0.0.0 fbCompositeSolidMask_nx8888x8888Cmmx
49 112017 1.9328 libpixman.so.0.0.0 libpixman.so.0.0.0 fbCompositeSrc_8888x8888mmx
50 94824 1.6361 libpixman.so.0.0.0 libpixman.so.0.0.0 pixmanCompositeRect
51 84551 1.4589 libpixman.so.0.0.0 libpixman.so.0.0.0 fbCompositeSolidMask_nx8x8888mmx
52 76908 1.3270 libpixman.so.0.0.0 libpixman.so.0.0.0 pixman_rasterize_edges
53 57645 0.9946 vmlinux vmlinux system_call
54 52950 0.9136 libpixman.so.0.0.0 libpixman.so.0.0.0 mmxCombineMaskU
55 52265 0.9018 intel_drv.so intel_drv.so I830WaitLpRing
56 51640 0.8910 vmlinux vmlinux __d_lookup
57 48207 0.8318 libpixman.so.0.0.0 libpixman.so.0.0.0 pixman_image_composite_rect
59 Now, this XAA profile is certainly strange. Why has mmxCombineOverU
60 jumped up from 1% to 13%. Why should there be any more compositing
61 happening here. Is this pixel format conversion we're seeing for some
66 CPU: Core 2, speed 2133.49 MHz (estimated)
67 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00
68 (Unhalted core cycles) count 100000
69 samples % image name app name symbol name
70 2465024 27.6332 intel_drv.so intel_drv.so i965_prepare_composite
71 1951957 21.8817 libxul.so libxul.so (no symbols)
72 1470150 16.4806 libc-2.5.so libc-2.5.so (no symbols)
73 382399 4.2867 libexa.so libexa.so ExaOffscreenMarkUsed
74 375330 4.2075 intel_drv.so intel_drv.so I830WaitLpRing
75 307074 3.4423 vmlinux vmlinux system_call
76 104493 1.1714 vmlinux vmlinux do_gettimeofday
77 97582 1.0939 intel_drv.so intel_drv.so i965_composite
78 79050 0.8862 libpixman.so.0.0.0 libpixman.so.0.0.0 pixman_rasterize_edges
79 53810 0.6032 vmlinux vmlinux __copy_to_user_ll
80 51434 0.5766 vmlinux vmlinux __d_lookup
82 And here with EXA we see some good, and some really bad. The good news
83 is that the pixman functions doing software compositing have
84 disappeared from the top of the profile, leaving only software
85 rasterization. But what's with this new `i965_prepare_composite`
86 function that's taking even more time than all of libxul? That seems
87 like rather excessive overhead.
89 A quick glimpse at the
90 [function](http://cgit.freedesktop.org/xorg/driver/xf86-video-intel.git/tree/src/i965_render.c),
91 (starting at line 395 or so), shows that it's just a sequence of
92 assignment statements, and then a "long sequence of commands needed to
93 set up the 3D rendering pipe". Is any of that setup redundant and
94 could it be easily eliminated?
96 I noticed two calls to i830WaitSync which seemed to have "slowdown"
97 written all over them. But I ran with these two calls removed and
98 didn't notice any change in the performance, (and if they were causing
99 a problem, shouldn't they have shown up separately in the profile