Intel GPU Support

Message boards : News : Intel GPU Support

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Thomas Koch
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Feb 12
Posts: 436
Credit: 37,847
RAC: 0
Message 10517 - Posted: 16 Jun 2015, 12:03:46 UTC

We are currently running a version of POEM++ for Intel GPUs on our Test Server. We only have a very limited amount of machines for testing this, so it would be great to find some volunteers with Intel GPU hardware to connect to the Test Project.

To do so, use the “Add project” wizard of your BOINC client and paste the Project URL “http://int-boinctest.int.kit.edu/poem/” into the corresponding text field, as our test server is not listed as an official BOINC project. We appreciate every help for testing, but please keep the following in mind:
- Test server results will not be used for scientific publications.
- Credits will not be listed in the BOINC cross-project statistics.
- Test server applications may crash your system, so be sure to save any valuable data beforehand.
Best practice is to return a few tasks, and immediately detach from the POEM@TEST project.


Thanks in advance,
Thomas
ID: 10517 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Avatar

Send message
Joined: 18 Jan 09
Posts: 454
Credit: 581,676,488
RAC: 0
Message 10519 - Posted: 16 Jun 2015, 19:43:35 UTC - in response to Message 10517.  

Thanks for the effort, Thomas! One more remark: I got plenty of tasks for my nVidia while testing the Intel GPU. You may want to deactivate the distribution of AMD & nVidia tasks as long as the Intel test is ongoing. Or: if you need those for verification you might want to limit their number.

@Everyone else: there's my initial feedback.

MrS
Scanning for our furry friends since Jan 2002
ID: 10519 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vladimir Tankovich
Volunteer developer
Volunteer tester

Send message
Joined: 12 Nov 10
Posts: 182
Credit: 429,219,133
RAC: 0
Message 10522 - Posted: 16 Jun 2015, 19:49:49 UTC - in response to Message 10519.  

thanks, I got a laptop with appropriate intel processor. I'll see if I get the same 1-2s delays.

Maybe it would be a good time to implement some comandline parameters to allow users to trade smoothness for throughput as they wish.
ID: 10522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Avatar

Send message
Joined: 18 Jan 09
Posts: 454
Credit: 581,676,488
RAC: 0
Message 10523 - Posted: 16 Jun 2015, 20:14:42 UTC - in response to Message 10522.  
Last modified: 16 Jun 2015, 20:17:57 UTC

Maybe it would be a good time to implement some comandline parameters to allow users to trade smoothness for throughput as they wish.

In an ideal world this would not be needed.. but things don't always work out like that. At Collatz, which my Intel ran before Einstein was available, those options helped a lot. I got higher throughput and lower CPU usage by tweaking things.

I noticed my screen became very sluggish running the app on my nVidia. I actually can't comment on the Intel in this regard, as I never use the dummy VGA output connected to it.

The usual way we deal with less than desired GPU load is to run multiple WUs concurrently. It works at Einstein and Collatz with Intel GPUs. I haven't tried this here, but given the low CPU load of the app it might also be a viable option. Who ever uses command line parameters could also use an app_config.

Edit: one more thought. Vladimir, how much communication is there between CPU and GPU? If you're building an executable specifically for Intel GPUs you can safely assume they're all APUs. Using shared memory might avoid some memory copy operations occurring during traditional GPU usage. Given the low CPU load this may not be a factor here at all, though.

MrS
Scanning for our furry friends since Jan 2002
ID: 10523 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ReaDy

Send message
Joined: 19 Aug 12
Posts: 1
Credit: 5,470,403
RAC: 0
Message 10525 - Posted: 16 Jun 2015, 20:57:13 UTC - in response to Message 10519.  

It is possible to forbid receiving tasks in Poem@test for amd and nvidia gpu having changed the file of the cc_config.xml settings. Example:
<?xml version="1.0" encoding="UTF-8" ?>
<cc_config>
<options>
<exclude_gpu>
<url>http://int-boinctest.int.kit.edu/poem/</url>
<type>NVIDIA</type>
</exclude_gpu>

<exclude_gpu>
<url>http://int-boinctest.int.kit.edu/poem/</url>
<type>ATI</type>
</exclude_gpu>

</options>
</cc_config>

After that to restart boinc.
Excuse for machine translation.
ID: 10525 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vladimir Tankovich
Volunteer developer
Volunteer tester

Send message
Joined: 12 Nov 10
Posts: 182
Credit: 429,219,133
RAC: 0
Message 10526 - Posted: 16 Jun 2015, 23:17:52 UTC - in response to Message 10523.  
Last modified: 16 Jun 2015, 23:18:51 UTC

there is very little memory(10% of bandwidth on a 970X with 2WUs) or cpu-gpu communication(2% for 290X on 16xpciex) for these tasks.

unfortunately more WUs won't help when batch size is reduced because single thread block takes a long time to execute, so after some threshold the decrease in kernel runtime will be minimal. Since there is also significant variability in individual block runtimes, ideally I need to provide 3-5X more than fits into all CUs of the GPU to reduce the waste of waiting for the tail to finish processing. To fix this I would need to rewrite algorithm somewhat... which will take significant time. when I have it, I'd rather spend it on fixing bugs.

You can always uncheck GPUs in browser UI of your test server account.
ID: 10526 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jacob Klein

Send message
Joined: 9 Oct 09
Posts: 101
Credit: 74,313,166
RAC: 0
Message 10527 - Posted: 17 Jun 2015, 1:55:07 UTC

I know you recommend detaching when not testing, but I've found it best/easier, from a user perspective, to just stay attached.

I'd recommend that, if you want to test something specific, set things up on the server so the specific thing gets tested.

If you don't waste my time, I won't waste yours. Please don't send test tasks if you aren't trying to test something. For your reference, I got NVIDIA tasks today from your Test project. I hope we're testing something, rather than wasting time.

Thanks,
Jacob
ID: 10527 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vladimir Tankovich
Volunteer developer
Volunteer tester

Send message
Joined: 12 Nov 10
Posts: 182
Credit: 429,219,133
RAC: 0
Message 10529 - Posted: 17 Jun 2015, 6:07:45 UTC - in response to Message 10527.  
Last modified: 17 Jun 2015, 6:16:15 UTC

When I've validated this change in debugger, I got about 95% efficiency of GCN arch.

tried a single WU and a pair of WUs at once on my broadwell laptop. both cases did not impact video performance in any way - was very smooth and was way above 12 fps it should have had if opencl calls were blocking. Apparently Intel driver is better than NVidia/amd in this regard.
single WU performance seems to be severely hampered by low CPU clock - 2 WUs or aggressive power profile improve perf. But it also appears to be hitting thermal ceiling and downclocking GPU.

Anyway, first 20% of 2 WUs ran at about 80% efficiency of my 290X. whole 1 WU ran at around 65% efficiency. Next 20% of 2 WUs slowed down to about same 65% efficiency, which I think is just downclocking of GPU (don't have a tool installed to confirm at this time).
Started and stopped 2WUs - first 5% ran at about 90% efficiency, next 5% - slowed down to 65%.

Conclusion: probably not a compute task for laptop... but the implementation have not lost much efficiency during port as far as I can test on this laptop. It also seems to run without errors at least for some time with 2 WUs.
ID: 10529 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Koch
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Feb 12
Posts: 436
Credit: 37,847
RAC: 0
Message 10530 - Posted: 17 Jun 2015, 13:43:57 UTC

I'd recommend that, if you want to test something specific, set things up on the server so the specific thing gets tested.

That was my intention when deprecating all NVIDIA and AMD application versions on the Test Server. Unfortunately, it looks like the deprecated versions are still used by hosts which have already downloaded the binary (because there is no new one I guess).
As GPU jobs for all card vendors are running within the same BOINC project, I can't imagine another way to disable task distribution to AMD and NVIDIA cards without rewriting our sorter logic.

Maybe it would be a good time to implement some comandline parameters to allow users to trade smoothness for throughput as they wish.

Although the dynamic batch size chooser is doing a good job for most cases, I can additionally implement this feature for the next release.
ID: 10530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Avatar

Send message
Joined: 18 Jan 09
Posts: 454
Credit: 581,676,488
RAC: 0
Message 10532 - Posted: 17 Jun 2015, 18:42:06 UTC - in response to Message 10530.  

Unfortunately, it looks like the deprecated versions are still used by hosts which have already downloaded the binary (because there is no new one I guess).

I don't think so. I attached freshly to POEM@Test and the client went straight to downloading the nVidia and Intel executable. To me it seems like whatever you did to remove the AMD & nVidia apps didn't work.

You can always uncheck GPUs in browser UI of your test server account.

There is no separate setting for the different types of GPUs. So far ReaDys suggestion seems to be the easiest one, but still requires manual configuration.

As GPU jobs for all card vendors are running within the same BOINC project, I can't imagine another way to disable task distribution to AMD and NVIDIA cards without rewriting our sorter logic.

Isn't the plan class meant for this?

Although the dynamic batch size chooser is doing a good job for most cases, I can additionally implement this feature for the next release.

Or may it be that the batch size chooser just needs some tweaking for Intel GPUs? Vladimir said he's aiming for 3 - 5x times more GPU threads than execution units. How do you determine the number of execution units of the Intel GPUs? Normally they're counted as "EUs", but those actually contain quite a few shaders, similar to the "CUs" of GCN.

MrS
Scanning for our furry friends since Jan 2002
ID: 10532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vladimir Tankovich
Volunteer developer
Volunteer tester

Send message
Joined: 12 Nov 10
Posts: 182
Credit: 429,219,133
RAC: 0
Message 10533 - Posted: 17 Jun 2015, 20:46:40 UTC - in response to Message 10532.  
Last modified: 17 Jun 2015, 20:47:07 UTC

It's 3-5X of the amount that can fit on all CUs at the same time. More than that will still be better.

Unfortunately making machine responsive with current codebase will have a noticeable perf hit(20-40%), so I don't think automatic way of choosing batch size can solve this.

Also this does not apply to intel GPU - it remained very responsive no matter the batch size, so highest supported one(due to memory allocation limit) can be used.
ID: 10533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom_unoduetre

Send message
Joined: 3 Aug 11
Posts: 6
Credit: 38,165
RAC: 0
Message 10535 - Posted: 18 Jun 2015, 10:03:51 UTC

My result:

host 628:
CPU: Intel i5-3470 CPU @ 3.20GHz
GPU: HD Graphics 2500
OS: Win 7 Enterprise 64
RAM: 8 GB
Driver: 9.18.10.3165
Boinc 7.4.42

WU:
http://int-boinctest.int.kit.edu/poem/result.php?resultid=195476

It finished succesfully, however it takes more than double the time to complete on the GPU than on a CPU (average 3.300 seconds CPU) so there´s definately room for improvement :-)
ID: 10535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vladimir Tankovich
Volunteer developer
Volunteer tester

Send message
Joined: 12 Nov 10
Posts: 182
Credit: 429,219,133
RAC: 0
Message 10536 - Posted: 18 Jun 2015, 16:33:13 UTC - in response to Message 10535.  

the GPU WUs do way more compute than CPU WUs.
ID: 10536 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Avatar

Send message
Joined: 18 Jan 09
Posts: 454
Credit: 581,676,488
RAC: 0
Message 10537 - Posted: 18 Jun 2015, 18:29:39 UTC

Tom's HD2500 crunched the WU in 7170s, whereas my HD4000 took 13130s! Mine has 16 EUs, whereas his has just 6. The clock speed and main memory speed can't be much higher on his system either. What's going on? He's running Win 7 (should make no difference) and an older GPU driver.

I suspect he may not have seen the frequent pauses which I saw.

MrS
Scanning for our furry friends since Jan 2002
ID: 10537 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vladimir Tankovich
Volunteer developer
Volunteer tester

Send message
Joined: 12 Nov 10
Posts: 182
Credit: 429,219,133
RAC: 0
Message 10538 - Posted: 18 Jun 2015, 18:41:38 UTC - in response to Message 10536.  
Last modified: 18 Jun 2015, 18:58:06 UTC

running the numbers - your GPU has a max of 110Gflops and the WU achieved about 26 Gflops average over these 2 hours. normally on GCN and Maxwell, it achieves 50% of rated glops. It was running with around the same efficiency on my laptop, before it downclocked itself due to heat. But your CPU is not mobile, so should always run at max clock speed.

Was it busy with anything else?
Can you try running 2 WUs at once and if CPU was busy, make sure 1 core is 100% available for these tasks?

Thanks!

here are results from my laptop with a 340 Gflop gpu (almost the same as 4000)
http://int-boinctest.int.kit.edu/poem/results.php?userid=268
Judging by timers a single iteration of your 4000 is as fast as on my laptop (batchsize 32 max time is about 83ms)
ID: 10538 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Avatar

Send message
Joined: 18 Jan 09
Posts: 454
Credit: 581,676,488
RAC: 0
Message 10539 - Posted: 18 Jun 2015, 21:09:40 UTC - in response to Message 10538.  

Yes, normally my iGPU constantly runs at maximum clock speed (1.3 GHz, a small OC from the stock 1.15 GHz). It's not doing anything else than number crunching. It does display a static desktop, but that's needed for OpenCL anyway (otherwise Intel deactivates it).

The CPU was quite busy at 90% overall load, but at any time 2 logical cores were not fully loaded. I started a new WU and reduced the CPU load until 1 physical core was completely free, but the pauses still remained. Can't say if they were any shorter, though, as they're irregular anyway.

But what cured the issue for now is running 2 WUs in parallel. Luckily the app_config for this is quite simple:
<app_config>
<app>
<name>poemcl</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>0.1</cpu_usage>
</gpu_versions>
</app>
</app_config>


After 6 minutes I've got 1300.4 MHz average GPU clock (so no downclocking any more) and 98.0% average GPU load, which is excellent. Curiously one WU seem to be progressing much faster than the other one. It can't say much about the final runtimes yet, but can you say how the expected task variability for this test is?

MrS
Scanning for our furry friends since Jan 2002
ID: 10539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vladimir Tankovich
Volunteer developer
Volunteer tester

Send message
Joined: 12 Nov 10
Posts: 182
Credit: 429,219,133
RAC: 0
Message 10542 - Posted: 19 Jun 2015, 5:18:19 UTC - in response to Message 10539.  
Last modified: 19 Jun 2015, 5:33:00 UTC

there are 2 types of WUs on test server - 2k39 and the rest. 2k39 are about 6X more compute.

Can you try getting a non-2k39 WU and run it with 1 core free, so CPU load is not an issue?
ID: 10542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom_unoduetre

Send message
Joined: 3 Aug 11
Posts: 6
Credit: 38,165
RAC: 0
Message 10543 - Posted: 19 Jun 2015, 7:08:33 UTC
Last modified: 19 Jun 2015, 7:13:05 UTC

now I crunch a 2k39 WU and I´m at 6.000 seconds and 13%, so I guess I will end up at something like 46K seconds which would be more in lign with the results of a HD4000.

So the GPU WU´s do contain more information and thus run longer as on a CPU?

Edit: Should we move the tec. discussion to the number crunching forum?
ID: 10543 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Avatar

Send message
Joined: 18 Jan 09
Posts: 454
Credit: 581,676,488
RAC: 0
Message 10544 - Posted: 19 Jun 2015, 8:17:29 UTC

The 2k39 WU started first, followed by a 2f21 shortly afterwards. When the 2f21 finished after 4060s, the 2k39 was at ~30%. From then on the 2k39 ran together with an Einstein WU. Good news: no problems here. It took 16240s to finish. Around that time I got an Einstein task which took 57ks instead of the usual 55ks and no other visible outliers, so I suspect the iGPU shared the work about 50:50% (as the Einsteins normally do) between both projects.

From the log of the 2f21 wrote:
Batch size 4 runtime: 0.0144609
Batch size 8 runtime: 0.0277798
Batch size 16 runtime: 0.0512277
Pff03_OpenCl choosing batch size 16 based on timings

From the log of the 2k39 wrote:
Batch size 4 runtime: 0.0666451
Batch size 8 runtime: 0.101164
Pff03_OpenCl choosing batch size 8 based on timings


MrS
Scanning for our furry friends since Jan 2002
ID: 10544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vladimir Tankovich
Volunteer developer
Volunteer tester

Send message
Joined: 12 Nov 10
Posts: 182
Credit: 429,219,133
RAC: 0
Message 10545 - Posted: 19 Jun 2015, 16:35:39 UTC - in response to Message 10544.  

If I understand all the numbers in previous email correctly, your GPU is about 20% slower than mine. This can easily be attributed to architectural changes, CPU load and such. So, not a glaring issue.
ID: 10545 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : News : Intel GPU Support


Copyright © 2017 KIT-INT