[Cython] prange CEP updated

Wed Apr 13 23:13:56 CEST 2011

On 13 April 2011 22:53, mark florisson <markflorisson88 at gmail.com> wrote:
> On 13 April 2011 21:57, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
>> On 04/13/2011 09:31 PM, mark florisson wrote:
>>>
>>> On 5 April 2011 22:29, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
>>>  wrote:
>>>>
>>>> I've done a pretty major revision to the prange CEP, bringing in a lot of
>>>> the feedback.
>>>>
>>>> Thread-private variables are now split in two cases:
>>>>
>>>>  i) The safe cases, which really require very little technical knowledge
>>>> ->
>>>> automatically inferred
>>>>
>>>>  ii) As an advanced feature, unsafe cases that requires some knowledge of
>>>> threading ->  must be explicitly declared
>>>>
>>>> I think this split simplifies things a great deal.
>>>>
>>>> I'm rather excited over this now; this could turn out to be a really
>>>> user-friendly and safe feature that would not only allow us to support
>>>> OpenMP-like threading, but be more convenient to use in a range of common
>>>> cases.
>>>>
>>>> http://wiki.cython.org/enhancements/prange
>>>>
>>>> Dag Sverre
>>>>
>>>> _______________________________________________
>>>> cython-devel mailing list
>>>> cython-devel at python.org
>>>> http://mail.python.org/mailman/listinfo/cython-devel
>>>>
>>>>
>>>
>>> If we want to support cython.parallel.threadsavailable outside of
>>> parallel regions (which does not depend on the schedule used for
>>> worksharing constructs!), then we have to disable dynamic scheduling.
>>> For instance, if OpenMP sees some OpenMP threads are already busy,
>>> then with dynamic scheduling it dynamically establishes how many
>>> threads to use for any parallel region.
>>> So basically, if you put omp_get_num_threads() in a parallel region,
>>> you have a race when you depend on that result in a subsequent
>>> parallel region, because the number of busy OpenMP threads may have
>>> changed.
>>
>> Ah, I don't know why I thought there wouldn't be a race condition. I wonder
>> if the whole threadsavailable() idea should just be ditched and that we
>> should think of something else. It's not a very common usecase. Starting to
>> disable some forms of scheduling just to, essentially, shoehorn in one
>> particular syntax, doesn't seem like the way to go.
>>
>> Perhaps this calls for support for the critical(?) block then, after all.
>> I'm at least +1 on dropping threadsavailable() and instead require that you
>> call numthreads() in a critical block:
>>
>> with parallel:
>>    with critical:
>>        # call numthreads() and allocate global buffer
>>        # calling threadid() not allowed, if we can manage that
>>    # get buffer slice for each thread
>
> In that case I think you'd want single + a barrier. 'critical' means
> that all threads execute the section, but exclusively. I think you
> usually want to allocate either a shared worksharing buffer, or a
> private thread-local buffer. In the former case you can allocate your
> buffer outside any parallel section, in the latter case within the
> parallel section. It the latter case the buffer will just not be
> available outside of the parallel section.
>
> We can still support any write-back to shared variables that are
> explicitly declared later on (supposing we'd also support single and
> barriers. Then the code would read as follows
>
> cdef shared(void *) buf
> cdef void *localbuf
>
> with nogil, parallel:
>    with single:
>        buf = malloc(n * numthreads())
>
>    barrier()
>
>    localbuf = buf + n * threadid()
>    <actual code here that uses localbuf (or buf if you don't assign to it)>
>
> # localbuf undefined here
> # buf is well-defined here
>
> However, I don't believe it's very common to want to use private
> buffers after the loop. If you have a buffer in terms of your loop
> size, you want it shared, but I can't imagine a case where you want to
> examine buffers that were allocated specifically for each thread after
> the parallel section. So I'm +1 on dropping threadsavailable outside
> parallel sections, but currently -1 on supporting this case, because
> we can solve it later on with support for explicitly declared
> variables + single + barriers.
>
>>> So basically, to make threadsavailable() work outside parallel
>>> regions, we'd have to disable dynamic scheduling (omp_set_dynamic(0)).
>>> Of course, when OpenMP cannot request the amount of threads desired
>>> (because they are bounded by a configurable thread limit (and the OS
>>> of course)), the behaviour will be implementation defined. So then we
>>> could just put a warning in the docs for that, and users can check for
>>> this in the parallel region using threadsavailable() if it's really
>>> important.
>>
>> Do you have any experience with what actually happen with, say, GNU OpenMP?
>> I blindly assumed from the specs that it was an error condition ("flag an
>> error any way you like"), but I guess that may be wrong.
>>
>> Just curious, I think we can just fall back to OpenMP behaviour; unless it
>> terminates the interpreter in an error condition, in which case we should
>> look into how expensive it is to check for the condition up front...
>
> With libgomp you just get the maximum amount of available threads, up
> to the number requested. So this code
>
>  1 #include <stdio.h>
>  2 #include <omp.h>
>  3
>  4 int main(void) {
>  5     printf("The thread limit is: %d\n", omp_get_thread_limit());
>  6     #pragma omp parallel num_threads(4)
>  7     {
>  8         #pragma omp single
>  9         printf("We have %d threads in the thread team\n",
> omp_get_num_threads());
>  10     }
>  11     return 0;
>  12 }
>
> requests 4 threads, but it gets only 2:
>
> [0] [22:28] ~/code/openmp  ➤ OMP_THREAD_LIMIT=2 ./testomp
> The thread limit is: 2
> We have 2 threads in the thread team
>
>>
>> Dag Sverre
>>
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel
>>
>

Although there is omp_get_max_threads():

"The omp_get_max_threads routine returns an upper bound on the number
of threads that could be used to form a new team if a parallel region
without a num_threads clause were encountered after execution returns
from this routine."

So we could have threadsvailable() evaluate to that if encountered
outside a parallel region. Inside, it would evaluate to
omp_get_num_threads(). At worst, people would over-allocate a bit.