Intel® C++ Compiler XE 13.1 User and Reference Guides
Elemental functions are a general language construct to express a data parallel algorithm. An elemental function is written as a regular C/C++ function, and the algorithm within describes the operation on one element, using scalar syntax. The function can then be called as a regular C/C++ function to operate on an single element or it can be called in a data parallel context, providing many elements to operate on. In Intel® Cilk™ Plus, the data parallel context is provided as an array.
When you write an elemental function, the compiler generates a short vector form of the function, which can perform your function's operation on multiple arguments in a single invocation. The short vector version may be able to perform multiple operations as fast as the regular implementation performs a single one by utilizing the vector ISA in the CPU. In addition, upon invocation of the function, if the data set is large enough, the compiler may assign different copies of the elemental functions to different threads (or workers), executing them concurrently. The end result is that your data parallel operation executes on the CPU utilizing both the parallelism available in the multiple cores and the parallelism available in the vector ISA.
If the short vector function is called inside a parallel loop, a cilk_for loop or an auto-parallelized loop that is vectorized, you can achieve both vector-level and thread-level parallelism.
In order for the compiler to generate the short vector function, you need to provide an indication in your code.
Windows* OS:
Use the __declspec(vector (clauses)) declaration, as follows:
__declspec(vector (clauses)) return_type elemental_function_name(arguments)
Linux* OS and OS X*:
Use the __attribute__((vector (clauses))) declaration, as follows:
__attribute__((vector (clauses))) return_type elemental_function_name(arguments)
The clauses for the vector declaration take the following values:
processor(cpuid) |
Where cpuid takes one of the following values:
|
vectorlength(n) |
Where n is a vectorlength (vl). It must be an integer that is a power of 2. The value must be 2, 4, 8, or 16. The vectorlength clause tells the compiler that each routine invocation at the call site should execute the computation equivalent to n times the scalar function execution. |
vectorlengthfor(datatype) |
Where the datatype value must be one of the following built-in types otherwise the behavior is undefined.
When you use the vectorlengthfor clause, n is computed as the data type corresponding to the size of the vector register/data type for the processor being used. For example, vectorlengthfor(float) results in n=4 for Intel® SSE2 to Intel® SSE4.2 target processors (with packed float operations available on 128-bit XMM registers), and n=8 for Intel® AVX target processors (with packed float operations available on 256-bit YMM registers). Using vectorlengthfor(int) results in n=4 for Intel® SSE2 to Intel® AVX target processors. NoteThe vectorlength and vectorlengthfor clauses are mutually exclusive. |
linear(param1:step1 [, param2:step2]…) |
Where
|
uniform(param [, param,]…) |
Where param is a formal parameter of the specified function. The uniform clause tells the compiler that the values of the specified arguments can be broadcast to all iterations as a performance optimization. Multiple uniform clauses are merged as a union. |
[no]mask |
The [no]mask clause tells the compiler to generate a masked vector version of the routine. |
Write the code inside your function using existing C/C++ syntax.
Typically, the invocation of an elemental function provides arrays wherever scalar arguments are specified as formal parameters. Use the array notation syntax available in Intel® Cilk™ Plus to provide the arrays succinctly. Alternatively, you can invoke the function from a _Cilk_for loop.
The following examples show how to use elemental functions to add two large arrays and store the result in a third array, taking advantage of the parallelism available in both the cores and the vectors in the CPU:
Windows* OS:
|
Example |
|---|
//declaring the function body __declspec((vector)) double ef_add (double x, double y){
return x + y; } //invoking the function using array notations a[:] = ef_add(b[:],c[:]); //operates on the whole extent of the arrays a,b,c a[0:n:s] = ef_add(b[0:n:s],c[0:n:s]); //use the full array notation construct to also specify n as an extend and s as a stride //Use the _Cilk_for construct to invoke the elemental function in a data parallel context _Cilk_for (j = 0; j < n; ++j) {
a[j] = ef_add(b[j],c[j]) } |
Linux* OS and OS X*:
|
Example |
|---|
//declaring the function body __attribute__((vector)) double ef_add (double x, double y){
return x + y; } //invoking the function using array notations a[:] = ef_add(b[:],c[:]); //operates on the whole extent of the arrays a,b,c a[0:n:s] = ef_add(b[0:n:s],c[0:n:s]); //use the full array notation construct to also specify n as an extend and s as a stride //Use the _Cilk_for construct to invoke the elemental function in a data parallel context _Cilk_for (j = 0; j < n; ++j) {
a[j] = ef_add(b[j],c[j]) } |
Only the calling code using the _Cilk_for calling syntax is able to use all available parallelism. The array notation syntax, as well as calling the elemental function from the regular for loop, results in invoking the short vector function in each iteration and utilizing the vector parallelism but the invocation is done in a serial loop, without utilizing multiple cores.
Limitations
The following language constructs are disallowed within elemental functions:
The GOTO statement
The switch statement with16 or more case statements
Operations on classes and structs (other than member selection)
The _Cilk_spawn keyword
Expressions with array notations