My master thesis was on parallel mesh generation. The main objective was developing a software that can generate large unstructured meshes with sizes in the hundreds of millions to billions range on complex geometry. Of course I also wanted to reduce the mesh generation time drastically, therefore scalability of the program was one of the main issues.
There is a need for mesh generators that can create meshes with over a billion elements. Many scientists do use some kind of refinement library in conjunction with sequential mesh generators and mesh repartioning tools. Unfortunately most of the time this is done manually and can be quite time consuming. The resulting meshes should also satisfy the condition that vertices at the boundaries should be exactly on the geometry. If the scientist is using parallel solvers, the mesh he creates will be probably partitioned, so in that case there is an additional requirement that partition boundaries should be conforming.
Mesh generation is basically used to approximate geometry for rendering purposes or physical simulations like FEA and CFD. Most sequential mesh generators can be classified into three categories. These are:
- Delaunay Methods
- Advancing Front Methods
- Octree Methods
There are basically two approaches to parallelize this process. The first is parallelizing the mesh generation process itself, the second is decomposing the geometry or refinement based approaches. I have selected the second approach since my time was limited and quoting N. Chrisochoides “It takes about 10-15 years to develop the algorithmic and software infrastructure for sequential industrial strength mesh generation libraries.” and creating programs that use parallelism is generally harder than their sequential counterparts.
My work consisted of three different ways of generating meshes.
- Geometry Decomposition (Method 1)
- Parallel Volume Refinement (Method 2)
- Parallel Surface Refinement (Method 3)
Of course there was also a need for a mesh migration algorithm that makes the partition boundaries conforming after refinement and decomposition.
2. Structures Used
Two different structures were used for the project.
Tree Structure for Method 1
This one basically consists of a modified BSP-Tree and Mesh. Terminal node data corresponds to processors. So the mesh generation can be done by a number of processors that is a power of 2. This is the case for all the methods implemented for this thesis.
In addition to the basic classes used by a “computer graphics” mesh I have added a MeshTetrahedron class to represent volume elements. Also since ownership information of elements are required a list of owning processors was added to the Mesh Element interface which all the different types of mesh elements (Vertex, Face, Tetrahedron etc.) implement. Basically the whole thing looks like this:
Mesh Structure for Methods 2 and 3
To save memory not all adjacencies on a standard mesh are implemented. The mesh structure that was used was based on the Elmer FEM Solver partitioned mesh structure.
Basically a standard mesh adjacency consists of all the relations as shown on the figure above in part (b). But structure implemented in this work is as seen in (a).
I also have classified the mesh elements into two since this is important in mesh migration and ensuring the conformity of partition boundaries.
- Interior entities: entities that uniquely reside on a partition (processor), i.e. non-shared entities.
- Partition boundary (shared) entities: entities that are shared by multiple partitions. Since I have used a distributed memory model, these entities are replicated on the processors that own the tetrahedra containing these entities.
3. Geometry Decomposition Method
The main approach here is using BSP Trees to create a new geometry with partition planes. Then a fine surface mesh with this new geometry is created and distributed across the processors. And lastly a fine volume mesh is generated in parallel. The main algorithm is as follows (Note that in this figure and in the remaining figures the colors mean the following. Blue means part that is done by the NETGEN library, Green is part done by METIS/ParMETIS and the orange parts are done by the code that is written by me):
One example run with a torus object looks like follows:
For the geometry decomposition I have used different Tree methods.
- BSP-Tree (random cutplanes)
- BSP-Tree (iterative PCA based cutplanes)
The figure above shows the 2D versions of the methods. You can see the Octree on the left, KD-Tree in the middle and BSP-Tree on the right.
In this first method I don’t use METIS for partitioning but use the above tree methods. The same tree that was used during the creation of the new geometry is also used for partitioning the mesh into the processors. Vertices in this approach have already global ID’s. The distribution of the surface mesh created is done by the leaf nodes. The data to be sent to each processor is created on the leaf nodes and therefore each terminal node corresponds to a processor. The first processor then distributes this surface meshes to each corresponding processor, where volume mesh generation from these surface meshes takes place in parallel. For distribution formats and global ID assignment of mesh elements the interested reader can refer to the full thesis which can be found at the end of this post.
4. Parallel Volume Refinement Based Mesh Generation
In this method a coarse volume mesh is generated using NETGEN and partitioned using ParMETIS and then refined in parallel by my code. There are some issues that have to be considered. One of them is the sub-mesh distribution which is detailed within the thesis. The main concern is that the following must be satisfied after the distribution:
- The shared vertices should contain the same processor list on all the sharing processors
- The shared vertices should have the same global ID on all the sharing processors
Details about global ID assignment can also be found within the thesis. The surface refinement uses the simple 4-way triangular refinement, and the volume refinement uses 8-way tetrahedral refinement respectively. The basic idea is just dividing the elements exactly in the middle as seen here:
Anyway the main algorithm looks as follows:
Here the steps 5-6 are done as follows. First for each edge new vertices are created. Then communication between processors takes place for shared vertices. In the next step first the new volume elements then the surface elements are generated. The creation of new vertices should follow some rules if conformity of the partition boundaries and correct global ID’s is to be ensured. First to ensure that only one processor assigns global ID’s and handles the distribution of information, I used the term “owner” processor. Each processor own’s all the non-shared (internal) vertices and part of the shared vertices whose owner it is. Every vertex is owned only by a single processor (partition). The owner of a new vertex is calculated using the following method.
- Z = (Set of processors who own Vertex X) ∩ (Set of processors who own Vertex Y)
- Order Z by processor ID.
- index = (idX+idY) % size(Z)
- Owner is Z[index]
The communication of shared vertex information also uses a similar assumption. Normally the owner processor sends id1 id2 glid to all processors that are in the set Z (id1 and id2 denote the global ID’s of the vertices that make the shared edge and glid is the global ID set by the owner processor for the newly created vertex in the ‘middle’ of this shared edge). And that causes some problems that need to be addressed. Those problems are as follows and can also be seen on the figure just below:
- There is some excess data that is sent. Owners wrongly assume that every processor who has both the vertices of a shared edge actually has that edge.
- There is some missing data. The assumed owner may not actually own the shared edge. So it did not send anything to the others.
The following solutions were implemented for these problems.
Solution to Problem 1
Processors that shouldn’t have gotten the additional vertex id information, send to the owner processor: id1 id2 procID. Then the owner recreates shared Vertex data using these. It first deletes those processor ID’s from the shared list and then sends the new shared information data to remaining processors again.
Solution to Problem 2
The processors that did not get the global id data from the assumed owner, sends the following data to the assumed owner: id1 id2 procID. The assumed owner combines this information from different processors and creates shared vertex information. The assumed owners set new global ID’s starting from the last updated maximum global vertex id. The assumed owner then sends both global ID info and shared vertex information to the processors. Let’s illustrate this solution on the problem that is shown in the figure above. Let’s define the following:
- The purple processor is called processor 0.
- The light blue processor is called processor 1.
- The orange processor is called processor 2.
In this case, the vertices a and b are both shared by all three processors. Let’s assume that the owner calculation yields the processor 0 as the owner. Since this processor actually does not have the edge (a,b) it won’t create c and will not send any information to the other processors. Processor 2 also doesn’t create c, so it is not a problem for him. But processor 1 thinks that processor 0 will create c and assign a global ID to it and send that information to him. Yet he doesn’t get that info. And as my solution it sends ‘id1 id2 procID’ (in this case a,b,1) to the assumed owner which is processor 0. Processor 0 gets a,b,1 from processor 1 and creates a shared vertex information from it. Note that he doesn’t use this data himself, but sends it to all processors that wanted to get the global ID for the newly created vertex on edge (a,b). In this case the newly assigned global ID is sent to only the processor 1. On more complex examples the same information could be sent to every processor that creates the new vertex from the edge.
5. Parallel Surface Refinement Based Mesh Generation
In this method first a coarse volume mesh is generated using NETGEN and partitioned using ParMETIS. The surface meshes are extracted from the sub-meshes that were created during this process. Then those surface meshes are refined until the required size is reached and generate volume meshes from these surface meshes again using NETGEN. The algorithm looks as follows:
By skinning the volume mesh I mean extracting the surface meshes that were created. The program already knows the geometric boundary faces, what it needs to find are the partition boundary faces. Together they form a closed surface mesh which can be used to generate volume meshes using NETGEN.
6. Mesh Migration
In the mesh migration I use the “owner updates” rule. The main algorithm first finds parent ID’s of the boundary faces if they are not already available. Then the elements are sent to the migrated processors in the following order. First the boundary faces, then volume elements and lastly vertices. In the end the new shared vertex information has to be calculated. The boundary faces and volume elements are send in two steps, first the number of elements that each processor will get is calculated and that information is shared. Then the actual elements will be sent to the corresponding processors. When sending vertices, in addition to these the shared vertex information on partition boundaries has to be taken into account. You can find detailed information within the thesis. Basically the following has to be done. Vertices which are no longer used by a processor should be deleted from that processor. During the migration duplicate vertices may arise. These should be handled. A migration example and what I mean by it can be seen in these figures.
The bold points denote the owning processor. So for example point a is owned by Processor 2 but shared among all 4 processors. The migrating parts are colored and arrows show their destinations. After the migration takes place the new partitioned mesh looks like this:
7. Results, Conclusions
In the table above, you can see the timing results of parallel mesh generation using geometry decomposition (i.e. Method 1). Timings of each component are reported. Method 1 was developed first in my work. It was realized, however, that it had some problems as examplified in the next figure. When the geometry is decomposed cutting planes, it is possible to have cases like the one shown. Here, an inappropriate cut location will result in thinly sliced regions and cause small elements to be generated where large elements are expected. This problem becomes more probable as the number of sub-domains increases. Because of this difficulty, Method 1 is not appropriate when used on a large number of processors.
In the next method (Method 2) as expected, the execution time of the sequential part increases with the number of processors, since the overhead associated with the partitioning and distributing of the initial mesh increases. The timings of the parallel component (the upper part) decreases as the number of processors increase, since, for a fixed mesh size, the size of the sub-mesh on each processor decreases. The scalability problem can be seen here at processor size around 1024 for a mesh size of 1.4 billion volume elements is also due to the size of the mesh. As can be seen from the figure, the problem arises even earlier for smaller size meshes. For the shaft and cube geometries (having 700M and 800M elements respectively) it can be seen that the scalability problem catches between 256 and 512 processors. This means that if the mesh size is further increased, it is also possible to increase the processor number without losing much efficiency. Unfortunately I did not foresee this problem and did implement my software to work on 32 bit integers. This would mean that many things like global ID’s for volume elements are limited up to around 4 billion. This could be remedied in a future work.
I have also tested the performance of the mesh migration algorithm. The following table shows timings obtained when 10% of the elements at each processor are migrated to 10 different random destination processors. The last entry for the case of 512 processors shows that migration of about 80 million elements took under a minute. Considering that migration is to be performed only once after repartitioning, these results are acceptable.
Unfortunately the last method could not be tested. In general the following conclusions can be made from this work:
- Automatic geometry decomposition is difficult and may not be suitable to run on large number of processors. Therefore, Method 1 can be used on a small number of processors, perhaps with user assisted geometry decomposition in a semi-automatic manner.
- For refinement based methods, mesh generation of billion(s) of elements are possible on distributed machines.
- Since the size of the mesh that can be initially generated sequentially dictates the size and the quality of the final fine mesh, using a fat node during the sequential phase might be preferred.
For future work the following can be possible:
- Making the last method working.
- Support for other input and output file types
- Extending to 64-bit architecture so that the 4 billion element limit can be surpassed
- Using alternatives to NETGEN and ParMETIS
- Integration of the software with a FEM solver.
Lastly I want to compare the methods used in this work in a table and put a few screenshots from the resulting meshes:
Publications that resulted from this work:
- Yilmaz, Y.and Ozturan, C., “A Parallel Mesh Generator Based on Sequential NETGEN”, PARENG 2013, (Won Young Researcher Best Paper in Engineering), Link for the paper is here and for the prize.
- Yilmaz, Y. and Ozturan, C., “Using sequential NETGEN as a component for a parallel mesh generator”, Advances in Engineering, Volume 84, June 2015. Link for the paper is here.
This work was financially supported by the PRACE project funded in part by the EUs 7th Framework Programme (FP7/2007-2013) under grant agreement no. RI-211528 and FP7-261557. The work was achieved using the PRACE Research Infrastructure resources (the Curie supercomputer at CEA in France).
By the way another consecutive work (more robust) done on the same problem by my university can be found at PMSH. Thanks to Seren Soner for making that publicly available.
Finally for those interested you can find the full thesis here.