The datasets used by many software applications can be represented as graphs, defined by sets of vertices and edges. These graphs are rich with useful information, and can be used to determine patterns and relationships among the stored data. This process of discovering relevant patterns from graphs is called Graph Pattern Mining (GPM). A team of Texas Computer Science (TXCS) researchers advised by Dr. Keshav Pingali has done groundbreaking work to make GPM programs more efficient and accessible. Their work was recently accepted to Very Large Databases (VLDB) 2020, one of the premier conferences in computer science.
GPM has applications in many areas, including chemical engineering, bioinformatics, and social sciences. For example, protein behavior can be mapped to a graph by representing the proteins as the vertices and the interactions between them as edges. Using a GPM program, researchers would be able to make useful discoveries about the data, such as finding frequent patterns that occur within the graph and what features they relate to. This information would then shine light on the etiology of a disease and enhance drug discovery.
The applications of these programs are seemingly endless. However, GPM problems are computationally complicated, memory-intensive and time-consuming to solve. Although a lot of work has been done to attempt to make it easy for programmers to develop GPM applications, these solutions are generally not very efficient or flexible, using distributed systems or disks to manage the intermediate data structures.
At the moment, Pangolin runs efficiently on a single multi core CPU or a single GPU, but its functionality still has room to grow. “The goal is to enable larger graphs and also larger patterns that could be mined within a reasonable amount of time,” says Dr. Xuhao Chen, the project’s lead researcher. Their paper is set to be published in VLDB at the end of the month.