EDBT'23 Tutorial: Mining Structures from Massive Texts by Exploring the Power of Pre-trained Language Models

Yu Zhang, Yunyi Zhang, and Jiawei Han

Time: March 29th, 2023 11:00 AM - 12:30 PM and 16:00 PM - 17:30 PM (UTC+2)



Technologies for handling massive structured or semi-structured data have been researched extensively in database communities. However, the real-world data are largely in the form of unstructured text, posing a great challenge to their management and analysis as well as their integration with semi-structured databases. Recent developments of deep learning methods and large pre-trained language models (PLMs) have revolutionized text mining and processing and shed new light on structuring massive text data and building a framework for integrated (i.e., structured and unstructured) data management and analysis.

In this tutorial, we will focus on the recently developed text mining approaches empowered by PLMs that can work without relying on heavy human annotations. We will present an organized picture of how a set of weakly supervised methods explore the power of PLMs to structure text data, with the following outline:

(1) An introduction to pre-trained language models that serve as new tools for our tasks;

(2) Mining topic structures: unsupervised and seed-guided methods for topic discovery from massive text corpora;

(3) Mining document structures: weakly supervised methods for text classification;

(4) Mining entity structures: distantly supervised and weakly supervised methods for phrase mining, named entity recognition, taxonomy construction, and structured knowledge graph construction;

(5) Towards an integrated information processing paradigm.