MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities
We use python=3.8, torch=1.13.1, cudatoolkit=11.3, and a single NVIDIA RTX A6000 GPU. Other packages can be installed using:
pip install -r requirements.txt
Specify the variables DATA_FOLDER_PATH and INTERMEDIATE_DATA_FOLDER_PATH within utils.py. DATA_FOLDER_PATH should be where your datasets are saved (all provided within the datasets/ folder) and INTERMEDIATE_DATA_FOLDER_PATH is where all of the intermediate data is stored (e.g. pickle files for class-oriented sentence and class representations, where the final pseudo-training dataset is stored).
In order to learn the contextualized sentence and document representations for a specific dataset (in this case, 20News), run the following command:
time CUDA_VISIBLE_DEVICES=[gpu] python run.py --gpu [gpu] --dataset_name 20News
The following are the primary arguments for MEGClass
dataset_name-
gpu$\rightarrow$ GPU to use; refer to nvidia-smi -
emb_dim$\rightarrow$ default=768; Sentence and document embedding dimensions (default based on bert-base-uncased). -
num_heads$\rightarrow$ default=2; Number of heads to use for MultiHeadAttention. -
batch_size$\rightarrow$ default=64; Batch size of documents. -
epochs$\rightarrow$ default=4; Number of epochs to learn contextualized representations for during single iteration. -
max_sent$\rightarrow$ default=150; For padding, the max number of sentences within a document. -
temp$\rightarrow$ default=0.1; Temperature scaling factor; regularization. -
lr$\rightarrow$ default=1e-3, Learning rate for training contextualized embeddings. -
iters$\rightarrow$ default=4; Number of iterations of iterative feedback. -
k$\rightarrow$ default=0.075; Top k proportion of docs added to each class set (7.5%). -
doc_thresh$\rightarrow$ default=0.5; Pseudo-training dataset threshold. -
pca$\rightarrow$ default=64; Number of dimensions projected to in PCA, -1 means not doing PCA.