Reproducing Router Training Results: A Deep Dive

by Alex Johnson 49 views

When diving into the fascinating world of router training, especially within the context of the R2R dataset and its associated training scripts, achieving precise replication of published results can sometimes feel like a puzzle. You've successfully employed the default training script python script/train/train_router.py --config resource/default_training_config.json on the AnonymousPaperReview/R2R_Router_Training dataset and hit a commendable recall of approximately 95%. However, the reported positive rate, also known as LLM usage, stands at around 31.8%, which is a noticeable jump from the 12.4% detailed in Table 4 of the paper. This discrepancy is a common point of inquiry for researchers aiming to precisely reproduce experimental outcomes. It suggests that subtle differences in setup, configuration, or interpretation might be at play, rather than a fundamental flaw in the training process itself. Let's explore the potential factors that could contribute to this difference and what steps you might take to align your results more closely with the paper's findings.

Understanding the Discrepancy: Recall vs. Positive Rate

Let's first unpack the metrics themselves. You've achieved a high recall of 95%, which is fantastic! Recall, in this context, measures the proportion of instances where the router correctly identified the need for LLM usage. In simpler terms, out of all the situations where an LLM should have been used, your router correctly flagged it almost all the time. This is a crucial performance indicator for a router, as it minimizes the risk of overlooking necessary LLM calls. On the other hand, the positive rate, or LLM usage, indicates the overall percentage of instances where the router decided to use the LLM, regardless of whether it was strictly necessary. The paper reports 12.4%, while your training yields 31.8%. This means your router is opting for LLM usage much more frequently than reported in the paper, even though it's doing a good job of catching the cases where LLM is needed.

This significant difference between your positive rate and the paper's could stem from several areas. It might be related to the thresholding mechanism used for making the final decision to invoke the LLM. The paper's results might have been obtained with a different default threshold or a different method of optimizing this threshold. Your training output shows a threshold of 0.33666666666666667 and final_metrics.positive_rate of 0.31790005282825. This suggests that at this specific threshold, a considerable portion of the data is classified as needing LLM. The optimizing section of your config specifies type: "threshold" and min_recall: 0.95. This means the system found a threshold that satisfies the 0.95 recall requirement. It's possible that the threshold that satisfies this high recall also results in a much higher positive rate than what was originally targeted or observed in the paper's specific experimental conditions.

Investigating Configuration and Assumptions

To bridge this gap, let's delve into the configuration and potential underlying assumptions. The provided training JSON is quite detailed, and subtle changes here could have a significant impact. One area to scrutinize is the model_type and its init_args. You are using HiddenStatesTokenLMHeadLogitsClassifier with pretrained_model_name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B". While this is a valid setup, it's worth double-checking if the paper used the exact same pretrained model or if any specific fine-tuning steps were applied to the base model before it was used for router training. Sometimes, the way the hidden states, tokens, or logits are extracted and fed into the classifier can also differ. The input_type is set to ["hidden_states", "token", "logits"], which seems comprehensive.

Furthermore, the data configuration, particularly type: "divergent" and input_prefix: "small_", should be verified against the paper's description. The divergent type suggests a specific way of constructing training examples, and the small_ prefix might imply that only a subset or a particular type of input features are being used. Are these settings directly from the paper, or were they chosen based on the default script? If the paper used a different data loading or preprocessing strategy, this could explain the divergence in results. The training.params such as num_epochs, batch_size, and patience are standard, but if the paper used a different learning rate schedule or optimizer, that could also play a role. Your training.optimizer is set to lr: 5e-05 and weight_decay: 0.0005, which are typical values, but variations here might be worth exploring. The training.loss with recall_factor: 1.0 is also a key parameter, and its interaction with the optimizer and data can heavily influence the model's behavior.

Exploring the Thresholding Strategy

The optimizing section, specifically type: "threshold" and min_recall: 0.95, is central to understanding your positive rate. The system optimizes by finding a threshold that ensures a minimum recall of 95%. It seems that to achieve this high recall, the algorithm might be selecting a threshold that is relatively low, thus leading to a higher positive rate. The paper's reported positive rate of 12.4% suggests that either a higher threshold was used, or the model learned to be more selective naturally, resulting in fewer LLM calls. It's possible that the paper's methodology involved a different approach to determining the optimal threshold, perhaps one that explicitly balanced recall and precision, or directly minimized a loss function that implicitly controlled the positive rate.

Consider if the paper's