![]() ![]() This will give you a better feel for how training is progressing and prevent you discovering problems only after you’ve finished training.īe careful about choosing your test functions. It’s, therefore, a good idea to log as many test set metrics during training as you can, rather than waiting to only use them at test time. Often, these are more human-interpretable than the value of your loss function, which can be difficult to understand (handy tip: cross-entropy should be -log when you start training, if the dataset is balanced). For instance, in machine translation and summarization we use BLEU or ROUGE respectively, and for other tasks we might use accuracy, precision, or recall. Whilst loss functions for deep learning have to be differentiable, we often use a different suite of non-differentiable metrics to articulate performance at test time. This is clearly not-quite-right, because a 5 star review is more similar to a 4 star one than it is a 1 star (the data is ordinal). For instance, the Amazon Reviews Dataset, which contains reviews along with star ratings, is frequently used as a classification task by top labs. The latter is surprisingly common, even in academia. whether you need to add a softmax), and mixing up regression and classification. The two most common errors I’ve seen here are confusing whether the loss function expects to receive probability distributions or logits (i.e. However, there are plenty of ways to abuse the rock-solid implementations you’re given. ![]() This is quite hard to do, because most of the frameworks will take care of loss function specification for you. Wherever you read in data, make the naming super-explicit: TrainDataLoader and TestDataLoader, for instance. It’s good practice to split your data into train, validation, and test set once, and then put them in separate folders in the filesystem. ![]() This can determine what kinds of generalization you are quantifying by test-set performance. Sometimes you need to think carefully about what exactly constitutes a disjoint set. But should it also include never-before-seen merchants, to reassure us that we’re not overfitting to specific shops? For instance, if we are trying to extract the total value from a receipt, obviously the test set should contain never-before-seen receipts. Of course, your train/val/test should be disjoint - containing different examples. Mixing training data into your validation or test set is easy to do, and will produce superb model performance - as assessed by your dodgy test set - and terrible performance in the real world. Nothing is more embarrassing or demoralizing than realizing your model is much worse than you thought. If you take away just one thing from this post, it should be this. These are hard to spot because we’re biased against seeing them: when the model performs surprising badly, we’re inclined to take a second look, but when it performs surpisingly well we’re more likely to congratulate ourselves on our superb intuition (essentially a form of confirmation bias).Ĭommon causes of overestimation are overfitting to your test set, your data not being representative of the real world, or simply screwing up your metrics. The big boss of errors: making a mistake which leads you to overestimate performance. Errors that lead you to believe your results are better than they are In fact, you probably need to regularize more heavily to reveal the benefit of your more expressive model.Įrrors that lead to inaccurate experimentation accrue over time, and thus spotting them early is valuable. If your navigation is faulty, you’ll quickly end up somewhere you don’t want to be.įor example, if you implement a new feature which adds a bunch of parameters and compare it to the performance of an existing model without redoing a hyperparameter search, you might incorrectly conclude that your new feature makes things worse. This is a more costly kind of mistake to make, because you end up making poor decisions.Ĭonsider the story of the AirAsia pilot who ended up in Melbourne rather than Malaysia because of a wonky GPS. Errors that lead to inaccurate experimentation You fix it, find a new error, and the cycle repeats. ![]() I’m not going to talk much about these errors, because they are typically explicit: the program fails, and you have to figure out why. In deep learning, the dreaded shape error is the most common, arising when you try and multiply together matrices of incompatible size. These kind of errors are regrettable, but survivable. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |