Skip to main content

Could anyone advise on:

  1. The optimal dataset size needed to achieve a high F1 score for NER-based extraction tasks?
  2. Any best practices for preparing and labeling the dataset to ensure accurate evaluation?
  3. How to balance between dataset diversity and quantity for better model performance?

Any insights, examples, or resources would be greatly appreciated!

These are really REALLY great questions.

Question 1) This number changes depending on the type of work you are doing and with how many classes.  If you are doing something like classification or sentiment analysis, you’d want to scale up your testing dataset size so you have a good representation of each class.  So if you have 5 different classifications, aim for 15-25 of each.   

Something like extraction just needs to be large enough that you can start to tell the difference between bad performance and bad luck.  50+ for those would be best.  If you have LOTS of classes in your extraction (my example had 2), you might want around 100 examples to test against.  

Question 2) Once you have a dataset that you’ll be testing with, go in the following order:
a) Label the ground truth. b) Give to a peer and have them independently create a ground truth. c) Compare.  (If you aren’t in agreement, perhaps the definitions or rules associated with your desired outputs aren’t as well defined as you thought!)  Resolve those difference.  d) Build your prompts and test

Most of those steps are covered in the F1 course, but I did not mention a peer review.

 

Question 3) This interacts a lot with question 1.  Your dataset diversity and quantity go hand in hand.  You need examples of each type of classification if going that route.  When doing extraction you’d examples of documents both with and without specific classes (if that’s even something possible with your use case).  Part of the reason I use “representative” as a term a lot in the videos is that its the simplest way to think about the problem.

If you purposefully curate information to run through your test, you’ve potentially biased the test!  If you allow for a random population from your data to be pulled, then you’ll have a more representative data set and likely get more accurate results.

 

Thanks for asking this question ​@ChanduMohammad .  Feel free to keep asking and keeping this alive.  My introduction to F1 scoring is not all-encompassing and this could be a great discussion ground for some more of the advance ML concepts.


These are really REALLY great questions.

Question 1) This number changes depending on the type of work you are doing and with how many classes.  If you are doing something like classification or sentiment analysis, you’d want to scale up your testing dataset size so you have a good representation of each class.  So if you have 5 different classifications, aim for 15-25 of each.   

Something like extraction just needs to be large enough that you can start to tell the difference between bad performance and bad luck.  50+ for those would be best.  If you have LOTS of classes in your extraction (my example had 2), you might want around 100 examples to test against.  

Question 2) Once you have a dataset that you’ll be testing with, go in the following order:
a) Label the ground truth. b) Give to a peer and have them independently create a ground truth. c) Compare.  (If you aren’t in agreement, perhaps the definitions or rules associated with your desired outputs aren’t as well defined as you thought!)  Resolve those difference.  d) Build your prompts and test

Most of those steps are covered in the F1 course, but I did not mention a peer review.

 

Question 3) This interacts a lot with question 1.  Your dataset diversity and quantity go hand in hand.  You need examples of each type of classification if going that route.  When doing extraction you’d examples of documents both with and without specific classes (if that’s even something possible with your use case).  Part of the reason I use “representative” as a term a lot in the videos is that its the simplest way to think about the problem.

If you purposefully curate information to run through your test, you’ve potentially biased the test!  If you allow for a random population from your data to be pulled, then you’ll have a more representative data set and likely get more accurate results.

 

Thanks for asking this question ​@ChanduMohammad .  Feel free to keep asking and keeping this alive.  My introduction to F1 scoring is not all-encompassing and this could be a great discussion ground for some more of the advance ML concepts.

Thanks ​@Matt.Stewart, this is helpful.


Reply