•1 min read•from Data Science
Precision and recall > .90 on holdout data
I'm running ML models (XGBoost and elastic net logistic regression) predicting a 0/1 outcome in a post period based on pre period observations in a large unbalanced dataset. I've undersampled from the majority category class to achieve a balanced dataset that fits into memory and doesn't take hours to run.
I understand sampling can distort precision or recall metrics. However I'm testing model performance on a raw holdout dataset (no sampling or rebalancing).
Are my crazy high precision and recall numbers valid?
Of course there could be something fishy with my data, such as an outcome variable measuring post period information sneaking into my variable list. I think I've ruled that out.
[link] [comments]
Want to read more?
Check out the full article on the original site
Tagged with
#large dataset processing
#big data performance
#big data management in spreadsheets
#generative AI for data analysis
#conversational data analysis
#rows.com
#Excel alternatives for data analysis
#real-time data collaboration
#intelligent data visualization
#data visualization tools
#enterprise data management
#data analysis tools
#data cleaning solutions
#cloud-based spreadsheet applications
#financial modeling with spreadsheets