When it comes to machine learning, I’m an avid fan of attacking data where it lives. 90%+ of the time, that’s going to be a relational database, assuming we’re talking about supervised machine learning.
Python is amazing, but pulling dozens of GB of data whenever you want to train a model is a huge bottleneck, especially if you need to retrain them frequently. Eliminating data movement makes a lot of sense. SQL is your friend.
For this article, I’ll use an always-free Oracle Database 21c provisioned on Oracle Cloud. I’m not sure if you can translate the logic to other database vendors. Oracle works like a charm, and the database you provision won’t cost you a dime — ever.
I’ll leave the Python vs. Oracle for machine learning on huge dataset comparison for some other time. Today, it’s all about getting back to basics.
I’ll use the following dataset today:
- Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. University of California, Irvine, School of Information and Computer Sciences. Retrieved…