Health Strategy is committed to utilizing the most cutting edge technology available. This means capitalizing on the recent explosion of interest in machine learning.
Machine learning is a field of study that involves designing and implementing algorithms that allow computers to learn from experience. They are especially useful in helping analyze and leverage large datasets. Our interest in machine learning, like the general trend, is largely driven by four factors: the huge amount of analytical power offered by these algorithms, democratization of software and software libraries, superior computing ability, and big data.The power of machine learning has been well documented in a multitude of fields such as predictive analytics, machine vision, natural language processing/translation, robotics, and driverless cars. Implementations of artificial intelligence such as Google’s AlphaGo and IBM’s Watson have shown that computers can be trained to be better than humans even in the incredibly complex games of Jeopardy and Go. Efforts to apply machine learning in the health industry, led by big names such as IBM, GE, and Google, have yielded fruits in genomics, imaging analytics and pathology, like Google’s application that helps clinicians detect breast cancer metastases in lymph nodes, patient and population management, clinical decision support, drug discovery and development, and will drive a new era of personalized hyper-targeted drugs.
Data mining software such as Orange and Weka make it easy for non-experts to immediately utilize machine learning, and examine how their datasets can yield rapid results. Software libraries, such as scikit-learn and TensorFlow, accelerate development of proprietary, customizable software, that can be tuned to deliver peak performance.
Amazon Web Services (AWS) and other cloud services allow users and companies like us to have instant access to powerful web servers that can efficiently run large datasets and computationally taxing algorithms.
Health Strategy’s greatest asset, besides our phenomenal personnel, is our data. Our carefully maintained databases give our machine learners fertile ground upon which to feed and grow.
After evaluating how the strengths of machine learning can be utilized for our applications, we decided to target outlier claims for our first machine learning application. Outliers are defined in a contract to be any claim that exceeds a certain cost threshold. While many companies stop here, and may pass the brunt of a fallacious charge onto their clients, we sometimes verify outliers with a manual process. This provides us with a considerable amount of labeled training data, enabling us to use time-tested supervised-data classification algorithms.
A unique aspect to the outlier problem is that it is an extremely unbalanced dataset. Less than 1% of outliers per definition are true outliers. A naïve classifier may classify all claims as non-outliers and achieve a 99% accuracy. While this can be combated in several ways (see links at the bottom), the simplest is to balance the dataset by throwing away non-outliers. This requires a gratuitously large dataset, which, fortunately, we have.
When employing machine learning, it is important not to get pigeon-holed into any one algorithm too early. Therefore, we have been developing with multiple algorithms simultaneously. While this can add a computational burden, this can be mitigated by using more powerful computing power like that offered by Amazon Web Services (AWS). At this early stage, results look very promising. Nearest neighbor (k-NN), support vector machines (SVM), decision trees, random forest, and AdaBoost all ring in above 99% in recall, F1, and precision measurements. We are currently driving this even higher by tuning hyperparameters. While neural nets may not be necessary for this application, we may try them as well.
More imbalanced dataset links:
- http://scikit-learn.org/stable/modules/svm.html
- http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
- http://ieeexplore.ieee.org/document/5128907/?reload=true
- https://pdfs.semanticscholar.org/cf06/9b7460ce1b5a0434a6a19f420544a780f35d.pdf
- http://neuro.bstu.by/ai/To-dom/My_research/Papers-0/For-research/D-mining/Anomaly-D/KDD-cup-99/NN/dawak02.pdf