Over the past year, machine learning and artificial intelligence technology have made significant strides. Specialized algorithms, including OpenAI’s DALL-E, have demonstrated the ability to generate images based on text prompts with increasing canniness. Natural language processing (NLP) systems have grown closer to approximating human writing and text. And some people even think that an AI has attained sentience. (Spoiler alert: It has not.)
And as Ars’ Matt Ford recently pointed out here, artificial intelligence may be artificial, but it’s not “intelligence”—and it certainly isn’t magic. What we call “AI” is dependent upon the construction of models from data using statistical approaches developed by flesh-and-blood humans, and it can fail just as spectacularly as it succeeds. Build a model from bad data and you get bad predictions and bad output—just ask the developers of Microsoft’s Tay Twitterbot about that.
For a much less spectacular failure, just look to our back pages. Readers who have been with us for a while, or at least since the summer of 2021, will remember that time we tried to use machine learning to do some analysis—and didn’t exactly succeed. (“It turns out ‘data-driven’ is not just a joke or a buzzword,” said Amazon Web Services Senior Product Manager Danny Smith when we checked in with him for some advice. “‘Data-driven’ is a reality for machine learning or data science projects!”) But we learned a lot, and the biggest lesson was that machine learning succeeds only when you ask the right questions of the right data with the right tool.
Those tools have evolved. A growing class of “no-code” and “low-code” machine learning tools are making a number of ML tasks increasingly approachable, taking the powers of machine learning analytics that were once the sole provenance of data scientists and programmers and making them accessible to business analysts and other non-programming end users.
While the work on DALL-E is amazing and will have a significant impact on the manufacture of memes, deep fakes, and other imagery that was once the domain of human artists (using prompts like “[insert celebrity name] in the style of Edvard Munch’s The Scream“), easy-to-use machine learning analytics involving the sorts of data that businesses and individuals create and work with every day can be just as disruptive (in the most neutral sense of that word).
ML vendors tout their products as being an “easy button” for finding relationships in data that may not be obvious, uncovering the correlation between data points and overall outcomes—and pointing people to solutions that traditional business analysis would take humans days, months, or years to uncover through traditional statistical or quantitative analysis.
We set out to perform a John Henry-esque test: to find out whether some of these no-code-required tools could outperform a code-based approach, or at least deliver results that were accurate enough to make decisions at a lower cost than a data scientist’s billable hours. But before we could do that, we needed the right data—and the right question.
Keeping it tabular
Amazon’s SageMaker Canvas and similar products, such as Google AutoML Tabular, are focused on working with these types of data sets; the tools are not suited to things like natural language processing for sentiment analysis, image recognition, or other unstructured data.
There are, however, many sorts of problems that fall into the world of tabular data. A few years back, for example, we visited GE Software to learn about their work on Predix, a machine learning-based system for modeling and predicting equipment failure so that GE could perform predictive maintenance—optimizing when systems were maintained to prevent major failures or outages in power distribution, aircraft engines, and other complex systems. These models were built with massive amounts of tabular data, derived from telemetry recorded by sensors in the equipment.
The last time we tried using machine learning, we had many rows of data, but not enough for the complex task at hand (and honestly, probably no amount of headline data would have ever answered the question we were trying to ask of it). To avoid our previous mistakes and to put a more realistic problem forward for the tools, we needed a tabular data set that provided numerical and category data we could use for a somewhat more direct question: What combination of data tells us when a system is likely to fail?
We went in search of some data sources that first represented real world problems that ML could be applied to and that also contained information that had multiple data points we could derive patterns from. We also wanted to see how well the new “no-code” tool from AWS (Sagemaker Canvas) could handle the task in comparison to the “low-code” AWS option (Autopilot) and the more labor-intensive code-driven approach. And if at all possible, we wanted a problem set that data scientists already had documented success with.
Those requirements narrowed the field pretty quickly. We don’t have GE’s data lake of telemetry to work with, but there were some public data sources we could use that at least provided some grist for model-making. So we went digging and found a data set that was as serious as a heart attack—literally.
While searching for some solid data sets to work with, we discovered one called “Heart Health” from UC Irvine’s Machine Learning Repository. A subset of the data, including one posted to the ML community Kaggle by data scientist Rashik Raman, has frequently been used to demonstrate the viability of the use of machine learning in medicine. We know one thing for sure about this data: Other people have successfully built models on it.
The heart disease data from Irvine held 14 data points of cardiac health data for 303 anonymized patients at the Cleveland Clinic, along with outcomes regarding whether they did or did not have a heart attack or some other acute cardiac health event. The data points for each patient include:
- The age and gender of the patient
- Whether or not the patient experienced exercise-induced angina (exng)
- The category of chest pain (cp) the patient experienced (typical angina, atypical angina, non-anginal pain, or none/asymptomatic)
- Resting blood pressure (trtbpd)
- Blood cholesterol level in milliliters per deciliter (chol)
- A boolean value representing whether fasting blood sugar was greater than 120 mg/dl (fbs)
- Resting electrocardiographic result categories (restecg): normal, having ST-T wave abnormality, or showing probable or definite left ventricular hypertrophy by Romhilt-Estes’ criteria
- Maximum heart rate achieved (thalachh)
- Electrocardiogram measurement of ST depression induced by exercise relative to rest (oldpeak)
- The number of major vessels (0-3) shown to have blockages in an angiogram (caa)
- Whether a patient has a form of the blood disorder thalassemia (thall): 1 for no, 2 for correctable, 3 for non-correctable)
- Output: Whether or not the patient ended up having a heart attack or similar acute heart disease—the original data included four categories of heart disease events; for simplification, the data had been modified to a binary output value (0 for no heart disease, 1 for heart disease).
There are some issues with the data. First of all, it’s a relatively small set—about a third of the total data from studies conducted by researchers and physicians at the Cleveland Clinic Foundation, the Hungarian Institute of Cardiology in Budapest, the Long Beach Department of Veterans Affairs Medical Center in California, and University Hospital in Zurich, Switzerland. That’s because the Cleveland Clinic data was the most complete, with diagnostic entries in all 14 data points. And 300 rows of data may not be enough to get a high level of accuracy from a predictive machine learning algorithm.
The sample is also skewed toward positive detections, so to speak. Of the 303 rows, 165 were cases where a heart attack occurred. And the set is skewed toward men: 206 of the 303 cases were men (which is in itself representative of the prevalence of heart disease in men, at least at the time). So it was not clear whether there would be an unnatural bias as a result.
We wanted to get a sense of how closely correlated the data points were statistically to other variables and to the output. So using a Jupyter notebook, we ran a function to calculate the Pearson correlation coefficient (Pearson’s R)—a measure of the linear relationship between the data points. A value closer to 1 (or -1) indicates a strong correlation (or negative correlation), suggesting that there is a close link between the variables. Here’s how those calculations looked for the Cleveland data:
We were most interested in the values along the bottom: the correlation to the output. Chest pain, maximum heart rate, exercise-induced angina, and ST depression on ECG results after exercise had the strongest correlations to the outcome of cases, but none of the variables had a strong correlation with each other. So it looked like there might be a way to get a somewhat accurate model out of this slice of the data. (Also, someone had claimed to have gotten 90 percent accuracy with their model based on this set, so that also suggested success was possible).
We still had reservations. We were going to have to split the data into training and testing sets, which meant that we would have only 150 to 200 rows of training data—which left us concerned that we would have underfit and a low accuracy on the testing set. So we started looking at the original archive.
The archived data sources were, to be frank, a bit of a mess. They were in a text file format, tabulated by space, without headers, and each had a different way of denoting the missing data. In some cases, a question mark was inserted where data did not exist; in others, a value of “-9” was assigned.
There are significant gaps in the data from the sources not included in the Cleveland set; some categories of data were not collected at all at some sites, and some may have been incomplete due to other consequences—such as the patient having a heart attack before the collection process. And going back to original sources for completion was not an option, as the original data is from 1988.
After downloading all the data, we converted the tabulated data into comma-separated value (CSV) format. To simplify the problem set a little, we reduced the five heart condition outcome categories to two (“0 – no heart disease,” and “1 – heart disease”). We then removed the filler content used to denote missing data. The complete data set included data from 920 patients:
We decided after reviewing the full data set to create a second set combining the Cleveland Clinic data with the Hungarian Institute of Cardiology data, which is the most complete of the remaining sets from the archive. (We ruled out using the Swiss and Veterans Affairs sets because of their additional missing data.)
The Hungarian set adds 294 additional rows of data, nearly doubling the sample size. It has drawbacks—it’s missing two categories of inputs from nearly every row. Angiogram data was not present for most of the cases, and only a few had tests for thalassemia. A breakdown of the missing data for the larger data set:
This would be bad if those two data points had a high correlation to the outcomes of the cases. Based on the analysis of the Cleveland data set, the angiogram results and thalassemia test had some correlation to the results and low correlation with any of the other variables, so we had some concern about what the model would lose if those data points had to be discarded.
It’s time to crunch the numbers
Given the shape of the larger data set, we’re pretty confident that we can get a good result even when missing two data points. Also, we’re less concerned about overfit of the model here with fewer data features. But given that the smaller data set has a relatively high success rate with more code-intensive methods, we’ll see what happens when we feed it to the robots.
In our next episode, we’ll throw the data at AWS SageMaker’s Canvas and AutoPilot no/low code tools to see if the robots can deal with it. And then we’ll take a look at what happens under the covers.