In a nutshell:
- Generative AI tools like ChatGPT can be beneficial for data science projects, but they have limitations when it comes to building predictive models.
- ChatGPT is not designed for numerical data and may provide inaccurate or unreliable results.
- It may not always be up to date and can have data leakage issues.
- Pecan offers a safer alternative for building predictive models, blending accessibility with accuracy.
Generative AI tools, like ChatGPT, are trained to create new data using old data. A very simplified explanation of how tools like ChatGPT work is that they rely on a large language model (LLM) to anticipate the next word in a phrase or summary based on the statistical likelihood of that word appearing next.
In some ways, it’s very similar to typical machine-learning models, which use historical data and algorithms to make predictions. Businesses have been using predictive machine-learning models for decades for hundreds of different use cases, from helping lenders determine if a loan applicant is likely to default to predicting customer churn.
But what happens when you use tools like ChatGPT, a newer form of AI, to create traditional predictive models?
Surprisingly, things get messy.
This fusion of predictive and generative AI can be a powerful combination — but only if done correctly. In this blog, we break down the ups and downs of using tools like ChatGPT for data science and show how a new approach to Predictive GenAI is combining the best of both worlds into one user-friendly solution.
The benefits of using ChatGPT (and friends) to support data science projects
At first glance, data science projects seem like an ideal use case for tools like ChatGPT, Gemini, Claude, and others. Some of the more notable use cases include:
- Automated data cleaning and preprocessing. Run your data by ChatGPT, and it can quickly point out any missing values, outliers, or anomalies.
- Data interpretation. Looking for a cursory analysis of your data? ChatGPT can instantly synthesize your data to point out key insights, correlations, and patterns. It can even process chunks of textual data to distill the top insights for easy interpretation.
- Algorithm and model recommendations. Simply tell ChatGPT what you’re trying to solve and give it an idea of your data and it can recommend the best models — while detailing their pros and cons. It can even help you pick out the right features.
- Code creation and optimization. ChatGPT can quickly write code for you in any coding language and review your code for errors.
- Creating reports for stakeholders. GenAI can save hours during your reporting process by summarizing and synthesizing insights into emails or PowerPoints to present key findings to stakeholders.
Where ChatGPT and similar tools fall short
For all of ChatGPT’s and other tools’ accessibility and familiarity, they certainly have many shortcomings. You only have to check out the latest AI-generated search results on Google or Facebook’s comment summaries to see that it still has a long way to go.
Here are several ways ChatGPT falls short in data science:
It's built for words, not numbers.
A recent report by Stanford University shows that for all of GenAI’s ability to surpass humans in areas like image classification and basic-level reading comprehension, AI can't beat humans in competition math.
That presents a much larger challenge for data professionals. They rely on tools to be exact, precise, and accurate so as not to devalue the large volumes of numerical data they're tasked with making useful. To take a finely-tuned data pipeline and have GenAI misrepresent it is a risk many are not willing to take—and for good reason.
This is especially true for building predictive models, which often use proprietary, sensitive, and very valuable company data. ChatGPT fails in that it:
- Can write code but isn't designed for building advanced machine-learning models
- Can offer prompt suggestions based on general best practices but may not be able to provide specific insights for unique situations that require heavy domain expertise
- Can predict words in a language context but is not built to generate predictions on numerical data
For example, if you were to ask ChatGPT how to improve a classification model, it could look at written text samples about other classification models and then anticipate a string of words describing the most likely solution. But that answer won't be based on your specific models, as it's not part of the data sets the GenAI’s models are referencing.
It's not always up to date.
Even though ChatGPT now includes real-time browsing results, this same functionality isn’t available for the company’s API. Developers still need to use a third-party or plug-in solution to bring web access to the solutions they build with the ChatGPT API.
Browsing can also be very limited among the sites ChatGPT can access. Currently, sites that don’t want to be included in “bot activity” may use robots.txt — a text file that instructs bots how to crawl their page — to keep OpenAI from accessing them.
In addition, sites not approved by OpenAI, either due to subject matter or to being region-locked, won’t show up in the most updated search results.
These limitations may be OK for some fields that haven't changed that much in the past few months, but for AI/ML use cases, which evolve rapidly, it's far from OK. If OpenAI can’t access all available updates, news sources, journals, etc. it will create a lackluster experience for analysts.
It's not reliable.
Then there are the mistakes. A recent study from Purdue University found that ChatGPT gave incorrect answers to programming questions 52% of the time.
ChatGPT has also shown signs of what's called "data leakage." This is when the AI uses data from outside the training data set to inform its answers or create the model. For example, if a predictive model was created from data that occurred after the requested data set, this could influence the model and make it not truly predictive. It invalidates the estimated performance of the model by giving it access to data it shouldn't know. Analysts and data scientists cannot adequately test or apply models with data leakage.
In addition, a recent study examining the use of LLMs for data science found that ChatGPT makes a lot of assumptions. “Participants were enthusiastic about ChatGPT's ability to infer knowledge about data just from column names. However, it still made many false assumptions, including time formats, data types, and how to handle outliers. This led to some participants using code from ChatGPT that actually hid a data quality issue.”
Most data professionals won't settle for error-prone GenAI results, and they shouldn’t have to.
Build safe and trusted predictive models with Pecan
So, what can a data professional like you do when you want the seamless and accessible experience that ChatGPT provides but need the outcomes to be exact, correct, and usable?
A low-code Predictive GenAI platform like Pecan brings the best of automated data preparation, modeling, and genAI without the risk of today's consumer GenAI tools. It offers many of the user-friendly features data professionals love about AI, including:
- Automated (and trusted) data preparation
- Automated feature engineering
- Support for tabular and numeric data sets and outcomes
- A chat-based AI assistant to walk you through every step of the data science process
- Guidance on the specific context of each specific data set and model
- The ability to write SQL to define the right training datasets for your models in seconds
Pecan is built with the latest generative AI technology, so answers about the modeling process are presented in plain language you can easily understand. However, models, numbers, and code are all generated with the latest machine-learning technology — not with LLMs that may not be ready for this task.
Pecan blends the accessibility of a ChatGPT conversation with the accuracy of predictive AI. So, even data professionals without data science degrees can work their way through the ideation, testing, building, and rapid iteration of machine learning models in minutes.
Here's what the typical data science process might look like with Pecan:
1. Dialogue with Pecan's Predictive Chat to set up the problem and clarify what the model will do. After some back and forth, the analytic copilot will help you isolate your predictive question and the best model to use.
2. This predictive question gets added to Pecan's SQL-based Predictive Notebook, where all the data and notes get organized to build your model.
3. Pecan creates an attribute table to gather all the relevant data for an event. Then, we use all this event data to find patterns and predict what may happen in the future. This becomes the basis for your predictive model.
We use the data sets you request as recently as you need. It isn't prone to data leakage or hallucinations and can be fully trusted with your data.
Looking to turn your data into a predictive engine — without the risks of using ChatGPT for data science? Try a demo today.