Mastering the Essentials: Data Science Beyond Complex Models
Written on
Chapter 1: The Data Scientist's Journey
Congratulations on your decision to pursue a career in data science! As someone who has walked this path, I can assure you that the journey is both rewarding and fulfilling. However, it's important to recognize that the expectations for the role can often differ from reality.
I frequently receive inquiries from those aspiring to become data scientists, asking what areas they should concentrate on. The suggestions range from Deep Learning courses on Udacity to Advanced Statistical Analysis on Coursera, along with Tableau visualization tutorials and software engineering guides on data pipelines and Spark. While all these topics hold significance, they can also feel overwhelming. When I started my career, I certainly wasn't an expert in every area; I learned what was necessary for my immediate responsibilities. Yes, that meant dedicating some weekends to mastering specific technologies, but the ability to self-learn as needed has been crucial in navigating the vast landscape of knowledge available.
A natural curiosity to learn new tools and concepts is essential. As a data scientist or software engineer, you are likely aware of the rapid pace of change in the software industry. Tools and packages you rely on can be updated monthly, and new software solutions emerge every six months to tackle challenges you've encountered.
However, I believe there’s a vital skill every data scientist should excel at: data analysis.
Section 1.1: The Role of Data Analysis
You might wonder, shouldn't data scientists be focused on more complex tasks, like developing machine learning models? The truth is, constructing a machine learning model can be quite straightforward. For instance, if my goal is to extract text from Medium blogs to create an NLP classifier, the process can be reduced to just a few lines of code: reading in the texts and their labels, vectorizing them, training a Naive Bayes classifier, and deploying it. That's just four lines of code to bring a model into production.
While some data scientists do build their own neural networks using frameworks like PyTorch, such positions typically require advanced mathematical and statistical knowledge, often reserved for top-tier tech firms with robust data infrastructures. Most data scientists tend to work with simpler machine learning models, focusing on providing them with the correct data. This necessitates a thorough analysis of the available data and the extraction of relevant features.
Let's consider the importance of prediction speed. Shouldn't we use complex models to enhance our prediction capabilities? Perhaps. Microsoft AI has developed an impressive gradient boosting model called Light-GBM, which I've tested against XGBoost, known for its speed. Light-GBM is lightweight and optimizes prediction time, supporting parallel and GPU learning. However, it comes with its own caveats; Light-GBM is most effective with at least 10,000 training samples, or it risks overfitting.
It's crucial to select algorithms based on a comprehensive understanding of their mechanics. For example, in my NLP classifier, I chose Naive Bayes over a more complex boosting algorithm. Naive Bayes is purely mathematical and offers exceptional speed, operating under the assumption that features are independent of one another. Although boosting algorithms provide more accuracy, they require more computation time and can overfit the data.
Subsection 1.1.1: Understanding Trade-offs in Model Building
When creating an NLP classifier, you need to identify features for your model, which in this case, consist of unique words from the text. This could lead to thousands of features in a single blog post. Building a random forest model with that many features can take significant time and resources. While boosting algorithms enhance speed over traditional decision trees, Naive Bayes remains quicker and more efficient for smaller datasets.
As you build your NLP classifier, understanding the trade-offs involved is vital. The choice of algorithm often hinges on the data at hand, necessitating thorough analysis to identify which method will yield the best results.
Note: For those interested in delving deeper into the intricacies of these algorithms, I recommend checking out StatQuest for a clearer understanding of statistics and machine learning techniques.
Chapter 2: The Data Scientist vs. Data Analyst Debate
Some may argue that data scientists are merely more advanced analysts. While it's true that data scientists possess a broader technical skill set—encompassing software engineering, algorithm design, and cloud development—the distinction is becoming blurrier as tools evolve and simplify.
You might ask, why not allow analysts to handle the data analysis while you focus on the more advanced modeling? While that may seem efficient, it can stifle your growth as a data scientist. As previously mentioned, it's often more beneficial to feed a simple model clean data than to overwhelm a complex model with poor-quality information. Obtaining clean data requires you to analyze it yourself, which in turn enables you to design an effective pipeline for building and training your models.
To illustrate this, let me share a real-world example from my work. Our team was tasked with developing an NLP classifier for patient medical records. The objective was to create an automated tagging system that would allow for quick navigation through lengthy medical documents. We faced the challenge of classifying over 50 labels, from heart conditions to brain injuries, with limited training data—five PDFs per category, each containing between 20 and 1,000 pages.
Despite these constraints, we achieved models with over 90% accuracy. We aimed to publish these models on GitHub, maintaining version control of our improvements. However, working with medical records necessitated strict adherence to privacy regulations to prevent any exposure of Protected Health Information (PHI). Even if our repository was private, any potential future data breach could lead to significant consequences.
During our analysis, I discovered a curious feature in our Back Injuries model. This was a patient's name that had inadvertently made it into our feature set. Upon investigation, we found that the optical character recognition (OCR) software had misread the text, treating a hyphenated name as a single word. This highlighted the importance of cleaning and analyzing data before deploying any models.
In conclusion, analyzing and refining data is a critical step before production. A model continuously predicts new, unseen data, and without proper data handling, it risks making erroneous assumptions that could lead to serious implications.
Thank you for reading! If you're interested in exploring more of my writing, feel free to check out my Table of Contents.
If you're not yet a paid member of Medium but want to access more articles and tutorials like this one, consider subscribing to Towards Data Science. By using this link, I receive compensation for your referral.
The first video titled "Why You Won't Be a Data Scientist" delves into the misconceptions and realities of pursuing a career in data science, shedding light on the challenges and skills required.
The second video titled "Is Having Good Data Engineers More Important Than Having Good Data Scientists?" discusses the critical roles of data engineers and scientists in the data ecosystem, examining how they complement each other in achieving successful outcomes.