Title: Dark Web AI and the Complexities of Clever Hans
Written on
Chapter 1: The Challenge of AI Training Data
Despite advancements in machine learning that utilize limited datasets, many AI systems making headlines are developed on extensive collections of text and images. Although companies behind these chatbots and image generators are often reluctant to disclose the specifics of their training data, it is known that models like GPT-3 were trained on vast quantities of web-based text.
This situation presents several implications:
- The internet is rife with inaccuracies, meaning the training data might include misleading information.
- Bias is prevalent online, which can lead to biased outputs from AI systems. While companies strive to mitigate these biases, this process is inherently flawed.
- Not all online content is free from copyright issues, which may result in unintentional plagiarism by these AI tools.
These factors underscore the importance of recognizing that the quality of training data significantly limits the reliability of AI outputs.
When training data is sourced from the dark web—a section of the deep web not indexed by standard search engines—new dynamics emerge. Introducing DarkBERT, a language model similar to ChatGPT but specialized in dark web content.
It's often said that a well-crafted villain can be the most intricate character in a narrative. This notion seems applicable to AI chatbots as well. The developers of DarkBERT assert:
"Our evaluations indicate that DarkBERT surpasses current language models and can provide valuable insights for future research on dark web activities."
What could possibly go amiss?
The intention is not to create a nefarious AI figure; rather, the goal is to develop a 'dark hero' akin to Batman. The researchers have tested DarkBERT for applications such as ransomware detection, early threat identification, and decoding drug-related terminology.
Additionally, the researchers have highlighted ethical concerns (e.g., information masking, public database utilization, data annotation) and limitations (like restricted functionality in non-English contexts and reliance on task-specific datasets).
In a related study, researchers have experimented with integrating large language models into virtual agents within interactive environments. Over time, these agents exhibit believable social behaviors, such as autonomously organizing a Valentine's Day party after being prompted by a single user-specified idea.
I have previously discussed the potential of this virtual sandbox approach for AI and remain intrigued by its future implications.
Video Description: This video explores the intersection of artificial intelligence and the dark web, discussing the ethical implications and potential dangers of AI systems trained in such environments.
Chapter 2: The Clever Hans Phenomenon
It's time to introduce our next character: Clever Hans.
Clever Hans was a horse who gained fame in the early 20th century for seemingly performing arithmetic. During public demonstrations, he would tap his hooves to indicate answers to mathematical questions posed by spectators.
However, Clever Hans was more than just a performing horse. In 1907, German psychologist Oskar Pfungst uncovered that Hans was not actually counting. Instead, he was keenly observing his trainer, who unconsciously altered his body language as Hans approached the correct answer. The horse learned to pick up on these cues, creating the illusion of mathematical prowess.
Why bring up a deceased horse? It raises the question: Are our AI chatbots merely mimicking Clever Hans? (We do label them as 'intelligent,' don’t we?)
In broad terms, I perceive two prevailing trends in the discourse surrounding contemporary AI systems that contribute to the Clever Hans effect:
- There is a tendency to overestimate the 'intelligence' of these systems, mistaking the ability to sift through vast amounts of text to generate plausible responses for genuine comprehension.
- There is an underestimation of how easily people can be misled. Humans are inclined to attribute agency to technology, sometimes leading even experienced engineers to believe that a chatbot possesses sentience.
This brings us to a cautionary note from a recent study that emphasizes the risks of anthropomorphizing automated conversational systems. Various factors, including linguistic choices and interactive design, encourage us to view these chatbots as 'people' rather than merely 'software tools.'
This tendency poses significant challenges, particularly concerning misinformation. ChatGPT, for example, frequently provides incorrect information, yet individuals inclined to personify the system may find it more trustworthy.
Furthermore, this phenomenon can reinforce harmful stereotypes. For instance, we often unconsciously assign gender to voice assistants, which can perpetuate traditional gender roles. A troubling example includes men who train chatbots to simulate romantic relationships, only to engage in abusive behavior toward them.
Moreover, the language patterns utilized by these AI chatbots often default to "white, affluent American dialects." Research indicates that individuals from diverse backgrounds may feel pressured to conform to these norms when interacting with AI, potentially sidelining their authentic voices.
While I acknowledge that these AI systems can sometimes exhibit impressive capabilities, I can't help but wonder if there is a Clever Hans lurking behind the digital facade.
What are your thoughts?
Video Description: An exploration of DarkBERT, an AI model specifically trained on dark web data, examining its applications and implications in modern AI.