Experiment 1: Understanding Legalese with LLMs

Experiment 1
Understanding Legalese with LLMs
Why it matters:
​More and more of these agreements include clauses about biometric data—like facial scans in virtual try-ons. This experiment helps you see what’s standard, what’s unusual, and how your likeness might be stored or shared.
Experiment Overview
The first Mellonhead Labs experiment is about comparing documents — specifically those that are long, dense, and consequential. Here are the concept you’ll explores in this experiment
Real World Scenario
Comparing dense, high-stakes documents
Extracting key details with integrity and low hallucination
Translating legal language into plain, readable English
Objective
Most, if not all, of us have accepted Terms and Conditions or Privacy Policies to create online accounts or join loyalty programs. While these agreements may seem routine, they increasingly include clauses about biometric data—like facial scans used in virtual try-on tools.  This experiment helps us identify what's standard vs. unique in these policies, and understand how our personal and virtual likeness may be stored or used, making us more informed and empowered consumers.
Key AI competencies explored 
Establishing context       Prompt chaining
Few-shot prompting       Prompt structure
Materials
Data 
Start with the terms and conditions for any two brands that offer both digital and in-store purchases and use technologies that may collect visual likeness or biometric data through virtual try-on tools.

Ulta Terms and Conditions

Sephora Beauty Insider
Research
There's a lot of research related to both document comparison and analyzing policy and legal documents. Here's what they've found helps:

Provide context. Describe the documents provided. For instance "Below are the terms and conditions for two beauty brands." [1][2]

Define categories. When extracting information into categories, provide a full sentence definition. This performed significantly better than simply naming the data types to look for.

Placement matters. Placing the text at the beginning may slightly increase accuracy and recall of the document. [1]

Prompt chaining. Break up your prompt into multiple steps, each taking on a role relevant to it's instruction. [2] But only break up when the complexity is such that the AI assistant isn't giving you good enough results. Studies also show that extraction and analysis may work better in a single step in some cases, perhaps because the LLM still has full context of the document. [1]

Term Parser - for extracting information

Term Verifier - for validating extracted information is in the source document

Analyst - for comparing terms found between documents and lastly making sense of the difference

Provide examples. It takes more time, but if you give the AI assistant sample input and output it will use that to more accurately analyze your documents. For policy analysis two examples may provide the best results. This technique is called "few-shot prompting". [1]

Sources

The full papers are available online.

Rodriguez, D., Yang, I., Del Alamo, J.M. et al. Large language models: a new approach for privacy policy analysis at scale. Computing 106, 3879–3903 (2024). https://doi.org/10.1007/s00607-024-01331-9

Mridul, M.A., Kang, I., Seneviratne, O. Terminators: Terms of service parsing and auditing agents. arXiv preprint arXiv:2505.11672 (2025). https://doi.org/10.48550/arXiv.2505.11672 
Guiding Questions
The objective is to use AI to compare documents given our experiment guidelines. Use these guiding questions to focus your prompting:

Does the policy mention biometric data or virtual likeness?

How is that data collected, stored, or shared?

What rights are you granting the company over your data or image?

Which parts of the policy seem unusual or differ from other companies?

Is it clear how to opt out or remove my data?
Experiment on Your Own
Now take this research and experiment on your own with ChatGPT, Claude, and other AI assistants to see what helps you best. As you experiment think about an approach as the combination of AI tools, model, prompting techniques and individual prompts chained together.
Which approach was the most accurate and the most complete?
What surprised you?
Not sure where to start?
Copy the contents of the files linked above into a text file, ensuring the filenames include the company name.
Prompt #1 - Context: For each section provide what is unique to each policy and what is similar. "I am interested in understanding the terms and conditions for Ulta and Sephora. Both are attached. For each section in the T&Cs list the terms unique to Ulta, unique to Sephora, and which are shared."
Prompt #2 - Verification. Validate the findings and note any items that cannot be tied back to the original policies. "Act as a paralegal. Review the analysis of terms and conditions for Ulta and Sephora. Identify errors and missing information. Remove information that cannot be tied back to the original documents. Correct any mistakes."
Prompt #3 - Analysis & Reporting "Put together a report that answers the questions below in a format that is easy to understand and designed for the average consumer. "
What data does the business collect and do they explicit list what they can or cannot do with the data?
What control do I have over communication and marketing?
How are the documents different in their scope and permissiveness or restrictiveness?
The Results 
AI performs best when given specific goals, structured formatting, and personal context, and that tools behave very differently unless guided carefully. While different models offer unique user experiences and stylistic nuances, the core outputs are often similar. The biggest factor in output quality isn’t the Generative AI model you choose, but it’s how you prompt it that makes all the difference when dealing with long, dense documents.
Read the full experiment resource guide here:
Click here