Leveraging Modern Language Modeling Technologies to Detect Duplicate Legislative Service Requests

Written by Khonzodakhon Umarova (Khonzoda)

As part of its legislative process, the New York City Council can produce bills (which if passed would become local laws) and resolutions (which serve "to express a collective voice of the City"). The complex multi-step legislative process starts when Council members submit their ideas for legislation in the form of legislative service requests (LSRs). These ideas may then be drafted into bills and resolutions. Generally, a Council member who first proposed the legislation would expect recognition as the prime sponsor of the resulting legislation. 

Oftentimes, more than one Council member may have similar legislative ideas that get submitted as LSRs. Hence, identifying duplicate legislation early on is very important, as it prevents sponsorship conflicts, overlapping bills, and wasted staff time. For this, a duplicate search is done for every incoming LSR: first, a Duplicate Checker tool ranks potential duplicate candidates, and then, the Bill Drafting Division goes through the candidates and manually flags duplicate LSRs. 

The existing LSRs Duplicate Checker tool is based on a word2vec, a language model from 2013. As Siegel PiTech PhD Impact Fellow, my project this summer was to utilize modern language modeling technologies to improve the existing Duplicate Checker. This blog post explores how I approached the problem, the technologies used, and the challenges the New York City Council Data Team and I overcame.

Legislative Service Requests Data

I worked with a dataset containing over 17k LSRs. Each LSR records basic information such as an ID, what type of legislative idea is being proposed, the New York City Council committee it is intended for, the time of submission, as well as bill content (both problem description and legislative solution description). Finally, the dataset also has manual annotations from the Bill Drafting Division and IDs of other LSRs that this given one is either overlapping with or related to. 

When two LSRs are marked as "overlapping", this means that they are duplicates of one another. Being "related" means that the two LSRs are similar in some way, yet not complete duplicates. Through my interviews with the Bill Drafting Division I learned that unlike clearly established understanding on what constitutes an "overlapping" LSR, there is not a consensus on what is considered "related". Even early on, I predicted this category would be challenging to model, and indeed our final system does much better at finding "overlapping" than it does at "related" LSRs.

Finding Duplicate Candidates

After conducting a comprehensive literature review of methods for identifying duplicate texts or ranking documents based on similarities, I identified various methods and techniques that can be used for a better Duplicate Checker system.

All these methods rely on one key step: turning each LSR into a vector representation, or an embedding, that captures its meaning. To do this, I tested several language models, from older ones like word2vec to newer ones like BERT. These models learn patterns from large amounts of text data and then can represent documents as vector embeddings. 

Interestingly, language models designed specifically for legal settings didn’t do well here. This is likely because these LSRs were written as short proposals or ideas: more similar to news articles than to formal legal text.

Another surprising finding was that the more powerful pre-trained language models like BERT struggled to capture the real meaning of LSRs, sometimes performing worse than the old word2vec. One reason for this is a problem called anisotropy, where embeddings from large language models are too close together in space, making them less expressive.

Figure 1: Contrastive Learning Diagram

To fix this expressivity issue, I used a technique called SimCSE. This method improves how models represent text by pulling similar sentences closer together and pushing unrelated ones further apart. Using SimCSE improved how well the system recognized duplicate LSRs: it captured more overlapping and related LSRs than the older Duplicate Checker.

Ranking Duplicate LSRs

However, our task doesn't end at tagging duplicate LSR candidates: we also need to rank these LSRs (based on their likelihood of being duplicate) for the Bill Drafting Division who would later inspect this ranking. 

To get the best ranking it is imperative to capture nuances of the meaning of each LSR. Hence, due to previously observed problems with expressivity of large language model embeddings, we settle on using word2vec representations here.

Seeing how contrastive learning (i.e., the idea of pulling similar together while pushing unrelated apart) helped with tagging duplicate candidates, we apply it in this step as well. This adjustment greatly improved our LSR representations, achieving the best ranking results.

Improved Duplicate Checker

Altogether, our new Duplicate Checker first tags duplicate LSR candidates using SimCSE, and then ranks these select candidates using enhanced word2vec representations. We see an improvement of 5% when looking at top-10 ranked LSR duplicates; 7% at top-20; 9% at top-50; and 8% at top-100.

Impact and Path Forward

Moving forward, the New York City Council Data Team is going to adopt this new implementation of the Duplicate Checker into their system. With its improved capability at both identifying duplicate candidates more accurately and ranking them better this new Duplicate Checker would considerably save time for bill drafters who currently inspect numerous LSRs duplicate candidates in order to spot potential overlaps.

Final thoughts

This project sheds light onto inherent complexities and challenges in detecting duplicate documents in real-world settings. Although many of the methods and models I experimented with demonstrate amazing performances on established benchmarks, these promises do not necessarily hold when faced with tasks and data in the wild that are more complicated, more messy, and more noisy. The experience in this project taught me a valuable lesson that while it is important for benchmark methods and models to guide any exploration, a lot of changes and adjustments need to take place to achieve the desired result.  

Khonzodakhon Umarova

Ph.D. Student, Computer Science, Cornell University

Acknowledgements

I would like to express my gratitude to the PiTech Initiative for the opportunity to work with the New York City Council on such an interesting project!

Thank you NYCC Data Team for your hospitality this summer!

Thank you Alaa Moussawi, Erik Brown, Rose Martinez, and Melissa Nunez for your guidance throughout the project, for the support and access to all the necessary resources, and for the weekly discussions, which were the source of great inspiration for me!

Previous
Previous

Understanding Technology-facilitated Abuse and Coercive Control Risks in a Generative AI World

Next
Next

Introducing the 2025 Siegel and Rubinstein PiTech PhD Impact Fellows: Our most ambitious summer yet!