Fixing Entity Loss: Deduplication In EdsNLP's Get_spans

Nov 9, 2025 by Admin 56 views

Hey everyone, let's dive into a common issue in edsnlp – the potential loss of entities when writing documents to disk. Specifically, we're going to talk about the get_spans function and how its default behavior can lead to this problem. I will also provide some suggestions on how to deal with this problem.

The Problem: Silent Entity Deduplication

So, here's the deal, guys. The default setup in edsnlp's get_spans function has a little quirk. It's designed in a way that deduplicates entity values. Now, sometimes this is exactly what you want, right? But other times, especially when you're trying to preserve all the information from your documents, it can be a real headache. This deduplication can lead to a loss of entities during the writing process. In simple terms, you end up with fewer entities on disk than you actually started with, which can mess with your analysis.

Let's be clear about what's happening. When get_spans deduplicates, it's essentially saying, "Hey, if we see the same entity multiple times, let's just keep one copy." While this can be useful in some situations, it's not ideal when you need to track every single instance of an entity. Think about it like this: you're trying to record every time a patient mentions "chest pain," but the system is only saving the first mention. Subsequent mentions get lost, and you don't have a complete picture of the patient's experience.

The current implementation of get_spans in edsnlp has a deduplication process baked in. This means that, by default, the function will remove duplicate entities, which is not always desirable. This default behavior can be problematic because users might not be aware that entities are being silently removed. They might assume that all entities are being preserved, leading to incorrect analyses or incomplete datasets. I can tell you that I've dealt with this personally.

This behavior is particularly relevant when working with clinical notes or other text data where the same entity might appear multiple times. For example, a patient's medical history might include several instances of a particular condition or symptom. If the deduplication process removes these duplicates, valuable information is lost, and the accuracy of the analysis is compromised. The user may not realize that the data on disk does not represent what they saw when analyzing the text. This can lead to a lot of wasted time because they might be unaware of any data issues.

To make things easier, I will suggest a solution below.

Suggested Solution: A `deduplicate` Argument

Here's a simple, elegant solution. We should introduce a deduplicate argument to the converters, specifically within the get_spans function. This argument would have a default value of False. This means, the function would behave as it currently does. However, if the user sets deduplicate=False, the function would preserve all entities, without deduplication. This gives the user control over the function's behavior and the data they are saving to disk.

This change would be a huge win because it gives users the flexibility they need. If they want to deduplicate, they can. If they need to preserve every instance of an entity, they can do that too. It's all about control and transparency. This will also reduce confusion by making the function's behavior more predictable. Users will know exactly what to expect when they use the function, and they can make informed decisions about how to process their data.

Imagine you are a researcher analyzing clinical notes. You want to study the frequency of symptoms like “chest pain.” The current deduplication process would merge all instances of chest pain into a single entry, obscuring the actual number of times it was mentioned. However, with the new deduplication argument, you can disable the deduplication and see every single mention of chest pain, giving you the real data. This allows for a deeper, more accurate analysis. Or consider a system for tracking medication dosages. The current deduplication could merge multiple instances of the same drug, and the results would not be useful.

By adding this argument, the library becomes more versatile, and the user gets a better experience because the behavior is predictable. This is one of those changes that seems small on the surface but has a big impact on the overall functionality and usability of the library. It’s about empowering the user and making the tool as effective as possible. The implementation of this is straightforward. The user can choose the behavior that meets their specific needs. No more surprises when writing to disk.

Addressing Duplicate Spans in the Code

There's another spot in the code where duplicate spans can cause issues. This line of code: https://github.com/aphp/edsnlp/blob/879e34034cebc77ab8d58dd00981f61a3a00e838/edsnlp/data/converters.py#L645 is also dropping those duplicate spans. To fix this, we should replace it with this:

for i, ent in enumerate(sorted(spans)):

This change ensures that all spans, including duplicates, are properly handled. The sorted(spans) part is critical here. It makes sure that the spans are processed in a consistent order, which is important for maintaining data integrity. It's a small change, but it's essential for guaranteeing the accuracy of the data written to disk.

Benefits of Implementing the Changes

So, why are these changes so important? Well, first off, they prevent data loss. No more silent removal of entities. You get the data you expect. Second, it improves user control. The deduplicate argument gives users the power to decide how they want to handle duplicates. Third, it increases transparency. Users will be aware of the function's behavior and can make informed decisions. Last but not least, it enhances the reliability of the library. By fixing these issues, we make sure that the data written to disk is as accurate as possible, which is essential for any analysis.

Conclusion

Adding a deduplicate argument and sorting the spans will significantly improve how edsnlp handles entities. This makes the library more user-friendly, more accurate, and more reliable. These changes are crucial for anyone working with text data where every entity matters. They're about making sure that the data you analyze is exactly what you expect. It's a win-win for everyone involved.