The bug report the agent was given — plus the gold fix and hidden tests, shown here for reference. The agent saw only the problem statement.
Problem statementgithub issue
Setting `min-similarity-lines` to `0` should stop pylint from checking duplicate code
### Current problem
Setting `min-similarity-lines` to `0` in the rcfile doesn't disable checking for duplicate code, it instead treats every line of code as duplicate and raises many errors.
### Desired solution
Setting `min-similarity-lines` to `0` should disable the duplicate code check.
It works that way in many other linters (like flake8). Setting a numerical value in flake8 to `0` (e.g. `max-line-length`) disables that check.
### Additional context
#214 requests being able to disable `R0801`, but it is still open
diff --git a/tests/checkers/unittest_similar.py b/tests/checkers/unittest_similar.py--- a/tests/checkers/unittest_similar.py+++ b/tests/checkers/unittest_similar.py@@ -502,3 +502,11 @@ def test_get_map_data() -> None: # There doesn't seem to be a faster way of doing this, yet. lines = (linespec.text for linespec in lineset_obj.stripped_lines) assert tuple(expected_lines) == tuple(lines)+++def test_set_duplicate_lines_to_zero() -> None:+ output = StringIO()+ with redirect_stdout(output), pytest.raises(SystemExit) as ex:+ similar.Run(["--duplicates=0", SIMILAR1, SIMILAR2])+ assert ex.value.code == 0+ assert output.getvalue() == ""
0FAIL → PASS (of 1)
17PASS → PASS
17 / 18graded tests
The agent run
The agent's step-by-step attempt — its reasoning, the tools it called, and the patch it produced.
System promptharness.agent.build_system_prompt
You are an AI coding assistant working inside a code repository that contains a bug. Investigate the code, find the bug, and fix it.
Your working directory is: /testbed
All four tools (read_file, write_file, edit_file, run_bash) already run from that directory. Use plain relative paths; do not prefix shell commands with cd. Fix the bug by editing source files only; do not create or edit test files.
Step 01 tool use
Let me understand the problem:
1. Setting `min-similarity-lines` to `0` should disable the duplicate code check entirely.
2. Currently, setting it to `0` treats every line as duplicate and raises many errors.
I need to find the relevant code that handles `min-similarity-lines` and fix it so that `0` disables the check.
Let me start by exploring the repository structure and finding the relevant code.
Now I can see the issue. When `min-similarity-lines` is set to `0`, the algorithm doesn't handle it properly. Let me trace through the code:
1. `Similar.__init__` sets `self.min_lines = min_lines` (which would be 0)
2. `_iter_sims` iterates over pairs of linesets
3. `_find_common` calls `hash_lineset(lineset1, self.min_lines)` which passes `0` as `min_common_lines`
4. In `hash_lineset`, when `min_common_lines` is 0:
- `shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)]` → this creates an empty list since `range(0)` is empty
- Then `for index_i, *succ_lines in enumerate(zip(*shifted_lines)):` → `zip()` of an empty list yields nothing, so this loop doesn't execute
- `hash2index` stays empty, `index2lines` stays empty
- This means no hashes are computed, so no similarities are found
Wait, that would mean it silently finds nothing. But the bug report says it "treats every line of code as duplicate and raises many errors." Let me look more carefully.
Actually wait, let me re-read the issue. The user says setting `min-similarity-lines` to `0` doesn't disable the check. But looking at the algorithm, with `min_lines=0`, the hash computation would produce empty results, meaning no similarities would be found. But maybe the issue is different...
Actually wait, let me look at `_find_common` more carefully:
```python
eff_cmn_nb = filter_noncode_lines(
lineset1, start_index_1, lineset2, start_index_2, nb_common_lines
)
if eff_cmn_nb > self.min_lines:
yield com
```
If `self.min_lines` is 0, then `eff_cmn_nb > 0` would be true for any match. So any similarity would be yielded.
But the question is: what does `hash_lineset` do with `min_common_lines=0`?
```python
shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)]
```
With `min_common_lines=0`, `range(0)` gives `[]`, so `shifted_lines = []`.
```python
for index_i, *succ_lines in enumerate(zip(*shifted_lines)):
```
`zip()` returns an empty iterator, so the loop body never executes.
So `hash2index` is empty, `index2lines` is empty, and no commonalities are found.
So actually, setting `min-similarity-lines=0` would result in the check being effectively disabled (no similarities found). But the user reports the opposite...
Wait, let me look more carefully. Maybe the issue is in the standalone `Similar` class used by the `symilar` command, not the `SimilarChecker`. But the user mentions `R0801` which is the message from the checker.
Let me think again... Maybe the issue is not in `hash_lineset` but somewhere else. Let me check how the `Similar` class is initialized:
```python
def __init__(
self,
min_lines: int = DEFAULT_MIN_SIMILARITY_LINE,
...
```
And in `SimilarChecker.__init__`:
```python
Similar.__init__(
self,
min_lines=self.config.min_similarity_lines,
...
)
```
If `min_lines` is 0, then:
- `hash_lineset` with `min_common_lines=0` will produce empty hash dicts
- `_find_common` will find no common hashes
- `_iter_sims` will yield nothing
- The check is effectively disabled
Hmm, but the issue says it raises many errors. Let me look at this from a different angle. Maybe the problem is that the `options` definition for `min-similarity-lines` doesn't have a way to disable it, and users expect setting it to `0` would work like other tools.
Actually, let me re-read the bug description: "Setting `min-similarity-lines` to `0` in the rcfile doesn't disable checking for duplicate code, it instead treats every line of code as duplicate and raises many errors."
Wait, so it actually does raise many errors? Let me think about what happens when `min_lines=0`:
In `hash_lineset`:
```python
shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)]
```
With `min_common_lines=0`, `shifted_lines = []`.
```python
for index_i, *succ_lines in enumerate(zip(*shifted_lines)):
```
`zip()` iterates over nothing, so nothing happens. `hash2index` is empty.
But wait, what if there are no stripped lines? Then `lines =
... [truncated 5,589 characters]
Let me check the test file and also look at how the min_lines value flows through the code more carefully.
Now I understand the code flow. Let me trace through what happens when `min_lines` is `0`:
1. In `hash_lineset` with `min_common_lines=0`:
- `shifted_lines = [iter(lines[i:]) for i in range(0)]` → `[]`
- `zip(*[])` → empty iterator
- The for loop doesn't execute
- `hash2index` is empty, `index2lines` is empty
2. In `_find_common`:
- `hash_1` and `hash_2` are empty frozensets
- `common_hashes` is empty
- `all_couples` stays empty
- Nothing yielded
3. In `_iter_sims`:
- Nothing yielded
4. In `close()`:
- `_compute_sims()` returns `[]`
- The for loop doesn't execute
- No messages are added
So it seems like `min_lines=0` should effectively disable the check. But the user reports it raises many errors.
Let me think about this differently. Maybe this is a different version of pylint and the bug was already reported and needs to be fixed. Let me look at the `_compute_sims` method more carefully:
```python
def _compute_sims(self) -> List[Tuple[int, Set[LinesChunkLimits_T]]]:
"""compute similarities in appended files"""
no_duplicates: Dict[int, List[Set[LinesChunkLimits_T]]] = defaultdict(list)
for commonality in self._iter_sims():
...
```
If `_iter_sims` yields nothing, then `no_duplicates` is empty. Then:
```python
sims: List[Tuple[int, Set[LinesChunkLimits_T]]] = []
ensembles: List[Set[LinesChunkLimits_T]]
for num, ensembles in no_duplicates.items():
cpls: Set[LinesChunkLimits_T]
for cpls in ensembles:
sims.append((num, cpls))
sims.sort()
sims.reverse()
return sims
```
Empty list returned.
Hmm, but wait. Let me reconsider. Maybe the problem is more subtle. Let me check if there's an issue with integer conversion or option handling.
Actually, let me look at this from a completely different angle. Let me look at the `SimilarChecker.set_option` method:
```python
def set_option(self, optname, value, action=None, optdict=None):
BaseChecker.set_option(self, optname, value, action, optdict)
if optname == "min-similarity-lines":
self.min_lines = self.config.min_similarity_lines
```
And in `SimilarChecker.__init__`:
```python
Similar.__init__(
self,
min_lines=self.config.min_similarity_lines,
...
)
```
It reads from `self.config.min_similarity_lines`. Looking at the option definition:
```python
(
"min-similarity-lines",
{
"default": DEFAULT_MIN_SIMILARITY_LINE,
"type": "int",
...
},
),
```
So `self.config.min_similarity_lines` is the integer value from the config. If set to 0, `self.min_lines` would be 0.
OK, I need to actually test this. But let me first look at what the actual fix should be. The desired behavior is: setting `min-similarity-lines` to `0` should disable the duplicate code check.
The simplest fix would be to add a check in `_iter_sims` or `_find_common` or `close` to return early if `self.min_lines <= 0`.
But wait, I just realized something. Let me trace through the algorithm more carefully with `min_lines=0`.
In `hash_lineset`:
```python
shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)]
```
With `min_common_lines=0`, this creates an empty list.
```python
for index_i, *succ_lines in enumerate(zip(*shifted_lines)):
```
`zip(*[])` = `zip()`. In Python, `zip()` with no arguments returns an empty iterator. So this loop body never executes.
So `hash2index` is an empty defaultdict(list) and `index2lines` is an empty dict.
Then back in `_find_common`:
```python
hash_1: FrozenSet[LinesChunk] = frozenset(hash_to_index_1.keys())
hash_2: FrozenSet[LinesChunk] = frozenset(hash_to_index_2.keys())
common_hashes: Iterable[LinesChunk] = sorted(
hash_1 & hash_2, key=lambda m: hash_to_index_1[m][0]
)
```
Both are empty frozensets. `hash_1 & hash_2` is empty. `sorted([])` = `[]`.
```python
all_couples: CplIndexToCplLines_T = {}
for c_hash in sorted(common_hashes, key=operator.attrgetter("_index")):
...
```
Loop never execut
... [truncated 11,136 characters]
Now let me look at the key function `hash_lineset` and the `_iter_sims` method more closely to understand the flow when `min_lines=0`:
run_bash{ command: "cd /testbed && python -c \"\nfrom pylint.checkers.similar import Similar, hash_lineset, LineSet\n\n# Let's test what happens with min_lines=0\nclass TestLS:\n pass…" }
tool result · run_bash
exitcode: 1
stdout:
stderr: Traceback (most recent call last):
File "<string>", line 13, in <module>
File "/testbed/pylint/checkers/similar.py", line 374, in append_stream
readlines = stream.readlines
AttributeError: 'list_iterator' object has no attribute 'readlines'
Step 05 tool use
The test failed because `append_stream` expects a stream with `readlines`. Let me fix the test.
exitcode: 0
stdout: Results with min_lines=0: []
Results with min_lines=1: [Commonality(cmn_lines_nb=3, fst_lset=<pylint.checkers.similar.LineSet object at 0x7fffff2b8eb0>, fst_file_start=0, fst_file_end=3, snd_lset=<pylint.checkers.similar.LineSet object at 0x7ffffd39d070>, snd_file_start=0, snd_file_end=3)]
3 similar lines
File 1: test1 [0:3]
File 2: test2 [0:3]
stderr:
Step 06 tool use
So with `min_lines=0`, the check is effectively disabled (returns empty list). The user's claim that "it treats every line of code as duplicate" doesn't seem to be reproducible with the current code.
But wait, maybe the issue is about a different version. Or maybe the bug is that while the check is silently disabled, the user wants it to be an explicit feature: setting `min-similarity-lines=0` should be the documented way to disable the check. And perhaps in a previous version, it didn't work correctly.
Actually, let me re-read the issue one more time: "Setting `min-similarity-lines` to `0` should stop pylint from checking duplicate code"
"Setting `min-similarity-lines` to `0` in the rcfile doesn't disable checking for duplicate code, it instead treats every line of code as duplicate and raises many errors."
Hmm, but my test shows it does disable it. Let me check the `SimilarChecker` path too, just to make sure.
Actually wait, maybe the issue is that `min-similarity-lines=0` is being set in the rcfile but the config system treats `0` differently? Let me check how the config option parsing works for `int` type.
Let me look at how options are parsed...
Actually, let me try a different approach. Let me look at what happens when the `SimilarChecker` is used with `min-similarity-lines=0`. Let me check if there's something in the BaseChecker or option parsing that might treat 0 as falsy and use the default instead.
exitcode: 1
stdout: min_lines: 0
stderr: Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/testbed/pylint/checkers/similar.py", line 817, in open
self.stats = self.linter.add_stats(
File "/testbed/pylint/reporters/reports_handler_mix_in.py", line 75, in add_stats
assert key not in self.stats
TypeError: argument of type 'NoneType' is not iterable
Step 08 tool use
The test is getting complicated. Let me focus on the actual code fix.
From my analysis, I can see that:
1. When `min_lines=0`, `hash_lineset` returns empty dicts (because `range(0)` is empty and the zip loop never executes)
2. `_find_common` finds no commonalities
3. `_iter_sims` yields nothing
4. The check is effectively disabled
But the issue says it doesn't work. Let me look at this from a different angle. Maybe the issue is in how the option value is retrieved. Let me look at `set_option`:
```python
def set_option(self, optname, value, action=None, optdict=None):
BaseChecker.set_option(self, optname, value, action, optdict)
if optname == "min-similarity-lines":
self.min_lines = self.config.min_similarity_lines
```
Note that it's using `self.config.min_similarity_lines` rather than `value`. Could there be a case where `value` is 0 but `self.config.min_similarity_lines` is something else? Let me check.
Actually, `BaseChecker.set_option` should set `self.config.min_similarity_lines` to `value`. So it should be fine.
But wait, let me check how options are parsed from the rcfile. Maybe there's an issue with the `int` type parsing that treats 0 as falsy...
Let me look at the options parsing code.
# Licensed under the GPL: https://www.gnu.org/licenses/old-licenses/gpl-2.0.html
# For details: https://github.com/PyCQA/pylint/blob/main/LICENSE
import optparse # pylint: disable=deprecated-module
from typing import Any, Dict, Tuple
from pylint.config.option import _validate
class UnsupportedAction(Exception):
"""raised by set_option when it doesn't know what to do for an action"""
class OptionsProviderMixIn:
"""Mixin to provide options to an OptionsManager"" …
Step 10 tool use
Now let me check the `_validate` function which is called by `set_option`:
# Licensed under the GPL: https://www.gnu.org/licenses/old-licenses/gpl-2.0.html
# For details: https://github.com/PyCQA/pylint/blob/main/LICENSE
import copy
import optparse # pylint: disable=deprecated-module
import re
from pylint import utils
# pylint: disable=unused-argument
def _csv_validator(_, name, value):
return utils._check_csv(value)
# pylint: disable=unused-argument
def _regexp_validator(_, name, value):
if hasattr(value, "pattern"):
return v …
Step 11 tool use
OK, I see that the `int` type uses `int` as the validator, which would convert `"0"` to `0`. So parsing `min-similarity-lines=0` from the rcfile should work correctly and set the value to `0`.
Now let me think about this differently. Maybe the bug is NOT about `min_lines=0` causing the algorithm to find false positives, but rather about `min_lines=0` not being treated as a way to disable the check. The user wants `0` to mean "disable", but currently `0` just means "find similarities with at least 0 common lines", which is effectively everything... but due to the algorithm, it actually finds nothing (because the hash loop never executes).
But wait, maybe I should actually look more carefully at the algorithm. What if `min_lines=0` causes `hash_lineset` to return something unexpected? Let me trace through again:
```python
def hash_lineset(
lineset: "LineSet", min_common_lines: int = DEFAULT_MIN_SIMILARITY_LINE
) -> Tuple[HashToIndex_T, IndexToLines_T]:
hash2index = defaultdict(list)
index2lines = {}
lines = tuple(x.text for x in lineset.stripped_lines)
shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)]
for index_i, *succ_lines in enumerate(zip(*shifted_lines)):
...
return hash2index, index2lines
```
With `min_common_lines=0`:
- `shifted_lines = []`
- `zip(*[])` returns an empty iterator
- Loop never executes
- Returns `defaultdict(list)` (empty) and `{}` (empty)
This is correct - no hashes means no duplicates found.
But what if the issue is that `min_lines` being 0 doesn't actually get set correctly? Let me check the `SimilarChecker.__init__`:
```python
def __init__(self, linter=None) -> None:
BaseChecker.__init__(self, linter)
Similar.__init__(
self,
min_lines=self.config.min_similarity_lines,
...
)
```
And `set_option`:
```python
def set_option(self, optname, value, action=None, optdict=None):
BaseChecker.set_option(self, optname, value, action, optdict)
if optname == "min-similarity-lines":
self.min_lines = self.config.min_similarity_lines
```
So if `min-similarity-lines=0`, then `self.config.min_similarity_lines` should be `0`, and `self.min_lines` should be `0`.
I'm now fairly convinced the current code already effectively disables the check when `min_lines=0`. The issue might be about making this behavior explicit and documented.
BUT wait, the issue description says "it instead treats every line of code as duplicate and raises many errors." This is a strong claim. Let me check if there's maybe a different version of the code or if I'm missing something.
Actually, let me look at the `_find_common` method one more time. There's this check:
```python
if eff_cmn_nb > self.min_lines:
yield com
```
If `self.min_lines` is 0, then `eff_cmn_nb > 0` would be true for any match. But this line is never reached because `all_couples` is empty when `min_lines=0`.
But what if there's a bug in `hash_lineset` that I'm not seeing? Let me look at `shifted_lines` more carefully:
```python
shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)]
```
When `min_common_lines=0`, `shifted_lines = []`. Then:
```python
for index_i, *succ_lines in enumerate(zip(*shifted_lines)):
```
`zip(*[])` → `zip()` → empty iterator. Loop never executes.
I'm going to accept that the current behavior with `min_lines=0` is that the check is silently disabled (returns no results). However, the user wants this to be an explicit feature. The fix should:
1. Explicitly handle `min_lines <= 0` by skipping the check
2. This makes the behavior documented and intentional rather than accidental
Let me now implement the fix. The best place is in `_iter_sims`:
```python
def _iter_sims(self) -> Generator[Commonality, None, None]:
"""iterate on similarities among all files, by making a cartesian product"""
if self.min_lines <= 0:
return
for idx, lineset in enumerate(self.linesets[:-1]):
for lineset2 in self
... [truncated 2,021 characters]
Now I have a clear picture. Let me verify exactly what happens with `min_lines=0` in the current code:
Interesting. With only 3 lines and `min_common_lines=4`, `hash2index` is also empty. That's because `zip(*shifted_lines)` will stop when the shortest iterator is exhausted. With 4 iterators and only 3 lines, the zip will be exhausted immediately (since one iterator starts at index 3 and has no elements). Let me verify with more lines:
Actually wait, `shifted_lines = [iter(lines[0:]), iter(lines[1:]), iter(lines[2:]), iter(lines[3:])]`. When we zip, the 4th iterator is exhausted (empty), so zip produces nothing. With 4 lines, the 4th iterator has 1 element, so zip produces 1 tuple.
OK, but my main analysis holds: with `min_lines=0`, `hash2index` is always empty, so no duplicates are found.
Now, the issue says "it instead treats every line of code as duplicate and raises many errors." This doesn't match my analysis. But it's possible that:
1. The issue is from an older version of the code
2. The user is mistaken about the behavior
3. There's some other code path I'm not seeing
Regardless, the fix should be to explicitly handle `min_lines <= 0` by skipping the check. This makes the behavior intentional and documented.
Let me implement the fix now. I'll add guards in two places:
1. `_iter_sims` - to skip the similarity computation entirely when `min_lines <= 0`
2. `hash_lineset` - as a safety measure, return empty dicts when `min_common_lines <= 0`
Actually, looking at it again, I think the cleanest fix is:
1. In `hash_lineset`: add an early return if `min_common_lines <= 0`
2. In `_iter_sims`: add an early return if `self.min_lines <= 0`
This way, both the standalone `Similar` tool and the `SimilarChecker` will correctly skip the check.
Now I understand the issue. When `min_lines=0`:
1. `hash_lineset` creates `shifted_lines = [iter(lines[i:]) for i in range(0)]` which is `[]`
2. `zip(*[])` returns an empty iterator, so the loop never executes
3. `hash2index` and `index2lines` are both empty
4. No similarities are found
So currently, `min_lines=0` *accidentally* disables the check because the hash computation loop doesn't execute. But this is fragile and not intentional. The fix should explicitly handle `min_lines <= 0` by skipping the check early, making the behavior intentional and robust.
Let me implement the fix:
edit_file{ path: "/testbed/pylint/checkers/similar.py", old_string: "def hash_lineset(\n lineset: \"LineSet\", min_common_lines: int = DEFAULT_MIN_SIMILARITY_LINE\n) -> Tuple[HashToIndex_T, IndexToLines_T]:\n \"\"\"\n Return two …", new_string: "def hash_lineset(\n lineset: \"LineSet\", min_common_lines: int = DEFAULT_MIN_SIMILARITY_LINE\n) -> Tuple[HashToIndex_T, IndexToLines_T]:\n \"\"\"\n Return two …" }
Now let me add an early return in `_iter_sims` to skip the check entirely when `min_lines <= 0`:
edit_file{ path: "/testbed/pylint/checkers/similar.py", old_string: " def _iter_sims(self) -> Generator[Commonality, None, None]:\n \"\"\"iterate on similarities among all files, by making a cartesian\n product\n …", new_string: " def _iter_sims(self) -> Generator[Commonality, None, None]:\n \"\"\"iterate on similarities among all files, by making a cartesian\n product\n …" }
diff --git a/pylint/checkers/similar.py b/pylint/checkers/similar.pyindex c65b1c89..9c7c1d9e 100644--- a/pylint/checkers/similar.py+++ b/pylint/checkers/similar.py@@ -238,6 +238,8 @@ def hash_lineset( """ hash2index = defaultdict(list) index2lines = {}+ if min_common_lines <= 0:+ return hash2index, index2lines # Comments, docstring and other specific patterns maybe excluded -> call to stripped_lines # to get only what is desired lines = tuple(x.text for x in lineset.stripped_lines)@@ -531,6 +533,8 @@ class Similar: """iterate on similarities among all files, by making a cartesian product """+ if self.min_lines <= 0:+ return for idx, lineset in enumerate(self.linesets[:-1]): for lineset2 in self.linesets[idx + 1 :]: yield from self._find_common(lineset, lineset2)