Leaderboard  /  pylint-4970

pylint-dev__pylint-4970

pylint-dev/pylint · v2.10  ·  15 steps · 320,881 tokens · 356s
partial
0.944

The task

The bug report the agent was given — plus the gold fix and hidden tests, shown here for reference. The agent saw only the problem statement.

Problem statementgithub issue
Setting `min-similarity-lines` to `0` should stop pylint from checking duplicate code
### Current problem

Setting `min-similarity-lines` to `0` in the rcfile doesn't disable checking for duplicate code, it instead treats every line of code as duplicate and raises many errors.

### Desired solution

Setting `min-similarity-lines` to `0` should disable the duplicate code check.

It works that way in many other linters (like flake8). Setting a numerical value in flake8 to `0` (e.g. `max-line-length`) disables that check.

### Additional context

#214 requests being able to disable `R0801`, but it is still open
Gold patchheld back · the reference fix
diff --git a/pylint/checkers/similar.py b/pylint/checkers/similar.py--- a/pylint/checkers/similar.py+++ b/pylint/checkers/similar.py@@ -390,6 +390,8 @@ def append_stream(self, streamid: str, stream: TextIO, encoding=None) -> None: def run(self) -> None: """start looking for similarities and display results on stdout"""+ if self.min_lines == 0:+ return self._display_sims(self._compute_sims()) def _compute_sims(self) -> List[Tuple[int, Set[LinesChunkLimits_T]]]:
Hidden testtest_patch · applied only to grade
diff --git a/tests/checkers/unittest_similar.py b/tests/checkers/unittest_similar.py--- a/tests/checkers/unittest_similar.py+++ b/tests/checkers/unittest_similar.py@@ -502,3 +502,11 @@ def test_get_map_data() -> None: # There doesn't seem to be a faster way of doing this, yet. lines = (linespec.text for linespec in lineset_obj.stripped_lines) assert tuple(expected_lines) == tuple(lines)+++def test_set_duplicate_lines_to_zero() -> None:+ output = StringIO()+ with redirect_stdout(output), pytest.raises(SystemExit) as ex:+ similar.Run(["--duplicates=0", SIMILAR1, SIMILAR2])+ assert ex.value.code == 0+ assert output.getvalue() == ""
0FAIL → PASS (of 1)
17PASS → PASS
17 / 18graded tests

The agent run

The agent's step-by-step attempt — its reasoning, the tools it called, and the patch it produced.

System promptharness.agent.build_system_prompt
You are an AI coding assistant working inside a code repository that contains a bug. Investigate the code, find the bug, and fix it.

Your working directory is: /testbed
All four tools (read_file, write_file, edit_file, run_bash) already run from that directory. Use plain relative paths; do not prefix shell commands with cd. Fix the bug by editing source files only; do not create or edit test files.
Step 01 tool use
Let me understand the problem: 1. Setting `min-similarity-lines` to `0` should disable the duplicate code check entirely. 2. Currently, setting it to `0` treats every line as duplicate and raises many errors. I need to find the relevant code that handles `min-similarity-lines` and fix it so that `0` disables the check. Let me start by exploring the repository structure and finding the relevant code.
run_bash{ command: "find /testbed -type f -name \"*.py\" | xargs grep -l \"min-similarity-lines\\|min_similarity_lines\\|similarity\" 2>/dev/null" }
tool result · run_bash
exitcode: 0 stdout: /testbed/tests/input/similar_lines_b.py /testbed/tests/input/similar_lines_a.py /testbed/tests/checkers/unittest_similar.py /testbed/pylint/checkers/similar.py /testbed/pylint/lint/parallel.py stderr:
run_bash{ command: "grep -r \"min-similarity-lines\\|min_similarity_lines\" /testbed --include=\"*.py\" -l 2>/dev/null" }
tool result · run_bash
exitcode: 0 stdout: /testbed/pylint/checkers/similar.py stderr:
Step 02 tool use
Let me look at the similar.py checker file to understand how `min-similarity-lines` is used.
read_file{ path: "/testbed/pylint/checkers/similar.py" }
tool result · read_file
# Copyright (c) 2006, 2008-2014 LOGILAB S.A. (Paris, FRANCE) <contact@logilab.fr> # Copyright (c) 2012 Ry4an Brase <ry4an-hg@ry4an.org> # Copyright (c) 2012 Google, Inc. # Copyright (c) 2012 Anthony VEREZ <anthony.verez.external@cassidian.com> # Copyright (c) 2014-2020 Claudiu Popa <pcmanticore@gmail.com> # Copyright (c) 2014 Brett Cannon <brett@python.org> # Copyright (c) 2014 Arun Persaud <arun@nubati.net> # Copyright (c) 2015 Ionel Cristian Maries <contact@ionelmc.ro> # Co …
Step 03 tool use
Now I can see the issue. When `min-similarity-lines` is set to `0`, the algorithm doesn't handle it properly. Let me trace through the code: 1. `Similar.__init__` sets `self.min_lines = min_lines` (which would be 0) 2. `_iter_sims` iterates over pairs of linesets 3. `_find_common` calls `hash_lineset(lineset1, self.min_lines)` which passes `0` as `min_common_lines` 4. In `hash_lineset`, when `min_common_lines` is 0: - `shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)]` → this creates an empty list since `range(0)` is empty - Then `for index_i, *succ_lines in enumerate(zip(*shifted_lines)):` → `zip()` of an empty list yields nothing, so this loop doesn't execute - `hash2index` stays empty, `index2lines` stays empty - This means no hashes are computed, so no similarities are found Wait, that would mean it silently finds nothing. But the bug report says it "treats every line of code as duplicate and raises many errors." Let me look more carefully. Actually wait, let me re-read the issue. The user says setting `min-similarity-lines` to `0` doesn't disable the check. But looking at the algorithm, with `min_lines=0`, the hash computation would produce empty results, meaning no similarities would be found. But maybe the issue is different... Actually wait, let me look at `_find_common` more carefully: ```python eff_cmn_nb = filter_noncode_lines( lineset1, start_index_1, lineset2, start_index_2, nb_common_lines ) if eff_cmn_nb > self.min_lines: yield com ``` If `self.min_lines` is 0, then `eff_cmn_nb > 0` would be true for any match. So any similarity would be yielded. But the question is: what does `hash_lineset` do with `min_common_lines=0`? ```python shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)] ``` With `min_common_lines=0`, `range(0)` gives `[]`, so `shifted_lines = []`. ```python for index_i, *succ_lines in enumerate(zip(*shifted_lines)): ``` `zip()` returns an empty iterator, so the loop body never executes. So `hash2index` is empty, `index2lines` is empty, and no commonalities are found. So actually, setting `min-similarity-lines=0` would result in the check being effectively disabled (no similarities found). But the user reports the opposite... Wait, let me look more carefully. Maybe the issue is in the standalone `Similar` class used by the `symilar` command, not the `SimilarChecker`. But the user mentions `R0801` which is the message from the checker. Let me think again... Maybe the issue is not in `hash_lineset` but somewhere else. Let me check how the `Similar` class is initialized: ```python def __init__( self, min_lines: int = DEFAULT_MIN_SIMILARITY_LINE, ... ``` And in `SimilarChecker.__init__`: ```python Similar.__init__( self, min_lines=self.config.min_similarity_lines, ... ) ``` If `min_lines` is 0, then: - `hash_lineset` with `min_common_lines=0` will produce empty hash dicts - `_find_common` will find no common hashes - `_iter_sims` will yield nothing - The check is effectively disabled Hmm, but the issue says it raises many errors. Let me look at this from a different angle. Maybe the problem is that the `options` definition for `min-similarity-lines` doesn't have a way to disable it, and users expect setting it to `0` would work like other tools. Actually, let me re-read the bug description: "Setting `min-similarity-lines` to `0` in the rcfile doesn't disable checking for duplicate code, it instead treats every line of code as duplicate and raises many errors." Wait, so it actually does raise many errors? Let me think about what happens when `min_lines=0`: In `hash_lineset`: ```python shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)] ``` With `min_common_lines=0`, `shifted_lines = []`. ```python for index_i, *succ_lines in enumerate(zip(*shifted_lines)): ``` `zip()` iterates over nothing, so nothing happens. `hash2index` is empty. But wait, what if there are no stripped lines? Then `lines = ... [truncated 5,589 characters]
Let me check the test file and also look at how the min_lines value flows through the code more carefully.
read_file{ path: "/testbed/tests/checkers/unittest_similar.py" }
tool result · read_file
# Copyright (c) 2010, 2012, 2014 LOGILAB S.A. (Paris, FRANCE) <contact@logilab.fr> # Copyright (c) 2012 Ry4an Brase <ry4an-hg@ry4an.org> # Copyright (c) 2014 Google, Inc. # Copyright (c) 2015 Ionel Cristian Maries <contact@ionelmc.ro> # Copyright (c) 2016-2018, 2020 Claudiu Popa <pcmanticore@gmail.com> # Copyright (c) 2016 Derek Gustafson <degustaf@gmail.com> # Copyright (c) 2018 Scott Worley <scottworley@scottworley.com> # Copyright (c) 2018 Sushobhit <31987769+sushobhit27@u …
run_bash{ command: "cd /testbed && grep -rn \"min_lines\\|min-similarity-lines\\|min_similarity\" --include=\"*.py\"" }
tool result · run_bash
exitcode: 0 stdout: pylint/checkers/similar.py:358: min_lines: int = DEFAULT_MIN_SIMILARITY_LINE, pylint/checkers/similar.py:364: self.min_lines = min_lines pylint/checkers/similar.py:478: hash_to_index_1, index_to_lines_1 = hash_lineset(lineset1, self.min_lines) pylint/checkers/similar.py:479: hash_to_index_2, index_to_lines_2 = hash_lineset(lineset2, self.min_lines) pylint/checkers/similar.py:503: effective_cmn_lines_nb=self.mi …
Step 04 tool use
Now I understand the code flow. Let me trace through what happens when `min_lines` is `0`: 1. In `hash_lineset` with `min_common_lines=0`: - `shifted_lines = [iter(lines[i:]) for i in range(0)]` → `[]` - `zip(*[])` → empty iterator - The for loop doesn't execute - `hash2index` is empty, `index2lines` is empty 2. In `_find_common`: - `hash_1` and `hash_2` are empty frozensets - `common_hashes` is empty - `all_couples` stays empty - Nothing yielded 3. In `_iter_sims`: - Nothing yielded 4. In `close()`: - `_compute_sims()` returns `[]` - The for loop doesn't execute - No messages are added So it seems like `min_lines=0` should effectively disable the check. But the user reports it raises many errors. Let me think about this differently. Maybe this is a different version of pylint and the bug was already reported and needs to be fixed. Let me look at the `_compute_sims` method more carefully: ```python def _compute_sims(self) -> List[Tuple[int, Set[LinesChunkLimits_T]]]: """compute similarities in appended files""" no_duplicates: Dict[int, List[Set[LinesChunkLimits_T]]] = defaultdict(list) for commonality in self._iter_sims(): ... ``` If `_iter_sims` yields nothing, then `no_duplicates` is empty. Then: ```python sims: List[Tuple[int, Set[LinesChunkLimits_T]]] = [] ensembles: List[Set[LinesChunkLimits_T]] for num, ensembles in no_duplicates.items(): cpls: Set[LinesChunkLimits_T] for cpls in ensembles: sims.append((num, cpls)) sims.sort() sims.reverse() return sims ``` Empty list returned. Hmm, but wait. Let me reconsider. Maybe the problem is more subtle. Let me check if there's an issue with integer conversion or option handling. Actually, let me look at this from a completely different angle. Let me look at the `SimilarChecker.set_option` method: ```python def set_option(self, optname, value, action=None, optdict=None): BaseChecker.set_option(self, optname, value, action, optdict) if optname == "min-similarity-lines": self.min_lines = self.config.min_similarity_lines ``` And in `SimilarChecker.__init__`: ```python Similar.__init__( self, min_lines=self.config.min_similarity_lines, ... ) ``` It reads from `self.config.min_similarity_lines`. Looking at the option definition: ```python ( "min-similarity-lines", { "default": DEFAULT_MIN_SIMILARITY_LINE, "type": "int", ... }, ), ``` So `self.config.min_similarity_lines` is the integer value from the config. If set to 0, `self.min_lines` would be 0. OK, I need to actually test this. But let me first look at what the actual fix should be. The desired behavior is: setting `min-similarity-lines` to `0` should disable the duplicate code check. The simplest fix would be to add a check in `_iter_sims` or `_find_common` or `close` to return early if `self.min_lines <= 0`. But wait, I just realized something. Let me trace through the algorithm more carefully with `min_lines=0`. In `hash_lineset`: ```python shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)] ``` With `min_common_lines=0`, this creates an empty list. ```python for index_i, *succ_lines in enumerate(zip(*shifted_lines)): ``` `zip(*[])` = `zip()`. In Python, `zip()` with no arguments returns an empty iterator. So this loop body never executes. So `hash2index` is an empty defaultdict(list) and `index2lines` is an empty dict. Then back in `_find_common`: ```python hash_1: FrozenSet[LinesChunk] = frozenset(hash_to_index_1.keys()) hash_2: FrozenSet[LinesChunk] = frozenset(hash_to_index_2.keys()) common_hashes: Iterable[LinesChunk] = sorted( hash_1 & hash_2, key=lambda m: hash_to_index_1[m][0] ) ``` Both are empty frozensets. `hash_1 & hash_2` is empty. `sorted([])` = `[]`. ```python all_couples: CplIndexToCplLines_T = {} for c_hash in sorted(common_hashes, key=operator.attrgetter("_index")): ... ``` Loop never execut ... [truncated 11,136 characters]
Now let me look at the key function `hash_lineset` and the `_iter_sims` method more closely to understand the flow when `min_lines=0`:
run_bash{ command: "cd /testbed && python -c \"\nfrom pylint.checkers.similar import Similar, hash_lineset, LineSet\n\n# Let's test what happens with min_lines=0\nclass TestLS:\n pass…" }
tool result · run_bash
exitcode: 1 stdout: stderr: Traceback (most recent call last): File "<string>", line 13, in <module> File "/testbed/pylint/checkers/similar.py", line 374, in append_stream readlines = stream.readlines AttributeError: 'list_iterator' object has no attribute 'readlines'
Step 05 tool use
The test failed because `append_stream` expects a stream with `readlines`. Let me fix the test.
run_bash{ command: "cd /testbed && python -c \"\nfrom pylint.checkers.similar import Similar, hash_lineset, LineSet\nfrom io import StringIO\n\n# Create a simple test\nlines1 = 'a = 1\\nb…" }
tool result · run_bash
exitcode: 0 stdout: Results with min_lines=0: [] Results with min_lines=1: [Commonality(cmn_lines_nb=3, fst_lset=<pylint.checkers.similar.LineSet object at 0x7fffff2b8eb0>, fst_file_start=0, fst_file_end=3, snd_lset=<pylint.checkers.similar.LineSet object at 0x7ffffd39d070>, snd_file_start=0, snd_file_end=3)] 3 similar lines File 1: test1 [0:3] File 2: test2 [0:3] stderr:
Step 06 tool use
So with `min_lines=0`, the check is effectively disabled (returns empty list). The user's claim that "it treats every line of code as duplicate" doesn't seem to be reproducible with the current code. But wait, maybe the issue is about a different version. Or maybe the bug is that while the check is silently disabled, the user wants it to be an explicit feature: setting `min-similarity-lines=0` should be the documented way to disable the check. And perhaps in a previous version, it didn't work correctly. Actually, let me re-read the issue one more time: "Setting `min-similarity-lines` to `0` should stop pylint from checking duplicate code" "Setting `min-similarity-lines` to `0` in the rcfile doesn't disable checking for duplicate code, it instead treats every line of code as duplicate and raises many errors." Hmm, but my test shows it does disable it. Let me check the `SimilarChecker` path too, just to make sure. Actually wait, maybe the issue is that `min-similarity-lines=0` is being set in the rcfile but the config system treats `0` differently? Let me check how the config option parsing works for `int` type. Let me look at how options are parsed... Actually, let me try a different approach. Let me look at what happens when the `SimilarChecker` is used with `min-similarity-lines=0`. Let me check if there's something in the BaseChecker or option parsing that might treat 0 as falsy and use the default instead.
run_bash{ command: "cd /testbed && python -c \"\nfrom pylint.checkers.similar import SimilarChecker\nfrom pylint.lint import PyLinter\nfrom pylint.testutils import GenericTestReporter …" }
tool result · run_bash
exitcode: 1 stdout: stderr: Traceback (most recent call last): File "<string>", line 10, in <module> File "/testbed/pylint/lint/pylinter.py", line 661, in set_option checkers.BaseTokenChecker.set_option(self, optname, value, action, optdict) File "/testbed/pylint/config/options_provider_mixin.py", line 52, in set_option optdict = self.get_option_def(optname) File "/testbed/pylint/config/options_provider_mixin.py", line 88, in get_option_def raise optparse …
Step 07 tool use
The option needs to be set on the SimilarChecker, not the linter directly. Let me try a different approach to test.
run_bash{ command: "cd /testbed && python -c \"\nfrom pylint.checkers.similar import SimilarChecker\nfrom pylint.lint import PyLinter\nfrom pylint.testutils import GenericTestReporter …" }
tool result · run_bash
exitcode: 1 stdout: min_lines: 0 stderr: Traceback (most recent call last): File "<string>", line 17, in <module> File "/testbed/pylint/checkers/similar.py", line 817, in open self.stats = self.linter.add_stats( File "/testbed/pylint/reporters/reports_handler_mix_in.py", line 75, in add_stats assert key not in self.stats TypeError: argument of type 'NoneType' is not iterable
Step 08 tool use
The test is getting complicated. Let me focus on the actual code fix. From my analysis, I can see that: 1. When `min_lines=0`, `hash_lineset` returns empty dicts (because `range(0)` is empty and the zip loop never executes) 2. `_find_common` finds no commonalities 3. `_iter_sims` yields nothing 4. The check is effectively disabled But the issue says it doesn't work. Let me look at this from a different angle. Maybe the issue is in how the option value is retrieved. Let me look at `set_option`: ```python def set_option(self, optname, value, action=None, optdict=None): BaseChecker.set_option(self, optname, value, action, optdict) if optname == "min-similarity-lines": self.min_lines = self.config.min_similarity_lines ``` Note that it's using `self.config.min_similarity_lines` rather than `value`. Could there be a case where `value` is 0 but `self.config.min_similarity_lines` is something else? Let me check. Actually, `BaseChecker.set_option` should set `self.config.min_similarity_lines` to `value`. So it should be fine. But wait, let me check how options are parsed from the rcfile. Maybe there's an issue with the `int` type parsing that treats 0 as falsy... Let me look at the options parsing code.
run_bash{ command: "cd /testbed && grep -rn \"def set_option\\|def read_config_file\\|def _parse\\|option.*type.*int\" --include=\"*.py\" pylint/config/ | head -30" }
tool result · run_bash
exitcode: 0 stdout: pylint/config/option_manager_mixin.py:239: def read_config_file(self, config_file=None, verbose=None): pylint/config/options_provider_mixin.py:49: def set_option(self, optname, value, action=None, optdict=None): stderr:
Step 09 tool use
Let me check how the options are parsed from the config file, specifically how integer values are handled.
read_file{ path: "/testbed/pylint/config/options_provider_mixin.py" }
tool result · read_file
# Licensed under the GPL: https://www.gnu.org/licenses/old-licenses/gpl-2.0.html # For details: https://github.com/PyCQA/pylint/blob/main/LICENSE import optparse # pylint: disable=deprecated-module from typing import Any, Dict, Tuple from pylint.config.option import _validate class UnsupportedAction(Exception): """raised by set_option when it doesn't know what to do for an action""" class OptionsProviderMixIn: """Mixin to provide options to an OptionsManager"" …
Step 10 tool use
Now let me check the `_validate` function which is called by `set_option`:
read_file{ path: "/testbed/pylint/config/option.py" }
tool result · read_file
# Licensed under the GPL: https://www.gnu.org/licenses/old-licenses/gpl-2.0.html # For details: https://github.com/PyCQA/pylint/blob/main/LICENSE import copy import optparse # pylint: disable=deprecated-module import re from pylint import utils # pylint: disable=unused-argument def _csv_validator(_, name, value): return utils._check_csv(value) # pylint: disable=unused-argument def _regexp_validator(_, name, value): if hasattr(value, "pattern"): return v …
Step 11 tool use
OK, I see that the `int` type uses `int` as the validator, which would convert `"0"` to `0`. So parsing `min-similarity-lines=0` from the rcfile should work correctly and set the value to `0`. Now let me think about this differently. Maybe the bug is NOT about `min_lines=0` causing the algorithm to find false positives, but rather about `min_lines=0` not being treated as a way to disable the check. The user wants `0` to mean "disable", but currently `0` just means "find similarities with at least 0 common lines", which is effectively everything... but due to the algorithm, it actually finds nothing (because the hash loop never executes). But wait, maybe I should actually look more carefully at the algorithm. What if `min_lines=0` causes `hash_lineset` to return something unexpected? Let me trace through again: ```python def hash_lineset( lineset: "LineSet", min_common_lines: int = DEFAULT_MIN_SIMILARITY_LINE ) -> Tuple[HashToIndex_T, IndexToLines_T]: hash2index = defaultdict(list) index2lines = {} lines = tuple(x.text for x in lineset.stripped_lines) shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)] for index_i, *succ_lines in enumerate(zip(*shifted_lines)): ... return hash2index, index2lines ``` With `min_common_lines=0`: - `shifted_lines = []` - `zip(*[])` returns an empty iterator - Loop never executes - Returns `defaultdict(list)` (empty) and `{}` (empty) This is correct - no hashes means no duplicates found. But what if the issue is that `min_lines` being 0 doesn't actually get set correctly? Let me check the `SimilarChecker.__init__`: ```python def __init__(self, linter=None) -> None: BaseChecker.__init__(self, linter) Similar.__init__( self, min_lines=self.config.min_similarity_lines, ... ) ``` And `set_option`: ```python def set_option(self, optname, value, action=None, optdict=None): BaseChecker.set_option(self, optname, value, action, optdict) if optname == "min-similarity-lines": self.min_lines = self.config.min_similarity_lines ``` So if `min-similarity-lines=0`, then `self.config.min_similarity_lines` should be `0`, and `self.min_lines` should be `0`. I'm now fairly convinced the current code already effectively disables the check when `min_lines=0`. The issue might be about making this behavior explicit and documented. BUT wait, the issue description says "it instead treats every line of code as duplicate and raises many errors." This is a strong claim. Let me check if there's maybe a different version of the code or if I'm missing something. Actually, let me look at the `_find_common` method one more time. There's this check: ```python if eff_cmn_nb > self.min_lines: yield com ``` If `self.min_lines` is 0, then `eff_cmn_nb > 0` would be true for any match. But this line is never reached because `all_couples` is empty when `min_lines=0`. But what if there's a bug in `hash_lineset` that I'm not seeing? Let me look at `shifted_lines` more carefully: ```python shifted_lines = [iter(lines[i:]) for i in range(min_common_lines)] ``` When `min_common_lines=0`, `shifted_lines = []`. Then: ```python for index_i, *succ_lines in enumerate(zip(*shifted_lines)): ``` `zip(*[])` → `zip()` → empty iterator. Loop never executes. I'm going to accept that the current behavior with `min_lines=0` is that the check is silently disabled (returns no results). However, the user wants this to be an explicit feature. The fix should: 1. Explicitly handle `min_lines <= 0` by skipping the check 2. This makes the behavior documented and intentional rather than accidental Let me now implement the fix. The best place is in `_iter_sims`: ```python def _iter_sims(self) -> Generator[Commonality, None, None]: """iterate on similarities among all files, by making a cartesian product""" if self.min_lines <= 0: return for idx, lineset in enumerate(self.linesets[:-1]): for lineset2 in self ... [truncated 2,021 characters]
Now I have a clear picture. Let me verify exactly what happens with `min_lines=0` in the current code:
run_bash{ command: "cd /testbed && python -c \"\nfrom pylint.checkers.similar import hash_lineset, LineSet\nfrom collections import defaultdict\n\n# Simulate what happens when min_lines…" }
tool result · run_bash
exitcode: 0 stdout: stripped_lines: [LineSpecifs(line_number=0, text='a = 1'), LineSpecifs(line_number=1, text='b = 2'), LineSpecifs(line_number=2, text='c = 3')] hash2index: {} index2lines: {} hash2index (4): {} index2lines (4): {} stderr:
Step 12 tool use
Interesting. With only 3 lines and `min_common_lines=4`, `hash2index` is also empty. That's because `zip(*shifted_lines)` will stop when the shortest iterator is exhausted. With 4 iterators and only 3 lines, the zip will be exhausted immediately (since one iterator starts at index 3 and has no elements). Let me verify with more lines: Actually wait, `shifted_lines = [iter(lines[0:]), iter(lines[1:]), iter(lines[2:]), iter(lines[3:])]`. When we zip, the 4th iterator is exhausted (empty), so zip produces nothing. With 4 lines, the 4th iterator has 1 element, so zip produces 1 tuple. OK, but my main analysis holds: with `min_lines=0`, `hash2index` is always empty, so no duplicates are found. Now, the issue says "it instead treats every line of code as duplicate and raises many errors." This doesn't match my analysis. But it's possible that: 1. The issue is from an older version of the code 2. The user is mistaken about the behavior 3. There's some other code path I'm not seeing Regardless, the fix should be to explicitly handle `min_lines <= 0` by skipping the check. This makes the behavior intentional and documented. Let me implement the fix now. I'll add guards in two places: 1. `_iter_sims` - to skip the similarity computation entirely when `min_lines <= 0` 2. `hash_lineset` - as a safety measure, return empty dicts when `min_common_lines <= 0` Actually, looking at it again, I think the cleanest fix is: 1. In `hash_lineset`: add an early return if `min_common_lines <= 0` 2. In `_iter_sims`: add an early return if `self.min_lines <= 0` This way, both the standalone `Similar` tool and the `SimilarChecker` will correctly skip the check.
Now I understand the issue. When `min_lines=0`: 1. `hash_lineset` creates `shifted_lines = [iter(lines[i:]) for i in range(0)]` which is `[]` 2. `zip(*[])` returns an empty iterator, so the loop never executes 3. `hash2index` and `index2lines` are both empty 4. No similarities are found So currently, `min_lines=0` *accidentally* disables the check because the hash computation loop doesn't execute. But this is fragile and not intentional. The fix should explicitly handle `min_lines <= 0` by skipping the check early, making the behavior intentional and robust. Let me implement the fix:
edit_file{ path: "/testbed/pylint/checkers/similar.py", old_string: "def hash_lineset(\n lineset: \"LineSet\", min_common_lines: int = DEFAULT_MIN_SIMILARITY_LINE\n) -> Tuple[HashToIndex_T, IndexToLines_T]:\n \"\"\"\n Return two …", new_string: "def hash_lineset(\n lineset: \"LineSet\", min_common_lines: int = DEFAULT_MIN_SIMILARITY_LINE\n) -> Tuple[HashToIndex_T, IndexToLines_T]:\n \"\"\"\n Return two …" }
tool result · edit_file
File /testbed/pylint/checkers/similar.py edited successfully.
Step 13 tool use
Now let me also add a guard in `_iter_sims`:
Now let me add an early return in `_iter_sims` to skip the check entirely when `min_lines <= 0`:
edit_file{ path: "/testbed/pylint/checkers/similar.py", old_string: " def _iter_sims(self) -> Generator[Commonality, None, None]:\n \"\"\"iterate on similarities among all files, by making a cartesian\n product\n …", new_string: " def _iter_sims(self) -> Generator[Commonality, None, None]:\n \"\"\"iterate on similarities among all files, by making a cartesian\n product\n …" }
tool result · edit_file
File /testbed/pylint/checkers/similar.py edited successfully.
Step 14 tool use
Now let me verify the changes look correct by reading the relevant sections of the file:
Let me verify the changes look correct:
read_file{ path: "/testbed/pylint/checkers/similar.py", offset: "209", limit: "50" }
tool result · read_file
ERROR: bad arguments for 'read_file': Toolbox.read_file() got an unexpected keyword argument 'offset'
read_file{ path: "/testbed/pylint/checkers/similar.py", offset: "530", limit: "15" }
tool result · read_file
ERROR: bad arguments for 'read_file': Toolbox.read_file() got an unexpected keyword argument 'offset'
Step 15 tool use
Let me read the file properly:
read_file{ path: "/testbed/pylint/checkers/similar.py" }
tool result · read_file
# Copyright (c) 2006, 2008-2014 LOGILAB S.A. (Paris, FRANCE) <contact@logilab.fr> # Copyright (c) 2012 Ry4an Brase <ry4an-hg@ry4an.org> # Copyright (c) 2012 Google, Inc. # Copyright (c) 2012 Anthony VEREZ <anthony.verez.external@cassidian.com> # Copyright (c) 2014-2020 Claudiu Popa <pcmanticore@gmail.com> # Copyright (c) 2014 Brett Cannon <brett@python.org> # Copyright (c) 2014 Arun Persaud <arun@nubati.net> # Copyright (c) 2015 Ionel Cristian Maries <contact@ionelmc.ro> # Co …
Agent diffcaptured before grading
diff --git a/pylint/checkers/similar.py b/pylint/checkers/similar.pyindex c65b1c89..9c7c1d9e 100644--- a/pylint/checkers/similar.py+++ b/pylint/checkers/similar.py@@ -238,6 +238,8 @@ def hash_lineset( """ hash2index = defaultdict(list) index2lines = {}+ if min_common_lines <= 0:+ return hash2index, index2lines # Comments, docstring and other specific patterns maybe excluded -> call to stripped_lines # to get only what is desired lines = tuple(x.text for x in lineset.stripped_lines)@@ -531,6 +533,8 @@ class Similar: """iterate on similarities among all files, by making a cartesian product """+ if self.min_lines <= 0:+ return for idx, lineset in enumerate(self.linesets[:-1]): for lineset2 in self.linesets[idx + 1 :]: yield from self._find_common(lineset, lineset2)