Skip to content

NY Times articles don't return full text #2

@jfmontanaro

Description

@jfmontanaro

If you run an NY Times article through Readable like so, you'll notice that only 9 (in this case) paragraphs are captured, out of the 30 or so in the original article.

Clearly this is an upstream bug in readability-rs, not Readable, so why am I raising an issue here? Unfortunately it looks like readable-rs may be abandoned: there's been no activity from the maintainer since April of 2021, and there are a couple of trivial pull requests still outstanding from June and September of that year, which makes me think the maintainer is unlikely to show up again any time soon.

I think I've identified the bug in readability-rs: the value of this line is negated when it shouldn't be, or at least I don't think it should be. I haven't fully grokked the scoring algorithm, but when I remove the negation and test against a NY Times article it seems to extract the whole article as I would expect.

So I see three possible courses of action:

  • Submit a pull request to readability-rs and see if it gets a response (this has the advantage that if I'm wrong and there's actually some good reason the algorithm is done this way, the maintainer will presumably know)
  • Fork readability-rs and fix the bug
  • Not bother with any of this since it seems to work okay on most sites and it is, after all, just a fun side project that you threw together for the heck of it. :)

I'll probably go ahead and submit the PR to readability-rs regardless, just in case it gets a response, but I thought I'd check with you here first since I only care about readability-rs insofar as it affects this project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions