-
Notifications
You must be signed in to change notification settings - Fork 9
Description
If you run an NY Times article through Readable like so, you'll notice that only 9 (in this case) paragraphs are captured, out of the 30 or so in the original article.
Clearly this is an upstream bug in readability-rs, not Readable, so why am I raising an issue here? Unfortunately it looks like readable-rs may be abandoned: there's been no activity from the maintainer since April of 2021, and there are a couple of trivial pull requests still outstanding from June and September of that year, which makes me think the maintainer is unlikely to show up again any time soon.
I think I've identified the bug in readability-rs: the value of this line is negated when it shouldn't be, or at least I don't think it should be. I haven't fully grokked the scoring algorithm, but when I remove the negation and test against a NY Times article it seems to extract the whole article as I would expect.
So I see three possible courses of action:
- Submit a pull request to readability-rs and see if it gets a response (this has the advantage that if I'm wrong and there's actually some good reason the algorithm is done this way, the maintainer will presumably know)
- Fork readability-rs and fix the bug
- Not bother with any of this since it seems to work okay on most sites and it is, after all, just a fun side project that you threw together for the heck of it. :)
I'll probably go ahead and submit the PR to readability-rs regardless, just in case it gets a response, but I thought I'd check with you here first since I only care about readability-rs insofar as it affects this project.