Federated Republic of Sean is a user on retro.social. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.
Federated Republic of Sean @freakazoid

Anyone know of a command-line RSS to EPUB converter that will merge all new articles across multiple feeds into a single book, preferably with an option for full-text retrieval?

@Qwxlea @freakazoid
This is something I've been meaning to get setup for a while but haven't had time. Calibre can definitely do it but it will take a little manual labor to get it all setup. I'm not aware of any other tools.

@freakazoid I know how to hack one together in a few lines, but not anything that exists off the shelf/in one step.

Pandoc will output epub, and is easy to customize. Should concatenate the files too, with the right command line options.

So it's just a matter of fetching the feed, doing full article retrieval, and outputting to some temp files, and then feeding the results in to pandoc.

I have a bash RSS parser tucked away somewhere, let me dig it up.

@freakazoid

I originally adapted my RSS parser from here: bbs.archlinux.org/viewtopic.ph

I don't know where my changes got to, but that might help.

@ajroach42 @freakazoid interesting, this could become a newspaper generator

@freakazoid @saper It looks like newsbeuter has the option to save articles locally, does full article downloads when the content is just a link, and will accept command line arguments.

I (don't think I have|haven't) tried to use it for making offline copies of feeds, but I assume it will handle that handily.

Once that's done, invoking Pandoc should be pretty simple.

@saper @freakazoid If you don't have a chance to poke at it this week or next, I'll have some time after we move, and would probably get some use out of this.

@ajroach42 @saper There does not appear to be a straightforward way to get Newsbeuter to save all unread articles automatically. The only two commands it supports as arguments are "reload" and "print-unread". The latter only prints the number of unread articles. What I want is "save-all".

@freakazoid @saper Darn.

Okay, I'll look in to it some more after work.

@ajroach42 @saper I could write a script to dump its cache, but then I'd have to manage "read" state somehow, and then I might as well just use feedparser and the pandoc wrapper from Python. And maybe some HTML simplification crap if there's anything Pandoc can't handle.

@freakazoid @saper I'm going to be really surprised if this isn't a solved problem already.

@ajroach42 @freakazoid @saper
I would love something like this as well. I want my feeds on my ereader and calibre is kind of overkill and not real simple to do this in an automated way. Nice find on the newsbeuter thing! I'm going to look into this as well.

@kelbot @ajroach42 @freakazoid The fastest way might be to patch to #newsbeuter (also new to me, thanks!)

Can't find this mentioned here, hard to believe nobody thought about it - so maybe it's there?

github.com/akrennmair/newsbeut

Newsbeuter seems to be unmaintained though, README points to #newsboat instead

github.com/newsboat/newsboat

@kelbot @ajroach42 @freakazoid
I think #newsboat is maintained by @minoru . But the plan is to rewrite it in github.com/newsboat/newsboat/i Rust so I don't know what is the prospect of adding such a feature.

@saper
I still feel like even newsbeuter/newsboat may be more than is needed for this but I'm failing at finding the right combination of tools so far.

I know of rsstail which is a cli tool that just checks a feed at defined intervals and outputs the article if there is a new one. Piping that to pandoc?
@ajroach42 @freakazoid @minoru

@kelbot @ajroach42 @freakazoid @minoru sure, just wanted to check if someone did not request that functionality.

@kelbot Newsboat isn't a good fit because it's interactive.

`rsstail | pandoc` is what I'd recommend, but be aware that rsstail doesn't persist the list of already seen items, so re-ruining the command will produce duplicates.

@saper @ajroach42 @freakazoid

@minoru @ajroach42 @saper @kelbot Duplicates aren't a dealbreaker since I was planning on making a new "edition" each day. It does put a damper on my idea to have it be in forward-chronological instead of reverse-chronological order, though.

What would be awesome would be a set of tools following the UNIX philosophy, where you start with a tool that outputs an RSS feed as a set of single-line JSON objects, then others to sort and filter, jq to extract the actual content, etc.

@freakazoid Actually, there is --newer=<date> flag that will help you make your "editions" different by including only articles published in the last 24 hours.

The order is reversible with --reverse, too.

A set of tools would indeed be awesome, but it seems like RSS doesn't get that much developers' attention these days. rsstail is barely maintained, FF just dropped its built-in reader, Newsboat accumulates bugs faster than they're fixed etc.

@kelbot @saper @ajroach42

@minoru @ajroach42 @saper @kelbot Some old tools have been dropped, but then their maintenance seems to be getting picked up again, and I haven't had trouble lately finding RSS feeds for the stuff I want. Feedparser, for example, is maintained again. While the latest release of RSSTail, 2.1 is a bit long in the tooth, it seems like it might have picked up some idea of "newness" beyond the -o option in the 1.8 release that's in Debian.

@kelbot @saper @ajroach42 @minoru Doing it by date as you suggest will probably work better for my purposes than keeping state, though it'd be even better to have --older as well so that one could generate "back-issues".

@minoru @ajroach42 @saper @kelbot Hmm, seems like rsstail doesn't want to exit, though - instead of simulating tail, it really seems to be simulating tail -f.

@freakazoid @saper @ajroach42 @minoru
That's exactly what I've had in mind. Have a computer automatically pull down the new articles from my feeds and convert them to epub once a day. Plug my ereader in first thing in the morning and get my days articles.

You're right, duplicates aren't a big deal if it's a daily digest type thing. I can live with that. Certain feeds not including the full content is a bigger issue.

@kelbot @minoru @saper @freakazoid So we write a check that looks to see if the content of the item is just a link.

If it's just a link, we pass it to w3m?

@ajroach42 @freakazoid @saper @minoru
I have some feeds that include a few lines of the article and then a link. Would that make that more difficult?

@kelbot @minoru @saper @freakazoid Yeah...

I mean, we can still check the content of the download and look for flags that it might be truncated (lots of truncated feeds end in "more") but it'd get messy quickly.

@ajroach42 @freakazoid @saper @minoru
Lilliputing for example posts The first 3 lines with "[…]" at the end. Then an additional line with a link to the full article and the main site. Some others are like you described, 3 lines and "more..." at the end. There will probably be odd feeds here and there with some similar but different convention.

@kelbot @minoru @saper @ajroach42 I think this should just be set manually on a per-feed basis, the way it typically works with interactive RSS readers.

@freakazoid
Good point. Feeds don't get added or removed often enough for that to be a hassle.
@ajroach42 @saper @minoru

@kelbot @minoru @saper @ajroach42 As I thought, an RSS to JSON converter was trivial to write. I kinda feel like a tiny command line program like this should be in Rust or Go or something, though, because to make it work you still need to install feedparser, which if you're being a purist will require a virtualenv, etc.

github.com/seanlynch/misc_pyth

@kelbot
I managed to make an epub by extracting the URLs from a feed and using wget -E -k -p, but it looks like crap because of all the extraneous nonsense on each page. So I'm going to need to replicate Firefox's "reader view" or whatever Pocket does. I'm thinking some kind of heuristics to figure out the content element(s) and then strip out any tags not in an allowlist, along with most styles. Also push headings down.
@minoru @saper @ajroach42

@ajroach42 @saper @minoru @kelbot
Extracting the content from each page seems like it could be a standalone project. Or maybe just have an updatable config file with the heuristics. I imagine something along the lines of spamassassin though hopefully much smaller.

@freakazoid @kelbot @minoru @ajroach42 w3m or lynx may do something reasonable but markup-free (it will be plain text)

@ajroach42
@minoru @saper @freakazoid
While the whole calibre suite is overkill maybe it can still help. This page in the calibre manual has quite a bit of detail on how they put the news recipes together for feeds.

manual.calibre-ebook.com/news.

@ajroach42 @minoru @saper @freakazoid
Thinking and looking at this some more. On the calibre github there are 1600+ recipes. The command to run a recipe is "ebook-convert myrecipe.recipe myrecipe.epub". Maybe using all the work already done by calibre and the community is a more efficient solution.

@kelbot @saper @minoru @ajroach42 If I can do it from a cron without neading a machine with a display, I have no problem using Calibre to do it.

@ajroach42 @minoru @saper @kelbot That is, if someone else figures out how to use Calibre that way. Personally, I'd prefer to put the effort into writing Python code that only pulls in feedparser, lxml, and pandoc.

@ajroach42 @kelbot @saper @minoru Just need to figure out how to do those things from the command line.

The inability to do much of anything without a display has been one of the most infuriating things for me about Calibre.

@freakazoid @minoru @ajroach42 a rewrite is probably going to be inefficient; is CLI thing suggested by @kelbot anything to start with? It would be much better to direct effort to make that part of Calibre a library that can be re-used.

(I'll dedicate this to all folk complaining that #node projects have too many #javascript modules; this is a counter-proof)

@freakazoid
Ditto, I would love to be able to use just the CLI recipe epub conversion without loading any GUI. It looks like that's possible but I think you still have to install the whole thing to get the CLI tools.
@minoru @saper @ajroach42

@freakazoid @minoru @saper @ajroach42
Just tested the cli ebook convert tool that comes with calibre. It is wonderfully simple to use. "ebook-convert feedtitle.recipe feedtitle.epub" and voila. It appears you can just install calibre and start using the cli tools without ever launching the main program. Would be great if we could pull out just the bits we want to use without the rest.

@kelbot
Great! Does it handle fetching the feed and cleaned-up fulltext where necessary, too? Because that would make it a no-brainer. If not, then we'd still need to do the same stuff one would have to do to use Pandoc. Not that it's that hard, but I was thinking that part could be done directly in Python using feedparser and lxml.
@ajroach42 @saper @minoru

@freakazoid
@minoru @saper @ajroach42
I believe the purpose of all the included recipes is that someone has worked that all out and shared it but it's site by site. The default behavior if you add a custom news source just grabs the feed as is. The good news is there are over 1600 recipes already there. So at least some feeds we're looking for may be there already or we have a bunch of examples to go off of.

@saper
Oh! I didn't realize the recipes were for specific sites. This changes everything!
@ajroach42 @minoru @kelbot

@freakazoid
Yep! I need to figure out my workflow and decide which system I want to install it on. How often I want to fetch the feeds etc.

I want to start working the ereader into my routine. It's a nice efficient device that's easy on the eyes.
@minoru @ajroach42 @saper

@minoru @ajroach42 @saper @kelbot "0 upgraded, 274 newly installed, 0 to remove and 0 not upgraded.
Need to get 144 MB of archives.
After this operation, 625 MB of additional disk space will be used."

Umm, nope, never mind.

@freakazoid @saper @ajroach42 @minoru
Yeah, unfortunately it is rather large and has a quite a few dependencies due to all the other functionality included in calibre.

How difficult do you think it would be to pull out just the cli tools? It wouldn't be so bad to pull a few python programs from the github and install a couple dependencies manually.

@freakazoid @saper @ajroach42 @minoru
I will probably go ahead and use it anyway for now since it seems like it is the only option that's usable right now.

@freakazoid @saper @ajroach42 @minoru
Update: I've been playing with calibre to convert feeds to ebooks some more and it's promising. The conversion actually works really well and is super simple to put the commands in a script and schedule a cron job to run. It is still a rather large program if all you're using are the cli tools but it works well. There is a flatpak of calibre available if that's more appealing.

@freakazoid @saper @ajroach42 @minoru
It would also be pretty amazing if there was a way to have the system detect when my specific ereader was plugged in and kick off an rsync command to copy the new epubs to the device automatically. Any idea if that's possible?

@kelbot @minoru @ajroach42 @saper I would be surprised if there weren't a way to do this in whatever's automounting the device under /run/media/<username>. Systemd?

@freakazoid @saper @minoru @kelbot Outside of systemd, this kind of polling daemon would be trivial to run in the background.

I don't know how to do it on a modern distro, but I imagine it'd be pretty simple.

@ajroach42 @kelbot @minoru @saper My guess is that there is a way to make this work under the user systemd directory as well. askubuntu.com/a/679600

@kelbot I'd read a blog post describing how you put it all together!

@freakazoid @saper @ajroach42

@kelbot @ajroach42 @freakazoid The issue with generating good offline copies is not RSS/Atom handling. It is figuring out all the components that need to be downloaded to produce a usable offline version. But I guess built on #JavaScript only are out of reach for typical #ebook formats? (EPUB is basically HTML but I don't think one should try to squeeze JS in there)

@saper @ajroach42 @kelbot I wonder if there is a tool that will load a web page in a browser, wait for the page load to complete and for any remaining JS events to finish running, then dump out the DOM as HTML and associated CSS?

@saper @kelbot @freakazoid Because javascript engines are a nightmare.

@freakazoid Calibre might do this. It has command line tools and can turn a single feed into a book. I’m sure there must be a way to give it multiple inputs.

@freakazoid You can use newsbeuter / newsboat to export article URLs. Retrieve articles. A trivial bit of code will grab just the pristine article and accurate metadata.[1] Pandoc can create your ePub from HTML.

Notes:

1. Hahahahahahaha!!!!! Have you SEEN the shit websites pump out these days? Your scheme will fail here, though readability's JS HTML renderer may help.

I find hand tuning/touchup is allmost always necessary.

@freakazoid Hmmm... I may have to write one one of these days as an experiment.