Anyone know of a command-line RSS to EPUB converter that will merge all new articles across multiple feeds into a single book, preferably with an option for full-text retrieval?
A start would be this:
Calibre has a bunch of commands line tools with it.
@freakazoid I know how to hack one together in a few lines, but not anything that exists off the shelf/in one step.
Pandoc will output epub, and is easy to customize. Should concatenate the files too, with the right command line options.
So it's just a matter of fetching the feed, doing full article retrieval, and outputting to some temp files, and then feeding the results in to pandoc.
I have a bash RSS parser tucked away somewhere, let me dig it up.
I (don't think I have|haven't) tried to use it for making offline copies of feeds, but I assume it will handle that handily.
Once that's done, invoking Pandoc should be pretty simple.
Can't find this mentioned here, hard to believe nobody thought about it - so maybe it's there?
Newsbeuter seems to be unmaintained though, README points to #newsboat instead
I still feel like even newsbeuter/newsboat may be more than is needed for this but I'm failing at finding the right combination of tools so far.
@minoru @ajroach42 @saper @kelbot Duplicates aren't a dealbreaker since I was planning on making a new "edition" each day. It does put a damper on my idea to have it be in forward-chronological instead of reverse-chronological order, though.
What would be awesome would be a set of tools following the UNIX philosophy, where you start with a tool that outputs an RSS feed as a set of single-line JSON objects, then others to sort and filter, jq to extract the actual content, etc.
@freakazoid Actually, there is --newer=<date> flag that will help you make your "editions" different by including only articles published in the last 24 hours.
The order is reversible with --reverse, too.
A set of tools would indeed be awesome, but it seems like RSS doesn't get that much developers' attention these days. rsstail is barely maintained, FF just dropped its built-in reader, Newsboat accumulates bugs faster than they're fixed etc.
@minoru @ajroach42 @saper @kelbot Some old tools have been dropped, but then their maintenance seems to be getting picked up again, and I haven't had trouble lately finding RSS feeds for the stuff I want. Feedparser, for example, is maintained again. While the latest release of RSSTail, 2.1 is a bit long in the tooth, it seems like it might have picked up some idea of "newness" beyond the -o option in the 1.8 release that's in Debian.
@freakazoid @saper @ajroach42 @minoru
That's exactly what I've had in mind. Have a computer automatically pull down the new articles from my feeds and convert them to epub once a day. Plug my ereader in first thing in the morning and get my days articles.
You're right, duplicates aren't a big deal if it's a daily digest type thing. I can live with that. Certain feeds not including the full content is a bigger issue.
@ajroach42 @freakazoid @saper @minoru
Lilliputing for example posts The first 3 lines with "[…]" at the end. Then an additional line with a link to the full article and the main site. Some others are like you described, 3 lines and "more..." at the end. There will probably be odd feeds here and there with some similar but different convention.
@kelbot @minoru @saper @ajroach42 As I thought, an RSS to JSON converter was trivial to write. I kinda feel like a tiny command line program like this should be in Rust or Go or something, though, because to make it work you still need to install feedparser, which if you're being a purist will require a virtualenv, etc.
I managed to make an epub by extracting the URLs from a feed and using wget -E -k -p, but it looks like crap because of all the extraneous nonsense on each page. So I'm going to need to replicate Firefox's "reader view" or whatever Pocket does. I'm thinking some kind of heuristics to figure out the content element(s) and then strip out any tags not in an allowlist, along with most styles. Also push headings down.
@minoru @saper @ajroach42
@freakazoid @minoru @ajroach42 a rewrite is probably going to be inefficient; is CLI thing suggested by @kelbot anything to start with? It would be much better to direct effort to make that part of Calibre a library that can be re-used.
@freakazoid @minoru @saper @ajroach42
Just tested the cli ebook convert tool that comes with calibre. It is wonderfully simple to use. "ebook-convert feedtitle.recipe feedtitle.epub" and voila. It appears you can just install calibre and start using the cli tools without ever launching the main program. Would be great if we could pull out just the bits we want to use without the rest.
Great! Does it handle fetching the feed and cleaned-up fulltext where necessary, too? Because that would make it a no-brainer. If not, then we'd still need to do the same stuff one would have to do to use Pandoc. Not that it's that hard, but I was thinking that part could be done directly in Python using feedparser and lxml.
@ajroach42 @saper @minoru
@minoru @saper @ajroach42
I believe the purpose of all the included recipes is that someone has worked that all out and shared it but it's site by site. The default behavior if you add a custom news source just grabs the feed as is. The good news is there are over 1600 recipes already there. So at least some feeds we're looking for may be there already or we have a bunch of examples to go off of.
How difficult do you think it would be to pull out just the cli tools? It wouldn't be so bad to pull a few python programs from the github and install a couple dependencies manually.
@freakazoid @saper @ajroach42 @minoru
Update: I've been playing with calibre to convert feeds to ebooks some more and it's promising. The conversion actually works really well and is super simple to put the commands in a script and schedule a cron job to run. It is still a rather large program if all you're using are the cli tools but it works well. There is a flatpak of calibre available if that's more appealing.
@freakazoid You can use newsbeuter / newsboat to export article URLs. Retrieve articles. A trivial bit of code will grab just the pristine article and accurate metadata. Pandoc can create your ePub from HTML.
1. Hahahahahahaha!!!!! Have you SEEN the shit websites pump out these days? Your scheme will fail here, though readability's JS HTML renderer may help.
I find hand tuning/touchup is allmost always necessary.