Inverting the Web
We use search engines because the Web does not support accessing documents by anything other than URL. This puts a huge amount of control in the hands of the search engine company and those who control the DNS hierarchy.
Given that search engine companies can barely keep up with the constant barrage of attacks, commonly known as "SEO". intended to lower the quality of their results, a distributed inverted index seems like it would be impossible to build.
@freakazoid What methods *other* than URL are you suggesting? Because it is imply a Universal Resource Locator (or Identifier, as URI).
Not all online content is social / personal. I'm not understanding your suggestion well enough to criticise it, but it seems to have some ... capacious holes.
My read is that search engines are a necessity born of no intrinsic indexing-and-forwarding capability which would render them unnecessary. THAT still has further issues (mostly around trust)...
@freakazoid ... and reputation.
But a mechanism in which:
1. Websites could self-index.
2. Indexes could be shared, aggregated, and forwarded.
4. Search could be distributed.
5. Auditing against false/misleading indexing was supported.
6. Original authorship / first-publication was known
... might disrupt things a tad.
NB: the reputation bits might build off social / netgraph models.
But yes, I've been thinking on this.
Also YaCy as sean mentioned.
There's also something that is/was used for Firefox keyword search, I think OpenSearch, a standard used by multiple sites, pioneered by Amazon.
Being dropped by Firefox BTW.
That provides a query API only, not a distributed index, though.
@kick HTTP isn't fully DNS-independent. For virtualhosts on the same IP, the webserver distinguishes between content based on the host portion of the HTTP request.
If you request by IP, you'll get only the default / primary host on that IP address.
That's not _necessarily_ operating through DNS, but HTTP remains hostname-aware.
@dredmorbius @kick @enkiv2 IP is also worse in many ways than using DNS. If you have to change where you host the content, you can generally at least update your DNS to point at the new IP. But if you use IP and your ISP kicks you off or whatever, you're screwed; all your URLs are new invalid. Dat, IPFS, FreeNet, Tor hidden sites, etc, don't have this issue. I suppose it's still technically a URL in some of these cases, but that's not my point.
@dredmorbius @kick @enkiv2 HTTP URLs don't have any way to specify the lookup mechanism. RFC3986 says the part after the // and optional authentication info followed by @ is a "registered name" or an address. It doesn't say the name has to be resolved via DNS but does say it is up to the local system to decide how to resolve it. So if you just wanted self-certifying names or whatever you can use otherwise unused TLDs the way Tor does with .onion.
@kick Clue seeks clue.
You're asking good questions and making good suggestions, even where wrong / confused (and I do plenty of both, that's not a criticism).
You're helping me (and I suspect Sean) think through areas I've long been bothered about concerning the Web / Internet. Which I appreciate.
(Kragen may have this all figured out, he's far certainly ahead of me on virtually all of this, and has been for decades.)
@dredmorbius @kick @enkiv2 @freakazoid building a non-distributed index has gotten a lot easier though. when I published the Nutch paper it was still not practical for a regular person to crawl most of the public textual web, from a cost perspective. (not sure if it's practical now, though, due to cloudflare)
@kragen I see a lot of this coming down to:
- What is the incremental value of additional information sources? At some point, net of validation costs, this falls below zero.
- Google's PageRank relied on inter-document and -domain relations. Author-based trust hasn't carried as much weight. I believe it needs to.
- Randomisation around ranking should help avoid systemib bias lock-ins.
- Penalties for fraud, with increasing severity and duration for repeats.
@dredmorbius @kick @enkiv2 @freakazoid I've thought that it might be reasonable to bootstrap a friendnet by assigning newcomers (randomly or by payment) to "foster families" or "undergraduate faculties" to allow them to gain enough whuffie to become emancipated. ideally, gradually, rather than through an emancipation cliff analogous to legal majority or a B.S.
@kragen Challenge on any such scheme is scaling quickly enough, relative to other systems.
Though if the founding cohort is sufficiently interesting, you'll have the reverse problem: too many people wanting in.
An inspiration I've long had for this is Lawrence Lessig's "signed by" convention at the ... Yale Wall, I think, described in "Code and Other Laws of Cyberspace".
That applied to anonymous messages, but for new users might also work.
@kragen It's effectively a socialisation problem -- how do you introduce new members to a society?
But doing that *without* creating an inculcated old-boys/girls/nbs network, or any of the usual ethnic or socioeconomic cliques. Something that most systems have generally failed at.
Random assignments should help but aren't of themselves sufficient.
@dredmorbius @kick @enkiv2 @freakazoid human societies have hierarchies of prestige; we can't hope to eliminate those through incentive design. We can hope to prevent things like despotism, witch-burning, the Inquisition, the Holocaust, and the burning of the Library of Alexandria. But there's going to be an old-enbies network, unavoidably.
@dredmorbius @kragen @kick @enkiv2 @freakazoid
Stafford Beer had some ideas about ways to rotate people through groups in such a way that ideas echo through a network. Based on graph theory & permutation. I've forgotten the name. Worth looking into as a way to grow/integrate folks into a large group by making connection in a smaller one & getting mirroring/feedback.
Taking a single possibility (I listed a few) from a thing I wrote to a couple of posts up-thread but didn’t send because I want to hear someone’s opinion on a sub-problem of one of the guesses listed:
Seed with trusted users (i.e. people submitting sites to crawl), rank preferentially by age (time-limited; would eventually wear off), then rank on access-by-unique-users. Given that centralized link aggregators wouldn’t disappear, someone throws HN in, for example, the links on HN get added into the pool, whichever get clicked on most rise up, eventually get their own ranking, etc.
This works especially well if using what I sent the e-mail to inquire a little more about: cluster sorting rather than just barebacking text (this is what Yippy does, for example, and what Blekko used to do), because it promotes niche results better than Google’s model with smaller datasets, and when users have more seamless access to better niches, more sites can get rep easier. Example: try https://yippy.com/search?query=dredmorbius vs. throwing your username into Google. The clustering allows for much more informative/interesting results, I think, especially if doing inquisitive searching.
Kragen mentioned randomly introducing newcomers (adding noise), but I think it might work better still if noise was added to the searches for at least the beginning of it. A single previously-unclicked link on the first five pages of search results?
@kick As little as possible.
I've not participated online under my real name (or even vague approximations of it) for a decade or more. That was seeming increasingly unattractive to me already then. And I'd been online for at least two decades by that point.
Of the various dimensions of trust, anti-sock-puppetry is one axis. It's not the only one. It matters a lot in some contexts. Less in others.
Doxxing may be occasionally warranted.
Umasking is a risk.
@dredmorbius @kick @enkiv2 @freakazoid yeah, although in many ways it's an improvement over Golden Horde society, Ivan the Terrible society, Third Crusade society, Diocletian society, Qin Er Shi society, Battle of the Bulge society, Khmer Rouge society, Holodomor society, People's Temple society, the society that launched the Amistad, etc. We didn't start the fire.
@kragen I'm referencing specifically the surveillance aspects, and the accellerating pace of that espeically over the past two decades or so. Though you can trace the trends back the the 1970s, generally.
Paul Baran was writing of the risks ~1966-1968, which is 52-54 years ago now.
IBM were actively demonstrating the risks 1939-1945.
Herbert Simon conveniently ignorant of this in 1978, when Zuboff discovered surveillance capitalism in her research.
@kragen Of the various drawbacks of the Mongol Hordes, massive mobile technological surveillance was not a prominent aspect.
The Battle of the Bulge and Holdomor societies _did_ benefit from informational organisation. Khmer Rouge and People's Temple may have, and the capabilities certainly existed.
General capabilities began ~1880, again with Holerith, nascent IBM.
@dredmorbius @kick @enkiv2 @freakazoid depending on who you were and where you lived, it was easy to end up with very little privacy after the Mongol invasion. The fact that the technologies employed were things like chains and swords rather than punched cards and loyalty scores was cold comfort to the enslaved. But, yes, I meant that the societies were more regrettable overall, not necessarily specifically along the surveillance axis.
@kragen My evolving thought is that privacy is an emergent concept, it's a force that grows proportionately to the ability to invade personal space and sanctum.
Pretechnical society had busybodies, gossibs, evesdroppers, spies, and assassins.
But if you wanted to listen to or observe someone, you had to put a body in proximity to do it. Preliterate (or largely so) society plebes didn't even leave paper trails. A baptismal, marriage, and will, if you were lucky.
@kragen We're at an age where a chat amongst friends, as here, is creating a distributed global written record, doubtless being scraped by academics, corporations, and state and nonstate surveillance systems.
US phone call history records date to the mid-1980s (if not before). Purchase, social, employment, and location records are comprehensive for at least the past decade, if not five or more.
If privacy is the ability to define and defend limits on information disclosure, there is precious little left.
The information glut is so immense that even multi-billion-dollar-funded state intelligence apparatus cannot meaningfully utilise the information preemptively. And yet those same state actors leak and lose their own personnel and intelligence data. Political organisations have email leaked. Generals and possibly presidents are downed.
@kragen The same state actors drop death on the sky based on cellphone metadata and other data traces.
And those are the ones we think of as the good guys.
China, Saudi, Israel, Russia, and who knows who all else, are doing far worse.
And we're only really a decade in to this brave new mobile-data-surveillance world.
@dredmorbius @kick @enkiv2 @freakazoid Well, privacy invasion was more typically done by your father, your husband, or your owner in many of these societies, rather than by the secret police. But it was in many cases quite pervasive. Of course when we think about medieval Europe, it's easier to imagine ourselves as monks, knights, or at least yeomen, than as villeins in gross, vagabonds, or women who died in forced childbirth, precisely because of that paper trail.
@kragen And yet, as the Chinese noted: Heaven is high and the emperor far away.
The inefficiencies of medieval systems (even highly-evolved bureaucratic ones as in China) left a great deal of latitude.
The lack of *material* wealth, or useful knowledge, imposed strong constraints. But the idea of being watched by unknown eyes, from anywhere on the planet, didn't exist. Your watchers were neighbours, and had profound limitations.
Still a threat, but knowable.
@dredmorbius @kick @enkiv2 @freakazoid Trump supporters label NPR as "fake news"; Trump opponents label Fox as "fake news". Presumably one side will win and the other will be penalized for linking to fake news, with increasing severity and duration or repeats. There's no particular reason to expect that it will be the correct side. See also: the Crusades, blood libel, babies ripped out of incubators, Lysenkoism. PageRank is immune to that.
There's objective truth, and there's concensus truth. The two seldom match up.
Old Mr. Free Speech Hisself, John Stuart Mill, wasn't optimistic on the truth's capacity to out.
If it's necessary to set up competing credentialing networks which operate independently (competing churches?), that ... might have to happen.
Motivated irrationality is, unfortunately, A Thing. And can be quite lucrative and rewarding, at least in the short term.
@kragen @dredmorbius @kick @enkiv2 @freakazoid
In the absence of any negative feedback, whoever can produce the most positive feedback will win (and when competing on access to information, winning accumulates). Whoever gets an early monopoly has a lot of control over the worldview even after they lose that monopoly...
@enkiv2 Pretty much this.
It's an evolutionary problem, I think, with likely analogues and lessons in biological evolution.
Negative feedbacks are fitness checks?
Though my question was, specifically: are negative feedbacks fitness checks? That is, the "selection" process within "variation, inheritance, and selection".
And vice versa: are fitness checks / selection processes negative feedback?
Not sure that they are or aren't. Musing on this.
Within a systems context, yes, negative feedback is required for sustainable function.
elimination of options based on failure of fitness checks certainly is a subset of negative feedback. i'm not assuming that the negative feedback in question is non-arbitrary though. it's just that in the absence of any negative feedback, everything goes positive, and whoever has the largest reach cannot be beaten. with negative feedback a powerful actor can be deplatformed by a coalition.
@enkiv2 Bang simply as available notation. Now that I think of it, it might make a good routing _mechanism_ specifier:
Again, I'm not sure this is better than individual protocols.
Another option would be to specify some service proxy, which could then handle routing. URI encoding doesn't seem to directly provide that, apps/processes define own proxy use.
@dredmorbius @enkiv2 @kick @freakazoid
Bang was used in usenet addresses to separate a series of hosts in order to specify a routing, since UUCP would be done by machines calling specific other known machines nightly over landline phones. You'd see bang routing in usenet archives as late as the early 90s. I'd be surprised if it's not still theoretically supported in URLs.
@enkiv2 Email also.
I used (though understood poorly) bang-path routing at the time.
So yes, I'm familiar with the usage and notation. The question of whether or not it's appropriate here is ... the question.
At present, HTTP URL's *presume* DNS.
The problem is that DNS itself is proving problematic in numerous ways, that ... don't seem reasonably tractable. The dot-org fiasco is pretty much the argument I've been looking for against the "just host your own domain" line.
@enkiv2 That's at best worked with difficulty for large organisations -- domain lapses, etc., occur with regularity.
Domain squatting, typosquatting, and a whole mess of other stuff, is a long-standing issue.
In that light, Google's killing the URL _might_ not be _all_ bad, but they've been Less Than Clear on what their suggested alternative is. And I trust them less far than I can throw them.
For individuals, the issues of persistent online space is a huge issue.
@enkiv2 Then there's the whole question of how many spaces is enough. There are arguments for _both_ persistence _and_ flexibility / alternatives, and locking everyone into a _single_ permanent identity generally Does Not End Well.
The notion of a time-indexed identity might address some of this. Internet Archive's done some work in this area. Assumptions of network immutability tend to break. In time.
@dredmorbius @enkiv2 @kick @freakazoid
Yeah. Any immutability needs to be enforced because when the W3C declared that changing web pages is Very Rude all the scam artists & incompetents did it anyway. Content archival projects like waybackmachine become easier if you have static addresses for static content & some kind of mechanism to repoint at a different set of static documents (like IPFS+IPNS).
@enkiv2 I'd argue that there's a place for redacting content -- see the Bryan Cantril thread from 1996 previously referenced. That's ... embarassing. Not particularly useful, though perhaps as a cautionary tale.
There's a strong argument that most social media should be fairly ephemeral and reach-limited.
There are exceptions, and *both* promiting *and* concealing information can be done for good OR evil.
@dredmorbius @enkiv2 @kick @freakazoid
In terms of negative feedback -- I don't consider redaction of already-published material to be the best or most useful form. We see problems that could be solved by this, if mirroring & wayback machine & screenshots didn't exist. I'm more hopeful about solving the dunking problem with norms.
Reach is a lot more nuanced & powerful. Permanent & reach-limited like SSB feels like the right thing for nominally-public stuff.
@enkiv2 @dredmorbius @kick @freakazoid
(Secret stuff is a different concern. Encryption gets broken. Accidentally leaking secret info publically is a problem but giving up all of the benefits of staticness -- mostly making decentralization viable -- won't solve the whole problem and also IMO isn't worth it for the few cases it does resolve.)
@freakazoid @enkiv2 @dredmorbius @kick
Depends on how much you take advantage of it. Immutability is rare so we basically don't have tech that uses it. (In plt, we have functional languages, which are basically just a matter of saying "if all variables are immutable what does that mean". Outside of plt, it's much more rare!) Social ramifications of immutability can be great or terrible depending on how we engineer norms around it.
A social network for the 19A0s.