Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rate limit concurrent peer dials #1802

Merged
merged 2 commits into from
Oct 9, 2015
Merged

rate limit concurrent peer dials #1802

merged 2 commits into from
Oct 9, 2015

Conversation

whyrusleeping
Copy link
Member

This should prevent the crashes when adding large files and directories.

addresses #1654 and other similar issues.

License: MIT
Signed-off-by: Jeromy jeromyj@gmail.com

@jbenet jbenet added the status/in-progress In progress label Oct 6, 2015
@@ -44,6 +44,9 @@ var (
// add loop back in Dial(.)
const dialAttempts = 1

// dial at most 12 peers concurrently
const concurrentPeerDials = 12
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose 12 here because 12*10 (conccurent dials per peer) * 2 (reuseport) is around 1/4 of the available file descriptors on linux, and is still less than the 256 limit on OSX. Although maybe its a little too close to the OSX limit. Maybe we should set it at runtime with an OS check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably bump the 10 factor to 5. i worry about putting it lower than that, as otherwise we wont have the chance to prefer local addresses. once we have good estimators, maybe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, 12 max concurrent dials is a bit rough. Maybe this should only be a limit for socket-consuming transports?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats really hard to tell at this point. The lock doesnt specify whats going to be dialed, just who.

How about dropping the per peer concurrency to 6 and raising the number of peers to 15? That puts us at a limit of 180 fd's (at least until we fix epoll)

@jbenet
Copy link
Member

jbenet commented Oct 6, 2015

LGTM, though maybe:

  • also bump down the 10 concurrent addr-dials per outgoing dial
  • make the ratelimit only rate limit "fd-per-conn" based conns

Also, i believe that if you fix the reuseport epoll + kqueue fd screwup, that halves the used fds

Sorry, something went wrong.

@whyrusleeping
Copy link
Member Author

rate limiting fd-per-conn dials will be a lot of work, and we dont have any such transports yet. I think its best to push this until a little later.

@jbenet
Copy link
Member

jbenet commented Oct 6, 2015

the point is that we don't need to rate limit utp and we do need to rate limit tcp


Sent from Mailbox

On Tue, Oct 6, 2015 at 10:05 AM, Jeromy Johnson notifications@github.com
wrote:

rate limiting fd-per-conn dials will be a lot of work, and we dont have any such transports yet. I think its best to push this until a little later.

Reply to this email directly or view it on GitHub:
#1802 (comment)

@whyrusleeping
Copy link
Member Author

Yeah, but Lock(peer) gives me no indication of what transport i'll actually be able to connect to them on. If we really want that, we would have to put it in the actual dial code, and then just limit the total number of concurrent tcp dials. or something like that (which works too)

@jbenet
Copy link
Member

jbenet commented Oct 6, 2015

Yep, that's why I meant by harder :)

@jbenet
Copy link
Member

jbenet commented Oct 6, 2015

(Think we should do this-- the limit will hurt)

@jbenet
Copy link
Member

jbenet commented Oct 6, 2015

Good side effect: utp addresses much more likey to be chosen

default:
if conn.IsTcpMultiaddr(remoteAddrs[i]) {
// For addresses that require an FD to dial
select {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ugly because you cant really do conditional select cases...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, this fixes it all for you:

func (s *Swarm) addrDialRateLimit(a ma.Multiaddr) (chan<-  struct{}) {
  if isFDCostlyTransport(a) {
    return s.fdRateLimit
  }

  // do not rate limit it at all
  return make(chan struct{}, 1)
}

// and in here: 

for _, i := range rand.Perm(len(remoteAddrs)) {

  select {
  case <-foundConn: // if one of them succeeded already
    break
  case <-worker.Closing(): // our context was cancelled
    break
  default:
  }


  workerAddr := remoteAddrs[i] // shadow variable to avoid race

  // we have to do the waiting concurrently because there are addrs 
  // that SHOULD NOT be rate limited (utp), nor blocked by other 
  // rate limited addrs (tcp).

  // we need to call limiter.Go as required by goproc/limiter semantics.
  // note: limiter.Go is not limiter.LimitedGo.
  limiter.Go(func(p process.Process) {

    // returns whatever ratelimiting is acceptable for workerAddr.
    // may not rate limit at all.
    rl := s.addrDialRateLimit(workerAddr)
    rl<- struct{}{}

    limiter.LimitedGo(func(worker process.Process) {
      dialSingleAddr(workerAddr)
    })

    <-rl
  })

}

@whyrusleeping
Copy link
Member Author

@jbenet alright, we do the thing now.

@whyrusleeping whyrusleeping added this to the 0.3.8 milestone Oct 7, 2015
@whyrusleeping
Copy link
Member Author

RFCR/M

@whyrusleeping whyrusleeping mentioned this pull request Oct 7, 2015
func IsTcpMultiaddr(a ma.Multiaddr) bool {
p := a.Protocols()
return len(p) == 2 && (p[0].Name == "ip4" || p[0].Name == "ip6") && p[1].Name == "tcp"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should have:

func isFDCostlyTransport(a ma.Multiaddr) bool {
  return isTCPMultiaddr(a)
}

@jbenet
Copy link
Member

jbenet commented Oct 8, 2015

This is very weird: https://gist.github.com/jbenet/d79f4b9cfcfeea48363e

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
License: MIT
Signed-off-by: Jeromy <jeromyj@gmail.com>
@jbenet
Copy link
Member

jbenet commented Oct 8, 2015

@whyrusleeping i pushed here -- fixed it for you i think 1c2223d

except for whatever is going on in #1802 (comment)

@whyrusleeping
Copy link
Member Author

@jbenet what changed in that commit?

@whyrusleeping
Copy link
Member Author

also, that gist is quite weird. But I doubt that "what weve been seeing isnt too many fd's", I've reproduced the error many times and confirmed the number of fd's on the system (yay for filesystem notifications on the /proc/pid/fd directories).

what did you repro that on? your laptop? does that happen every time?

@whyrusleeping
Copy link
Member Author

@jbenet i cant repro the thing you posted. Was it a fluke? Or can you repro reliably?

@whyrusleeping
Copy link
Member Author

made my roommate test this on OSX for me, looks like its an OSX issue. It also happens on master. TestSimultOpenMany opens more file descriptors than OSX can handle by default, we dont see the failures in CI because for CI we lower the number of rounds to run by half. That test opening too many file descriptors causes all the rest to fail because theyre all in the same process.

@GitCop
Copy link

GitCop commented Oct 8, 2015

There were the following issues with your Pull Request

  • Commit: f1f068d
    • Invalid signoff. Commit message must end with
      License: MIT
      Signed-off-by: .* <.*>

Guidelines and a script are available to help. Your feedback on GitCop is welcome on this issue.


This message was auto-generated by https://gitcop.com

@whyrusleeping
Copy link
Member Author

@GitCop killjoy.

License: MIT
Signed-off-by: Jeromy Johnson <why@ipfs.io>
@whyrusleeping
Copy link
Member Author

@jbenet that should fix your issue.

jbenet added a commit that referenced this pull request Oct 9, 2015
rate limit concurrent peer dials
@jbenet jbenet merged commit 26d40b2 into master Oct 9, 2015
@jbenet jbenet deleted the fix/too-many-fd branch October 9, 2015 20:06
@jbenet jbenet removed the status/in-progress In progress label Oct 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants