Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sopel is not starting after a timeout on python 3.7 #1888

Closed
RhinosF1 opened this issue Jun 9, 2020 · 36 comments
Closed

Sopel is not starting after a timeout on python 3.7 #1888

RhinosF1 opened this issue Jun 9, 2020 · 36 comments
Labels
Not Us Issues that are not Sopel's responsibility, e.g. a bug in the environment or a dependency

Comments

@RhinosF1
Copy link
Contributor

RhinosF1 commented Jun 9, 2020

Description

When #1868 occurs on 3.7 we get a lot of log spam from #1865 in between

[2020-06-09 04:46:41,529] sopel.irc.backends INFO - Closed!

and

[2020-06-09 04:46:43,538] sopel.irc.backends INFO - Connection closed...

but the connection never recovers

Reproduction steps

  1. Install latest sopel & python 3.7 and install the https://github.com/Pix1234/ZppixBot-Source/tree/dev plugins
  2. Wait awhile
  3. Bot isn’t detecting some pongs due to a race condition #1868 occurs
  4. Bot doesn't auto reboot

Expected behavior

The two linked issues to not occur but if they do, it should recover automatically

Logs

see https://phabricator.wikimedia.org/T254348#6204886 and https://phabricator.wikimedia.org/T254348#6204907

Environment

  • Sopel .version: 7.0.4
  • Sopel installed via: pip
  • Python version: 3.7.3
  • Operating system: Debian GNU/Linux 9.8 (stretch)
  • IRCd /version: ircd-seven-1.1.9(20191023-4b033a7b9221,charybdis-3.4-dev). wolfe.freenode.net eHIKMpSZ6 TS6ow 32N
  • Relevant plugins: Nothing touches memory

Notes

As far as I can see, find_lines shouldn't be being touched at that point as the connection never fully starts so of course it will be empty

@RhinosF1 RhinosF1 added the Bug Things to squish; generally used for issues label Jun 9, 2020
@RhinosF1
Copy link
Contributor Author

RhinosF1 commented Jun 9, 2020

The full log seems to show it just keeps looping around (14 times) with

[2020-06-09 04:46:41,529] sopel.irc.backends   INFO     - Connection closed...
[2020-06-09 04:46:41,529] sopel.bot            INFO     - Stopping the Job Scheduler.
[2020-06-09 04:46:41,529] sopel.bot            INFO     - Job Scheduler stopped.
[2020-06-09 04:46:41,529] sopel.bot            INFO     - Calling shutdown for 0 plugins.
[2020-06-09 04:46:41,529] sopel.irc.backends   INFO     - Closed!
[2020-06-09 04:46:41,535] sopel.bot            ERROR    - Unexpected error ('find_lines') from ZppixBot-test at 2020-06-09 04:46:41.535755. Message was: [2020-06-09 04:46:39,527] sopel.bot            ERROR    - Unexpected error ('find_lines') from ZppixBot-test at 2020-06-09 04:46:39.527121. Message was: [2020-06-09 04:46:37,517] sopel.bot            ERROR    - Unexpected error ('find_lines') from ZppixBot-test at 2020-06-09 04:46:37.517694. Message was: [2020-06-09 04:46:35,507] sopel.bot            ERROR    - Unexpected error ('find_lines') from ZppixBot-test at 2020-06-09 04:46:35.507173. Message was: [2020-06-09 04:46:33,496] s
Traceback (most recent call last):
  File "/data/project/zppixbot-test/zppixbottest37/lib/python3.7/site-packages/sopel/bot.py", line 606, in call
    exit_code = func(sopel, trigger)
  File "/data/project/zppixbot-test/zppixbottest37/lib/python3.7/site-packages/sopel/module.py", line 495, in guarded
    return function(bot, trigger, *args, **kwargs)
  File "/data/project/zppixbot-test/zppixbottest37/lib/python3.7/site-packages/sopel/modules/find.py", line 43, in collectlines
    if trigger.sender not in bot.memory['find_lines']:
KeyError: 'find_lines'
[2020-06-09 04:46:43,538] sopel.irc.backends   INFO     - Connection closed...
[2020-06-09 04:46:43,538] sopel.bot            INFO     - Stopping the Job Scheduler.

until it eventually gives up. This has never happened on python 3.5

@MyWay
Copy link

MyWay commented Jun 15, 2020

Same here, but on 3.5.3 running Sopel 7.0.4. I get an IRC timeout and sopel doesn't reconnect.

@RhinosF1
Copy link
Contributor Author

Same here, but on 3.5.3 running Sopel 7.0.4. I get an IRC timeout and sopel doesn't reconnect.

I've have never seen this bug on python 3.5.3 - do you have any debug logs?

The only time python 3.5.3 breaks is when I break it or the starter script is broke not sopel.

@MyWay
Copy link

MyWay commented Jun 15, 2020

It happens frequently to me, so yes, I can add logs.

@RhinosF1
Copy link
Contributor Author

Please ensure logging_level is set to DEBUG

@MyWay
Copy link

MyWay commented Jun 15, 2020

Ok, restarted with logging_level set to DEBUG. In 1-2 days I will post the log.

@MyWay
Copy link

MyWay commented Jun 17, 2020

[2020-06-16 23:25:23,636] sopel.irc.backends   ERROR    - Server timeout detected after 112s; closing.
[2020-06-16 23:25:23,688] sopel.irc.backends   INFO     - Connection closed...
[2020-06-16 23:25:23,720] sopel.bot            INFO     - Stopping the Job Scheduler.
[2020-06-16 23:25:24,049] sopel.bot            INFO     - Job Scheduler stopped.
[2020-06-16 23:25:24,061] sopel.bot            INFO     - Calling shutdown for 1 plugins.
[2020-06-16 23:25:24,072] sopel.bot            DEBUG    - Calling sopel.coretasks.shutdown
[2020-06-16 23:25:24,083] sopel.irc.backends   DEBUG    - Closing socket
[2020-06-16 23:25:24,144] sopel.irc.backends   INFO     - Closed!

@Exirel
Copy link
Contributor

Exirel commented Jun 17, 2020

And that's all? Nothing else before and/or after that? That's a bit thin. :-/

@MyWay
Copy link

MyWay commented Jun 17, 2020

Yes, before it I had a long list of:
sopel.coretasks DEBUG - Sending WHO for channel: #mychan
which I thought were unrelated.

@Exirel
Copy link
Contributor

Exirel commented Jun 17, 2020

But what happen after that? What does your bot? Is there anything in the exceptions log file?

@MyWay
Copy link

MyWay commented Jun 17, 2020

Nothing, I have to kill it and restart.Can't see anything inside default.exceptions.log.

@dgw
Copy link
Member

dgw commented Jun 17, 2020

Weird. Getting disconnected by timeout or whatever without setting the status flags to quit or restart should pass through this whole block, log a warning after the "Closed!" info line, and reach the sleep(), then restart the loop and connect again. But it's not?

sopel/sopel/cli/run.py

Lines 117 to 124 in b575787

if not isinstance(delay, int):
break
if p.wantsrestart:
return -1
if p.hasquit:
break
LOGGER.warning('Disconnected. Reconnecting in %s seconds...', delay)
time.sleep(delay)

Thought I had a tenuous lead when I spotted if not isinstance(delay, int): break, but it looks like nothing ever touches the value assigned way up at the top, so… WTF. 😕

@Exirel
Copy link
Contributor

Exirel commented Jun 17, 2020

Nothing, I have to kill it and restart.Can't see anything inside default.exceptions.log.

How long did you wait before killing it? Maybe you have a thread that takes too to end, or maybe you kill it before it can goes into the reconnecting state?

@MyWay
Copy link

MyWay commented Jun 17, 2020

I waited 20 hours, about.

@MyWay
Copy link

MyWay commented Jun 17, 2020

If you want to try anything I can modify the code, in 1-2 days it will happen again.

@RhinosF1
Copy link
Contributor Author

I can't reproduce after cleaning pyc files, reinstalling everything from pip and resetting it's cache

@dgw
Copy link
Member

dgw commented Jun 25, 2020

Closing per @RhinosF1 (both here and on IRC).

@MyWay Please try cleaning up Python's cached bytecode etc. (*.pyc files, __pycache__ directories, and so on). Unfortunately your logs don't show anything useful, but they're no less helpful than the ones we got from @RhinosF1 about this, so I'm guessing if you also clean up your installation you'll get similar success. 😸

@dgw dgw closed this as completed Jun 25, 2020
@dgw dgw added Not Us Issues that are not Sopel's responsibility, e.g. a bug in the environment or a dependency and removed Bug Things to squish; generally used for issues labels Jun 25, 2020
@MyWay
Copy link

MyWay commented Jul 3, 2020

How do you delete the cache if installing from pip?

@Exirel
Copy link
Contributor

Exirel commented Jul 3, 2020

@MyWay you can try to uninstall Sopel with pip, than re-install it?

@MyWay
Copy link

MyWay commented Jul 5, 2020

I did, same happens.

@RhinosF1
Copy link
Contributor Author

RhinosF1 commented Jul 5, 2020

I did, same happens.

See if pip has any caches, also delete any pyc files for the system.

@MyWay
Copy link

MyWay commented Jul 5, 2020

I have installed it with the no cache flag, so I don't think. About system pyc files I have no idea where they could be.

@RhinosF1
Copy link
Contributor Author

RhinosF1 commented Jul 5, 2020

I have installed it with the no cache flag, so I don't think. About system pyc files I have no idea where they could be.

Anywhere on the filesystem so cd to / and then run "find . -name "*.pyc" -delete && find . -type d -empty -delete" which will remove any pyc files and delete empty directories. I strongly suggest backups before though and being extremely careful in what you remove. You can replace -delete with -print to dry run it.

@MyWay
Copy link

MyWay commented Jan 16, 2021

Changed server too, now on python 3.7.3, but still having this issue. Any hint on how to find the root cause?

@dgw
Copy link
Member

dgw commented Jan 17, 2021

All I can say is see if this still happens on current master. Without anything useful in debug logs, the next step would be running under a debugger (pdb or strace), which would generate a LOT of output to sift through if it still takes a few days to reproduce.

@MyWay
Copy link

MyWay commented Jan 21, 2021

I see. I'm trying to disable plugins one at time. Could it be related to the whois plugin? It seems it's not happening anymore.

@dgw
Copy link
Member

dgw commented Jan 21, 2021

Could be. What's a whois plugin, and how does it work? We don't ship one.

@MyWay
Copy link

MyWay commented Jan 21, 2021

It was shipped under sopel-extras. Don't know if you replaced programmatically whois in some way.

@RustyBower
Copy link
Contributor

From the sopel-extras repo "Broken Sopel modules you might pick up and fix if you want them. Probably faster to just rewrite, though"

@dgw
Copy link
Member

dgw commented Jan 21, 2021

Indeed, the sopel-extras repo:

  • is archived
  • contains a disclaimer in its description that its contents are likely broken
  • contains even more disclaimers in the README that the plugins were removed from Sopel at some point due to conflicts, breakage, or other weirdness

We really don't support stuff in sopel-extras, and I suggest not using those plugins, maybe except for very simple ones like 8ball.

@MyWay
Copy link

MyWay commented Jan 21, 2021

I see, thanks. Are you aware of any other solution to have the whois command inside sopel and get the result?

@dgw
Copy link
Member

dgw commented Jan 21, 2021

I can't even see how to trigger a WHOIS from that whois.py file in sopel-extras. It has no commands, so you must have some kind of wrapper or an edited copy. I guess it could be part of the cause for this issue, but seems unlikely…

That said, even if it's completely unrelated, you'd probably be better off writing a new WHOIS plugin that handles things a bit differently. For example, polling for a result using @module.interval instead of a loop with sleep(), or catching more WHOIS-related numerics and posting the result back only when receiving 318 RPL_ENDOFWHOIS. If you aren't up to writing one, maybe someone in #sopel on freenode would be interested in doing it.

@MyWay
Copy link

MyWay commented Jan 21, 2021

I see, thanks. I'll wait some days to confirm the problem is somewhere there and then proceed like you said.

@RhinosF1
Copy link
Contributor Author

I'm happy if you poke me over at phab.bots.Miraheze.wiki to consider taking ownership under @MirahezeBots of the Whois plugin. It doesn't look like that much of maintenance burden.

@Exirel
Copy link
Contributor

Exirel commented Jan 21, 2021

For example, polling for a result using @module.interval instead of a loop with sleep()

The fact is: if the sleep() is called in a non-threaded function while the bot is trying to start, you will have problem, because it uses an io-loop, and not threads, to function. Any blocking I/O (such as synchronous sleep()) will block the bot.

That being said, I'd try the bot without any extra plugins, just to be sure, and see what happens. Good luck finding the source of your problem, and I hope the root cause will because obvious soon enough so it can be fixed - either on your side, or by Sopel itself.

@MyWay
Copy link

MyWay commented Jan 25, 2021

Unfortunately that wasn't the problem. I have now disabled all plugins, let's see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Not Us Issues that are not Sopel's responsibility, e.g. a bug in the environment or a dependency
Projects
None yet
Development

No branches or pull requests

5 participants