Fix 404 Errors After Pangolin Update: A Traefik Troubleshooting Guide

by Omar Yusuf 70 views

Hey guys,

I'm encountering a frustrating issue after updating Pangolin from version 1.4.0 to 1.8.0 – all my resources are returning 404 errors. I've been digging through the logs, and the only clue I've found is this message from Traefik (version 1.4.5):

ERR Provider error, retrying in 664.618717ms error="cannot decode configuration data: field not found, node: sticky" providerName=http

To give you a clearer picture, I've included my configuration files below. Any help or insights would be greatly appreciated!

Docker Compose Configuration

Here's my docker-compose.yml file:

name: pangolin
services:
  pangolin:
    image: fosrl/pangolin:1.8.0
    container_name: pangolin
    restart: unless-stopped
    volumes:
      - ./config:/app/config
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3001/api/v1/"]
      interval: "3s"
      timeout: "3s"
      retries: 15

  gerbil:
    image: fosrl/gerbil:1.1.0
    container_name: gerbil
    restart: unless-stopped
    depends_on:
      pangolin:
        condition: service_healthy
    command:
      - --reachableAt=http://gerbil:3003
      - --generateAndSaveKeyTo=/var/config/key
      - --remoteConfig=http://pangolin:3001/api/v1/gerbil/get-config
      - --reportBandwidthTo=http://pangolin:3001/api/v1/gerbil/receive-bandwidth
    volumes:
      - ./config/:/var/config
    cap_add:
      - NET_ADMIN
      - SYS_MODULE
    ports:
      - 51820:51820/udp
      - 21820:21820/udp
      - 443:443 # Port for traefik because of the network_mode
      - 80:80 # Port for traefik because of the network_mode
      - 25565:25565
      - 25566:25566

  traefik:
    image: traefik:v3.4.5
    container_name: traefik
    restart: unless-stopped
    network_mode: service:gerbil # Ports appear on the gerbil service
    depends_on:
      pangolin:
        condition: service_healthy
    command:
      - --configFile=/etc/traefik/traefik_config.yml
    volumes:
      - ./config/traefik:/etc/traefik:ro # Volume to store the Traefik configuration
      - ./config/letsencrypt:/letsencrypt # Volume to store the Let's Encrypt certificates

networks:
  default:
    driver: bridge
    name: pangolin
    enable_ipv6: true

Key Takeaways from the Docker Compose File

  • We are using Pangolin 1.8.0, Gerbil 1.1.0, and Traefik 3.4.5.
  • Pangolin's configuration is mounted from the ./config directory.
  • Gerbil depends on Pangolin being healthy and has several command-line arguments for configuration, including remote config and bandwidth reporting.
  • Traefik is running in network_mode: service:gerbil, which means it shares the network namespace with the Gerbil container. This is a crucial detail for understanding how Traefik routes traffic.
  • Traefik's configuration is loaded from /etc/traefik/traefik_config.yml, and Let's Encrypt certificates are stored in /letsencrypt.

Pangolin Configuration (config.yml)

Here's the content of my pangolin config.yml file:

app:
  dashboard_url: https://dash.url.com
  log_level: info
  save_logs: false
domains:
  domain1:
    base_domain: url.com
    cert_resolver: letsencrypt
  domain2:
    base_domain: url2.com
    cert_resolver: letsencrypt
  domain3:
    base_domain: url3.com
    cert_resolver: letsencrypt
server:
  external_port: 3000
  internal_port: 3001
  next_port: 3002
  internal_hostname: pangolin
  session_cookie_name: p_session_token
  resource_access_token_param: p_token
  resource_access_token_headers:
    id: P-Access-Token-Id
    token: P-Access-Token
  resource_session_request_param: p_session_request
  secret: top secret ;)
  cors:
    origins:
      - https://dash.url.com
    methods:
      - GET
      - POST
      - PUT
      - DELETE
      - PATCH
    headers:
      - X-CSRF-Token
      - Content-Type
    credentials: false
traefik:
  cert_resolver: letsencrypt
  http_entrypoint: web
  https_entrypoint: websecure
gerbil:
  start_port: 51820
  base_endpoint: dash.url.com
  use_subdomain: false
  block_size: 24
  site_block_size: 30
  subnet_group: 100.89.137.0/20
rate_limits:
  global:
    window_minutes: 1
    max_requests: 500
users:
  server_admin:
    email: [email protected]
    password: secretpassword
flags:
  require_email_verification: false
  disable_signup_without_invite: true
  disable_user_create_org: true
  allow_raw_resources: true
  allow_base_domain_resources: true
email:
  smtp_host: redacted
  smtp_port: 587
  smtp_user: redacted
  smtp_pass: redacted

Delving Deeper into the Pangolin Configuration

  • This config.yml file is the heart of your Pangolin setup. It defines everything from domain configurations to server settings and user access.
  • The domains section specifies the base domains and certificate resolvers for your applications. Make sure these are correctly configured and match your actual domain names.
  • The server section defines the ports used by Pangolin internally and externally, as well as settings for session cookies, resource access tokens, and CORS (Cross-Origin Resource Sharing).
  • The traefik section is particularly important. It tells Pangolin which certificate resolver to use and which entry points (web and websecure) Traefik should use for HTTP and HTTPS traffic.
  • The gerbil section configures the Gerbil service, including the starting port, base endpoint, and subnet group. These settings are crucial for the VPN functionality.
  • Rate limits are set globally to prevent abuse, limiting requests to 500 per minute.
  • User settings and flags control user registration, email verification, and other features.
  • The email section contains sensitive information for SMTP configuration, which is redacted in the provided file.

Traefik Dynamic Configuration

Here's my traefik dynamic config (dynamic_config.yml):

http:
  middlewares:
    redirect-to-https:
      redirectScheme:
        scheme: https

  routers:
    # HTTP to HTTPS redirect router
    main-app-router-redirect:
      rule: "Host(`dash.url.com`)"
      service: next-service
      entryPoints:
        - web
      middlewares:
        - redirect-to-https

    # Next.js router (handles everything except API and WebSocket paths)
    next-router:
      rule: "Host(`dash.url.com`) && !PathPrefix(`/api/v1`)"
      service: next-service
      entryPoints:
        - websecure
      tls:
        certResolver: letsencrypt

    # API router (handles /api/v1 paths)
    api-router:
      rule: "Host(`dash.url.com`) && PathPrefix(`/api/v1`)"
      service: api-service
      entryPoints:
        - websecure
      tls:
        certResolver: letsencrypt

    # WebSocket router
    ws-router:
      rule: "Host(`dash.url.com`)"
      service: api-service
      entryPoints:
        - websecure
      tls:
        certResolver: letsencrypt

  services:
    next-service:
      loadBalancer:
        servers:
          - url: "http://pangolin:3002"  # Next.js server

    api-service:
      loadBalancer:
        servers:
          - url: "http://pangolin:3000"  # API/WebSocket server

Dissecting the Traefik Dynamic Configuration

  • This file is where you define your Traefik routes, middlewares, and services. It's what tells Traefik how to handle incoming traffic.
  • The middlewares section defines a redirect-to-https middleware, which is used to redirect HTTP traffic to HTTPS.
  • The routers section defines several routers for different purposes:
    • main-app-router-redirect: Redirects HTTP traffic for dash.url.com to HTTPS.
    • next-router: Handles traffic for dash.url.com (excluding /api/v1) and routes it to the next-service. This is likely for your Next.js frontend.
    • api-router: Handles traffic for dash.url.com and /api/v1, routing it to the api-service. This is for your API endpoints.
    • ws-router: Handles WebSocket traffic for dash.url.com and routes it to the api-service.
  • The services section defines the backend services:
    • next-service: Points to http://pangolin:3002, which is likely your Next.js server.
    • api-service: Points to http://pangolin:3000, which is likely your API and WebSocket server.

Traefik Configuration (traefik_config.yml)

Finally, here's the content of my main traefik config (traefik_config.yml):

api:
  insecure: true
  dashboard: true

providers:
  http:
    endpoint: "http://pangolin:3001/api/v1/traefik-config"
    pollInterval: "5s"
  file:
    filename: "/etc/traefik/dynamic_config.yml"

experimental:
  plugins:
    badger:
      moduleName: "github.com/fosrl/badger"
      version: "v1.2.0"
log:
  level: "INFO"
  format: "common"

certificatesResolvers:
  letsencrypt:
    acme:
      httpChallenge:
        entryPoint: web
      email: "[email protected]"
      storage: "/letsencrypt/acme.json"
      caServer: "https://acme-v02.api.letsencrypt.org/directory"

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"
  tcp-25565:
    address: ":25565/tcp"
  tcp-25566:
    address: ":25566/tcp"
    transport:
      respondingTimeouts:
        readTimeout: "30m"
    http:
      tls:
        certResolver: "letsencrypt"

serversTransport:
  insecureSkipVerify: true

Unraveling the Core Traefik Configuration

  • This is Traefik's main configuration file, defining the global settings and providers.
  • The api section enables the Traefik dashboard in insecure mode (insecure: true). It's crucial to understand the security implications of this in a production environment. Exposing the dashboard without proper authentication can be a major security risk.
  • The providers section is key. It tells Traefik where to get its configuration:
    • http: This provider dynamically fetches configuration from Pangolin's API endpoint (http://pangolin:3001/api/v1/traefik-config).
    • file: This provider loads configuration from the dynamic_config.yml file.
  • The experimental section enables the badger plugin, likely a custom plugin for Pangolin.
  • The log section configures Traefik's logging level and format.
  • The certificatesResolvers section configures Let's Encrypt for automatic certificate generation and renewal. It uses the HTTP challenge and stores certificates in /letsencrypt/acme.json.
  • The entryPoints section defines the entry points for incoming traffic: web (port 80), websecure (port 443), and two TCP entry points (25565 and 25566). The TCP entry points also have TLS configured using Let's Encrypt.
  • The serversTransport section disables TLS verification, which is generally not recommended for production environments.

Initial Troubleshooting Steps & Potential Causes

Okay, so we have a lot of config to unpack here, but that's good! The more info, the better we can troubleshoot. Based on the error message and your configurations, here's my initial thoughts:

  1. Traefik's "cannot decode configuration data: field not found, node: sticky" error: This is the most important clue. It suggests that Traefik is trying to read a configuration setting (sticky) that either doesn't exist or is in the wrong format. This could be due to a breaking change in the Traefik configuration schema between versions or an issue with how Pangolin is generating the dynamic Traefik configuration.
  2. Traefik Version Incompatibility: You mentioned you're using Traefik 3.4.5, but the error message suggests that it might be expecting configurations from an older version. Pangolin updates could introduce configurations not compatible with your Traefik version.
  3. Dynamic Configuration Issues: Pangolin dynamically generates Traefik configuration. There might be a bug in how Pangolin generates the configuration for your setup, especially after the update.
  4. File Provider Conflicts: You're using both the HTTP provider (to fetch config from Pangolin) and the File provider (to load dynamic_config.yml). There could be conflicts or precedence issues between these providers.
  5. Caching/Stale Configuration: Traefik might be caching an old configuration, especially if the dynamic configuration isn't updating correctly.

Next Steps: Let's Get This Fixed!

Here are some concrete steps we can take to try and resolve this:

  1. Check Traefik Compatibility: Pangolin 1.8.0 might require a newer version of Traefik. Double-check the Pangolin documentation for compatibility information. If necessary, try upgrading Traefik to the latest stable version (but be sure to test in a non-production environment first!).

  2. Inspect the Dynamic Configuration: The most crucial step is to see what configuration Pangolin is actually generating for Traefik. You can access the output of the HTTP provider by making a request to http://pangolin:3001/api/v1/traefik-config from within your Docker network. Save this output and carefully examine it for errors, missing sections, or unexpected values. Look for anything related to sticky or other load-balancing configurations.

  3. Simplify Traefik Configuration: To rule out conflicts between providers, try disabling the File provider temporarily. Comment out the file section in your traefik_config.yml:

    # providers:
    #   file:
    #     filename: "/etc/traefik/dynamic_config.yml"
    

    Then, restart Traefik and see if the 404 errors persist. If the issue is resolved, it points to a conflict between the dynamic configuration and your dynamic_config.yml file.

  4. Clear Traefik Cache (if applicable): Traefik might have a cached version of the configuration. While Traefik generally handles updates gracefully, sometimes clearing the cache can help. There isn't a direct way to "clear the cache," but restarting Traefik effectively clears its in-memory configuration. If you're using any persistent storage for Traefik's configuration (besides Let's Encrypt), you might need to investigate how to clear its cache.

  5. Review Pangolin Logs: Check the Pangolin logs for any errors or warnings related to Traefik configuration generation. There might be clues there about why the configuration is invalid.

  6. Check Network Connectivity: Although it's less likely, ensure that Traefik can actually reach Pangolin on http://pangolin:3001. You can try docker exec -it <traefik_container_id> curl http://pangolin:3001/api/v1/ to test connectivity from within the Traefik container.

  7. Temporary Static Configuration: Create a very basic static Traefik configuration (using the dynamic_config.yml file) that simply routes all traffic to Pangolin’s port 3000 or 3002. This will help determine whether the issue lies within the dynamic configuration generation or with basic connectivity and routing.

    For example, your dynamic_config.yml might look like this:

http:
  routers:
    catchall:
      rule: "HostRegexp(`{host:.+}`)"
      service: pangolin-service
      entryPoints: [web, websecure]
  services:
    pangolin-service:
      loadBalancer:
        servers:
          - url: "http://pangolin:3000"

Remember to restart Traefik after making any configuration changes.

Sharing is Caring: Let's Collaborate!

To help me (and others in the community) assist you better, please share the following after you've tried these steps:

  • The output of http://pangolin:3001/api/v1/traefik-config (redact any sensitive information, of course!).
  • Any relevant Pangolin logs.
  • The results of each troubleshooting step you've tried.

Let's work through this together! 404 errors can be a pain, but with a systematic approach, we can usually find the culprit. Good luck, and keep us updated!