KEP-5922: Conformance Tests for Out-of-Tree Networking Features#5923
KEP-5922: Conformance Tests for Out-of-Tree Networking Features#5923danwinship wants to merge 1 commit intokubernetes:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| |[KEP-2433] `TopologyAwareHints` |1.33 |Some service proxies |:fearful: | | ||
| |[KEP-4444] `ServiceTrafficDistribution` |1.33 |Some service proxies |:fearful: | | ||
| |[KEP-3015] `PreferSameTrafficDistribution` |1.35 |Few service proxies |:rage: | | ||
|
|
There was a problem hiding this comment.
|
|
||
| - To promote a service DNS feature or behavior to "future | ||
| conformance", it would have to already be implemented correctly by | ||
| both `kube-dns` and `CoreDNS`. |
There was a problem hiding this comment.
kube-dns may not be a good example, it does not uses EndpointSlices kubernetes/dns#504 so dual stack is not handled correctly IIRC
There was a problem hiding this comment.
It's not an "example", it's a clarification of a requirement that already exists but was previously not explicitly stated: the existing GKE-based conformance jobs use kube-dns (right?), so therefore new conformance tests must be able to pass in a cluster using kube-dns.
kubernetes/kubernetes#132019 was supposed to have made EndpointSlice support explicitly required for conformance... I guess it made EndpointSlice-based proxying a conformance requirement, but not EndpointSlice-based service DNS...
I'll need to revisit this as part of KEP-4974... (Is Google is planning to ditch kube-dns? I thought I heard something about that...)
There was a problem hiding this comment.
I've added an item about this to this week's SIG Network agenda
| conformance", it would have to already be implemented correctly by | ||
| both `kube-dns` and `CoreDNS`. | ||
|
|
||
| (NetworkPolicy, cloud load balancers, Ingress, and Gateway are |
There was a problem hiding this comment.
I wonder of we should revisit NetworkPolicy, but is just a drive by comment, unrelated to the KEP itself
There was a problem hiding this comment.
I went back and forth on whether to say something here, but yes, this does pave the way for making NP a requirement. (Though as Tim pointed out, if we think "NPv2" is coming maybe we should just wait for that...)
|
/cc @BenTheElder @pohly for the testing framework and mechanics /cc @dims @johnbelamaric for sig arch conformance It is worth having this discussion, at least having a reasonable path to unblock this deadlock of third_parties not investing in being conformance and us trying to be good guys to not break them |
yes please |
|
|
||
| However, for people doing conformance testing of Kubernetes | ||
| distributions, failures in the "future conformance" tests would merely | ||
| result in warnings in the conformance test results, not failures. The |
There was a problem hiding this comment.
I am not sure how to achieve the "merely result in warnings" part. Once a test runs, any failure is recorded as a failure in the JUnit result.
Maybe we can do some post-processing (we already implement our own JUnit reporting) and turn "failed" into "warning" in https://pkg.go.dev/github.com/onsi/ginkgo/v2@v2.28.1/reporters#JUnitFailure for tests which are marked as "are allowed to fail".
But the overall test suite result then will still be a failure. Hydrophone and Sonobuoy may have to be adapted to report this differently.
There was a problem hiding this comment.
I was originally imagining that maybe in the same way you can recover from a panic in go, that maybe there was some way to catch a ginkgo.Fail(). Then we could turn it into a Skip() instead. (There shouldn't normally be any Skipped tests in the conformance results, so that would be something the testing helpers could recognize as being specific to this case). But I couldn't find anything in the ginkgo docs suggesting anything like that would be possible...
A higher-effort version of that would be to have the test case itself Skip itself on failure, but that would imply it couldn't use a lot of helper functions (like, most of gomega), and there'd be the risk that we'd accidentally end up actually failing in some edge case.
Maybe "run the test suite twice" really is the best approach. I guess rather than splitting it into "all present+future tests" and "only present tests", the split could be "only present tests" and "only future tests", and the instructions for the actual conformance results would be the same as they are now, but then we'd suggest you could also run the future conformance tests separately to confirm your future conformance...
There was a problem hiding this comment.
I was originally imagining that maybe in the same way you can recover from a panic in go, that maybe there was some way to catch a ginkgo.Fail()
That does indeed not work. As soon as ginkgo.Fail is called, the test is marked as failed. You can catch the special panic and continue the test, but there's no way to intercept the actual failure. The post-processing that I mentioned could have the same effect, though.
There was a problem hiding this comment.
Maybe we can do some post-processing (we already implement our own JUnit reporting) and turn "failed" into "warning" in https://pkg.go.dev/github.com/onsi/ginkgo/v2@v2.28.1/reporters#JUnitFailure for tests which are marked as "are allowed to fail".
Oof, that seems confusing and problematic.
Maybe:
- It just fails
- We work with the CNCF before rolling out any of these to update the tooling used to enforce passing conformance to accept overall suite failure and specifically consider the tests, ignoring tests that are
[FutureConformance]or something
People are typically running these tests under some wrapper, and often not inspecting the junit at all.
I think a "failure" will be much more visible, and encourage actually fixing things.
But if we don't want to strictly require it to pass yet, then we can just have the conformance program permit those failures when a special tag is present.
I doubt any existing pipeline would surface a a "warning", including our own.
There was a problem hiding this comment.
IOW: we treat this like any other tests today: excluded or included from the run. Those preparing for the future can run [(Future?)Conformance], those running current conformance can run [Conformance]
The tricky thing is getting buy-in to actually bother running these and preparing for them, otherwise this is essentially no different from just ... delaying promotion of an otherwise untagged test, as we sometimes do today.
|
|
||
| This may require some combination of changes to: | ||
|
|
||
| - The k/k e2e framework |
There was a problem hiding this comment.
I recently worked with the Ginkgo maintainers on supporting labeling tests with version numbers for arbitrary components: https://pkg.go.dev/github.com/onsi/ginkgo/v2#ComponentSemVerConstraint
A test could be tagged as ComponentSemVerConstraint("KubernetesConformance", ">=1.57").
Then ginkgo --sem-ver-filter=" KubernetesConformance=1.56" does not include this test. ginkgo --sem-ver-filter=" KubernetesConformance=1.57" does.
I believe (to be verified!) that the test also gets excluded when --sem-ver-filter is not used. This might not be what we want.
There was a problem hiding this comment.
If you use no --sem-ver-filter then it just ignores the version constraints and runs everything.
| Given that, the _simplest_ approach would just be to tell people to | ||
| run the full present-and-future conformance suite first, and if it | ||
| passes, submit those results, but if it fails, re-run just the present | ||
| conformance suite, and submit the results of that. |
There was a problem hiding this comment.
Maybe this "present" vs. "future" could be achieved with --sem-ver-filter.
6039cac to
ae79957
Compare
ae79957 to
1618991
Compare
|
@pohly updated with a clearer plan... |
|
|
||
| - To promote a pod networking-related feature or behavior to "future | ||
| conformance", it would have to already be implemented correctly by | ||
| both `kindnet` and "GKE Dataplane v1". |
There was a problem hiding this comment.
We don't run CI with GKE AFAIK, is "GKE Dataplane v1" a reference to kube-up.sh?
Also, when we say kindnet, do we mean sigs.k8s.io/kindnet, or that which runs in a kind cluster OOTB which we use to run some CI with conformance tests project-wide (the former has cloud bits that the latter does not, their implementations have diverged)
x-ref #4224
There was a problem hiding this comment.
So again as I commented to Antonio in the discussion about kube-dns above, the text here is not trying to create new constraints. It is trying (and apparently failing) to describe a constraint that already exists, namely, "you can't even merge a PR adding a new conformance test unless the test passes in all of the always_run: true, optional: false k/k CI jobs".
By "GKE Dataplane v1", I meant "whatever networking we run in GCE CI jobs"... I thought "GKE Dataplane v1" was a correct way of describing that... In particular, to the best of my knowledge, pod networking in GCE CI jobs uses a CNI plugin called "kubenet" that is a descendant of the "kubenet" plugin that used to exist in-tree, but which no longer exists anywhere within the Kubernetes project.
Likewise, by "kindnet", I mean "whatever we run in the kind CI jobs"
|
|
||
| Tests for future conformance should use `framework.ConformanceIt()`, | ||
| but should include the additional decorator | ||
| `framework.WithConformanceVersion(version)` with a future Kubernetes |
There was a problem hiding this comment.
AIUI, we don't support running the e2e tests against a cluster using compatibility/emulated versions. If you want to test against a cluster being run with --emulated-version=1.28 you have to run 1.28's e2e test binary. Right?
| <<[UNRESOLVED] kube-dns? >> | ||
|
|
||
| I previously thought we still depended on kube-dns, but from | ||
| https://github.com/kubernetes/kubernetes/pull/137553, it seems we |
There was a problem hiding this comment.
I don't think we do as a project.
I'm not sure if any ecosystem offerings still do, or if that's even reasonable to block on ... I lean towards: ~no, no.
There was a problem hiding this comment.
I was specifically worried about CI clusters, since kube-up.sh, etc, still defaults to kube-dns if CLUSTER_DNS_CORE_DNS is unset. But I missed the fact that cluster/gce/config-default.sh sets that to true, so you get CoreDNS in any job that doesn't explicitly set it to false. So it seems that all of our CI jobs use CoreDNS except for a few sig-network ones which are intentional kube-dns variants of other jobs.
As for the wider ecosystem, SIG Network is deprecating it (kubernetes/kubernetes#137556) after confirming that there is not enough usage of it for it to be reasonable to block on.
/sig network architecture testing docs
/assign @aojea @thockin